Prospective validation of the NCI Breast Cancer Risk Assessment Tool (Gail Model) on 40,000 Australian women

There is a growing interest in delivering more personalised, risk-based breast cancer screening protocols. This requires population-level validation of practical models that can stratify women into breast cancer risk groups. Few studies have evaluated the Gail model (NCI Breast Cancer Risk Assessment Tool) in a population screening setting; we validated this tool in a large, screened population. We used data from 40,158 women aged 50–69 years (via the lifepool cohort) participating in Australia’s BreastScreen programme. We investigated the association between Gail scores and future invasive breast cancer, comparing observed and expected outcomes by Gail score ranked groups. We also used machine learning to rank Gail model input variables by importance and then assessed the incremental benefit in risk prediction obtained by adding variables in order of diminishing importance. Over a median of 4.3 years, the Gail model predicted 612 invasive breast cancers compared with 564 observed cancers (expected/observed (E/O) = 1.09, 95% confidence interval (CI) 1.00–1.18). There was good agreement across decile groups of Gail scores (χ2 = 7.1, p = 0.6) although there was some overestimation of cancer risk in the top decile of our study group (E/O = 1.65, 95% CI 1.33–2.07). Women in the highest quintile (Q5) of Gail scores had a 2.28-fold increased risk of breast cancer (95% CI 1.73–3.02, p < 0.0001) compared with the lowest quintile (Q1). Compared with the median quintile, women in Q5 had a 34% increased risk (95% CI 1.06–1.70, p = 0.014) and those in Q1 had a 41% reduced risk (95% CI 0.44–0.79, p < 0.0001). Similar patterns were observed separately for women aged 50–59 and 60–69 years. The model’s overall discrimination was modest (area under the curve (AUC) 0.59, 95% CI 0.56–0.61). A reduced Gail model excluding information on ethnicity and hyperplasia was comparable to the full Gail model in terms of correctly stratifying women into risk groups. This study confirms that the Gail model (or a reduced model excluding information on hyperplasia and ethnicity) can effectively stratify a screened population aged 50–69 years according to the risk of future invasive breast cancer. This information has the potential to enable more personalised, risk-based screening strategies that aim to improve the balance of the benefits and harms of screening.


Background
National guidelines and programmes for universal age-based breast cancer screening were established in many countries following trials showing reduced breast cancer mortality [1][2][3][4]. However, increasing evidence on measurable risk factors for breast cancer [5,6] and growing concern about overdiagnosis [7,8] and the appropriateness of mammography for women with dense breasts [9,10] has fuelled interest in more personalised, risk-stratified screening protocols that better optimise the balance of the benefits and harms of screening [11]. A number of countries have established nationally co-ordinated screening programmes. Australia, for example, has a breast cancer screening programme (BreastScreen Australia) offering free biennial mammographies targeted towards women aged 50-74 years (extended from 50 to 69 years in mid-2015) with participation of approximately 55% [12]. Similar programmes have been established in the UK, Canada, Europe, and elsewhere. While risk-stratified screening intervals and more intensive surveillance for high-risk women or women with high mammographic density has been proposed [13], there are no widespread protocols for tailored breast cancer screening in Australia or internationally.
Risk-stratified screening protocols require accurate estimates of risk using data that can be readily obtained by population-based programmes. The Gail model [14][15][16] is relatively simple, requiring minimal information on the family history of cancer. The original model estimated absolute risk of invasive and in-situ breast cancers [17], and was later modified [18] and incorporated into the National Cancer Institute's Breast Cancer Risk Assessment Tool (hereafter referred to as the Gail model) and used for predicting invasive breast cancer risk for women without a personal history of breast cancer [19]. The Gail model has performed well on white women residing in the US and Europe [20][21][22], with poorer performance in women of other ethnic backgrounds, such as African American, Hispanic, Asian, and Pacific Islander women [23][24][25]. In Australia, the performance of the Gail model has been assessed for high-risk women [26] and women younger than 60 years of age [27].
The lifepool cohort comprises 53,800 women recruited since 2010 primarily from the Australian populationbased mammography screening programme to facilitate research into breast cancer screening, epidemiology, and genetics. Using data from baseline questionnaires, we generated Gail risk estimates for active breast cancer screening participants in the historical target age range for screening (50-69 years) and compared predicted and observed risk of incident invasive breast cancer. In addition, we evaluated risk estimates from reduced Gail models, assessing the incremental benefit obtained by adding variables to the model in order of diminishing contribution to risk estimation.

Study participants
Lifepool commenced recruitment in May 2010, restricted to women aged at least 40 years at enrolment. Up to January 2015, recruitment was primarily through an invitation included in appointment letters for women attending subsequent rounds of screening at the BreastScreen programme based in the Australian state of Victoria (BreastScreen Victoria). Other methods of recruitment were publicity at women's health events, referrals by participants to friends and family, and inclusion as a research project on the national database Register4 [28] in July 2012. On enrolment, lifepool participants complete a detailed 'baseline' questionnaire capturing socio-demographic, lifestyle, and health-related information. Further details on the cohort including the questionnaire and other material can be found on-line (http://www.lifepool.org). The lifepool cohort is regularly linked to BreastScreen Victoria records and to the Victorian Cancer Registry to update information on the occurrence of any cancer diagnosed within the state of Victoria.

Data provided for this analysis
Complete questionnaire data were provided for this study for all participants who completed baseline questionnaire data up to 11 September 2016. Lifepool also provided linked data comprising: 1) BreastScreen Victoria screening episodes up to 27 June 2017 with information on screening dates and cancer diagnoses (screen-detected or interval cancer, diagnosis date, invasive or in situ); and 2) Victorian Cancer Registry breast cancer diagnoses (date, invasive, or in situ) and, for women with any cancer registration, death records (date, cause of death). Lifepool also provided participant withdrawals and ad hoc death notifications and cancer diagnosis outside Victoria. Data provision is described in Additional file 1.

Statistical analyses Gail scores
Gail risk scores were assigned using the source code available on the National Cancer Institute website [19], which generates the probability of breast cancer for some specified integer year in the future (e.g. 5-year risk), or to a fixed age in years for a study population.
To evaluate the Gail model as a potential tool for assessing the risk of future breast cancer following a clear screen, we restricted our analyses to women aged 50-69 years who had had a screening episode with a benign final outcome within ±60 days of completing their baseline study questionnaire ('reference screen') and, as per the model's specification, no personal history of invasive breast cancer, ductal carcinoma in situ (DCIS), or lobular carcinoma in situ (LCIS) prior to that screen.
We did not use the 'family membership' field in the Gail model source code designed for generating scores for groups of women (which would combine risk information from identified family members in the study group) as this information was unavailable in our data. Most race/ethnicity categories within the Gail model did not map to the ethnic profile of Australian women; as a best approximation, women who self-reported any Asian ethnicity were assigned to the Gail category ' Asian-American' (relabelled to ' Asian') and all other women to the category 'White' (labelled 'Mixed').
We generated year probability of breast cancer ('scores') for each woman and compared incident invasive breast cancer outcomes by quantile groups of risk (partitioned by group-level quintiles and/or deciles), for three age ranges (50-69/50-59/60-69 years). Hazard functions were censored to diagnosis (invasive or in situ), death, or 31 December 2016 (whichever occurred first). Quantile groups (i.e. quintiles and deciles) were generated for each age range analysed to reflect how the Gail model would assign women to risk groups if used on specific age groups. Receiver operating curves (ROC) were generated for outcomes against continuous Gail scores for women with a minimum follow-up period of 3 years. To compare observed and estimated diagnoses, we generated the Gail predicted probability of breast cancer for each woman for her observation period by linear interpolation between annual-year Gail estimates. Of note, the order of Gail scores does not change with the specified duration of future risk so that women would be ranked the same if we described 1-year, 5-year or 10-year risk. However, the expected number of cancers in this study are dependent on the follow-up time for each woman, so that women with the same rank of baseline risk but different observation periods (e.g. 3 years versus 6 years) would have a different probability of a cancer being observed during the follow-up period. We then summed these observation period-based probabilities for each year risk quantile group to generate the expected number of cancers within that group, and compared this with the observed number of cancers using chi-squared tests and ratios of expected to observed cancers (confidence intervals (CIs) calculated as for Constantino et al. [29]). Statistical tests used Stata 15 software (StataCorp, College Stations, TX, USA).

Reduced variable Gail models
We evaluated Gail models using a reduced number of input variables, starting with the most important predictor of cancer risk in this cohort as identified using a machine learning approach. To maximise information to train and validate machine learning, we extended the dataset to all ages and women with invasive cancer diagnosed at the baseline mammogram (Fig. 1). The eight Gail variables ('features') were ranked using the feature importance function in XGBoost (version 0.72) implemented in Python (version 3.4). We conducted 100 extractions of training and test datasets. For each extraction, we randomly selected a test set (N = 6131) comprising a representative balance of cases (women who developed breast cancer) and controls (women who did not develop breast cancer) and a corresponding training set (N = 16,269) weighted to have a ratio of 1:9 cases to controls. The model was trained on each training dataset and validated on the corresponding test dataset, generating 100 ranks of variable importance which were then combined in a single ranking of variables according to the number of times each variable appeared in that ranking. Gail scores were calculated for each model by step-wise addition of variables according to that ranking (Models 1-8), with these scores then categorised into quantile groups and then evaluated under a hazards framework as for the whole model.

Cohort characteristics
A total of 40,158 women (75% of the cohort) were included in our analyses. Major exclusions were: 2806 women who resided outside the state of Victoria at the time of completing their questionnaire because their subsequent diagnoses were unlikely to appear on Victorian screening and cancer registry records; 988 women who were not linked to screening records; 3085 women who did not have a baseline screening mammogram within 60 days of completing their questionnaire; and 169 women with a personal history of breast cancer prior to their reference screen. We excluded a further 262 women who had had a breast cancer diagnosis (205 invasive and 57 DCIS) at their reference screen, and 5965 women outside the historical BreastScreen target age range of 50-69 years at their reference screen for logistic regression analyses (however, these women were included in the machine-learning sample). No women remaining in the sample had a LCIS diagnosis at or prior to their reference screen. Additional exclusions are presented in Fig. 1.
During a median follow-up of 4.3 years, 564 women (1.4%) were diagnosed with invasive breast cancer ( Table 1). The median time from the reference screen to diagnosis was 813 days (2.2 years), with a maximum of 5.3 years. Three women were diagnosed with incident LCIS (one with subsequent invasive breast cancer within the follow-up period), and 243 deaths from all causes were reported of which eight were due to breast cancer. Gail model variables for this group are described in Table 2. Women who developed invasive breast cancer were older at enrolment, more likely to have first-degree female relatives with breast cancer, and were more likely to have had a breast biopsy. Approximately 3% of all participants were of Asian ethnicity; however, it should be noted that women in the 'mixed' group were ethnically heterogeneous. Nearly all women (95%) attended screening during the follow-up period (Table 1).

Cancer incidence
Observed and expected diagnoses are shown as rates according to decile groups of Gail model-predicted 5-year risk in Fig. 2, with ratios of expected to observed invasive cancers (E/O) according to quantile groups of predicted 5-year risk shown in Table 3. Overall, the model was generally well calibrated with some evidence of over-prediction in women at the highest level of risk; 612 cases were predicted compared with 564 cases observed, corresponding to an expected-to-observed ratio of 1.09 (95% CI 1.00-1.18). Expected and observed outcomes by quintile groups differed significantly overall (χ 2 = 23.0, p < 0.0001). E/O did not differ significantly for quantile groups Q1-Q4 and D9; however, the Gail model overestimated risk for women in decile group D10 (E/O 1.65, 95% CI 1.33-2.07), leading to a net overestimation in group Q5 (E/O 1.40, 95% CI 1.20-1.64). Similar patterns persisted within age groups 50-59 and 60-69 years (E/O 1.08, 95% CI 0.96-1.23, and 1.09, 95% CI 0.97-1.22, respectively).

Reduced Gail model
Machine learning models ranked the importance of Gail model variables as ordered in Table 4 (age being the most important). Most variables were consistently ranked for the 100 runs, except for 'first live birth age' and 'age at menarche' which exchanged places having a 62% frequency of ranking in second and third positions, respectively. Hazard ratios for each quintile group were found to vary as the first four variables were progressively added (Models 1-5) but changed little with the addition of further variables (Models 6-8); Model 5 (incorporating number of biopsies) led to a more accurate ranking of observed outcomes than Models 1-4 (Fig. 3). For Model 5, women in group Q5 had a 2.28-fold higher risk of developing invasive breast cancer compared with women in Q1 (95% CI 1.73-3.01) ( Table 4). Of note, when the number of first-degree relatives was added (Model 4), the expected values increased greatly in the upper decile but the observed values did not rise to match (E/O for D10 was 0.99-1.03 for Models 1-3, then 1.51-1.66 for Models 4-8). Therefore, Model 4 appears comparable to the full Gail model in terms of stratifying women into risk groups.

Discussion
Comparing outcomes arising within a maximum of 6.5 years follow-up, we found that women aged 50-69 years within the highest quintile of Gail risk scores (Q5) had more than double the risk of invasive breast cancer compared with women in the lowest quintile (Q1). Compared with women in the median-risk group (Q3), Q1 had a 40% reduced risk and Q5 a 34% increased risk of incident invasive breast cancer. This suggests that the existing Gail model is suitable for assigning women into groups at significantly different risk of invasive breast cancer in the 5 years following a negative screen.
We found good overall agreement between expected and observed cases of invasive breast cancer, confirming absolute risk estimates over an average of 4.3 years of follow-up except for women in the upper decile of Gail scores; while these women were appropriately classified as the highest-risk group, their absolute Gail risk scores overestimated the observed outcomes ( Fig. 2 and Table 3). This may be due to the exclusion of higher-risk women such as women with cancer diagnosed at the first-round or other prior screening episodes and/or women who attend high-risk services rather than BreastScreen due to a family history or identified increased genetic risk of breast cancer. This latter theory is supported by the increase in expected cancers in group D10 with the addition of family history to the reduced Model 5, without a concomitant increase in the observed number of cancers in that group. Therefore, using the Gail model in this population is expected to rank women well into the quantile groups examined; however, for women assigned to the highest decile of risk (> 3% estimated 5-year risk) a more detailed risk assessment or alternative models incorporating additional family history information might be considered, such as that proposed by Pfeiffer et al. [30]. The current Gail model does not incorporate high-risk gene mutations such as BRCA1/2; in Australia, such women are referred to more intensive surveillance outside the BreastScreen programme. Of note, the ethnicity variable was ranked with low importance in our machine learning models, reflecting poor correspondence between Australian ethnicity groups and the Gail 'race' variable values. A modified ethnicity variable suited to the local population may  [31]. Using machine learning, a reduced model resulted in hazard ratios comparable to the full Gail model, suggesting that a simplified model (e.g. limited to age, first live birth age, age at menarche, number of first-degree female relatives with breast cancer, and possibly history of biopsy) could be equally effective in this population while saving significant effort and resources. Unsurprisingly, the stepwise addition of the variables 'had biopsy' made little difference since the number of biopsies was already included. The ethnicity variable would hold more value if the Gail model was modified to suit Australian ethnicity categories.
The modest discriminatory accuracy of the Gail model (AUC = 0.59) is consistent with a recent meta-analysis of European validation studies (pooled AUC =0.58) [32], confirming that risk information should be conveyed clearly and carefully to ensure that it is understood to apply to group-level rather than individual-level risk. However, group-level estimates such a 5-year risk of less than 1% for women in the lowest quintile versus more than 3% for women in the upper decile (Table 3) are meaningful for group-level health advice and interventions, such as the potential value of more personalised screening protocols targeted to specific risk groups.
This study has various strengths. Analyses are based on data from a large prospective cohort of actively screened participants, with questionnaires completed during 2010-2014 and outcomes recorded up to end 2016, and therefore results are highly relevant to contemporary screening populations and programmes. Cancer outcomes were identified through direct linkage with cancer registrations, and screening histories by direct linkage with the screening programme. We accounted for censoring by using hazards models, and we report outcomes for groups based on quintile and decile values to demonstrate potential applications for this tool not only to identify women at very high risk of breast cancer but also to identify women at medium and reduced risk of breast cancer.
Our study has several limitations. Firstly, we did not have records of cancers diagnosed outside the state of Victoria, although these are likely to be few. Secondly, we did not have complete death records. Based on Australian deaths data [33] (average death rates for 2010-2012 by 5-year age group applied to observed person-years to the end of 2016), the expected number of all-cause deaths in this cohort is approximately 724 (versus 243 recorded deaths). Our 'expected' cancers will therefore be slightly overestimated due to overestimated exposure time to risk of breast cancer for women without a cancer registered in Victoria. This may help explain why the expected number of cancers exceeded the observed number. However, because other-cause death is unlikely to be strongly associated with the Gail model within the age group examined, confounding would be minimal. Another limitation relates to the generalisation to the whole screened population; our sample is drawn from BreastScreen participants who consented to participate in the lifepool cohort and these women may be more willing and/or able than other BreastScreen participants to provide the information required for the Gail model.
This study contributes to the international body of evidence on the validity of the Gail model as well as  Table 3 Comparison of expected and observed cases of invasive breast cancer, and hazard ratios for observed cases, according to Gail model predicted 5-year risk for all women by age group, and for group level risk quintiles (Q1 to Q5) and, within Q5, the upper two deciles of risk (D9 and D10)  providing information on the model's applicability in a population breast screening setting. Although several validation studies of Gail model predictions on prospective cohorts have been conducted [32], limited validation studies have been performed on women attending routine breast cancer screening [14,[34][35][36][37][38]. This is the first validation study applied to a population of breast cancer screening participants in Australia.
As appropriate for validating a predictive tool, our analysis excluded from our study group women with a breast cancer diagnosis at or prior to their 'baseline' lifepool recruitment screen; it is possible that the observed rates of cancer would be slightly different if the risk tool was applied to all women at first-round screening, or if the risk tool was applied to the general population (e.g. through general practice). Since its inception, the Gail model has been modified to account for the variation in breast cancer risk observed in various populations [23][24][25]. Risk predication can be improved by combining the Gail model with mammographic density [21,34] and genetic factors [27,38]. Future work by our group will extend the use of machine learning methods to generate breast cancer risk prediction models based on lifepool cohort data, optimally combining clinical, genetic, mammographic density, and behavioural risk factors. We will also report outcomes for younger and older women, by mode of detection (screen, interval or other), and incidence of DCIS as the lifepool cohort matures.

Conclusions
The findings from this study indicate that the Gail model, or a simplified version of this model, is an effective tool for stratifying active breast cancer screening participants aged 50-69 years to groups according to risk of invasive breast cancer diagnosed up to 5 years following risk assessment.

Additional file
Additional file 1: Table S1. The funder had no role in the design of the study, the collection, analysis, or interpretation of the data, the writing of the manuscript, or the decision to submit the manuscript for publication.

Availability of data and materials
The data that support the findings of this study are available from the corresponding author, but restrictions apply to the availability of these data which were used under license for the current study and thus are not publicly available. However, data are available from the authors upon reasonable request and with permission of the lifepool cohort study.
Authors' contributions CN, IC, PJ, and GBM conceptualised the study. LD, SC, GL, PP, and CW were involved in data acquisition and cleaning. PP analysed the data for machine learning and CN conducted all other analyses. CN, PP, LSV, IC, and PJ interpreted the data. LSV drafted the manuscript with substantial contribution from CN. All authors read, critically reviewed, and approved the final manuscript.
Ethics approval and consent to participate Lifepool was approved by the Peter MacCallum Cancer Centre Human Research Ethics Committee (reference 0966) on 22 January 2010. Women provided informed consent prior to enrolment to lifepool.

Consent for publication
Not applicable.