Skip to main content

Development and external validation of a breast cancer absolute risk prediction model in Chinese population



In contrast to developed countries, breast cancer in China is characterized by a rapidly escalating incidence rate in the past two decades, lower survival rate, and vast geographic variation. However, there is no validated risk prediction model in China to aid early detection yet.


A large nationwide prospective cohort, China Kadoorie Biobank (CKB), was used to evaluate relative and attributable risks of invasive breast cancer. A total of 300,824 women free of any prior cancer were recruited during 2004–2008 and followed up to Dec 31, 2016. Cox models were used to identify breast cancer risk factors and build a relative risk model. Absolute risks were calculated by incorporating national age- and residence-specific breast cancer incidence and non-breast cancer mortality rates. We used an independent large prospective cohort, Shanghai Women’s Health Study (SWHS), with 73,203 women to externally validate the calibration and discriminating accuracy.


During a median of 10.2 years of follow-up in the CKB, 2287 cases were observed. The final model included age, residence area, education, BMI, height, family history of overall cancer, parity, and age at menarche. The model was well-calibrated in both the CKB and the SWHS, yielding expected/observed (E/O) ratios of 1.01 (95% confidence interval (CI), 0.94–1.09) and 0.94 (95% CI, 0.89–0.99), respectively. After eliminating the effect of age and residence, the model maintained moderate but comparable discriminating accuracy compared with those of some previous externally validated models. The adjusted areas under the receiver operating curve (AUC) were 0.634 (95% CI, 0.608–0.661) and 0.585 (95% CI, 0.564–0.605) in the CKB and the SWHS, respectively.


Based only on non-laboratory predictors, our model has a good calibration and moderate discriminating capacity. The model may serve as a useful tool to raise individuals’ awareness and aid risk-stratified screening and prevention strategies.


Breast cancer is the most common and rapidly increasing female malignancy in China [1]. Compared with developed countries, breast cancer in China is characterized by a rapidly increasing incidence rate, lower survival rate, and vast geographic variation. The annual percent increase in breast cancer incidence was 4.5% and 9.1% in urban and rural areas of China, respectively [2]. In 2015, there were 304,000 newly diagnosed cases and 70,000 deaths from breast cancer, with an incidence rate of 54.3 per 100,000 in urban areas and 34.5 per 100,000 in rural areas [3]. The 5-year relative survival rates during 2003–2015 only ranged from 73.1% to 82.0% in Chinese women (55.9% to 72.9% for rural women), which were much worse than that of 90% for American women [4]. Early detection is the cornerstone of preventing morbidity and mortality due to breast cancer. However, it was impeded by the lack of individuals’ awareness and national scale screening program.

Following the pioneering model derived by Gail et al. in 1989 [5], multiple models have been developed [6]. However, most models were developed in the western populations and may not be applicable to Chinese women, even the Gail model modified for Chinese-Americans [7]. A previous meta-analysis showed that these models tended to overestimate the risk of Asia women [8], and some predictors, such as the number of prior breast biopsies, are not available for most Chinese women. Several models have also been developed in China [9,10,11,12,13,14,15]. However, most of them were developed using a case-control design, which is subjected to selection and recall bias. Additionally, all these studies were conducted with participants from the eastern provinces of China, where breast cancer incidence rates are higher than those in the other areas of China [1]. More importantly, of the seven models, only one, which was conducted in Shandong province, has been externally validated in a small cohort with only 34 cases. Therefore, a validated breast cancer risk prediction model based on data from Chinese women with good generalizability is more than timely and much needed.

In this study, we used data from a large nationwide prospective cohort, the China Kadoorie Biobank (CKB), as well as national age- and residence (urban and rural)-specific invasive breast cancer incidence rates and non-breast cancer mortality rates to develop a risk prediction model considering competing risk, and used data from another large prospective cohort, the Shanghai Women’s Health Study (SWHS), to independently validated the model.


Data for model development

Data from the CKB, a large-scale prospective study, was used to derive the relative risk (RR) model [16]. The study took place in 10 study sites, 5 in urban area (Qingdao, Harbin, Haikou, Suzhou, Liuzhou) and 5 in rural area (Pengzhou, Tianshui, Hui county, Tongxiang, Liuyang) of China. The regions were selected according to local disease patterns, exposure to certain risk factors, population stability, quality of death and disease registries, local commitment, and capacity. Potential eligible participants were identified through official residential records. Invitation letters (with study information leaflets) were delivered door-to-door by local community leaders or health workers. The estimated population response rate was ~ 30% (26–38% in the five rural areas and 16~50% in the five urban areas). Overall, a total of 512,715 participants aged 30–79 years old, including 302,510 (59.0%) women were recruited during 2004–2008. All participants had completed a questionnaire and had physical measurements taken.

Incident cases of invasive breast cancer and mortality were identified chiefly through the linkage with the national health insurance claim database and disease registries, supplemented with local residential records and annual active confirmation. The International Classification of Diseases, 10th Revision was used to code all breast cancer (C50) by trained staff who were blinded to baseline information. We excluded women who had missing data for any reproductive factors or who provided implausible data on age at menarche or age at first live birth. We further excluded women who reported previous histories of cancer at baseline or had missing data for body mass index (BMI), leaving 300,824 women in the analysis.

Data for external validation

Independent data from the SWHS was used to externally validate the derived model based on CKB data [17]. In brief, 74,942 women were recruited from seven urban communities in Shanghai, China during 1996–2000.

At baseline, all information involved in the current analysis was collected through in-person interviews and anthropometric measures following standard protocol. Incident breast cancer cases (ICD-9 code 174) were identified by a combination of active re-surveys every 2 to 4 years and annual linkage with the Shanghai Cancer Registry and the Shanghai death certificate registry. The cancer diagnosis was verified through home visits and reviews of medical charts obtained from the hospitals where the patients were diagnosed. Applying the same exclusion criteria as the CKB data, 73,203 SWHS participants were included.

Statistical Methods

Relative risk prediction model

Participants were considered at risk from the enrollment to the diagnose of invasive breast cancer, death, loss to follow-up, or Dec 31, 2016, whichever came first. Cox proportional hazards model was used to estimate the hazard ratios as the metric of relative risk (RR) for each variable in the model, with age as the timescale, and stratified jointly by 10 study sites and age at enrollment in a 5-year interval (i.e., 100 strata to control the confounding by age and study sites).

We initially considered the following variables to construct the model: education, tobacco smoking, alcohol drinking, total physical activity, consumption of soybean, BMI, height, first-degree family history of overall cancer, menopausal status, number of live birth, age at menarche, total duration of breastfeeding, and usage of contraceptives. Because we did not collect information on family history of breast cancer, we used the family history of overall cancer as a surrogate. The continuous variables were converted to categorical variables to reduce overfitting. Cutoffs of BMI were chosen according to the well-established criteria for Chinese [18]. And, the quartile of height was used as cutoffs of height. For other predictors, cutoffs were chosen when the model achieved the smallest Bayesian Information Criterion (BIC). We assessed the proportional hazards assumption by the Schoenfeld residuals. In line with previous studies [19, 20], we found only BMI was subject to time-varying effects. Therefore, we further split follow-up time into two age intervals at 50 years and added an interaction term of attained age and BMI. We first assessed all variables with P < 0.05 together in the model. Variable selection was repeated using stepwise backward elimination, which yielded the same result. The variables were converted to ordinal variables if their RRs were proportional to levels and no evidence of nonlinearity was detected using fractional polynomials. All first-order interactions were tested one by one using the likelihood ratio test comparing models with and without the interaction term. For all variables in the final model, the lowest risk category was regarded as the reference group, to facilitate population attributable risk (PAR) computation.

Given the higher incidence rate of breast cancer in urban areas than that in rural areas, we also tempted to build residence (urban/rural)-specific models, i.e., variable selection and predictors coefficients were separately done in urban and rural datasets. Interestingly, we found that the relative risks were similar between urban and rural areas, and there was no significant interaction between area and risk factors (see Additional file 1). Therefore, we used the same set of relative risk estimates for all participants in the CKB to maintain model parsimony and to more reliably estimate hazard ratios.

Absolute risk projection

We used an approach similar to that described by Gail et al. to project absolute risk from initial age to final ag e[5, 21]. Briefly, the absolute risk that a woman who is age a and who has risk factors x will develop breast cancer by age a + τ is

$$ \mathrm{P}\left(a,\tau, x\right)={\int}_a^{a+\tau }{h}_1\left(t,x\right)\exp \left[-{\int}_a^t\left({h}_1\left(u,x\right)+{h}_2(u)\right) du\right]\mathrm{d}t\kern0.5em $$

where h1(t, x) is the age-specific hazards of developing breast cancer and h2(t) is the age-specific hazards for competing causes at age t. We can estimate h1(t, x) = h10(t)RR(x) as the product of age-, residence-specific baseline hazards h10(t) and relative risks RR(x) from the relative risk model described above. RR(x) are age-constant for all risk factors x except for BMI, which has two different RR for < 50 and ≥ 50 years old.

To have a robust and generalizable model, we calculated the baseline age- and residence-specific hazards h10(t), by multiplying age-specific incidence rates in 2014 from the National Central Cancer Registry of China (NCCR) [22] by one minus population attributable risk (PAR). The PAR was estimated using the formula described by Bruzzi et al. [23] and can be interpreted as the fraction in the incidence of breast cancer that would have been reduced during follow-up if all six predictors in the relative risk model (i.e., education, BMI, height, family history of overall cancer, parity, and age at menarche) took the lowest risk category of predictors. PAR of 1 indicates all breast cancer incidence attribute to the factors, while PAR of 0 indicates no breast cancer incidence attribute to these factors. The distribution of risk factors in four groups defined by attained ages (below/above 50 years old) and residence (urban/rural) were different, so we estimated the PAR separately in the four above-mentioned groups. Further, death from causes other than breast cancer will prevent the occurrence of breast cancer, of which risk increased with age. To account for the competing risk, we calculated age- and residence-specific mortality rates of non-breast cancer, h2(t), as age- and residence-specific all-cause mortality rates in 2014 from Health Statistics Yearbook [24] minus age- and residence-specific breast cancer mortality rates in 2014 from the NCCR. These incidence and mortality rates are listed in Additional file 2.

As a sensitivity analysis, we built an absolute risk model using breast cancer incidence rates and non-breast cancer mortality rates from the CKB cohort to understand calibration of internal validation. As another sensitivity analysis, we built an absolute risk model using breast cancer incidence rates and non-breast cancer mortality rates from Shanghai in the external validation (calibrated model) to evaluate whether robust local rates, if available, can improve model performance.


The above development process was first done using whole CKB data and repeated in a random two-thirds of the CKB data (derivation subcohort). We found that the same set of predictors was selected and the RRs for predictors were similar using the above-mentioned two methods (Additional file 3). We used data splitting approach for internal validation, i.e., the model was fitted to random two-thirds of the CKB data and evaluated on the remaining one-third (test subcohort). To have more precise estimations of model parameters, we still used the model developed from the whole CKB dataset for external validation in the SWHS dataset. We assessed calibration by comparing the expected number of breast cancer cases (E) with the observed number (O) overall and for subgroups defined by predictors. The calibration plot was drawn to examine the agreement across deciles of predicted risk in the total population. The projected probability of breast cancer was calculated from the age at enrollment to the younger of either the age at last follow-up or the age on Dec 31, 2016, for the CKB participants or Dec 31, 2014, for the SWHS participants. The 10-year projected risk was also estimated. The 95% confidence intervals (CIs) of E/O ratios were calculated based on Poisson distribution. An E/O ratio above one indicates that the model overestimates cancer risk, and an E/O less than one indicates that the model underestimates cancer risk. Discrimination was quantified by calculating the area under the receiver-operating characteristic curve (AUC), also known as c-statistics, for 10-year risk model. Age- and residence-adjusted AUC was also assessed to eliminate the effect of age and residence. Higher AUC indicates higher discriminating ability, where random classification results in an AUC of 0.5 and perfect discrimination results in 1. To further assess the discriminating accuracy, we estimated the RRs comparing different quintiles of predicted risk. We also estimated a range of performance indices corresponding to a series of cut-offs ranging from 0.4% to 2% in both the CKB and the SWHS. The indices included percent of high-risk population, sensitivity, specificity, positive/negative predictive value (PPV/NPV), and numbers needed to be screened to confirm one case in the next 10 years (NNS, one divided by the PPV).

The calculation of absolute risk was performed using SAS (version 9.4, SAS Institute Inc.), and all other statistical analyses were performed using Stata (version 14, StataCorp).


Of the 300,824 women in the CKB cohort included in the RR model development, the mean age at recruitment was 51.4 years. Compared with those in rural areas, women in urban areas were older, more educated, more overweight or obese, taller, and were more likely to have positive overall cancer family history, early age at menarche, and less likely to have multiple children (Table 1). Compared with women in urban areas of the CKB, women in the SWHS had similar ages, BMI, and number of live births, but tended to be more educated, taller, to have more relatives diagnosed with cancer, and to have an earlier age at menarche.

Table 1 Baseline characteristics of women by residence and dataset in China Kadoorie Biobank (CKB) and Shanghai Women’s Health Study (SWHS)

During a median of 10.2 years of follow-up in the CKB, 2287 women developed invasive breast cancer. The final age- and study site-stratified model included education, BMI, height, family history of cancer, parity, and age at menarche (Table 2). The association between BMI and breast cancer risk was non-significant in women younger than 50 years and was positive associated in women above this age (test-for-interaction was significant). No other significant interaction between predictors was found. Based on the relative risk model and distribution of risk factors, the PARs estimated in urban areas were 0.74 for women younger than 50 years and 0.76 for women 50 years and older. The corresponding PAR estimates in rural areas were 0.63 and 0.65, reflecting fewer cases were attributed to the six predictors in the relative risk model in the rural areas.

Table 2 Age- and study site-stratified RR (95% CI) for breast cancer in China Kadoorie Biobank

Of the 73,203 women in the SWHS, 1409 were diagnosed with breast cancer during a median of 16.1 years of follow-up. The CKB model predicted 1320 cases in the SWHS, yielding an E/O of 0.94 (95% CI, 0.89 to 0.99). The number of cases was statistically significantly underestimated among women aged 60 years and older, women with lower education, women shorter than 150.2 cm, women without family history of overall cancer, women with multiple live births, and women with age at menarche at 15–16 years. The model statistically significantly overestimated risk for women with 2 or more affected first-degree relatives. For all other categories, there was good agreement between the observed and predicted number of breast cancers (Table 3). The calibration plot showed agreement across deciles of predicted risk, except for the second-lowest decile (Fig. 1b). We further recalculated the absolute risk using Shanghai local rates and found a better calibration, with an E/O (95% CI) overall of 1.01 (0.96–1.06) (see Additional file 4).

Table 3 Expected and observed number of breast cancer in test subcohort of China Kadoorie Biobank (CKB) and Shanghai Women’s Health Study (SWHS)
Fig. 1
figure 1

Area under the receiver-operating characteristic curve (AUC) and calibration plot for 10-year breast cancer risk model. Test subcohort of China Kadoorie Biobank (a, c). Shanghai Women’s Health study (b, d)

As a reference, we also present calibration results for the test subcohort of the CKB study (Table 3 and Fig. 1a). Overall, the CKB model predicted 760 cases in the CKB test subcohort, yielding an E/O (95% CI) of 1.01 (0.94–1.09). The model statistically significantly overestimated the risk of women in rural areas but underestimated the risk in urban areas. In the sensitivity analysis, we recalculated the absolute risk using CKB rates (see Additional file 4), and found the calibrated E/Os were 1.03 (0.95–1.13) and 0.99 (0.88–1.12) for participants in the urban and rural areas, respectively.

Discriminating accuracy of the 10-year risk model is presented in Table 4 and Fig. 1c, d. The overall AUC was 0.658 (95% CI, 0.631–0.684) in the CKB test subcohort and attenuated to 0.634 (95% CI, 0.608–0.661) after adjusting for age and residence. External validation resulted in an overall unadjusted AUC of 0.573 (95% CI, 0.553–0.593) and an age-adjusted AUC of 0.585 (95% CI, 0.564–0.605).

Table 4 Discrimination of the CKB 10-year prediction model in the test subcohort of China Kadoorie Biobank (CKB) and Shanghai Women’s Health Study (SWHS)

And compared with women in the lowest quintile of 10-year predicted risk, the adjusted RR for women in the highest quintile was 6.74 in the CKB (95% CI, 4.57–9.92) and 2.55 in the SWHS (95% CI, 2.06–3.16) (Table 5). Larger RRs were observed in women aged 50 years and older and women in urban areas. The stratifying efficiency of our model at different 10-year predicted risk cut-offs in the CKB and SWHS is shown in Additional files 5 and 6.

Table 5 Age- and residence-adjusted RR (95% CI) by quantiles of predicted risk in the test subcohort of China Kadoorie Biobank (CKB) and Shanghai Women’s Health Study (SWHS)


We developed a prediction model for invasive breast cancer among Chinese women aged 30 years and older using data from a large nationwide prospective cohort and validated its performance in an independent cohort in Shanghai. The model includes six factors in the relative risk prediction (education, BMI, height, family history of overall cancer, parity, and age at menarche) and two additional factors in the absolute risk prediction (age and residence area). The model was well-calibrated in both the CKB and SWHS cohorts, though there were under- or overestimation of risk in some risk factor strata. After eliminating the effect of age and residence, we found the adjusted AUC was 0.634 and 0.585 in the CKB and SWHS, respectively, which are comparable with those of some previous externally validated models [9, 25].

Overall, our model fits well in the CKB and underestimated (6%) the risk of women in the urban area in the SWHS. To have a good model generalization, we have applied China’s national age and residence (urban/rural) rates in the absolute risk calculation, instead of regional rates like previous studies in China [9,10,11,12,13,14,15]. Therefore, the agreement of the national rates with rates in validation datasets may play a major role in the calibration. CKB’s cancer incidence and mortality rates were consistent with national rates during 2008–2013 [26], resulting in the excellent calibration in the CKB. Despite the overall concordance, the model overestimated the risk of women in rural areas but underestimated the risk in urban areas, reflecting that higher incidence rates in urban areas and lower rates in rural areas in the CKB cohort than the corresponding national rates (see Additional file 2). Interestingly, although SWHS cohort women were recruited around 10 years before the CKB in Shanghai, one of the most developed cities in China, the CKB model can still provide acceptable calibration in the SWHS cohort. The slight underestimation was caused by higher incidence rates of breast cancer in Shanghai. In our sensitivity analyses of recalculating the absolute risk using local rates, the above-mentioned calibration errors diminished, confirming that our relative risk model was robust and the errors were solely caused by the mismatch between national rates and local rates (see Additional file 4). A previous meta-analysis showed that the Asian American Breast Cancer Study model (AABCS), or Gail model for Asian Americans, overestimated breast cancer risk for Asian women (pooled E/O = 1.82, 95% CI 1.31–2.51) [7, 8]. This overestimation was also observed in a recent cohort study in China (E/O = 2.39, 95% CI 1.71–3.46) [9]. Similarly, we applied the AABCS model to the CKB and SWHS data and found an E/O of 1.89 (95% CI, 1.82–1.97) and 1.16 (1.10–1.23) for the CKB and SWHS, respectively. We further recalibrated the AABSC model using rates from China and still found an overall miscalibration (CKB: E/O [95% CI], 0.94 [0.90–0.98]; SWHS: 0.67 [0.63–0.71]) and for most subgroups defined by the predicted risk deciles (see Additional file 7).

In the external validation, we found a moderate AUC of 0.585, which was better than or equivalent to those of the AABCS model [8, 9, 25]. Matsuno et al. reported the AUC of the AABCS model (including age at menarche, age at first live birth, number of affected mothers, sisters, and daughters with breast cancer, and number of previous benign biopsies) was 0.614 (95% CI 0.587–0.640) in the validation among Asian-Americans [7], but AUC decreased to 0.54 in two independent validations conducted in China [9] and Korean [25]. We found that the age- and residence-adjusted AUCs of both the original AABCS model and calibrated AABCS model in the CKB and the SWHS data were all around 0.54 (see Additional file 7). To our knowledge, only one model developed in China was externally validated, with higher AUC (0.64, 95% CI 0.55–0.72), but few cases in their validation set and same location of derivation and validation sets limited the robustness of the results [9]. Although several models in China had statistically significantly higher AUC by additionally including genetic information, the lack of external validation precludes direct comparison with our models [11, 14, 15].

The development of the CKB risk prediction model has several public health implications. First, our model, with the moderate discriminating ability and good calibration, can facilitate allocation of preventive resources under monetary and medical constraints and aid risk-based screening strategies [27]. China’s breast cancer 2019 screening guidelines recommended an opportunity for screening for women with average risk aged 40–44 years and biennial screening for women aged 45–69 years, which is mainly done by mammograph and supplemented with breast ultrasonography and magnetic resonance imaging [28]. However, such an age-based screening strategy ignores the large variation in breast cancer risk in the population [29]. Given the limited medical and economic resources in China, it is more cost-effective to adopt a risk-based screening strategy that can allocate resources to do intensive screening for women at high risk, while less frequent screening for women with low risk. Second, at the individual level, our model can be used for individual risk counseling and promote a healthy lifestyle. Knowing their own cancer risk may motivate obese women to lose weight. Third, as described by Gail et al., our model can also aid designing preventive trials and estimating the absolute burden of a specific population [27].

Our study has several strengths. We used data from the largest nationwide prospective cohort study in China to develop the relative risk model, augmented with China national incidence and mortality rates, and validated in another large prospective cohort study. These methods ensure our model to be robust and potentially generalizable to both rural and urban areas in China. Also, all predictors in the model are non-invasive, easy to measure at low cost, which makes the model easily applicable to the general population. We plan to develop an online risk calculator to promote its use.

However, one must be aware of limitations of our study. First, several established risk factors were not included in the model. Although several studies included alcohol [29,30,31], the low prevalence of alcohol intake in the CKB (see Table 1) precluded the inclusion. Additionally, we did not have data on family history of breast cancer, so we used a family history of all cancers as a surrogate to capture the inherited susceptibility of breast cancer as much as possible. This surrogation may not be accurate such that the risk was overestimated in women with two or more family members having cancers. The history of benign breast diseases was not collected in the CKB and we think it might not be reliably collected in the general Chinese population. Second, cumulative evidence showed heterogeneous associations of epidemiological factors with estrogen receptor (ER)-specific breast cancer though some factors are common for both ER-positive and ER-negative breast cancers [32, 33]. We did not build ER-specific models due to the lack of information on subtypes of breast cancer in the current database of the CKB cohort. Since the majority of breast cancer in Chinese women was estrogen ER-positive (80.3% in women < 50 years and 76.8% in women 50 or older) [34], our model might primarily apply to ER-positive breast cancer. Finally, we only externally validated our model in urban Shanghai, which has one of the highest incidence rates in China. Therefore, further validation of our model in other regions, especially in rural regions, is still needed.


In summary, we have developed and validated a breast cancer risk prediction model that only relies on non-laboratory predictors. The model has a good calibration and a moderate discriminating capacity. The model may serve as a useful tool to raise individuals’ awareness and to identify women who may benefit from breast cancer screening in China. To improve the model discriminating accuracy, further studies can add genetic and epigenetic predictors for breast cancer, as well as mammographic density. Validation of our model in other regions of China, especially rural areas, is also desirable to evaluate the robustness of the CKB model.

Availability of data and materials

Details of how to access China Kadoorie Biobank data and details of the data release schedule are available from



Asian American Breast Cancer Study model


Area under the receiver-operating characteristic curve


Bayesian Information Criterion


Body mass index


Confidence interval


China Kadoorie Biobank


Expected number of breast cancer cases


International Classification of Diseases, 10th Revision


National Central Cancer Registry


Numbers needed to be screened to confirm one case


Negative predictive value


Observed number of breast cancer cases


Positive predictive value


Relative risk


Shanghai Women’s Health Study


Population attributable risk


  1. Chen W, Zheng R, Baade PD, Zhang S, Zeng H, Bray F, et al. Cancer statistics in China, 2015. CA Cancer J Clin. 2016;66(2):115–32.

    Article  PubMed  Google Scholar 

  2. Sun KX, Zheng RS, Gu XY, Zhang SW, Zeng HM, Zou XN, et al. Incidence trend and change in the age distribution of female breast cancer in cancer registration areas of China from 2000 to 2014. Zhonghua Yu Fang Yi Xue Za Zhi. 2018;52(6):567–72.

    Article  CAS  PubMed  Google Scholar 

  3. Zheng RS, Sun KX, Zhang SW, Zeng HM, Zou XN, Chen R, et al. Report of cancer epidemiology in China, 2015. Zhonghua Zhong Liu Za Zhi. 2019;41(1):19–28.

    CAS  PubMed  Google Scholar 

  4. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2019. CA Cancer J Clin. 2019;69(1):7–34.

    Article  PubMed  Google Scholar 

  5. Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst. 1989;81(24):1879–86.

    Article  CAS  PubMed  Google Scholar 

  6. Cintolo-Gonzalez JA, Braun D, Blackford AL, Mazzola E, Acar A, Plichta JK, et al. Breast cancer risk models: a comprehensive overview of existing models, validation, and clinical applications. Breast Cancer Res Treat. 2017;164(2):263–84.

    Article  PubMed  Google Scholar 

  7. Matsuno RK, Costantino JP, Ziegler RG, Anderson GL, Li H, Pee D, et al. Projecting individualized absolute invasive breast cancer risk in Asian and Pacific Islander American women. J Natl Cancer Inst. 2011;103(12):951–61.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Wang X, Huang Y, Li L, Dai H, Song F, Chen K. Assessment of performance of the Gail model for predicting breast cancer risk: a systematic review and meta-analysis with trial sequential analysis. Breast Cancer Res. 2018;20(1):18.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Wang L, Liu L, Lou Z, Ding L, Guan H, Wang F, et al. Risk prediction for breast cancer in Han Chinese women based on a cause-specific Hazard model. BMC Cancer. 2019;19(1):128.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Wu F, He D, Zhao G, Fang H, Xu W. Risk factors of breast cancer and a risk predictive model for Chinese women in Shanghai, China. Chin J Cancer Prev Treat. 2017;24(12):795–801,807.

    Google Scholar 

  11. Hsieh YC, Tu SH, Su CT, Cho EC, Wu CH, Hsieh MC, et al. A polygenic risk score for breast cancer risk in a Taiwanese population. Breast Cancer Res Treat. 2017;163(1):131–8.

    Article  CAS  PubMed  Google Scholar 

  12. Wang F, Dai J, Li M, Chan WC, Kwok CC, Leung SL, et al. Risk assessment model for invasive breast cancer in Hong Kong women. Medicine (Baltimore). 2016;95(32):e4515.

    Article  Google Scholar 

  13. Wang Y, Gao Y, Battsend M, Chen K, Lu W, Wang Y. Development of a risk assessment tool for projecting individualized probabilities of developing breast cancer for Chinese women. Tumour Biol. 2014;35(11):10861–9.

    Article  PubMed  Google Scholar 

  14. Dai J, Hu Z, Jiang Y, Shen H, Dong J, Ma H, et al. Breast cancer risk assessment with five independent genetic variants and two risk factors in Chinese women. Breast Cancer Res. 2012;14(1):R17.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Zheng W, Wen W, Gao YT, Shyr Y, Zheng Y, Long J, et al. Genetic and clinical predictors for breast cancer risk assessment and stratification among Chinese women. J Natl Cancer Inst. 2010;102(13):972–81.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, et al. China Kadoorie Biobank collaborative g: China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol. 2011;40(6):1652–66.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Zheng W, Chow WH, Yang G, Jin F, Rothman N, Blair A, et al. The Shanghai Women’s Health Study: rationale, study design, and baseline characteristics. Am J Epidemiol. 2005;162(11):1123–31.

    Article  PubMed  Google Scholar 

  18. China, NHaFPCotPsRo. Criteria of weight for adults (WS/T 428–2013). Beijing: Standards Press of China; 2013.

    Google Scholar 

  19. van den Brandt PA, Spiegelman D, Yaun SS, Adami HO, Beeson L, Folsom AR, et al. Pooled analysis of prospective cohort studies on height, weight, and breast cancer risk. Am J Epidemiol. 2000;152(6):514–27.

    Article  PubMed  Google Scholar 

  20. World Cancer Fund/American Insititute for Cancer Research: Continuous Update Project Expert Report 2018. Diet, nutrition, physical activity and oesophageal cancer.

  21. Pfeiffer RM, Park Y, Kreimer AR, Lacey JV Jr, Pee D, Greenlee RT, et al. Risk prediction for breast, endometrial, and ovarian cancer in white women aged 50 y or older: derivation and validation from population-based cohort studies. PLoS Med. 2013;10(7):e1001492.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Li H, Zheng RS, Zhang SW, Zeng HM, Sun KX, Xia CF, et al. Incidence and mortality of female breast cancer in China, 2014. Chin J Oncol. 2018;40(3):166–71.

    Article  CAS  Google Scholar 

  23. Bruzzi P, Green SB, Byar DP, Brinton LA, Schairer C. Estimating the population attributable risk for multiple risk factors using case-control data. Am J Epidemio. 1985;122(5):904–14.

    Article  CAS  Google Scholar 

  24. National Health and Family Planning Commission of the People's Republic of China. Health Statistics Yearbook (2015). Beijing. China: Peking Union Medical College Press; 2015.

    Google Scholar 

  25. Min JW, Chang MC, Lee HK, Hur MH, Noh DY, Yoon JH, et al. Validation of risk assessment models for predicting the incidence of breast cancer in korean women. J Breast Cancer. 2014;17(3):226–35.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Pan R, Zhu M, Yu C, Lv J, Guo Y, Bian Z, et al. Cancer incidence and mortality: a cohort study in China, 2008-2013. Int J Cancer. 2017;141(7):1315–23.

    Article  CAS  PubMed  Google Scholar 

  27. Gail MH, Pfeiffer RM. Breast cancer risk model requirements for counseling, prevention, and screening. J Natl Cancer Inst. 2018;110(9):994–1002.

    Article  PubMed  PubMed Central  Google Scholar 

  28. China Anti-Cancer Association, National Clinical Research Center for Cancer. Breast cancer screening guideline for Chinese Women. Cancer Biol Med. 2019;16(4):822–4.

    Google Scholar 

  29. Maas P, Barrdahl M, Joshi AD, Auer PL, Gaudet MM, Milne RL, et al. Breast cancer risk from modifiable and nonmodifiable risk factors among white women in the United States. JAMA Oncol. 2016;2(10):1295–302.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Wang S, Ogundiran T, Ademola A, Olayiwola OA, Adeoye A, Sofoluwe A, et al. Development of a breast cancer risk prediction model for women in Nigeria. Cancer Epidemiol Biomarkers Prev. 2018;27(6):636–43.

  31. Petracci E, Decarli A, Schairer C, Pfeiffer RM, Pee D, Masala G, et al. Risk factor modification and projections of absolute breast cancer risk. J Natl Cancer Inst. 2011;103(13):1037–48.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Colditz GA, Rosner BA, Chen WY, Holmes MD, Hankinson SE. Risk factors for breast cancer according to estrogen and progesterone receptor status. J Natl Cancer Inst. 2004;96(3):218–28.

    Article  CAS  PubMed  Google Scholar 

  33. Yang XR, Chang-Claude J, Goode EL, Couch FJ, Nevanlinna H, Milne RL, et al. Associations of breast cancer risk factors with tumor subtypes: a pooled analysis from the Breast Cancer Association Consortium studies. J Natl Cancer Inst. 2011;103(3):250–63.

    Article  PubMed  Google Scholar 

  34. Zhu X, Ying J, Wang F, Wang J, Yang H. Estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 status in invasive breast cancer: a 3,198 cases study at National Cancer Center, China. Breast Cancer Res Treat. 2014;147(3):551–5.

    Article  CAS  PubMed  Google Scholar 

Download references


The most important acknowledgment is to the participants in the study and the members of the survey teams in each of the 10 regional centres, as well as to the project development and management teams based at Beijing, Oxford and the 10 regional centers.

Members of the China Kadoorie Biobank collaborative group

International Steering Committee: Junshi Chen, Zhengming Chen (PI), Robert Clarke, Rory Collins, Yu Guo, Liming Li (PI), Jun Lv, Richard Peto, Robin Walters. International Co-ordinating Centre, Oxford: Daniel Avery, Ruth Boxall, Derrick Bennett, Yumei Chang, Yiping Chen, Zhengming Chen, Robert Clarke, Huaidong Du, Simon Gilbert, Alex Hacker, Mike Hill, Michael Holmes, Andri Iona, Christiana Kartsonaki, Rene Kerosi, Ling Kong, Om Kurmi, Garry Lancaster, Sarah Lewington, Kuang Lin, John McDonnell, Iona Millwood, Qunhua Nie, Jayakrishnan Radhakrishnan, Paul Ryder, Sam Sansome, Dan Schmidt, Paul Sherliker, Rajani Sohoni, Becky Stevens, Iain Turnbull, Robin Walters, Jenny Wang, Lin Wang, Neil Wright, Ling Yang, Xiaoming Yang. National Co-ordinating Centre, Beijing: Zheng Bian, Yu Guo, Xiao Han, Can Hou, Jun Lv, Pei Pei, Chao Liu, Canqing Yu. 10 Regional Co-ordinating Centres: Qingdao CDC: Zengchang Pang, Ruqin Gao, Shanpeng Li, Shaojie Wang, Yongmei Liu, Ranran Du, Yajing Zang, Liang Cheng, Xiaocao Tian, Hua Zhang, Yaoming Zhai, Feng Ning, Xiaohui Sun, Feifei Li. Licang CDC: Silu Lv, Junzheng Wang, Wei Hou. Heilongjiang Provincial CDC: Mingyuan Zeng, Ge Jiang, Xue Zhou. Nangang CDC: Liqiu Yang, Hui He, Bo Yu, Yanjie Li, Qinai Xu,Quan Kang, Ziyan Guo. Hainan Provincial CDC: Dan Wang, Ximin Hu, Jinyan Chen, Yan Fu, Zhenwang Fu, Xiaohuan Wang. Meilan CDC: Min Weng, Zhendong Guo, Shukuan Wu,Yilei Li, Huimei Li, Zhifang Fu. Jiangsu Provincial CDC: Ming Wu, Yonglin Zhou, Jinyi Zhou, Ran Tao, Jie Yang, Jian Su. Suzhou CDC: Fang Liu, Jun Zhang, Yihe Hu, Yan Lu, Liangcai Ma, Aiyu Tang, Shuo Zhang, Jianrong Jin, Jingchao Liu. Guangxi Provincial CDC: Zhenzhu Tang, Naying Chen, Ying Huang. Liuzhou CDC: Mingqiang Li, Jinhuai Meng, Rong Pan, Qilian Jiang, Jian Lan,Yun Liu, Liuping Wei, Liyuan Zhou, Ningyu Chen Ping Wang, Fanwen Meng, Yulu Qin,, Sisi Wang. Sichuan Provincial CDC: Xianping Wu, Ningmei Zhang, Xiaofang Chen,Weiwei Zhou. Pengzhou CDC: Guojin Luo, Jianguo Li, Xiaofang Chen, Xunfu Zhong, Jiaqiu Liu, Qiang Sun. Gansu Provincial CDC: Pengfei Ge, Xiaolan Ren, Caixia Dong. Maiji CDC: Hui Zhang, Enke Mao, Xiaoping Wang, Tao Wang, Xi Zhang. Henan Provincial CDC: Ding Zhang, Gang Zhou, Shixian Feng, Liang Chang, Lei Fan. Huixian CDC: Yulian Gao, Tianyou He, Huarong Sun, Pan He, Chen Hu, Xukui Zhang, Huifang Wu, Pan He. Zhejiang Provincial CDC: Min Yu, Ruying Hu, Hao Wang. Tongxiang CDC: Yijian Qian, Chunmei Wang, Kaixu Xie, Lingli Chen, Yidan Zhang, Dongxia Pan, Qijun Gu. Hunan Provincial CDC: Yuelong Huang, Biyun Chen, Li Yin, Huilin Liu, Zhongxi Fu, Qiaohua Xu. Liuyang CDC: Xin Xu, Hao Zhang, Huajun Long, Xianzhi Li, Libo Zhang, Zhe Qiu.


This work was supported by National Natural Science Foundation of China (91846303), and DH was supported by Breast Cancer Research Foundation. The CKB baseline survey and the first re-survey were supported by a grant from the Kadoorie Charitable Foundation in Hong Kong. The long-term follow-up is supported by grants (2016YFC0900500, 2016YFC0900501, 2016YFC0900504) from the National Key R&D Program of China, National Natural Science Foundation of China (81390540, 81390541, 81390544), and Chinese Ministry of Science and Technology (2011BAI09B01). The SWHS was funded by National Institutes of Health/National Cancer Institute (UM1 CA182910 and R37CA70867).

Author information

Authors and Affiliations




LL, DH, JL, and CY conceived and designed the study. LL, ZC, and JC, as the members of CKB steering committee, designed and supervised the conduct of the CKB study, obtained funding, and together with JL, YG, ZB, HD, LY, YC, HG, PL acquired the data for the CKB study. WZ, YG, YX, and XS designed and supervised the conduct of the SWHS.YTH and YZH analyzed the CKB data, and DH, WW, and FZ analyzed the SWHS data. YTH wrote the first draft of the manuscript. LL and DH contributed to the interpretation of the results and critical revision of the manuscript for important intellectual content and approved the final version of the manuscript. All authors reviewed and approved the final manuscript. LL and DH are the guarantors.

Corresponding authors

Correspondence to Dezheng Huo or Liming Li.

Ethics declarations

Ethics approval and consent to participate

The study protocol of the CKB was approved by the Ethics Review Committee of the Chinese Center for Disease Control and Prevention (Beijing, China: 005/2004) and the Oxford Tropical Research Ethics Committee, University of Oxford (UK: 025–04). All participants provided written informed consent before taking part in the study. Written informed consent was obtained from all participants of the SWHS and the SWHS study was approved by the institutional review boards at all participating institutions.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Show comparison of Age-adjusted RR (95% CI) for breast cancer among women in urban and rural areas of China Kadoorie Biobank.

Additional file 2.

Show age- and residence-specific breast cancer incidence rates and mortality rates of non-breast cancer per 100,000 person-years by data sources.

Additional file 3.

Show age- and site-adjusted RR (95% CI) from the derivation subcohort and the whole China Kadoorie Biobank.

Additional file 4.

Show expected and observed number of breast cancer in the test subcohort of China Kadoorie Biobank and Shanghai Women’s Health Study using the corresponding local rates.

Additional file 5.

Show performance of the breast cancer prediction model across different predicted risk cutoffs in the China Kadoorie Biobank.

Additional file 6.

Show performance of the breast cancer prediction model across different predicted risk cutoffs in the Shanghai Women's Health Study.

Additional file 7.

Show validation of the Asian America Breast Cancer Study model for predicting individual breast cancer risk in China Kadoorie Biobank and Shanghai Women's Health Study.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, Y., Lv, J., Yu, C. et al. Development and external validation of a breast cancer absolute risk prediction model in Chinese population. Breast Cancer Res 23, 62 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: