Evaluation and comparison of different breast cancer prognosis scores based on gene expression data
Breast Cancer Research volume 25, Article number: 17 (2023)
Breast cancer is one of the three most common cancers worldwide and is the most common malignancy in women. Treatment approaches for breast cancer are diverse and varied. Clinicians must balance risks and benefits when deciding treatments, and models have been developed to support this decision-making. Genomic risk scores (GRSs) may offer greater clinical value than standard clinicopathological models, but there is limited evidence as to whether these models perform better than the current clinical standard of care.
PREDICT and GRSs were adapted using data from the original papers. Univariable Cox proportional hazards models were produced with breast cancer-specific survival (BCSS) as the outcome. Independent predictors of BCSS were used to build multivariable models with PREDICT. Signatures which provided independent prognostic information in multivariable models were incorporated into the PREDICT algorithm and assessed for calibration, discrimination and reclassification.
EndoPredict, MammaPrint and Prosigna demonstrated prognostic power independent of PREDICT in multivariable models for ER-positive patients; no score predicted BCSS in ER-negative patients. Incorporating these models into PREDICT had only a modest impact upon calibration (with absolute improvements of 0.2–0.8%), discrimination (with no statistically significant c-index improvements) and reclassification (with 4–10% of patients being reclassified).
Addition of GRSs to PREDICT had limited impact on model fit or treatment received. This analysis does not support widespread adoption of current GRSs based on our implementations of commercial products.
Breast carcinoma is an unregulated growth of cells within the functional units of breast epithelium . It is one of the three most common cancers worldwide (with an estimated 2.26 million new cases in 2020) and the most common malignancy in women [2, 3]. Treatment approaches for breast cancer are diverse and varied: in the UK between 2013 and 2014, 81% of breast cancer patients had surgery as part of their primary treatment regimen, 63% had radiotherapy, 34% had chemotherapy , and 62% had endocrine therapy .
Clinicians managing breast cancer must balance the risks of treatment against the potential benefits. Prognostic scores have been developed to assist in these predictions. These include clinical risk scores and genomic risk scores (GRSs). PREDICT is one example of a clinical score. It is based on a multivariable Cox proportional hazards model incorporating patient age, tumour size, tumour grade, tumour protein expression (ER, HER2 and KI67), positive nodes and mode of diagnosis [6, 7]. It provides estimates of absolute treatment benefit for hormone therapy, chemotherapy, adjuvant trastuzumab and bisphosphonate therapy. PREDICT is recommended by the National Institute for Health and Care Excellence (NICE) as a tool for supporting clinical decisions on adjuvant treatment benefit  and has been endorsed by the American Joint Committee on Cancer . The underlying model is flexible, enabling additional biomarkers to be incorporated.
GRSs (also called genomic prognostic scores) based on RNA gene expression data were developed in response to the concern that clinicopathological features are imperfect estimators of disease risk and chemosensitivity . They have the theoretical advantages of optimal use of continuous variables and added robustness (e.g. by gathering information on ER activity through a cluster of genes) . Many GRSs have been developed, but few are commercially available and fewer still are endorsed by clinical bodies. At present, only Oncotype DX , EPClin  and Prosigna  have been approved for use in clinical practice in the UK under specific circumstances . Another signature, MammaPrint , was deemed clinically effective but not cost-effective.
Several key metrics are used to assess model fit, including calibration, discrimination and reclassification. Calibration is defined as the agreement between observed outcomes and predictions, often presented as an absolute difference in values. Discrimination is a model’s ability to differentiate between those with and without the outcome , typically expressed using c-indices . Reclassification refers to the movement of individuals between risk categories with the introduction of a new prediction model (or extension of a model through the addition of new variables) [17, 19]. Even if the calibration or discrimination of a model does not change, a change in risk categories may result in an individual receiving different treatment according to clinical guidelines .
There is a paucity of evidence comparing GRSs to current validated clinical scores. Previous studies comparing GRSs to clinicopathological scores use models less comprehensive than PREDICT and tend to reduce continuous variables into categories [20,21,22,23,24,25,26,27,28].
This study aims to assess whether GRSs provide any additional clinical benefit beyond PREDICT, the current standard of care, using a head-to-head comparison in an external cohort. It also analyses the impact of model fit when GRSs are incorporated into the PREDICT algorithm. GRSs included for comparison were those referred to by NICE in their most recent guideline : EPClin, Oncotype DX, Prosigna and MammaPrint.
Material and methods
Linked clinical and gene expression data were obtained from the METABRIC study [29, 30], described in detail elsewhere. The hazard ratio (HR) functions from PREDICT version 2.1 were used in this analysis . Surrogate KI67 status was calculated using gene expression data for MKI67, the gene which codes the KI67 protein, using the mclust package . Proportions of KI67 status grouped by cancer grade, stage, number of lymph nodes positive and hormonal status were similar to those previously reported .
Four GRSs were adapted in line with the specifications of their respective papers—EndoPredict, Oncotype DX (ODx), Prosigna and MammaPrint. Code was adapted from the genefu R package  to make it suitable for use on z-score normalised expression data. 10-year breast cancer-specific survival (BCSS) was the outcome of interest, defined as the percentage of patients who did not die from breast cancer over ten years.
Building Cox proportional hazards models
Cox proportional hazards models were built using the survival package . The primary outcome of interest was breast cancer-specific death. Separate models were built for ER-positive and ER-negative patients, since the baseline hazard is different in these two groups [7, 12]. In all models, the PREDICT prognostic index was constrained to have a coefficient of one and it was included as an offset. This avoids overfitting and serves as an independent validation of the PREDICT model. No constraints were placed upon GRSs as this information was unavailable, and so this does not serve as an independent validation of these scores as used clinically. Unlike in the original PREDICT model, follow-up was not censored at 15 years.
Univariable models were built for each GRS in turn. Multivariable models were built using PREDICT plus a GRS to assess whether the prognostic information provided by GRSs was independent of PREDICT. Since PREDICT is already a validated multivariable model which incorporates key clinical factors known to be associated with breast cancer prognosis, no additional terms were included in the model to prevent overfitting; for this reason, EndoPredict was used in place of EPClin, and the ROR-C score chosen for Prosigna, in multivariable models. Models were built using single GRSs, since multiple scores are unlikely to be used simultaneously in a clinical setting due to prohibitive cost.
Adding GRS terms into the PREDICT algorithm
We also assessed the impact of including GRSs on the calibration, discrimination and reclassification of the PREDICT algorithm. GRS terms were incorporated as additional terms into PREDICT after rescaling such that the average HR across the GRS distribution was one. This ensures that the baseline hazard used in PREDICT is appropriate.
To account for differences in follow-up time, the expected 10-year survival probability of each patient was calculated using PREDICT. Calibration was reported as the absolute difference in 10-year BCSS between the predicted results (the mean survival of all patients as reported by each algorithm) against the observed results (calculated using the survival package ). Discrimination was reported for each algorithm in turn by producing a univariable Cox proportional hazards model, and statistical significance tested using the survcomp  package. Goodness of fit was reported using log-likelihoods and tested using one-way ANOVA tests. Log-likelihoods are equivalent to Akaike information criterion in this case since all models have the same number of variables.
In order to account for the effect of using the same observations to estimate the hazard ratios and to measure model performance, we computed the optimism  using a bootstrap procedure adapted from the rms R package . We resampled 100 times from the original dataset, fitted the model for each of these samples and compared the performance estimated in the bootstrap sample with a testing sample that contained the observations not sampled in each iteration. The difference between them is the optimism and gives an indication of the amount of overfitting in the model.
The effect of second-generation chemotherapy upon BCSS was used to assess reclassification. Locally, the Cambridge Breast Unit uses PREDICT to stratify patients into three groups according to the predicted benefit of adjuvant chemotherapy: absolute increases in BCSS of < 3%, 3–5% and > 5% . The first group is usually not offered chemotherapy, and chemotherapy is recommended in the third group; for the middle group, a discussion of the pros and cons of treatment is conducted. These thresholds were used to categorise patients; reclassification was assessed using reclassification tables.
All analyses were conducted in RStudio (version 4.1.0, RStudio, Inc., MA, USA); analysis code and data are provided as Additional file 1.
Study population characteristics
Matched clinical outcome and gene expression data were available for 1980 patients in the METABRIC cohort (Table 1). Median follow-up in the study population was 9.56 years (range 0–29.2 years). There were 646 breast cancer-specific deaths during the study period.
Cox proportional hazards models
Additional file 2: Table S1 summarises key metrics from univariable Cox proportional hazards models. In the ER-positive cohort, all scores except MammaPrint had statistically significant HRs. No GRS had a significant HR in ER-negative patients.
The discrimination of PREDICT (c-index 0.687) was better than GRSs in ER-positive cases; MammaPrint was the best GRS (0.652). GRS discrimination was poor for ER-negative patients with PREDICT performing substantially better (0.667).
In multivariable models, EndoPredict and MammaPrint statistically significantly improved the fit of PREDICT in ER-positive patients (Table 2). While adding Prosigna significantly improved model fit in ER-negative patients, the overall hazard ratio remained non-significant. There was no significant change in discrimination with the addition of GRSs in either ER-positive or ER-negative patients.
Modified PREDICT algorithm
The PREDICT algorithm was modified to incorporate GRS coefficients from the multivariable models. All modified algorithms underestimated 10-year absolute survival in the METABRIC cohort (Table 3). Survival in the METABRIC cohort at 10 years was 74.0% for ER-positive patients and 58.5% in ER-negative patients. Estimates in ER-positive patients underestimated survival by between 12.2% and 13%, with the closest estimate from PREDICT + MammaPrint. Estimates in ER-negative patients underestimated survival by between 2.3 and 8.8%, with the closest estimate from PREDICT + Prosigna. Subgroup analyses are reported in Additional file 2: Figures S1–S6.
Including GRSs in PREDICT resulted in statistically significant improvements in model fit. Point estimates of discrimination were improved in ER-positive patients with the inclusion of any GRS. However, none of these changes were statistically significant.
The majority of patients remained in the same clinical group when using the original and modified forms of the PREDICT model. A total of 1878 patients were included in these analyses, with the remaining 102 excluded due to missing event data. There were 74 (4%) reclassifications when Oncotype DX was included in PREDICT. This was lower than those for EndoPredict (132; 7%), MammaPrint (154; 8%) and Prosigna (183; 10%).
The most important clinical category of ER-positive patients to consider is intermediate benefit, since the benefit of adjuvant chemotherapy is unclear in this group. Reclassification varied by GRS (Table 4). Similar numbers of patients were reclassified into and out of the intermediate benefit category with Oncotype DX (40 vs. 34) and EndoPredict (71 vs. 61). More patients were reclassified out of intermediate benefit than into it using MammaPrint (66 vs. 88), while more were classified as intermediate when using Prosigna (102 vs. 80).
No patients were reclassified from high to low benefit or vice versa when Oncotype DX, EndoPredict or MammaPrint were used; with Prosigna, 1 patient was reclassified from low to high benefit.
Overall, EndoPredict, MammaPrint and Prosigna demonstrated prognostic power independent of PREDICT in multivariable models in ER-positive patients; however, discrimination was not significantly improved. Incorporating GRSs into the PREDICT algorithm did not improve calibration, with underestimation of 10-year BCSS in the METABRIC cohort in ER-positive and ER-negative cohorts. Measures of discrimination were not significantly changed. GRS inclusion caused 4–10% of patients to be reclassified into different clinical categories.
This analysis addresses some key gaps in the current evidence base. In their most recent guideline on the topic, NICE  states that “there are no data available to compare the tumour profiling tests with PREDICT, or to define the clinical risk groups using PREDICT”. Previous studies tended to compare GRS against one another or against suboptimal clinicopathological parameters. This study used the current clinical standard of care, making it easier to assess whether clinical predictions are improved.
By creating modified versions of the PREDICT algorithm which accepted GRSs as an additional term, this analysis leveraged absolute risk predictions from PREDICT to allow inferences about the impact of GRSs on model fit. Although this is not the same as calculating the calibration of the scores themselves, it nonetheless demonstrates what the impact of combining PREDICT and GRSs would be.
PREDICT had poor calibration in this analysis, underestimating 10-year BCSS by 13%. Independent validation of PREDICT by Gray et al.  on the Scottish Cancer Registry showed much higher calibration, with 5% overestimation of 5-year mortality and 2% underestimation of 10-year mortality. The differences in findings between this analysis and previous work may be due to differences in the outcome of interest (overall survival versus BCSS) or the cohorts themselves (as outcomes in the general population may be different to those in a highly selected cohort like METABRIC, from the UK and Canada). It may also be due to calibration drift, where risk estimates change over time due to changes in population characteristics or disease incidence . This may also explain the lower underestimates of calibration in ER-negative disease (Additional file 2: Figure S1), for whom treatment advances have been more limited.
There is a need to develop prognostic molecular scores which take into account the standard clinicopathological variables already used in a clinical setting and then explore how much additional benefit is gained from the inclusion of genomic signatures. In this way, the developed scores will reflect real-world clinical practice and be more relevant to clinical decision-making.
Limitations and future work
A key limitation of this work is that surrogate GRSs were used rather than their commercial counterparts due to cost considerations. Although all GRSs in this study were derived from published papers in line with the original authors’ instructions, there is nonetheless a risk that these differ from commercial scores. Future work should use commercial scores where possible.
Due to the time period in which METABRIC patients were recruited, certain treatments (e.g. bisphosphonates) were less commonly used. There is also a risk of time-varying confounding due to available treatment regimens changing over time (and affecting patient survival). This is likely given that breast cancer survival rates have improved dramatically over the past decades, a finding attributed in major part to improved treatment [1, 2].
The use of pre-processed METABRIC data presented some challenges for conversion of GRSs. Changing the type of data normalisation post hoc is challenging and risks introducing further biases . These issues continue to exist even if raw data are analysed; the only way to eliminate them is to use the genomic test with the official platform. This was not feasible due to the high cost of requesting such testing.
Only BCSS was considered in this analysis. Clinical outcome measures used in cancer are diverse and include measures of survival, disease recurrence and disease-free survival. Although these outcomes are correlated to some extent, future studies would benefit from considering multiple clinical outcomes to ensure that these findings are consistent and sustained across clinically relevant subgroups.
This study likely underestimates PREDICT’s performance, since several variables required for optimal prediction (screening status, KI67 status and bisphosphonate use) were unavailable. Although alternative methods were used to infer KI67 status, these are unlikely to be as accurate as established histopathological techniques.
Similarly, this analysis likely overestimates GRS prognostic power, since GRS coefficients were unconstrained, while PREDICT was constrained to one. This has the effect of allowing GRS coefficients to be re-estimated in the current dataset, effectively creating an overfitting problem whereby the predictive power of included variables is overestimated. The impact of this may be quite large: when univariable PREDICT models were built using the unconstrained variable, model fit was dramatically improved compared to the constrained model (log-likelihood -2823.6 versus –2860.2).
Although this study found that there were largely non-significant changes in model performance as a result of incorporation of genomic prognostic scores, these changes need to be modelled economically. In particular, the small number of reclassifications may be important at a health system level if they result in improvements in patient care (which reduce long-term costs of readmission, for example).
This study evaluates the impact of adding GRSs to current standards of care for breast cancer predictive modelling using key model performance metrics (calibration, discrimination and reclassification). Three GRSs (EndoPredict, MammaPrint and Prosigna) demonstrated power to predict BCSS in breast cancer independent of PREDICT. However, incorporating these models into PREDICT had only a modest impact upon calibration (underestimating 10-year BCSS by around 12%), discrimination (with c-indices non-significantly different to the original PREDICT algorithm) and reclassification (with 4–10% of patients reclassified into different clinical categories). Performance was much better in ER-positive than ER-negative patients. Although these small improvements in model fit might be clinically useful, economic analyses are needed to assess whether this justifies the increased cost.
Availability of data and materials
All data generated or analysed during this study are included in this published article (and its supplementary information files).
Winters S, Martin C, Murphy D, Shokar NK. Breast cancer epidemiology, prevention, and screening. Prog Mol Biol Transl Sci. 2017;151:1–32.
Harbeck N, Gnant M. Breast cancer. Lancet. 2017;389:1134–50.
Ferlay J, Colombet M, Soerjomataram I, Parkin DM, Piñeros M, Znaor A, et al. Cancer statistics for the year 2020: an overview. Int J Cancer. 2021;149:778–89.
Service NCR& A, UK CR. Chemotherapy, radiotherapy and tumour resections in England: 2013–2014 workbook. 2017.
Bastiaannet E, Charman J, Johannesen TB, Schrodi S, Siesling S, van Eycken L, et al. A European, observational study of endocrine therapy administration in patients with an initial diagnosis of hormone receptor-positive advanced breast cancer. Clin Breast Cancer. 2018;18:e613–9.
Wishart GC, Azzato EM, Greenberg DC, Rashbass J, Kearins O, Lawrence G, et al. PREDICT: a new UK prognostic model that predicts survival following surgery for invasive breast cancer. Breast Cancer Res. 2010;12:R1.
Candido dos Reis FJ, Wishart GC, Dicks EM, Greenberg D, Rashbass J, Schmidt MK, et al. An updated PREDICT breast cancer prognostication and treatment benefit prediction model with independent validation. Breast Cancer Res. 2017;19:58.
NICE. Early and locally advanced breast cancer: diagnosis and management (NG101). NICE; 2020.
American Joint Committee on Cancer. AJCC cancer staging manual. Berlin: Springer; 2017.
Waks AG, Winer EP. Breast cancer treatment: a review. JAMA. 2019;321:288–300.
Gyorffy B, Hatzis C, Sanft T, Hofstatter E, Aktas B, Pusztai L. Multigene prognostic tests in breast cancer: past, present, future. Breast Cancer Res. 2015;17:1–7.
Chia SKL. Clinical application and utility of genomic assays in early-stage breast cancer: key lessons learned to date. Curr Oncol. 2018;25:S125–30.
Filipits M, Rudas M, Jakesz R, Dubsky P, Fitzal F, Singer CF, et al. A new molecular predictor of distant recurrence in ER-positive, HER2-negative breast cancer adds independent information to conventional clinical risk factors. Clin Cancer Res. 2011;17:6012–20.
Parker JS, Mullins M, Cheung MCU, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27:1160–7.
NICE. Tumour profiling tests to guide adjuvant chemotherapy decisions in early breast cancer [DG34]. NICE; 2018.
Van’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AAM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–6.
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128–38.
Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA. 1982;247:2543–6.
Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–35.
Beumer IJ, Persoon M, Witteveen A, Dreezen C, Chin SF, Sammut SJ, et al. Prognostic value of MammaPrint in invasive lobular breast cancer. Biomark Insights. 2016;11:139–46.
Buus R, Sestak I, Kronenwett R, Denkert C, Dubsky P, Krappmann K, et al. Comparison of EndoPredict and EPclin With Oncotype DX recurrence score for prediction of risk of distant recurrence after endocrine therapy. J Natl Cancer Inst. 2016;108:djw149.
Drukker CA, Elias SG, Nijenhuis MV, Wesseling J, Bartelink H, Elkhuizen P, et al. Gene expression profiling to predict the risk of locoregional recurrence in breast cancer: a pooled analysis. Breast Cancer Res Treat. 2014;148:599–613.
Martin M, Brase JC, Calvo L, Krappmann K, Ruiz-Borrego M, Fisch K, et al. Clinical validation of the EndoPredict test in node-positive, chemotherapy-treated ER+/HER2- breast cancer patients: results from the GEICAM 9906 trial. Breast Cancer Res. 2014;16:R38.
Yao K, Goldschmidt R, Turk M, Wesseling J, Stork-Sloots L, de Snoo F, et al. Molecular subtyping improves diagnostic stratification of patients with primary breast cancer into prognostically defined risk groups. Breast Cancer Res Treat. 2015;154:81–8.
Zhang Y, Schnabel CA, Schroeder BE, Jerevall P-L, Jankowitz RC, Fornander T, et al. Breast cancer index identifies early-stage estrogen receptor-positive breast cancer patients at risk for early- and late-distant recurrence. Clin Cancer Res. 2013;19:4196–205.
Zhao X, Rodland EA, Sorlie T, Vollan HKM, Russnes HG, Kristensen VN, et al. Systematic assessment of prognostic gene signatures for breast cancer shows distinct influence of time and ER status. BMC Cancer. 2014;14:211.
Filipits M, Nielsen TO, Rudas M, Greil R, Stoger H, Jakesz R, et al. The PAM50 risk-of-recurrence score predicts risk for late distant recurrence after endocrine therapy in postmenopausal women with endocrine-responsive early breast cancer. Clin Cancer Res. 2014;20:1298–305.
Gnant M, Filipits M, Greil R, Stoeger H, Rudas M, Bago-Horvath Z, et al. Predicting distant recurrence in receptor-positive breast cancer patients with limited clinicopathological risk: using the PAM50 Risk of Recurrence score in 1478 postmenopausal patients of the ABCSG-8 trial treated with adjuvant endocrine therapy alone. Ann Oncol. 2014;25:339–45.
Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486:346–52.
Rueda OM, Sammut S-J, Seoane JA, Chin S-F, Caswell-Jin JL, Callari M, et al. Dynamics of breast-cancer relapse reveal late-recurring ER-positive genomic subgroups. Nature. 2019;567:399–404.
Scrucca L, Fop M, Murphy TB, Raftery AE. Mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016;8:289–317.
Abubakar M, Orr N, Daley F, Coulson P, Ali HR, Blows F, et al. Prognostic value of automated KI67 scoring in breast cancer: a centralised evaluation of 8088 patients from 10 study groups. Breast Cancer Res. 2016;18:1–13.
Gendoo DMA, Ratanasirigulchai N, Schröder MS, Paré L, Parker JS, Prat A, et al. Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics. 2016;32:1097–9.
Therneau TM. A package for survival analysis in R [Internet]. Compr. R Arch. Netw. Comprehensive R Archive Network (CRAN); 2021. Cited 28 June 2021. https://cran.r-project.org/package=survival.
Schröder MS, Culhane AC, Quackenbush J, Haibe-Kains B. survcomp: an R/Bioconductor package for performance assessment and comparison of survival models. Bioinformatics. 2011;27:3206–8.
Harrell FEJ. Regression modeling strategies. Berlin: Springer; 2001.
Loh S-W, Rodriguez-Miguelez M, Pharoah P, Wishart G. A comparison of chemotherapy recommendations using Predict and Adjuvant models. Eur J Surg Oncol. 2011;37:S21–2.
Gray E, Marti J, Brewster DH, Wyatt JC, Hall PS. Independent validation of the PREDICT breast cancer prognosis prediction tool in 45,789 patients using Scottish Cancer Registry data. Br J Cancer. 2018;119:808–14.
Van Calster B, McLernon DJ, Van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17:1–7.
Qin S, Kim J, Arafat D, Gibson G. Effect of normalization on statistical and biological interpretation of gene expression profiles. Front Genet. 2013;4:160.
OMR was supported by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014) and the UKRI (UK; MC_UU_00002/16).
Ethics approval and consent to participate
Consent for publication
PDP receives a share of fees received by Cambridge Enterprises for licensing of PREDICT. AC and OMR have no conflicts of interest to declare.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Chowdhury, A., Pharoah, P.D. & Rueda, O.M. Evaluation and comparison of different breast cancer prognosis scores based on gene expression data. Breast Cancer Res 25, 17 (2023). https://doi.org/10.1186/s13058-023-01612-9