Prognostic signatures in breast cancer: correlation does not imply causation
© BioMed Central Ltd 2012
Published: 19 June 2012
Skip to main content
© BioMed Central Ltd 2012
Published: 19 June 2012
Testing the statistical associations between microarray-based gene expression signatures and patient outcome has become a popular approach to infer biological and clinical significance of laboratory observations. Venet and colleagues recently demonstrated that the majority of randomly generated gene signatures are significantly associated with outcome of breast cancer patients, and that this association stems from the fact that a large proportion of the transcriptome is significantly correlated with proliferation, a strong predictor of outcome in breast cancer patients. These findings demonstrate that a statistical association between a gene signature and disease outcome does not necessarily imply biological significance.
Breast cancer encompasses a plethora of distinct diseases characterised by different biological features and clinical outcomes [1–3]. Microarray-based gene expression profiling studies have played a pivotal role in unravelling the molecular and clinical diversity of the disease (for a review see ). These studies led to the development of a molecular classification of breast cancer , where the different molecular subtypes identified were found to be associated with distinct clinical outcomes [5, 6], and to the development of numerous multigene predictors (that is, gene signatures) of outcome, which were initially reported to outperform the current clinicopathological algorithms to define the prognosis of breast cancer patients [7, 8] (reviewed in [3, 9]).
Microarrays have also played a pivotal role in addressing one of the major bottlenecks in translational research: ascribing relevance in the human disease context of results obtained from in vitro studies and animal models. The availability of multiple gene expression datasets with patient follow-up in the public domain allowed the investigation of whether a microarray-based signature derived from a set of laboratory experiments would have biological significance. For instance, a signature derived from tumour-initiating breast cancer cells was shown to be of prognostic significance in a publicly available microarray dataset, and this was used as the basis to suggest that the tumourigenic breast cancer cell signature 'may detect transcriptional profiles associated with mutations that arrest cells in an immature state of differentiation and function as markers of more aggressive tumors' .
In their recent paper , Venet and colleagues made the intriguing observation that gene signatures developed to identify phenomena completely unrelated to cancer - such as the effect of postprandial laughter on peripheral blood mononuclear cells, the localisation of skin fibro-blasts or social defeat obtained from mice brains - were significantly associated with outcome in a cohort of 295 breast cancer patients of the Netherlands Cancer Institute (NKI-295) . In addition, it was also shown that, out of 1,890 gene signatures deposited in the Molecular Signatures Database, 67% were associated with breast cancer outcome at P <0.05, and 23% were associated at P <10−5. The large number of signatures significantly associated with outcome may be due to the enrichment of the Molecular Signatures Database with cancer-related signatures; hence the authors generated for each Molecular Signatures Database signature a signature of identical size but composed of randomly selected genes. Strikingly, out of these randomly derived signatures, 77% were associated with outcome at P <0.05 and 30% were associated at P <10−5. Furthermore, the authors went on to show that only 18 of the 47 published prognostic signatures that were either derived for the purpose of finding better prognostic tools or, in most cases, were used to suggest biological relevance of laboratory findings performed statistically better than the best 5% of random gene signatures of the same size .
A critically relevant set of observations made by Venet and colleagues include the fact that >90% of randomly generated signatures containing >100 genes were shown to be associated with outcome of breast cancer patients . Further, up to 26% of all probes within the micro-array platform used for the analysis of the samples from the NKI-295 dataset were significantly associated with outcome on univariate analysis. Even when more stringent parameters (that is, the q value) to account for the false discovery stemming from multiple comparisons were used, 17% of all probe sets were shown to be significantly associated with outcome . What are the statistical and/or biological reasons for these observations?
Given that previous studies had revealed that proliferation is the main and shared determinant of the prognostic accuracy of multigene predictors of outcome in breast cancer patients [3, 12–14], the authors developed a proliferation metagene called meta-PCNA. This metagene was composed of the top 1% of genes whose expression was most positively correlated with the expression of the proliferating cell nuclear antigen (PCNA) across 36 normal tissues. Venet and colleagues confirmed that proliferation is a major prognostic determinant of outcome in unstratified breast cancer patients . meta-PCNA was then used to adjust the expression data of breast cancer gene signatures, which resulted in a dramatic reduction in the association between most published and random signatures and outcome.
So why do random gene signatures with >100 genes correlate with breast cancer patient outcome? The crux of the problem appears to be the large number of proliferation-related genes in the breast cancer transcriptome itself, given that the authors found that 58% of the microarray probes used for the analysis of the NKI-295 dataset were correlated with meta-PCNA . Virtually any large collection of genes will therefore inevitably be enriched for proliferation-related genes. Moreover, given that there are many genes whose expression levels correlate with cell cycle and/or proliferation but whose main biological functions/gene ontology may not be related to these phenomena, any attempt to remove known proliferation-related genes as defined by gene ontology are likely to be futile . While this does not imply that the published signatures do not have prognostic value, the underlying unifying feature among them is the effect of proliferation and the signal of additional biological relevance beyond this is minimal.
Arguably, one of the major contributions of Venet and colleagues was to bring to the attention of the breast cancer research community the limitations of an approach ever so familiar in this day and age: using microarrays to suggest that a mechanism is relevant to human breast cancer from the finding that a gene expression marker for this mechanism predicts outcome of breast cancer patients . Their study has also reminded us of the old maxim that 'correlation does not imply causation'. The assessment of the expression levels of a gene or gene signature may be clinically useful without yielding interesting biological or mechanistic insights. On the other hand, an association between a gene signature derived from laboratory experiments and the prognosis of breast cancer patients does not necessarily imply that the genes which compose a given signature are of biological significance to the disease.
The Netherlands Cancer Institute cohort of 295 breast cancer patients
proliferating cell nuclear antigen.
The authors' work is supported by Breakthrough Breast Cancer. BW is funded by a Cancer Research UK postdoctoral fellowship. The authors acknowledge NHS funding for the NIHR Biomedical Research Centre.