Prognostic signatures in breast cancer: correlation does not imply causation

Testing the statistical associations between microarray-based gene expression signatures and patient outcome has become a popular approach to infer biological and clinical significance of laboratory observations. Venet and colleagues recently demonstrated that the majority of randomly generated gene signatures are significantly associated with outcome of breast cancer patients, and that this association stems from the fact that a large proportion of the transcriptome is significantly correlated with proliferation, a strong predictor of outcome in breast cancer patients. These findings demonstrate that a statistical association between a gene signature and disease outcome does not necessarily imply biological significance.

Breast cancer encompasses a plethora of distinct diseases characterised by diff erent biological features and clinical outcomes [1][2][3]. Microarray-based gene expression profi ling studies have played a pivotal role in unravelling the molecular and clinical diversity of the disease (for a review see [3]). Th ese studies led to the development of a molecular classifi cation of breast cancer [4], where the diff erent molecular subtypes identifi ed were found to be associated with distinct clinical outcomes [5,6], and to the development of numerous multigene predictors (that is, gene signatures) of outcome, which were initially reported to outperform the current clinicopathological algorithms to defi ne the prognosis of breast cancer patients [7,8] (reviewed in [3,9]).
Microarrays have also played a pivotal role in addressing one of the major bottlenecks in translational research: ascribing relevance in the human disease context of results obtained from in vitro studies and animal models. Th e availability of multiple gene expression datasets with patient follow-up in the public domain allowed the investigation of whether a microarray-based signature derived from a set of laboratory experiments would have biological signifi cance. For instance, a signature derived from tumour-initiating breast cancer cells was shown to be of prognostic signifi cance in a publicly available microarray dataset, and this was used as the basis to suggest that the tumourigenic breast cancer cell signature 'may detect transcriptional profi les associated with mutations that arrest cells in an immature state of diff erentiation and function as markers of more aggressive tumors' [10].
In their recent paper [11], Venet and colleagues made the intriguing observation that gene signatures developed to identify phenomena completely unrelated to cancersuch as the eff ect of postprandial laughter on peripheral blood mononuclear cells, the localisation of skin fi broblasts or social defeat obtained from mice brains -were signifi cantly associated with outcome in a cohort of 295 breast cancer patients of the Netherlands Cancer Institute (NKI-295) [8]. In addition, it was also shown that, out of 1,890 gene signatures deposited in the Molecular Signatures Database, 67% were associated with breast cancer outcome at P <0.05, and 23% were associated at P <10 −5 . Th e large number of signatures signifi cantly associated with outcome may be due to the enrichment of the Molecular Signatures Database with cancer-related signatures; hence the authors generated for each Molecular Signa tures Database signature a signature of identical size but composed of randomly selected genes. Strikingly, out of these randomly derived signatures, 77% were associated with outcome at P <0.05 and 30% were associated at P <10 −5 . Furthermore, the authors went on to show that only 18 of the 47 published prognostic signatures that were either derived for the purpose of fi nding better prognostic tools or, in most cases, were used to suggest biological relevance of laboratory fi ndings performed statistically better than the best 5% of random gene signatures of the same size [11].
A critically relevant set of observations made by Venet and colleagues include the fact that >90% of randomly

Abstract
Testing the statistical associations between microarraybased gene expression signatures and patient outcome has become a popular approach to infer biological and clinical signifi cance of laboratory observations. Venet and colleagues recently demonstrated that the majority of randomly generated gene signatures are signifi cantly associated with outcome of breast cancer patients, and that this association stems from the fact that a large proportion of the transcriptome is signifi cantly correlated with proliferation, a strong predictor of outcome in breast cancer patients. These fi ndings demonstrate that a statistical association between a gene signature and disease outcome does not necessarily imply biological signifi cance. generated signatures containing >100 genes were shown to be associated with outcome of breast cancer patients [11]. Further, up to 26% of all probes within the microarray platform used for the analysis of the samples from the NKI-295 dataset were signifi cantly associated with outcome on univariate analysis. Even when more stringent parameters (that is, the q value) to account for the false discovery stemming from multiple comparisons were used, 17% of all probe sets were shown to be signi ficantly associated with outcome [11]. What are the statistical and/or biological reasons for these observations?
Given that previous studies had revealed that proliferation is the main and shared determinant of the prognostic accuracy of multigene predictors of outcome in breast cancer patients [3,[12][13][14], the authors developed a prolifera tion metagene called meta-PCNA. Th is metagene was composed of the top 1% of genes whose expression was most positively correlated with the expression of the proliferating cell nuclear antigen (PCNA) across 36 normal tissues. Venet and colleagues confi rmed that proliferation is a major prognostic determinant of outcome in unstratifi ed breast cancer patients [11]. meta-PCNA was then used to adjust the expression data of breast cancer gene signatures, which resulted in a dramatic reduction in the association between most published and random signatures and outcome.
So why do random gene signatures with >100 genes correlate with breast cancer patient outcome? Th e crux of the problem appears to be the large number of proliferation-related genes in the breast cancer transcriptome itself, given that the authors found that 58% of the microarray probes used for the analysis of the NKI-295 dataset were correlated with meta-PCNA [11]. Virtually any large collection of genes will therefore inevitably be enriched for proliferation-related genes. Moreover, given that there are many genes whose expression levels correlate with cell cycle and/or proliferation but whose main biological functions/gene ontology may not be related to these phenomena, any attempt to remove known proliferation-related genes as defi ned by gene ontology are likely to be futile [11]. While this does not imply that the published signatures do not have prognostic value, the underlying unifying feature among them is the eff ect of proliferation and the signal of additional biological relevance beyond this is minimal.
Arguably, one of the major contributions of Venet and colleagues was to bring to the attention of the breast cancer research community the limitations of an approach ever so familiar in this day and age: using micro arrays to suggest that a mechanism is relevant to human breast cancer from the fi nding that a gene expression marker for this mechanism predicts outcome of breast cancer patients [11]. Th eir study has also reminded us of the old maxim that 'correlation does not imply causation' . Th e assessment of the expression levels of a gene or gene signature may be clinically useful without yielding interesting biological or mechanistic insights. On the other hand, an association between a gene signature derived from laboratory experiments and the prognosis of breast cancer patients does not necessarily imply that the genes which compose a given signature are of biological signifi cance to the disease.
Abbreviations NKI-295, The Netherlands Cancer Institute cohort of 295 breast cancer patients; PCNA, proliferating cell nuclear antigen.

Competing interests
The authors declare that they have no competing interests.