Expression genomics in breast cancer research: microarrays at the crossroads of biology and medicine

Genome-wide expression microarray studies have revealed that the biological and clinical heterogeneity of breast cancer can be partly explained by information embedded within a complex but ordered transcriptional architecture. Comprising this architecture are gene expression networks, or signatures, reflecting biochemical and behavioral properties of tumors that might be harnessed to improve disease subtyping, patient prognosis and prediction of therapeutic response. Emerging 'hypothesis-driven' strategies that incorporate knowledge of pathways and other biological phenomena in the signature discovery process are linking prognosis and therapy prediction with transcriptional readouts of tumorigenic mechanisms that better inform therapeutic options.


Introduction
DNA microarrays are tools for assessing the functional dynamics of genes and genomes in a highly parallel fashion. Historically defined as ordered collections of DNA probes for the specific detection of complementary DNA targets, microarrays enable genome-wide surveys of the relative abundance of mRNA transcripts, the high-resolution mapping of genomic copy number alterations, the identification of binding sites of nucleic acid-binding proteins, and the comprehensive analysis of single-nucleotide polymorphisms (SNPs). Although microarray technology and its applications have evolved considerably over the years to meet a growing range of genomic challenges [1], the classical format for microarrays in interrogating the transcriptome (that is, expression microarrays) has been a key technology for discovery in functional and medical genomics.
Since the mid-1990s, expression microarrays have been extensively applied to the study of cancer, and no cancer type has seen as much genomic attention as breast cancer. The most prolific area of breast cancer genomics has been the elucidation and interpretation of gene expression patterns that underlie biological and clinical properties of tumors. In a seminal study that analyzed expression profiles of primary breast tumors, Perou and colleagues [2] showed that the vast and complex transcriptional data generated by microarrays contained discernible patterns of gene expression that related to tumor biology and behavior. Through hierarchical cluster analysis, numerous 'gene clusters' could be recognized as biologically distinct networks reflecting the phenotypic wiring of individual tumors. These 'molecular portraits' revealed information on multiple biological tiersfrom broad tumorigenic properties to discrete biochemical pathways to intra-tumor tissue heterogeneity -and led to the discovery of an 'intrinsic' gene subset that could distinguish between multiple new cancer subtypes on the basis of fundamental tumor properties associated with cell-type origin. These subtypes, termed Luminal A/ER+, Luminal B/ER+, Normal Breast-like, ERBB2+, and Basal-like (that is, the Perou-Sorlie subtypes), were subsequently shown to be stable and reproducible classes observable in different patient populations, and correlated significantly with tumor recurrence and patient survival [3,4].
Together, these studies provided early evidence that the transcriptional circuitry of breast cancer, as revealed by microarrays, could not only provide novel insights into the biology of cancer but could also accurately identify certain previously discernible clinical phenotypes (for example estrogen receptor (ER) status, HER2/neu expression, and proliferation rate) and robustly define new molecularly informed classifications that delineate novel disease entities associated with patient outcomes.
More recently, new investigative techniques have begun to refine our understanding of the breast cancer oncotranscriptome and how it relates to tumor biology and Breast Cancer Research Vol 9 No 2 Miller and Liu behavior. From this vantage point, the intersections between pathological mechanisms and clinical endpoints are being explored with new vigor. Traditional microarray methods for uncovering prognostic expression signatures, based primarily on empirical associations not requiring plausible biological relevance of the markers used, are now sharing the stage with mechanistically motivated strategies driven by knowledge of oncogenic pathways and processes. More commonly, experimental approaches show that pathobiological simulations performed in vitro reveal transcriptional configurations predictive of tumor biology in vivo. Together, these functional genomics strategies are changing the scientific process of breast cancer biomarker discovery, towards one that incorporates mechanistic knowledge.

Patient prognosis
The work by Perou, Sorlie and colleagues demonstrated the power of expression genomics to stratify patients clinically on the basis of the complex molecular configurations of their tumors. Questions remained, however, about the practical utility of the Perou-Sorlie subtypes in prognosis, and whether other genomic strategies might provide greater prognostic resolution in certain clinically challenging patient subpopulations.
Van 't Veer and colleagues [5] and Wang and colleagues [6] both focused on the identification of gene expression 'signatures' (rather than tumor subtypes) that could predict outcome in patients with early-stage breast cancer (N0, T1/T2), the majority of whom would unnecessarily receive systemic adjuvant therapy according to conventional guidelines. Working with primary tumor material from patients who did not receive adjuvant systemic therapy, each group identified and validated a prognostic signature capable of predicting 5-year disease recurrence [5][6][7]. The signature by van 't Veer and colleagues (otherwise known as the Amsterdam signature) consisted of 70 genes, whereas the predictor of Wang and colleagues (otherwise known as the Rotterdam signature) was composed of 76 genes: 60 for prognosis of patients with ER-positive tumors, and 16 for prognosis of those with ER-negative disease. In each case, the prognostic power of the signature was independent of, and even superior to, conventional risk factors (such as tumor size, histologic grade, and patient age), and, in comparison with the St Gallen's and National Institutes of Health consensus guidelines for establishing patient eligibility for adjuvant chemotherapy, the signatures were better at predicting which patients should not receive adjuvant therapy (and similar at predicting who should receive adjuvant therapy), potentially sparing a significant fraction of 'cured' patients from overtreatment.
Both the Amsterdam and Rotterdam signatures have now been further validated in large multicenter investigations that confirm the prognostic advantages of the expression signatures over conventional guidelines for selecting patients for adjuvant systemic therapy [8,9]. The Amsterdam signature has now been marketed for clinical use through the Amsterdam-based diagnostics company, Agendia, founded in 2003 by the Netherlands Cancer Institute (NKI).
An interesting aspect of these two studies is that although the two gene lists were derived from the same basic scientific question and using similar patient cohorts, only three genes were found in common to both signatures [6]. Various technical differences have been proposed to account for this discrepancy, but others have noted that if one looks beyond the genes to the pathways they represent, multiple pathways can be found in common between the signatures, indicating that the signatures and their predictive powers may converge on the same underlying biology [6]. Although the endpoints of these two investigations were clinical in nature, a compelling biological interpretation of the results has emerged: that early primary tumors may already possess the hardwiring necessary for future metastasis, thus countering the view that metastatic potential is an acquired trait that develops later in the course of tumorigenesis and in a rare subpopulation of cells.

Tailored treatment
If early-stage primary breast tumors are already hardwired for metastatic potential, might their propensity for therapeutic response also be molecularly ingrained, and measurable via a transcriptional readout? Valuable evidence supporting this hypothesis was first demonstrated in the context of diffuse large-B-cell lymphoma (DLBCL). Alizadeh and colleagues [10] used expression microarrays to elucidate transcriptional patterns that could dichotomize DLBCL samples into at least two distinct classes reflecting different aspects of normal B-cell physiology. One class showed expression of genes commonly induced in germinal-center B cells (the GCB-like class), whereas the other was characterized by expression of genes associated with mitogenic stimulation of blood B cells (termed the activated B-cell (ABC)-like class). Importantly, these two classes showed distinct clinical behaviors after chemotherapy; patients with GCB-like disease had twice the 5-year survival rate of those with ABC-like disease [10][11][12].
Because NF-κB activity is critical for the development and survival of normal B cells and is known to be important in several cancer types, Davis and colleagues [13] investigated the possibility that the NF-κB pathway might be differentially activated between GCB-like and ABC-like forms of DLBCL. Indeed, the authors identified, in the microarray data, a handful of NF-κB target genes that were significantly differentially expressed between the two groups, with higher expression in the ABC-like class. Using cell lines representative of the two classes, the authors showed that constitutive NF-κB activity was required for survival of the ABC-like class, but not the GCB-like class. That NF-κB can protect cells from death induced by certain chemotherapeutics may partly explain the poor survival outcomes observed in the ABC-like class. Moreover, the results suggest that patients of the poor-outcome ABC-like class, as defined by gene expression profiling, may derive benefit from treatment with NF-κB inhibitors that are known to work in synergy with chemotherapy to enhance cell death. This hypothesis is currently under investigation in a phase II clinical trial at the National Cancer Institute, Rockville, MD, USA.
In the context of breast cancer, several expression profiling studies have provided preliminary evidence for the existence of therapy-predictive signatures. These studies have relied primarily on empirical approaches that assess, either directly or indirectly, tumor sensitivity to drugs. The direct approach is prospective, involving expression analysis of preoperative tumor biopsies taken in the neoadjuvant setting, and subsequent 'supervised' class prediction to determine whether a multigene predictor can distinguish tumors that will show complete pathologic response (pCR) from those that will exhibit residual or progressive disease. So far, this approach has been used in several contexts to elucidate therapy-predictive signatures for treatments such as docetaxel [14,15], T/FAC (paclitaxel, 5-fluorouracil, adriamycin, and cyclophosphamide) [16,17], AC (adriamycin and cyclophosphamide) [18], and AT (adriamycin and paclitaxel) [19]. Although each study has reported the discovery of predictive genes with some promising classification accuracy, in most cases little or no independent validation has yet been reported. In the largest and most validated of these studies, Hess and colleagues [17] discovered a 30-probe predictor of pCR after T/FAC therapy that, in validation, showed high sensitivity for identifying pCR cases (92%) and a high negative predictive value for predicting cases that exhibited residual disease (96%). In comparison with the predictive power of conventional variables, this result could be viewed as a marginal, but valuable, prognostic improvement, but it will require further validation in larger cohorts to demonstrate significant clinical value.
A more indirect approach to identifying therapy-predictive genes involves the retrospective analysis of historical samples in which patient outcome data can be used as an approximate measure of therapeutic response. An advantage of this approach is that it uses a long-term measurement of therapeutic efficacy, such as whether or not the cancer returns over time, rather than a short-term pathologic response that does not always correlate with future outcome. However, a drawback is that the line between therapy prediction and patient prognosis is blurred. Whereas a relapsing cancer can be viewed as a therapy failure, one that does not return may have been successfully treated at surgery and may thus have no bearing on the effectiveness of adjuvant therapy. Nevertheless, prediction of therapy failure can indicate the need for a more aggressive treatment strategy. Studies pursuing this line of investigation have described a 2-gene test [20] and a 21-gene test [21] both for tamoxifen failure, that, when validated, outperformed conventional predictors of recurrence. Not found in these studies, however, was direct evidence that alternative therapies would provide benefit for these patients. In a followup to the latter study, Paik and colleagues [22] showed a significant interaction between the 21-gene test and combined tamoxifen and chemotherapy (cyclophosphamide, methotrexate and fluorouracil or methotrexate and fluorouracil), suggesting that women predicted to fail tamoxifen treatment could potentially benefit from additional chemotherapy.
Ultimately, prognostic signatures resulting from empirical methods that group tumors into biologically uncharacterized classes (such as 'responders' and 'nonresponders') may be performance limited. The molecular heterogeneity of breast cancer suggests that the biological programs driving tumor progression are both numerous and diverse, and these programs, operating independently or in aggregate, may dictate how a tumor or subgroup of tumors will progress clinically or will respond to certain drugs. The ability to define these circuitries biologically, parse them out at the transcriptional level, and assess their prognostic associations will allow the identification of tumor subtypes based on pathway activities that not only predict for tumor behaviors but also explain them.

Surrogate signatures
In breast cancer, several clinicopathological markers are frequently used alone or in combination to assess patient risk. For example, lymph node stage, tumor size, and histologic grade are important elements of the major prognostic indices, whereas ER status is widely regarded as the primary predictor of response to hormonal (antiestrogen) therapy. Microarray data sets from large studies of breast cancer have provided unique opportunities to investigate the relationships between gene expression patterns and these clinical/ laboratory parameters. These studies have revealed several underlying signatures associated with the primary physiology of the tumor with important prognostic and predictive implications, and suggest that the sum of multiple geneexpression measurements may provide greater diagnostic precision than the biochemical or morphological marker on which they are based.
Perhaps the most apparent and widely observed of these expression signatures is the one that reflects ER status. Composed of hundreds of genes that include known direct and indirect targets of the ER, this signature is strongly correlated with clinical measurements of ER (for example by immunohistochemistry, ligand-binding assay, and enzyme immunoassay) and faithfully partitions tumors into ER-positive and ER-negative classes with reproducible accuracy [2,5,23]. This close link between the signature and ER status is further demonstrated by the observation that the relative levels of the ER signature genes are predictive of ER protein levels as measured by enzyme immunoassay in a panel of human breast tumors [24]. Even the expression of the ER gene itself (as measured by microarray) is highly correlated with ER status [2], leading some groups, for data analysis purposes, to substitute microarray-based ER expression levels for clinical measures of ER status in the absence of clinical data [5,7]. Because the ER transcript is itself a central figure in the ER signature, together with a number of known ER target genes, it is plausible that the transcriptional activity of ER drives expression of the ER signature genes. In this context, the signature could be viewed as a functional readout of ER activity. Recently, the ER signature was analyzed in a cohort of ER-positive tumors and found to be prognostic of diseasefree survival in patients receiving adjuvant tamoxifen monotherapy [25], suggesting that a gene-expression-based readout of ER functionality may be a greater predictor of antiestrogen response than a measure based on ER protein level alone.
In a study aimed at understanding the relevance of p53 status in breast cancer prognosis, we recently identified a 32gene signature capable of distinguishing p53 mutant and p53 wildtype breast tumors with moderate (85%) accuracy [26]. Subsequent analysis of the misclassified tumors, however, shed light on the reason for classification failure. Misclassified wildtype tumors (that is, with the mutant-like signature; n = 26) showed highly significant underexpression of several known direct target genes of p53, as well as the p53 gene itself, whereas p53 mutant tumors with the wildtype-like signature (n = 12) showed significantly higher expression of the p53 target genes than other mutant tumors.
Furthermore, in an independent study of p53 activity, over half of the p53 signature genes identified in the breast tumors were found to be significantly modulated by p53 activation in HCT116 colorectal cancer cells [27]. These observations suggest that the signature, as a gauge of p53 transcriptional endpoints, may be more tuned to p53 function than mutational status as ascertained by the gold standard for mutational analysis, direct sequencing. Moreover, survival analysis of patients with p53 wildtype tumors showed that those with the mutant-like signature had a significantly shorter interval to disease-specific death than those with the wildtype-like profile. In several independent breast cancer cohorts, this signature of p53 deficiency was highly correlated with metastatic recurrence and therapeutic failure, regardless of treatment type, and remained a significant prognostic predictor in multivariate analyses with conventional risk factors, whereas p53 mutational status alone did not. Together, these observations suggest that an expression signature derived from the molecular differences between p53 mutant and wildtype tumors may provide a more comprehensive and clinically useful readout of p53 functionality than mutational status alone.
In a similar vein, we and others have recently investigated the clinical utility of gene expression patterns associated with the histologic grade of breast cancer. Although histologic grade is widely regarded as a strong indicator of disease recurrence, its acceptance as a routine prognostic variable has been limited by the subjective nature of the grading process and its history of inter-observer variability. Recently, a 5-gene genetic grade signature [28] and a 97-gene genomic grade index [29] have been identified, both capable of discriminating grade I and grade III tumors with high accuracy, and partitioning intermediate grade II tumors into grade I-like and grade III-like classes with enhanced prognostic resolution. Patients with grade II disease classified as grade I-like and grade III-like showed significantly different 10-year survival curves -similar to those of patients with histologic grade I and grade III tumors, respectively. Moreover, in multivariate analyses with conventional prognostic variables, we found that the genetic grade signature remained highly significant, even outperforming lymph node status and tumor size in most cohorts analyzed [28]. That most of these signature genes have known functions in cell-cycle-related processes and are significantly correlated with tumor mitotic index and Ki67 scores (A. Ivshina, personal communication) suggests that these gradeassociated signatures are also markers of proliferation.
Thus, multigene predictors that objectively capture the prognostic essence of histologic grade and cellular proliferation have surprising precision in assessing risk of recurrence, particularly for women with grade II disease. Indeed, from a purely prognostic perspective, these studies suggest that there is no grade II, only shades of low and high grade. Furthermore, from a biological perspective, these findings offer insight into the pathobiological nature of breast cancer, suggesting that tumors of low and high grade may reflect independent biological entities rather than a continuum through which cancer progresses.

Parsing pathways
The expression signatures derived from ER status, p53 mutation, and histologic grade are products of 'bottom-up' analytical strategies [30] that are biologically motivated rather than empirically derived. These strategies first define relationships between a physiologic or biochemical phenomenon and patterns of gene expression, then use the expression patterns to predict the relative contribution of the phenomenon or pathway to clinical tumor behavior. In contrast to 'top-down' strategies that identify predictive signatures in the absence of biological input, the bottom-up approach has several advantages. First, by defining the downstream genes, insights into the molecular underpinnings of a discrete pathophysiologic phenomenon (such as an oncogenic pathway) are obtained. Second, the transcript levels of the genes themselves can be used to predict the extent of pathway activation in individual tumors, with the potential to select patients for pathway-targeted therapies. Third, such signatures can be assessed singly or in parallel to study the individual and combinatorial effects of distinct pathways on tumor aggressiveness, patient outcome or therapeutic response, in contrast to the dilution of individual pathway contributions that occurs in signatures derived from empirically based methods.
Desai and colleagues at the National Cancer Institute (USA) were among the first to investigate the global transcriptional outputs of multiple oncogenic pathways and their discriminatory powers [31]. Profiling breast tumors of transgenic mice harboring different mammary-gland-specific oncotransgenes (MMTV-Ha-ras, MMTV-neu, MMTV-myc, MMTV-polyoma middle T antigen, C3T-SV40 large T antigen and WAP-SV40 large T antigen), the authors identified expression cassettes unique to the different transgenes, indicating that transcriptional fingerprints of the earliest initiating oncogenic events could be identified within primary tumors.
Building on this concept, Joseph Nevins and colleagues at Duke University have recently published a series of reports that illustrate a systematic approach to the discovery and clinical application of pathway-specific and drug-specific signatures. Using primary mouse embryo fibroblasts [32] and human mammary epithelial cells [33] transfected with oncogenes such as HRAS, MYC, E2F and SRC, the authors identified expression signatures that distinguished oncogeneactivated cells from controls. These signatures, representing transcriptional readouts of pathway activity derived in vitro, were then tested for their ability to assess pathway activation states in vivo with the use of mouse and human primary tumors previously characterized for aberrations in these pathways. The relative probability of pathway activation (or deregulation) was then estimated by comparing the configuration of the tumor profiles with that of the (in vitro) pathway-activated signatures. In this manner, the authors demonstrated that, on a probability scale, pathway activity could be predicted in vivo with significant accuracy. When applied to data sets of breast, ovarian and lung tumors, hierarchical clustering of the relative probabilities of pathway activation (as measured for multiple pathway signatures) could distinguish between patient subgroups with significantly different survival rates, demonstrating a strong association between multimodal pathway deregulation and clinical tumor behavior [33]. Moreover, when applied to a panel of cancer cell lines with known sensitivities to pathwayspecific compounds (for example, for Ras and Src), the signatures were found to be significantly correlated with drug response [33].
These results demonstrate that expression signatures anchored to pathway activation states may aid in our biological understanding of tumor behavior and potentiate a means for selecting patients who will respond to pathwayspecific therapies. Furthermore, where traditional classification methods have involved assigning patients (or tumors) to classes with definitive boundaries, assessing the likelihood that a tumor or patient will exhibit a certain trait (such as pathway deregulation or survival), as demonstrated in these studies, translates class prediction to a probability scale whereby sensitivity relative to specificity may be adjusted according to clinical need.
Taking these concepts further, Potti and colleagues [34] combined microarray data from the NCI-60 cell lines with historical pharmacologic data generated from the NCI-60 panel at the National Cancer Institute to define expression signatures capable of discriminating between cell lines that are sensitive to various drugs and those that are resistant. In this manner, drug response signatures were obtained for compounds such as docetaxel, topotecan, adriamycin, paclitaxel, 5-fluorouracil, and cyclophosphamide. The predictive capacity of these signatures was then validated by using two types of independent data set: first, those composed of cell line expression profiles generated in independent pharmacologic studies, and second, those composed of primary tumor profiles taken in the context of neoadjuvant therapy. Remarkably, with the latter validation approach, these predictors derived in vitro achieved more than 80% accuracy in each of five independent neoadjuvant studies involving breast and ovarian cancer patients treated with docetaxel, topotecan, adriamycin, or paclitaxel. However, it should be noted that the separation of patients into predicted response groups (sensitive versus resistant) was based on a 'best-fit' line; nevertheless in each case this line fell close to the 50% probability score, thus introducing only a small bias into the reported accuracies. Furthermore, the authors showed that multiple drug response signatures could be combined to predict sensitivity to multidrug regimens such as T/FAC) and FAC (5-fluorouracil, adriamycin, and cyclophosphamide), again with more than 80% accuracy.
Finally, the authors superimposed predictions based on the two types of signature: for drug response and for oncogenic pathways. In one example they found a significant association between predicted activation of the phosphoinositide 3kinase (PI3-kinase) pathway and predicted docetaxel resistance in the NCI-60 data set. In a separate group of lung cancer cell lines, this association not only remained significant but the cells predicted to be PI3-kinase activated were significantly sensitive to a PI3-kinase inhibitor. This demonstrates that the drug response and pathway activation signatures can not only be used individually to predict treatment outcomes, but can also be combined for insight into the mechanisms modulating drug sensitivity. Together, these studies present a rational knowledge-based approach to individualized treatment, whereby the combinatorial analysis of biologically and experimentally defined expression signatures might one day guide therapeutic decisions that are truly tailored to the unique molecular anatomy of an individual's tumor.
Moving forward with in vitro-based models for building genomic predictors, several important considerations regarding system design and prediction accuracy must be addressed. What is the optimal number of models (namely cell lines, pathway targets, and so on), and how much biological diversity should be included in the system? What phenotypic endpoints should be used (IC 50 ? LC 50 ? a specific time point?) and how do these relate to tumor pharmacokinetics or pathway activation states? How do different classification strategies compare with respect to the robustness and accuracy of the genomic predictors they generate?

Mining mechanisms
The vast quantities of data generated from large-scale expression profiling studies provide a rich ground for exploring the complex and conditional relationships that exist between genes, their expression patterns, and tumor phenotypes. These relationships, although complex, exhibit a natural order governed by biological rules. This order is manifested in the hierarchical structure of gene-gene correlations from which the various prognostic expression signatures have been mined. Although bottom-up investigations have elucidated the biology underlying several of these signatures, most multigene expression patterns associated with prognosis remain biologically anonymous. Understanding this biology, and the transcriptional mechanisms regulating these signatures, may lead to the discovery of new oncogenic pathways and therapeutic targets.
To explore the diversity of gene correlations that underlie the clinical behavior of cancer, we have analyzed large microarray data sets of primary breast tumors for genes that are both coordinately expressed (in clusters) and individually related to clinical outcomes, and have discovered numerous distinct expression cassettes that may signify clinically relevant pathways in breast carcinogenesis ( Figure 1). However, a biological definition of these pathways and the mechanisms that regulate them requires more than simple inference, but rather the integration of multiple forms of information (for example biological, clinical, and genomic) coupled with statistical and experimental validation methods.
Early microarray studies involving breast cancer cell lines identified a large cluster of coordinately expressed genes associated with cell proliferation rates [35]. Later dubbed the proliferation signature, these genes have since been linked to various aspects of tumorigenesis in breast and other cancer types including neoplastic transformation [36], histologic grade [28,29,37], and poor patient survival [38,39]. (Cluster 4 in Figure 1 represents this signature.) For statistical support of the notion that this signature reflects cellular proliferation in primary breast tumors, we analyzed various subsets of these signature genes for correlations with different forms of biological and clinicopathological information. Gene ontology analysis of the signature genes consistently resulted in the significant enrichment of proliferative processes such as mitosis, cytokinesis, chromosomal segregation, chromatin packaging and remodeling, and DNA metabolism and replication (LDM and ETL, unpublished results). Using clinical tumor annotations, we found significant correlations between expression of the signature genes and pathologic markers of proliferation including Ki67, S-phase fraction and mitotic index (LDM and ETL, unpublished results), further supporting the link between gene expression and tumor cell proliferation. Furthermore, a significant fraction of these signature genes have been observed in cell synchronization experiments involving HeLa cells (cervical carcinoma) as being expressed periodically at specific phases of the cell cycle [40]. Thus, as illustrated in this simple example, the integrative analysis of functional, clinical, and experimental information can provide substantial support for the hypothesis that an expression signature reflects a specific biological phenomenon -in this case, the proliferative capacity of tumor cells.
Integration of additional forms of data, such as genomic sequence, location, and copy number alterations, can potentially expose the transcriptional mechanisms that regulate the expression of these correlated genes. For example, Gasch and Eisen [41], exploring mechanisms of gene co-regulation in yeast, demonstrated that promoter analysis of coordinately expressed genes could reveal significant enrichments of binding motifs specific for the transcription factor(s) responsible for the observed coordinate expression. However, despite the success of this approach in identifying gene regulatory mechanisms in organisms of lower complexity [42], it has so far shown little success in elucidating transcriptional mechanisms in cancer, perhaps owing in part to the greater complexity and lack of spatial compactness of human gene promoters. In a recent study by Kristensen and colleagues [43], the impact of genetic variation on breast cancer gene expression was examined. Using a panel of 50 primary human tumors with matched patient blood samples, the authors found that selected germline SNPs at putative regulatory loci in 115 of 203 candidate genes (of the reactive oxygen species pathway) showed highly significant associations with microarray expression patterns, indicative of both cis-acting and trans-acting effects. In some instances, transcripts associated with SNPs in trans showed significant enrichment for certain gene ontology terms and pathways, suggesting linkages between SNPs and the activity of biological programs. This work indicates that the coordinate expression of genes in breast cancer may be markedly influenced by genetic variation at gene regulatory loci, and opens up a new avenue for the discovery of transcriptional regulatory mechanisms and genetic biomarkers in breast cancer.
Alterations in chromosomal copy number are also manifested in the gene expression patterns of breast cancer. In Figure 1, for example, clusters 7 and 11 are significantly enriched for genes mapping to cytobands 17q12 and 16p13, respectively (see Figure 1 legend). Both loci are frequently amplified in breast cancer, suggesting that the correlated expression of these genes may be explained, in large part, by the transcriptional consequences of genomic amplification. This Clustergram of diverse gene expression signatures prognostic of breast cancer recurrence. Tumors (n = 251; columns) and gene probe sets (n = 816; rows) of the Uppsala cohort (GEO ID GSE3494) [26] were hierarchically clustered by using Pearson correlation and average linkage analysis. Probe-set values were natural-log-transformed and mean centered before clustering. Initially, all 44,928 probe sets (on the Affymetrix U133A and U133B arrays) were assessed for survival correlations as follows. The expression value for each gene was used to dichotomize patients into below-mean and above-mean expression groups. The two groups were then assessed for differences in distant metastasis-free survival (DMFS) by Cox regression analysis. Probe sets significantly associated with DMFS (that is, with likelihood-ratio test P values of less than 0.05) were hierarchically clustered as described above, and clusters with average correlations of more than 0.5 were selected for inclusion in the figure. Probe sets within clusters were then averaged for each tumor, and cluster survival associations were determined as described above  hypothesis is supported by the work of Pollack and colleagues [44], who first examined the intersection between expression array and array comparative genomic hybridization (CGH) data from breast cancer cell lines and primary breast tumors, and observed that more than 60% of high-level copynumber gains coincided with the coordinate overexpression of involved genes, producing, in effect, a residual expression footprint of a genomic amplicon. The integrative analysis of high-resolution array CGH and microarray expression data is now frequently applied to investigations of the mechanistic context of genomic aberrations. In breast cancer, focused studies on 17q12 and 8p11 have revealed new oncogene candidates in which amplification and overexpression are highly correlated [45,46]. Genes identified by this strategy, such as LSM1, BAG4, and C8orf4 on 8p11, have subsequently been shown to drive neoplastic transformation in vitro, and when expressed in combination can induce growth that is independent of both growth factors and anchorage to substrate [47].
The intersection between gene amplification and overexpression has also been exploited to uncover transcriptional regulators of a prognostic expression signature in breast cancer. In a series of work, Howard Chang and colleagues explored the relationship between wound healing and cancer progression [30,48,49]. Initial microarray analysis defined an expression signature of serum response in fibroblasts that, when applied to breast and other epithelial cancer data sets, seemed indicative of tumors exhibiting an active wound response [48]. This wound response signature was subsequently found to be prognostic of survival for patients with breast, lung, and gastric cancers [30,48].
To uncover the transcriptional mechanisms driving expression of the wound response genes, Adler and colleagues [49] used a genetic linkage approach (stepwise linkage analysis of microarray signatures (SLAMS)) involving the integration of gene expression and array CGH data. Considering the possibility that the origin of the wound response signature may be rooted in chromosomal alterations, the authors identified genes with patterns of copy number gain or loss that significantly distinguished breast tumors positive and negative for the wound signature. They observed an enrichment of genes localized to 8q and amplified in tumors with the activated wound response. Analysis of the distributions of 8q-amplified genes within tumor groups led the authors to deduce the possibility of a regulatory interaction between components of 8q24 and 8q13. Closer examination of the expression patterns of the amplified genes revealed that the MYC gene on 8q24 was the one most highly induced by fibroblasts upon serum stimulation, and the CSN5 gene on 8q13 was the one most highly correlated with the wound signature, suggesting a synergistic role for these two proteins in modulating the expression of the wound signature genes. MYC encodes an oncogenic transcription factor frequently amplified in breast cancer, and CSN5 encodes the catalytic subunit of the COP9 signalosome, a multifunctional activator of cullin-based ubiquitin ligases.
To test for a functional interaction, Adler and colleagues overexpressed MYC and CSN5 in noncancerous MCF10A breast epithelial cells. Co-expression of MYC and CSN5, but not the expression of a green fluorescent protein control or of either gene alone, resulted in the induction of more than 75% of the 255 genes overexpressed in the activated wound signature, as well as significant increases in cellular proliferation and invasion through Matrigel that were consistent with the association between the activated wound response and more aggressive disease. Thus, from in silico prediction to experimental validation, Adler and colleagues demonstrate a methodology of integrative genomic analysis that can facilitate the discovery of complex transcriptional mechanisms regulating gene expression signatures. The increased complexity is that the expression phenotype is manifested only with the activation of two cooperating gene products: a synthetic or conditional effect.

Future challenges
Expression arrays initially began simply as a method of multiplexing single gene discovery, akin to running several thousand quantitative RNA dot-blots. From this onedimensional approach evolved the current state of the art: expression profiling to uncover pathway regulation of gene expression and to define molecular classes on the basis of integration of the total signals experienced by the cancer cell. Fundamental to this transition has been the ability to analyze and model complex systems made possible by mathematical algorithms coupled with computational capacity. It is in this realm of complexity analysis that the future of array-based expression genomics will lie. One can clearly see some of the more immediate areas of expansion.
First, data content can increase. Other characteristics of the transcriptome such as exon usage and noncoding RNAs (including microRNAs) are not well covered by the existing array technologies and their inclusion would inevitably result in greater precision and comprehensiveness. Exon junctions could conceivably be included in the battery of tests yet to be applied. Of course, this will require greater array capacity in terms of encompassing more probes in smaller spaces. Given the advances in microelectronics, those possibilities are currently available but are perhaps not cost-effective for broad biological experimentation.
Second, the analytical systems can be more informed.
Although the output of individual probes can be viewed as events that are independent from that of any other probe, biologically, the degrees of freedom of transcriptional systems are already constrained by biochemical and even evolutionary reality. Thus, gene X is always coordinately expressed with gene Y, or gene A is always upstream of gene B, or proteins C, E, and F are always in a complex and function only as a unit, never alone. These genetic, biochemical, or physiologic relationships validated by other means can be incorporated as 'priors' as we seek higher orders of interaction.
Last, metadata sets will emerge that will markedly expand the ability to validate and to model transcriptional networks of biological and clinical significance. This is already taking place with Oncomine [36], and follows the success of other genomic databases. As a result of standardization, the availability of large numbers of data sets describing the transcriptional behavior of breast cancers has permitted the validation of local observations in silico. In the context of prognosis, the performance of expression signatures can now be validated in and compared across numerous independent cohorts [4,26,28,50], and analyzed in combination for synergistic interactions [30]. At some point, the content of the expression metadata sets for breast cancer will be large enough to sustain continuous activity in data mining, hypothesis generation, and validation. This requires the inclusion of detailed clinical information. In some medical research communities, this metadata set approach is more advanced. Comparative and evolutional geneticists use the growing number of complete genomes in publicly available databases as their primary substrate for investigation. In molecular epidemiology, whole-genome SNP databases with linked clinical data are being made available to qualified researchers for analysis and data mining.
These trends will have a great impact on breast cancer research. The advantage will be the ability to be comprehensive and yet precise at the same time, and the speed of discovery will be breathtaking. The challenge, however, will shift to organizational issues. How fast can we validate new marker sets? What kind of incentives can we use to encourage groups to share primary data? How can we sustain teams of computer scientists, basic molecular biologists, molecular pathologists, and oncologists to meet these challenges?