Array-CGH and breast cancer

The introduction of comparative genomic hybridization (CGH) in 1992 opened new avenues in genomic investigation; in particular, it advanced analysis of solid tumours, including breast cancer, because it obviated the need to culture cells before their chromosomes could be analyzed. The current generation of CGH analysis uses ordered arrays of genomic DNA sequences and is therefore referred to as array-CGH or matrix-CGH. It was introduced in 1998, and further increased the potential of CGH to provide insight into the fundamental processes of chromosomal instability and cancer. This review provides a critical evaluation of the data published on array-CGH and breast cancer, and discusses some of its expected future value and developments.


Introduction
The precise aetiology of the majority of breast cancers is elusive, which contrasts with the estimated 5-10% that are caused by inherited mutations. It is clear that breast cancer presents as a collection of distinct disease types that differ in disease progression, treatment response and disease-free survival. In addition to conventional pathology, emerging technologies, including micro-arrays, have proven to be excellent tools in enhancing our understanding of individual breast cancers and providing assistance in treatment decision making in clinically relevant subgroups [1]. Initial observations, indeed most of our current understanding of chromosome abnormalities came from conventional cytogenetics. The importance of DNA copy number alterations has been demonstrated in many tumours [2]. More recently, the results of array comparative genomic hybridization (array-CGH) analysis of tumour tissues have been described from several perspectives, including identification of subgroups (class discovery); identification of genes that are involved in tumour progression, metastasis and treatment response; identification of candidate oncogenes (in genetic amplifications) or tumour suppressors (in homozygous deletions); and classification of hereditary cancers. A few studies have also described CGH-based distinctions between sporadic and familial cases of breast cancer. Among familial cases, further classification into BRCA1, BRCA2, or 'other genetic risk factor(s)' is useful in these families in general but also for potential additional gene discovery.
Array-CGH technology is a fairly recent and important upgrade to the groundbreaking (conventional, metaphase chromosome) CGH technology [3,4], and it is applied to detect chromosomal DNA copy number alterations (CNA) in cells or tissues that occur as a result of genomic instability. CGH technology caused a paradigm shift in (clinical) cytogenetics because it avoids the need to culture (tumour) cells. This is perhaps the main reason why there was considerable over-representation of information on nonsolid tumours in cytogenetic databases. Array-CGH can measure genomewide copy numbers in an unprecedented and objective manner. More importantly, complex karyotypes -a hallmark of many tumours -could be analyzed, including those too complex for G-banding because CGH readout is done on normal metaphase chromosomes. Last but not least, CGH was spectacular step forward because it preceded and did not require the completion of the human genome (draft) sequence [5]. It was not until 1998, with knowledge of the human sequence, that array-CGH began to replace conventional CGH [6], again with some major improvements.
Breast Cancer Research Vol 8 No 3 van Beers and Petra M Nederlof summarized as follows: (novel) gene discovery in relation to subtypes, stage, or prognosis of breast cancer; and building class discovery tools for classification of independent breast cancers, regardless of the primary aim of identifying genes.

Technical issues
The effective resolution, sensitivity and reproducibility of array-CGH are constantly being enhanced. Although the maximal technical resolution of CGH is now approximately 140 kb [13], the resolution of the majority of published array-CGH studies is in the order of 1 or 2 Mb. This is the average spacing of probes along the genome but varies considerably by region. The sensitivity of nearly all platforms is sufficient to detect one-copy gains and losses, provided that at least 70% of the cells in the sample are tumour cells. Thus, the detection limits decrease with a combination of the following parameters; smaller aberration size (in bases), smaller aberration amplitude (copy number) and smaller tumour subclone representation.

Choice of probe sets
Whether to employ probe sets including large insert clones such as bacterial artificial chromosome (BAC), yeast artificial chromosome, cosmid and fosmid, as opposed to oligonucleotide (generally 60-70 mers) probe sets, is a major consideration in the design of array-CGH studies. Most characteristics of various platforms are described in detail elsewhere [11]. Clearly, the advantage of using oligonucleotide provides the greatest flexibility in terms of availability, which is limited only by knowledge of the DNA sequence for the species under investigation. Careful selection of probes is important because the genome contains many pseudogenes and duplicons; this has been done for human and mouse up to about 44k [14][15][16]. The disadvantage of using short 60-mer probes is the low hybridization signal, which in theory is a factor of about 2000 times lower than when using the much larger insert BAC clones, which in turn suffer from variable but high DNA repeat content that must be quenched by Cot1 (the dominant fraction of human repetitive DNA sequence)-DNA in the hybridization mix. In practice, to deal with the low signal to noise ratio, generally a moving average of about 1-5 Mb is often applied, which in effect decreases resolution. The proprietary technology of Agilent (Palo Alto, CA, USA) to print very high-density arrays enables genome-wide oligoarray-CGH to be conducted on one microscope slide; this contrasts with the genome-wide tiling (32k) BAC array, which uses two slides for the same coverage, doubling the cost for the profile. One way to overcome this problem is to use a 1 Mb resolution for the complete genome followed by tiling resolution for the entire genome [17] or for selected regions of interest [18].

Data analysis
Once an array-CGH profile has been established using any of a range of adequate protocols available [19][20][21][22][23][24][25][26][27], one may decide to use segmentation statistics to determine the boundaries between and the copy number levels of the aberrant genomic regions. The underlying biology of copy numbers has been modelled and assumes that CGH data are a combination of underlying DNA copy number with a component of Gaussian noise. One such algorithm uses the maximum likelihood criterion adjusted using a penalization term for taking into account model complexity to define breakpoints [28]. Further, in-depth analyses as well as stateof-the-art implementation of more complex assumptions have been described [28][29][30]. These methods essentially aim to 'de-noise' CGH data to enhance breakpoint identification, such that candidate regions can be identified with greater precision [18,[31][32][33][34][35][36][37][38][39][40]. With respect to sensitivity in detecting gains and losses, it is also critical to use reference samples that contain a pool of DNA from at least six, unrelated healthy individuals to suppress the (false) discovery of aberrations in the control sample, caused by the existence of 'normal' interindividual variation in copy numbers [41]. This is in accordance with what others have experienced, namely that advancing to array-CGH leads to much more data, of similar quality. Having more data per sample poses a risk for over-fitting, in other words a risk for finding significant differences between groups by chance alone. One way in which studies are coping with over-fitting is to increase outlier thresholds (e.g. from log2ratio = 0.2 to 0.3) or to require a certain minimum proportion (say 33%) of tumours to share a particular aberration. This effectively restricts the number of features (probes) left for comparison but it may not always be the optimal choice. Nevertheless, this is a method for finding the dominant features first. We believe that it may be rather optimistic to set a minimum number of samples for identifying significant aberrations at, for instance, 33% because this would easily ignore (not detect) rare breast cancer subtypes that tend to be under-represented in any particular study.

Array-CGH and expression arrays
Study of gene expression using array technology has garnered enormous popularity, and continues to generate high-impact classifiers of tumour type and disease progression [3,[42][43][44]. One must keep in mind that expression arrays and array-CGH measure different things, which are not always easily correlated. Expression arrays measure relative abundance of specific mRNA transcripts and CGH measures relative copy numbers of genomic DNA regions in a sample. The expression levels per gene vary enormously even in normal cells, depending, for instance, on cell type and cell cycle. The chromosomal fragment (or gene) copy numbers as measured by array-CGH show much less variation, being one or two copies in normal cells, and perhaps ranging between none and 20 copies in tumour cells. It is therefore difficult to correlate expression data with genomic data directly.
Several studies that have correlated DNA copy number data (CGH) with expression profiles have been published [37,45,46]. They indicate a significant correlation between copy number and gene expression. In regions of high-level amplification, 44-62% of genes were reported to be overexpressed [45,46], compared with just 12% of all genes. The relation may in some cases be less straightforward, probably due to the complexity of gene regulation, including pathway feedback loops. An illustration for this is that breast cancers with ERBB2 (17q21) amplification almost invariably overexpress the gene, which contrasts with only 55% of cases with a gain of ERBB1 (7p12) [47]. Arrays that support measurement of DNA as well as RNA include cDNA, oligo-, and single nucleotide polymorphism (SNP) arrays. The benefits of dual use are limited because CGH signals become critically low on short probes; short-probe (highcomplexity) arrays tend to be much more expensive then BAC arrays; and the reference sample in expression profiling is not defined but at best held constant across a series of experiments, as opposed to using a completely 2n (normal karyotype) reference for array-CGH.
Another important advantage of CGH compared with expression arrays is that CGH can be conducted reliably using archival formalin-fixed paraffin-embedded material [26,[48][49][50]. Although there is evidence that archival material is also suited to expression analysis, it is only with recently fixed tissues and only with much effort [51].

Perspective from the tumour (cell line) to DNA copy number
One major application of CGH is in the analysis of cancer genomes for genome-wide copy number changes. The resolution of classic chromosomal CGH is limited by the highly condensed state of metaphase chromosomes and further by the optical resolution of the microscopes used to capture the hybridization signals. In addition, chromosome condensation may vary from metaphase spread to metaphase spread. The analysis software normalizes all chromosomes and assumes linear condensation of all chromosomes. As a result, the precise location of aberrations becomes uncertain, which reduces the true effective resolution. In contrast, the resolution of array-CGH is limited only by the density and average length of the probe set printed on the array. Compared with metaphase-CGH, one problem -especially with earlier versions of array-CGH -was the uncertainty regarding the genomic location of the clones, which is dependent on the version of the 'genome build' of the human genome. The ability to detect a change of one copy for both technologies is roughly similar but highly influenced by the frequent admixture of nontumour cells in the sample, as well as by heterogeneity of the tumour tissue (i.e. generally not all cells will have the same ploidy for all chromosomal regions). Both factors are important in the analysis of breast cancer, because breast tumours are heterogeneous and may contain significant proportions of normal cells. Therefore, as a rule CGH should be performed only on tissue samples containing 70% or more tumour cells, or one must meet this requirement through enrichment by macrodissection, (laser capture) microdissection, or flow sorting (fluorescence-activated cell sorting), followed by some means of DNA amplification [27,52].
Contrary to mRNA content, which for a given gene in a cell can range between no copies (not expressed) up to many thousands, the chromosome content of normal (reference) cells is extremely stable (2n, or diploid); when it is unstable, as is frequent in cancer cells, it at least remains quite discrete. This means that extensive regions of the genome are usually still balanced as in the diploid (2n or unchanged) or tetraploid (duplicated) state. Unbalanced regions exist either as homozygous loss (0 copies), heterozygous loss (1 copy or loss of heterozygosity [LOH]), gain (~3 or 4 copies), and amplifications (~5 or more copies). The distinction between diploid and triploid or tetraploid genomes cannot be detected by array-CGH but is best determined by some independent cell-based method such as fluorescent in situ hybridization (FISH).

And from DNA copy number back to the tumour
If we reverse the perspective, array-CGH data can also tell us something about tumours. This is similar to how cytogenetic data were used to define subgroups in breast cancer that correlated with histological type, grade and mitotic activity of the tumour [53,54]. CGH has been used to classify breast tumours [55,56]. Because breast cancer is a disease with high levels of chromosome instability, it can readily be studied by CGH. Results from various classification studies have indicated that many gains and losses show recurrences anywhere between 20% and 80% of all tumours in a class, depending on the region and the cancer subtype investigated. Some recurrences exhibit sufficient differential gains or losses between classes to permit their use for classification, as in the case of inherited BRCA1 and BRCA2 mutations [55,56]. Certain regions such as 1q and 8q almost exclusively exhibit gain and rarely loss, whereas 16q often shows loss but hardly ever gain.
Robust classification procedures require more than counting frequencies of specific aberrations in various tumour types. It is important to appreciate that even a significant correlation of, for instance, 16q loss, which is more frequent in lobular breast cancer [57], provides limited statistical prediction power for an individual prospective case. This requires more rigorous, iterative methods of feature and model selection, followed by cross-validations and external validations. Provided that the correct methods are used, we believe that array-CGH can be useful for identifying distinct breast tumour subtypes, and can help to define them further, similar to the study reported by Jonsson and coworkers [58].

CGH and cancer
Cancer is the result of a myriad of genetic and epigenetic alterations. Identification of the causal perturbations that Available online http://breast-cancer-research.com/content/8/3/210 confer malignant transformation is a central goal in cancer biology. CGH is a powerful tool for investigating tumours in a genome-wide manner for such candidate regions. This strategy was successfully used to identify c-Myc, Her2Neu, Rab25 and a range of other potential oncogenes [32,59,60]. Vice versa, beginning with knowledge of the causal gene for the tumour, CGH has been useful in elucidating the extent of specific as well as recurrent aberrations in tumours such as those associated with mutations in BRCA1, BRCA2, or P53 in both mouse [61] and human [55,56,58,62,63]. Genomic instability occurs is still not fully understood. Recently, a number of possible mechanisms focusing on the fidelity of chromosome segregation and/or DNA repair have been proposed [64,65], which may hold some indirect clues as to how breast tumour CGH profiles in BRCA1 mutation carriers are different from those in sporadic cases, but they still fail to explain why. While progress is made to elucidate further the precise nature of genomic instability, the resulting specificity of CGH profiles has already been put to good use.

Classification
For instance, CGH profiles have been used in the classification of breast tumours of unknown causality into BRCA1 mutation carriers and noncarriers [56]; to delineate the relationship between synchronous, recurrent and/or metastatic tumours [55,[66][67][68][69][70][71]; and to define the recurrent aberrations that appear to be associated with certain clinical types of breast cancer (e.g. ductal tumours) or with prognosis or clinical course, or both [17,57,[72][73][74][75][76][77][78][79] (Table 1). The study by Rennstam and coworkers [72] was performed using metaphase-CGH, but it clearly demonstrated differential 5-year survival statistics (56% versus 96%) for distinct CNA tumour types that were independent of more conventional markers such as grade, and progesterone receptor or node status. Jones and coworkers [73] used CGH to subclassify 86 breast tumours of grade III and basal type into groups with shorter (3.5 years) and longer (15 years) survival.
Others have successfully used metaphase and/or array-CGH profiles for classification and for mutation pre-screening. Both Wessels and coworkers [56] and Alvarez and colleagues [77] conducted studies that were effective in identifying BRCA1-associated tumours based on CGH profiles of the tumours. Wessels and coworkers [56] identified 33 out of 34 proven BRCA1 cases and assigned 10 false-positives among bilateral breast cancer cases (enriched for elevated risk), four of which have since been proven to be true BRCA1 mutation carriers (personal communication). This CGH driven classification has been repeated by Van Beers and coworkers [55], who reported specific CGH aberrations for BRCA2-related breast tumours. Also along these lines, Alvarez and colleagues [77] built a predictor for BRCA1 and found nearly half of the 'BRCA1-like' CGH profiles in their 'BRCAX' breast cancer type to be hypermethylated on the BRCA1 promoter, suggesting that loss of BRCA1 could have been an initiating event in these tumours. [19,21,22,25,27,[80][81][82][83][84][85][86][87][88] (Table 1). Some of these results, together with large amounts of conventional CGH, FISH and SKY data of breast cancer, are freely available through the Progenetix repository [89,90]; NCI and NCBI's SKY/M-FISH and CGH Database (2001) [91]; Charité, Berlin University [92]; and in the supplementary data of many individual publications. Cytogenetic data has been described [53,54] and is available for more then 1000 cases at NCBI. The following discussion illustrates the use and usefulness of array-CGH for various breast cancer genomes.

Candidate gene searches A number of results of array-CGH analysis of breast tumours and breast cancer cell lines have been published
A detailed study of 31 advanced archival breast tumours conducted by Nessling and coworkers [47] elegantly demonstrates how specificity, sensitivity and resolution are increased in (matrix) array-CGH compared with conventional CGH by using the same samples on both platforms. They identified 44 genes by array-CGH and verified all of them by PCR in 31 breast tumours. Many of these genes are located in common altered regions in breast cancer, such as AIB1 (amplified in breast cancer-1) at 20q12 (68% of their cases). A novel and significant finding was gain of 6p21, containing CCND3 and p21/WAF1, in 45% of their cases. There is ample evidence implicating these genes in cancer, and so a gain of 6q21 suggests a role for CCND3 and p21/WAF1 in these cancers also.
Cowell and coworkers [93], using FISH, observed a single translocation t(3;9) in the MCF10 'normal breast' cell line associated with an immortalized phenotype. During further passages this immortalized cell lineage (MCF10A) acquired additional alterations including t(6;19) and gain of 1q, detected by CGH. This may suggest that the 1q gain in breast cancer is an early event and thus may explain why it is so common (> 40% of all breast cancers). It is quite remarkable for a metaphase-CGH study, even after verification of the predominant rearrangements by FISH, to close in on potential tumour genes, such as TAFA1 and p16 in this study. A similar approach looking for recurrent alterations among selected (familiar) breast cancers by metaphase-CGH did find such a candidate region but it failed to identify predisposing mutations. It is reasonable to assume that array-CGH outperforms conventional CGH in this respect, and is more efficient as a first step toward evaluating candidate regions, mainly because of increased resolution. Such candidate gene searches by array-CGH will be valuable in identifying regions and ultimately genes that are associated with specific phenotypes such as cell growth, anchorageindependent survival and metastasis capacity, which are relevant to and could translate into differential clinical treatment. Although counterintuitive, it might be important to recognize that regions of recurrent aberrations may not contain the mutations sought at all. This is exemplified by the recurrent aberrations in BRCA1 mutant tumours occurring on chromosomes 3 and 5 and 10, which are more frequent then aberrations at the BRCA1 locus itself [56]. Thus, CGH has helped to uncover some of the intrinsic complexity of tumour chromosome behaviour, which is still poorly understood.
There are several effective ways to avoid some of the risks involved in searching complexly rearranged genomes for candidate genes. One is to map global gene expression onto genomic positions using comparative expressed sequence hybridization (CESH) [94]. Another is to focus on known candidate regions and simultaneously monitor gene expression as a filtering step to exclude genes. This was the approach used by Garcia and coworkers [18], who were prompted by results reported by Ray and colleagues [95] to study gene expression and CGH at near-tiling resolution of 33 primary breast cancers, 27 breast cancer cell lines and 20 primary ovarian cancers at chromosome 8p11-12 (about 10 Mb), which is a gene-dense region that has been implicated in various tumour types. By cross-comparison they were able to define a minimal region of common amplification that contained four overexpressed and therefore candidate oncogenes, namely FLJ14299, C8org2, BRF2 and RAB11FIP. This important finding requires confirmation in independent series of breast cancer, but the study clearly demonstrates the power of the CGH approach in combination with other assays.
At the same time, Prentice and coworkers [17] reported the same amplified region in 24% of all 382 cases of breast cancer examined by FISH on tissue micro-arrays followed by BAC array-CGH in five cases. They reported a minimal common amplified region in all five cases centromeric to NRG and FGFR1 (which are also frequently involved in breast cancer) that contain just three genes, namely FLJ14299, SPFH2 (C8org2) and PROSC. Interestingly, only FLJ14299 and SPFH2 overlap with findings reported by Garcia and coworkers, suggesting that amplification of one of these, or both, could represent functional alterations in breast cancer. The only information available for FLJ14299 is that it resembles a C2H2 zinc-finger type transcription factor that is conserved in Drosophila and zebra fish (Danio rerio). For SPFH2, a potential membrane association is predicted but unproven (GeneCards; www.genecards.org). Although highlevel amplification in this region correlated with poor survival in both studies, further studies will be necessary to determine whether these genes have the capacity of an oncogene. Another candidate gene, namely Rab25, was identified through array-CGH by Cheng and coworkers [32]. Rab25 was found to stimulate anchorage-independent cell survival and was thus characterized as a potential driver of ovarian and breast tumour development. Those investigators showed that the copy number of this 1q22 region correlated with differential disease-free survival in ovarian cancer patients, further suggesting that 1q22 is associated with tumour aggressiveness.
The above examples illustrate the usefulness of array-CGH in cancer genome research and justify further investigation involving mapping more candidate cancer genes in a relatively straightforward and high-throughput manner.

Hereditary breast (and ovarian) cancer and array-CGH
Jonsson and coworkers [58] reported array-CGH findings in 26 cases of hereditary breast cancer and 26 cases of sporadic breast cancer. They reported one recurrent amplicon (3q27.1-3) in 71% of 14 BRCA1 breast tumours and one recurrent amplicon (17q23.3-24.2) in more than 75% of 12 BRCA2 breast tumours. They further identified a set of 169 BAC clones out of their total 3.6k probe set, with sufficiently differential log ratios between BRCA1, BRCA2 and sporadic cancer to permit its use for classification of hereditary cases. However, this classifier will only gain validity after independent CGH profiles (a validation set) can verify it.
Wessels and colleagues [56] built a molecular classifier with a performance of 84% for detecting BRCA1 tumours. This classifier was made using 36 breast tumours from proven BRCA1 mutation carriers and 30 breast tumours from 30 independent bilateral (elevated risk) breast cancer patients. This classifier was built from metaphase-CGH but performs well on array CGH data from BRCA1, BRCA2, sporadic and BRCAX cases (van Beers EH, unpublished data). A classifier for BRCA2 has been more problematic. Using metaphase-CGH, a number of significant and recurrent genomic aberrations in 25 BRCA2 tumours were described [55] but the collective predictive power fell short of building a reliable classifier. We believe that the improved resolution together with the improved reproducibility of array-CGH might generate the data necessary for a BRCA2 classifier and possibly also for other subtypes in the future.

Prognostic information from CGH data
Callagy and coworkers [96] have made an important contribution to clinical prognostication using an array-CGH based study. They asked how array-CGH was different between short-term survivors (< 5 years) and long-term survivors (>10 years). As a result, they found a trio of TOP2a, ERBB2 and EMS1 to identify statistically significant (P = 0.01) differential prognosis among these subgroups. The differential expression for these genes was further substantiated by tissue microarray FISH. Quite surprisingly, their good and bad prognosis groups did not exhibit differences in grade, size, or oestrogen receptor statusfeatures that are currently most widely accepted for clinical prognostication. Because results were not stratified by tumour type (40 of 52 tumours were invasive ductal carcinoma), this study does not permit a definitive conclusion to be drawn about whether this prognostic set of three genes will be equally valuable in all breast cancer types. It is worth mentioning that their CGH probe set consisted of just 57 cancer-gene selected loci. This suggests that larger probe sets could identify more markers, and more relevant ones.
Cingoz and colleagues [57] reported a number of recurrent CGH observations that appear to correlate with certain disease characteristics. Their list of most frequent gains includes 1q (55%), 8q (52%) and 20q (29%), which is similar to many other breast cancer CGH studies and can be considered general breast cancer copy number gains. Other regions were distinct to subgroups, such as 16q loss in seven out of nine (78%) cases of invasive lobular carcinoma compared with five out of 18 (28%) cases of invasive ductal carcinoma. Since invasive lobular and ductal carcinoma are easily distinguished by pathologists, their findings are probably most important to our understanding of the intrinsic biology of these different tumour types rather than being beneficial for diagnosis or clinical decision making.

Pathways of instability
It has long been recognized that several different pathways of genomic instability exist. Probably most apparent is the difference between chromosomal instability and microsatellite instability, the former being more frequent in breast cancer and the latter being found more often in relation to MLH1 mutations that occur frequently in colon cancers [97]. It seems likely that certain tumours arrive at an aneuploid state through a series of events starting with telomere dysfunction, followed by polyploidization as a cellular rescue/survival event. Then, as a direct result chromosomes suffer numerous additional breakage-fusion events, resulting in seemingly chaotic accumulation of chromosome gains and losses [98]. This scenario is thought to occur frequently in adenocarcinomas, including breast cancer, with high-level aneuploidy. With respect to breast cancer, at least two distinct genome instability pathways exist and are described in detail in a recent review by Reis-Filho and coworkers [7].

Some related array-based methods
Any review of the analysis of chromosomal copy numbers in breast cancer by array-CGH would be incomplete without mention of some related technologies (e.g. genome-wide LOH analysis, SNP [haplotype] analysis, CESH [94,99] [also called expressive genomic hybridization [100]], large-scale PCR [101] and expression array studies) that often have similar or overlapping goals, including classification, clinical prognostication, defining treatment subgroups, finding novel genes, and so on.
Several studies have already demonstrated the feasibility of performing genome-wide LOH and SNP analysis, for example by using Illumina bead-array technology on formalin-fixed paraffin-embedded tumour tissues [102][103][104]. These technologies have a clear advantage over (array) CGH in that they extract haplotype information in addition to chromosome copy number. This is especially relevant in chromosomal uniparental isodisomy, inherited or acquired through mitotic recombination resulting in variably extended stretches of homozygosity (i.e. LOH) in the absence of copy number changes. This has been observed in breast cancer and may be more frequent then is generally acknowledged [105].

Conclusion
Array-CGH is a reliable, sensitive and high-resolution method that is highly automated compared with metaphase-CGH, which it is now replacing; thus, array-CGH is expected to generate an enormous amount of data in the coming years.
One limitation of all current breast cancer array-CGH studies is the limited number of samples used per study. It poses the classic risk for over-fitting for as long as the average study contains two orders of magnitude more features (i.e. probes) then samples. One way to tackle this limitation is to have available the combined data sets or a repository for metaanalyses of CGH data, histological data, immunostaining data, clinical data, and so forth.
This review has described the two separate goals of CGH in breast cancer: gene discovery and class discovery. For both we have given successful examples. An example of gene identification is TAFA1, the significance of which in cell transformation was found by array-CGH and was verified extensively with other methods [93]. These data clearly prioritize this locus for further functional studies of this relatively uncharacterized gene. An example of class prediction by array-CGH is the study Jonsson and coworkers [58], who built classifiers for BRCA1 and BRCA2 breast tumours.
One further conclusion of our review of the array-CGH literature is that the known heterogeneity of breast cancer also seems reflected in array-CGH data, such that multiple types of profiles have been reported, sometimes with clear but sometimes with less clear associations with known types of breast cancer. The fact is that the number of described subtypes of breast cancer increases with increasing sensitivity and resolution of newer methods. Although novel technology has no a priori knowledge of the problem analyzed, one would generally be satisfied if a new technology would permit enhanced stratification of disease entities. However, in breast cancer we see a trend toward discovery of more subgroups that sometimes appear independent of more traditional classification features, such as grade, size and immunochemistry (e.g. P53, oestrogen receptor, progesterone receptor, ERBB2) [96]. One interpretation is that the within group variability, for instance, among all grade III cases or all oestrogen receptor negative cases is still quite large. In our opinion, this could account for the independent clustering when using CGH data compared with other tumour characteristics, even when the same tumour set is studied.
In the near future, when increasing numbers of studies generate higher resolution data in conjunction with allelespecific information, such as produced by SNP arrays or Illumina bead arrays, it may become possible also to elucidate some effects of genetic backgrounds or genotypes on CGH as a proxy for genomic instability. Nevertheless, genetic background appears to impact on CGH profiles [106], and this type of information should hopefully teach us more about the biology of chromosomal instability. This touches on a possible relation between CGH profiles and human genetic diversity that can best be studied on genome-wide SNP/ CGH arrays and could be extremely useful in locating putative (median and low risk) breast cancer genes.
Despite the current trend in which oligo array-CGH is gradually replacing BAC array-CGH because of its flexibility (clone-less, fast and custom print-on-demand, and probably the most powerful advantage of assessing copy number and allele information simultaneously), it seems that BAC arrays will remain important for profiling DNA from formalin-fixed material.

Competing interests
The authors declare that they have no competing interests.