Mapping the location of recurring amplicons in array-CGH data
© BioMed Central 2005
Published: 17 June 2005
Copy number alterations (CNAs) are believed to constitute key genetic alterations in the cellular transformation of many tumors . Microarray-based comparative genomic hybridization (array-CGH) allows the construction of high-resolution genome-wide maps of copy number alterations, and statistical software packages are available for exploring and analysing array-CGH data (see, for example, [2, 3]), facilitating the delineation of the boundaries of CNAs in individual tumors and thereby localizing and identifying potential oncogenes and tumor suppressor genes. Although CNAs vary widely with respect to size and location, some genomic regions are known to have much higher prevalence of alteration than others. Mapping the location of these CNA hotspots facilitates location of genes of potential importance to tumor development as well as identification of alterations forming key steps in tumor development. There is, however, a need for consistent ways of combining array-CGH results for different arrays. Here, we present a statistical modelling-based approach for this.
Suppose we have available for each gene (clone) on an array a binary (0/1) variable indicating whether the gene is amplified or not. Such data may be constructed from array-CGH data using one of the aforementioned software packages. Each tumor may then be represented by an m-dimensional binary vector, where m is the number of genes on the array. For an experiment involving n tumors we thus have a set of m-dimensional vectors z1, ..., zn and we consider the latter to be realizations from a multivariate distribution P(z). We consider three models for P(z) of increasing sophistication. The first assumes complete independence between genes, the second assumes a Markov-chain dependence structure and the third assumes a Markov Random Field dependence structure . We demonstrate how P(z) can be estimated in each case and show that, by suitable constrained maximization of P(z), we may determine genomic intervals corresponding to probable occurring intervals of copy number alteration.
The method is demonstrated (for all three models) on simulated binary copy number status data for varying number of genes and tumors. We also demonstrate the use on real array-CGH data that have been processed by CGH-Explorer  in order to obtain a binary copy number status vectors for each tumor.
We have proposed a novel statistical method for the derivation of probable intervals of CNA, based on copy number status data from a sample of tumors. The method is based on a probabilistic model for the copy number status in a tumor, and we have discussed three models of increasing sophistication. The most basic of the three models corresponds to simply reporting all genes that are amplified in at least k% of the tumors. The other two models take into consideration the important fact that neighboring genes are not, in general, altered independently of each other. Utilizing this property of copy number data allows derivation of probable intervals of CNA that are less prone to noise degradation than alternative methods. In addition, results are derived in the context of a well-defined probabilistic framework and are therefore more easily interpretable.
- Lengauer C, Kinzler KW, Vogelstein B: Genetic instabilities in human cancers. Nature. 1998, 396: 643-649. 10.1038/25292.View ArticlePubMedGoogle Scholar
- Lingjærde OC, Baumbusch LO, Liestøl K, Glad IK, Børresen-Dale AL: CGH-Explorer: a program for analysis of array-CGH data. Bioinformatics. 2005, 21: 821-822.View ArticlePubMedGoogle Scholar
- Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R: A method for calling gains and losses in array CGH data. Biostatistics. 2005, 6: 45-58. 10.1093/biostatistics/kxh017.View ArticlePubMedGoogle Scholar
- Cressie NAC: Statistics for Spatial Data. 1993, New York: John Wiley & SonsGoogle Scholar