Next-generation sequencing

Next-generation sequencing (also known as massively parallel sequencing) technologies are revolutionising our ability to characterise cancers at the genomic, transcriptomic and epigenetic levels. Cataloguing all mutations, copy number aberrations and somatic rearrangements in an entire cancer genome at base pair resolution can now be performed in a matter of weeks. Furthermore, massively parallel sequencing can be used as a means for unbiased transcriptomic analysis of mRNAs, small RNAs and noncoding RNAs, genome-wide methylation assays and high-throughput chromatin immunoprecipitation assays. Here, I discuss the potential impact of this technology on breast cancer research and the challenges that come with this technological breakthrough.


Introduction
Since the publication of the first draft of the human genome sequence [1,2], the field of genomics has changed dramatically. Most importantly, the availability of this information has led to a technological boom, with the development of highthroughput methods that could be used to interrogate the wealth of data available in the human genome and transcriptome. The fields of genomic and transcriptomic science have expanded at an unprecedented pace.
In the past decade we have witnessed the rise of microarrays, a technology that has been extensively applied to the study of cancer genomes and transcriptomes. Of all solid cancers, breast cancer has been the most comprehensively studied using these methods. Although some of the promises of microarrays have not materialised in the time frame some of the proponents of this technology have foreseen, the highthroughput data generated in microarray-based experiments have changed the way breast cancer is perceived [3,4]. The approach has brought to the forefront of cancer research the concepts of breast cancer heterogeneity -that distinct molecular subtypes of breast cancer are underpinned by distinct genetic and epigenetic aberrations, and that distinct subtypes of breast cancer may have their prognosis and response to therapy governed by distinct molecular pathways and networks [5,6]. It should be noted, however, that microarray-based expression profiling and comparative genomic hybridisation provide data with important limitations. For instance, microarray-based expression profiling only provides a semiquantitative assessment of gene expression; it is limited by the nature of the probes included in the platform and their sensitivity and specificity. Comparative genomic hybridisation and SNP array analysis have provided a wealth of data on gene copy number aberrations in breast cancer and have helped identify potential therapeutic targets for subgroups of breast cancer patients; however, this technology does not provide any information about structural genomic aberrations and base pair mutations [7].
An ideal tool for the genetic characterisation of cancers is one that could provide information about copy number aberrations, allelic information, somatic rearrangements and base pair mutations in a single experiment [7]. Furthermore, data generated with such technology should be presented in such a way that the presence of cells other than cancer cells in the samples would not constitute an insurmountable hurdle. Such a tool, a few years ago, would belong to the realms of science fiction.
Technology, however, has evolved at an unprecedented pace. We are currently witnessing yet another molecular revolution, one that will most certainly dwarf the paradigm shifts brought about by the introduction of microarrays: the advent of massively parallel sequencing (also known as nextgeneration sequencing). This technology allows for the accrual of qualitative and quantitative information about any type of nucleic acid in a given sample at an incredible throughput while incurring relatively limited costs (reviewed in [8][9][10][11][12][13]).
(page number not for citation purposes) in instrumentation coupled with the development of highperformance computing and bioinformatics have reduced the cost of sequencing. However, increases in the throughput of Sanger DNA sequencing are achieved by the use of additional sequencers in parallel, owing to the requirement of gel electrophoresis or additional wells for the capillary sequencing of each reaction.
Using different approaches, massively parallel sequencing methods overcome the limited scalability of traditional Sanger sequencing by either creating micro-reactors and/or attaching the DNA molecules to be sequenced to solid surfaces or beads, allowing for millions of sequencing reactions to happen in parallel. At present, there are four technologies commercially available and several other promising approaches are in various stages of development and implementation (Table 1) (reviewed in [8][9][10][11][12][13]). The current generation of massively parallel sequencers has led to a quantum leap in our ability to sequence genomes, so much so that 10-fold coverage of the human genome (30 Gb DNA sequence) can be obtained in a single run for no more than US$15,000 to US$20,000. (Note that the Human Genome Sequencing Consortium generated 3 Gb at the cost of approximately US$3 billion and took 13 years!) Perhaps more important than the sequencing throughput provided by this technology and its relative low cost compared with traditional sequencing methods is the type of data it generates. Instead of long reads generated from a PCR-amplified sample, massively parallel sequencing methods provide much shorter reads (~21 to ~400 base pairs), but millions of them [8][9][10][11][12][13]. Unlike previous sequencing methods that required DNA amplification (that is, the final sequence was representative of modal population of DNA templates), sequencing can now be performed from single DNA molecules. The short reads generated in the sequencing of each DNA molecule can be counted and quantified, allowing the identification of mutations in nonmodal populations of cells (that is, identification of a somatic mutation in a small subpopulation of cells immersed in a modal population with wild-type sequences) and accurate copy number assessment of each genomic region ( [14] and references therein). In addition, with the recent introduction of approaches that allow for the sequences of both ends of a DNA molecule (that is, paired end massively parallel sequencing or mate pair sequencing), it has become possible to detect balanced and unbalanced somatic rearrangements (that is, fusion genes) in a genome-wide fashion [12,14,15].
Not surprisingly, this massive increase in throughput has come at a cost, with the accuracy of each short read being significantly lower than the output generated from Sanger sequencing. Although this is circumvented by the depth of sequencing (that is, multiple reads of the same region), it is accepted that physical validation using traditional sequencing methods is required. Note that each type of next generation sequencing leads to specific types of artefacts (reviewed in [8][9][10][11][12][13]); however, as we are writing the book on nextgeneration sequencing as we go along, one should be aware of unexpected artefacts and new findings should be interpreted with caution.
What can be done with massively parallel sequencing?
Next-generation sequencing has already been applied to resequencing studies, which have led to sequencing of complete normal and cancer genomes being performed in a matter of weeks [16][17][18]. Massively parallel sequencing can be employed for the simultaneous characterisation of cancer genomes in terms of somatic base pair and in-del mutations, balanced and unbalanced rearrangements, and copy number changes in a single experiment [14,18]. Apart from sequencing whole genomes, massively parallel sequencing can be coupled with DNA capturing methods for focused analysis of specific genomic regions, specific genes or the whole exome [19]. In fact, the Breast Cancer International Cancer Genome Consortium has pledged to complete sequencing the genome of 1,500 breast cancers [20]. This study will provide a comprehensive catalogue of the genetic alterations found in breast cancer in general and in the different subtypes of the disease.
Massively parallel sequencing can be applied to germline DNA for gene association studies and for the analysis of cancer genomes [8][9][10][11][12][13][14], and may constitute a paradigm shift in the way mutations that cause rare diseases can be identified. In fact, the power of this technology to unravel genes whose germline mutations cause rare mendelian disorders is exemplified by the identification of MYH3 germline mutations as a cause of Freeman-Sheldon syndrome through the targeted sequencing of all protein-coding regions (exomes) of four individuals with this syndrome and eight unrelated individuals [19]. Although in the interpretation of results from target exome and whole genome sequencing studies of a small number of subjects, investigators will have to deal with the previously underestimated number of private SNPs and copy number DNA polymorphisms, the 1000 Genomes Project will provide a more complete catalogue of SNPs, copy number polymorphisms, and short insertion and deletion polymorphisms in the general population [21], which may facilitate the discovery of pathogenic germline mutations.
In addition to the ability to sequence DNA, massively parallel sequencing can be applied to sequencing RNA [22]. Four main applications have already been developed -namely, digital gene expression, RNA sequencing, paired end RNA sequencing, and small and noncoding RNA sequencing. An in-depth discussion of these methods and their impact on our ability to perform transcriptomic analyses is beyond the scope of this short communication, and readers are referred to excellent reviews on this topic [13,22]. Suffice it is to say , which are RNA molecules resultant from the co-splicing of two genes that are contiguous in the genome in the absence of a structural genomic aberration. When combined with DNA massively parallel sequencing, RNA sequencing has the potential to unravel RNA editing events, such as the nonsynonymous transcript editing of the COG3 and SRP9 genes in a metastatic invasive lobular carcinoma [18]. Furthermore, massively parallel sequencing studies of noncoding and small RNAs coupled with the results of the ENCODE project [28] are likely to reveal a level of transcriptional regulation way beyond our current models.
Modifications of the protocols for massively parallel sequencing also allow for an unbiased assessment of DNA methylation [29,30] and histone acetylation, and are likely to replace microarrays in the analysis of high-throughput immunochromatin precipitation assays [13,31]. Next-generation sequencing is also replacing microarrays in high-throughput RNA interference screens: one can perform genomewide screens to identify genes that interfere with the viability of cancer cells using pools of short hairpin RNAs, and the results can be deconvoluted using next-generation sequencing [32]. This latter approach is likely to provide a wealth of information on genes that are selectively required for cancer cell survival and potential drug targets.

Massive parallel sequencing: opportunities and challenges
The multiple applications and uses of massively parallel sequencing are likely to reshape several aspects of breast cancer research. Given the unprecedented ability to identify mutations, copy number aberrations and somatic rearrangements in cancer genomes, the information accrued by massively parallel sequencing of breast cancers may lead to a paradigm shift in the way breast cancers are classified. In fact, this technology offers a unique opportunity to move from the current descriptive and prognostic classification systems to a functional genomic taxonomy that is based on the molecular aberrations that drive specific subgroups of cancers, in a way akin to the classification system currently used for leukaemias and lymphomas. With the availability of information of the genetic alterations required for the survival of cells of a given cancer, tumours may be classified according to the genetic aberrations they harbour, according to the molecular networks activated or inactivated by these genetic aberrations, and, importantly, according to the agents these tumours are sensitive to.
Studies performing large-scale conventional sequencing of breast cancers [33,34] revealed that there are a relatively low number of genes frequently mutated and a high number of genes rarely mutated in breast cancer. It should be noted, however, that the number of mutations found in oestrogen receptor-negative breast cancer cell lines [34] was higher than that found in an oestrogen receptor-positive breast cancer [18]. It is therefore plausible that different types of breast cancer are driven by distinct constellations of genetic aberrations. It should be noted, however, that even tumours from the same type may be characterised by mutations of distinct genes in the same or complementary molecular networks, which would result in a similar phenotype.
Recent whole-genome characterisation of M1 leukaemias [35,36] and of a metastatic deposit of an invasive lobular carcinoma of the breast [18] has demonstrated the power of this technology for the identification of novel potential mutations that drive specific subtypes of complex and heterogeneous diseases such as leukaemias and breast cancer, and has demonstrated how the mutational spectrum of a cancer evolves over time. Furthermore, next-generation sequencing analysis of cancer types whose tumours are rather homogeneous in terms of their molecular makeup, such as some special types of breast cancer [37][38][39][40][41], may lead to the identification of pathognomonic genomic alterations, in a way akin to C134Y FOXL2 mutations in granulosa cell tumours of the ovary [42]. These driver genetic alterations (for example, mutations, amplifications and fusion genes) have the potential of being exploited as therapeutic targets.
Although the presence of non-neoplastic tissues (that is, stroma, inflammatory infiltrate and entrapped normal tissues) represents a challenge for the analysis of the genomes of preinvasive lesions, primary breast cancers and their metastatic deposits, there is evidence to suggest that if a tumour is sequenced at a sufficient depth then accurate sequences at base pair resolution can be obtained and somatic mutations identified [18].
Another important application of massively parallel sequencing due to its ability to deep sequence specific genomic regions is the identification of secondary mutations as mechanisms of resistance to specific agents [43,44]. There are several lines of evidence to demonstrate that de novo and acquired resistance to some targeted therapies is driven by secondary mutations in the target genes (for example, the T790M mutation in the EGFR gene causing resistance to anti-epidermal growth factor receptor agents [45], and secondary KIT mutations leading to resistance to imatinib mesylate and sorafenib [46]) or in genes whose inactivation is synthetically lethal in the presence of the targeted therapy (for example, BRCA2 and BRCA1 revertant mutations as a mechanism of resistance to platinum salts and poly(ADPribose) polymerase inhibitors [47][48][49]).
It should be noted, however, that the deluge of data derived from next-generation sequencing studies might take a relatively long time to be translated into information that is clinically relevant. Given that each cancer genome may have an excess of 10,000 somatic mutations, it is unclear how much validation through the identification of recurrent mutations [14] or by laborious functional studies will be required to separate driver mutations (that is, those that either confer growth/survival advantage for a tumour or are required for the cancer cells for the maintenance of their malignant behaviour) from passenger alterations (that is, genomic noise). Furthermore, next-generation sequencing is likely to unravel a much greater complexity of the normal human genome in terms of SNPs and copy number polymorphisms [50], some of which may be confined to some somatic tissues in the same individual [51,52]. Massively parallel sequencing will require an availability of high-performance computing and bioinformatic support that is way beyond that of most research laboratories. Furthermore, quality control and standardisation of the massively parallel sequencing experiments and data reporting are important issues to consider. Finally, the ethical aspects of next-generation sequencing are by no means trivial, and the readers are referred to excellent reviews covering these aspects [9,11].

Conclusion
One could argue that massively parallel sequencing is not only an end, but also a means for performing experiments that may answer questions that could not even be asked previously. The revolution that is likely to be brought about by massively parallel sequencing methods is akin to the revolution fostered by the introduction of the PCR in the 1980s. It is undeniable that this technology will constitute a quantum leap in breast cancer basic and translational research; however, numerous challenges lie ahead. We ought to learn from our recent experience with microarrays, and avoid any sort of unjustified overoptimism. The greatest danger of using this revolutionary technology is that it comes with new problems; if we move too quickly, the lessons we are beginning to learn from previous high-throughput studies may be forgotten when massively parallel sequencing is applied to clinical and translational questions.