Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression

Introduction Genomic and transcriptomic alterations affecting key cellular processes such us cell proliferation, differentiation and genomic stability are considered crucial for the development and progression of cancer. Most invasive breast carcinomas are known to derive from precursor in situ lesions. It is proposed that major global expression abnormalities occur in the transition from normal to premalignant stages and further progression to invasive stages. Serial analysis of gene expression (SAGE) was employed to generate a comprehensive global gene expression profile of the major changes occurring during breast cancer malignant evolution. Methods In the present study we combined various normal and tumor SAGE libraries available in the public domain with sets of breast cancer SAGE libraries recently generated and sequenced in our laboratory. A recently developed modified t test was used to detect the genes differentially expressed. Results We accumulated a total of approximately 1.7 million breast tissue-specific SAGE tags and monitored the behavior of more than 25,157 genes during early breast carcinogenesis. We detected 52 transcripts commonly deregulated across the board when comparing normal tissue with ductal carcinoma in situ, and 149 transcripts when comparing ductal carcinoma in situ with invasive ductal carcinoma (P < 0.01). Conclusion A major novelty of our study was the use of a statistical method that correctly accounts for the intra-SAGE and inter-SAGE library sources of variation. The most useful result of applying this modified t statistics beta binomial test is the identification of genes and gene families commonly deregulated across samples within each specific stage in the transition from normal to preinvasive and invasive stages of breast cancer development. Most of the gene expression abnormalities detected at the in situ stage were related to specific genes in charge of regulating the proper homeostasis between cell death and cell proliferation. The comparison of in situ lesions with fully invasive lesions, a much more heterogeneous group, clearly identified as the most importantly deregulated group of transcripts those encoding for various families of proteins in charge of extracellular matrix remodeling, invasion and cell motility functions.


Introduction
Invasive ductal breast carcinoma (IDC) is the most common malignancy of the breast, accounting for ~80% of all invasive breast tumors [1]. Although an issue of much controversy over the years, there is now general agreement and overwhelming histopathological and genetic evidence indicating that most invasive breast carcinomas appear to develop gradually from defined precursor lesions [2]. However, it also became clear that progression toward more aggressive phenotypes is not obligatory [3]. It is further evident that many genetic abnormalities underlying tumor progression are probably phenotypically silent.
Numerous molecular genetic changes have been reported as relevant in human breast carcinogenesis, such as anomalies affecting cell proliferation, apoptosis and invasion [4]. Preinvasive breast lesions such as high-grade ductal carcinoma in situ (DCIS) are known to have acquired a myriad of genomic and transcriptomic changes, but as their name implies they are not yet invasive. The development of the ability to invade surrounding tissues is perhaps the most critical event in cancer progression. Among proposed invasion-related genes with reported altered expression in tumor cells are cell adhesion molecules, proteases and cytoskeletal molecules that may influence motility [5]. Identifying the key and most common gene expression abnormalities involved in the transition steps from preinvasion to a fully invasive phenotype is an extremely important topic of research and the main objective of the present report. Studies on this area may provide clues to better diagnose premalignant lesions at high risk of progression and may aid in achieving a better understanding of critical early molecular mechanisms involved in breast cancer evolution.
Serial analysis of gene expression (SAGE) is a comprehensive profiling method that allows for global, unbiased and quantitative characterization of transcriptomes [6,7]. SAGE provides a statistical description of the mRNA population present in a cell without prior selection of the genes to be studied, and this constitutes a major advantage. In this sense, only open systems can identify expressed genes that have not yet been cloned or partially sequenced. A second major advantage is that the information generated is digital in format, and can be directly compared with data generated from any other laboratory or with data available in public databases such as the Cancer Genome Anatomy Project http://cgap.nci.nih.gov/SAGE.
To perform a comparative SAGE analysis of normal, preinvasive and invasive lesions, we used a modified t test that we have recently developed [8]. This method has the advantage of taking into account both the intra-sample and inter-sample variability, identifying 'common patterns' of gene changes systematically occurring across samples. Most of the tests developed for measuring differential expression in SAGE data focus on capturing the first type of variation correctly, but tend to neglect the second type [9,10]. The aim of the present study was to provide a statistically robust global gene expression analysis on the progression of breast cancer using the described statistical approach comparing breast normal and tumor SAGE libraries obtained from public databases combined with additional SAGE libraries recently generated in our laboratory.

SAGE libraries
To perform the comparative analysis of different stages of breast cancer progression, we combined SAGE libraries available in public databases with breast cancer libraries generated and sequenced at our own laboratory. To this end, 12 SAGE libraries of breast tissues (four normal breast tissues, six DCIS tissues and two IDC tissues) were downloaded from the Cancer Genome Anatomy Project -SAGE Genie database (libraries generated at the Polyak Laboratory, Dana-Farber Cancer Institute, Boston, MA, USA). We used 11 additional breast cancer SAGE libraries generated by ourselves, at an approximate resolution of 100,000 SAGE tags per library. All IDC SAGE libraries used in this study were from lymph node-negative, estrogen receptor-positive and progesterone receptor-positive tumor samples, with a tumor size classification of T1 or T2 (i.e. T1-T2 N0 M0). Table 1 summarizes all the SAGE libraries used in this comparative analysis.

SAGE methodology
For the SAGE libraries generated in our laboratory we followed standard methods. Briefly, total RNA was extracted from snap-frozen tissues using TRIzol (Invitrogen, San Francisco, CA, USA). SAGE library construction was performed with the I-SAGE kit (Invitrogen) according to the manufacturer's protocol and introducing only minor modifications. The anchoring enzyme was NlaIII and the tagging enzyme used was BsmFI. Concatemerized ditags were cloned into pZERO-1 and sequenced with an ABI 3700 DNA Analyzer (Applied Biosystems, Foster City, CA, USA).

SAGE data processing
SAGE tags were extracted from sequencing files using the SAGE2000 software version 4.0 (a kind gift from Dr K. Kinzler, John Hopkins School of Medicine, Baltimore, MD, USA). Tag abundances for all libraries were normalized to a total of 100,000 tags (at which level a tag present 10 times has an abundance of 0.01%). Tag to gene assignments as well as additional annotations using public databases (e.g. Gene Ontology, Locus Link, Unigene cluster) were performed, using web-based SAGE library tools developed by ourselves http://spi.mdacc.tmc.edu/bitools/ about/sage_lib_tool.html. In our comparison we used only tags with only one reliable assigned gene.

Statistical analysis of SAGE libraries
To compare the 23 SAGE libraries, we utilized a modified t test recently developed by us [8]. This analysis allowed us to identify SAGE tags with significantly different expression levels (P < 0.01) between normal tissue and DCIS and between DCIS and IDC. Tags with total counts of less than three in all libraries were filtered out before the analysis. In order to enable visualization and illustration of our analyses, we utilized the TIGR MultiExperiment Viewer (MeV 2.2) software (The Institute for Genomic Research, Rockville, MD, USA). This tool was employed for normalization and average clustering of the SAGE data.
The aim of the heat maps presented is simply to organize and illustrate the data by graphical means. Briefly, the normalization included logarithmic transformation followed by median centering by samples and genes. We used standard average hierarchical clustering techniques to classify and illustrate further the differences found by the modified t test, showing the clusters of differentially coexpressed genes between the normal tissue, DCIS and IDC groups.

Generation and analysis of SAGE libraries
The primary goal of our study was to identify the most commonly occurring transcriptome changes in the transition from normal breast epithelium to DCIS and invasive carcinoma. To this end, SAGE data obtained from 11 breast cancer libraries generated in our laboratory (1,090,815 tags) were combined and compared with data available in the public domain (661,863 tags), thus generating a data-set of almost 1.7 million breast cancer and normal specific tags, representing approximately 25,157 transcripts from a total of 23 libraries (Table 1).
Our statistically stringent analysis revealed 52 transcripts commonly deregulated across the board when comparing normal tissue with DCIS ( Fig. 1), and 149 transcripts when comparing DCIS with IDC (P < 0.01) (Fig. 2) (see additional data files 1 and 2 for additional information with statistical cutoff at P < 0.05). Selected genes based on relative abundance, highly statistical differences and high fold changes between compared groups are sorted and represented in Tables 2 and 3. As expected, we detected various ribosomal genes among the most abundant transcripts in all the breast SAGE libraries, and these genes were highly upregulated in the invasive carcinomas. This agrees with the previous global expression profiles and with the comparisons of cancers and the corresponding normal tissues in general [7,11,12].
To simplify illustration of the data, ribosomal genes are not included in the figures and tables.

Global comparison of normal tissues and DCIS
Among the 52 transcripts detected as differentially expressed in DCIS (P < 0.01), 36 were downregulated transcripts and 16 were upregulated transcripts in these lesions when compared with normal breast epithelial cells and mammary epithelial organoids ( Fig. 1 and Table 2). We defined and classified the 52 genes differentially expressed into the nine functional categories [13] shown in Fig. 3a.
Clusterin was one of the most dramatically downregulated genes (-63.9-fold; P = 0.0036) in DCIS libraries. This gene encodes a heterodimeric, highly conserved, secreted glycoprotein. Alterations in Clusterin expression and/or protein maturation are linked to changes in tissue growth or regression, which may be related to specific proapoptotic or antiapoptotic protein isoforms [15]. Clusterin was reported as overexpressed during tissue and cell involution, and was downregulated in esophageal squamous cell carcinoma and prostate carcinoma, suggesting that this expression alteration could be a general phenomenon during tumor progression [16,17]. On the contrary, and in contrast to these and our observations, Redondo and colleagues reported increased Clusterin expression in Hierarchical clustering of the most commonly different expressed genes between normal breast tissue and ductal carcinoma in situ (DCIS) groups (P < 0.01) Hierarchical clustering of the most commonly different expressed genes between normal breast tissue and ductal carcinoma in situ (DCIS) groups (P < 0.01). Color scale at bottom of picture is used to represent expression level: low expression is represented by green, and high expression is represented by red.

Figure 2
Hierarchical clustering of the most commonly differentially expressed genes between ductal carcinoma in situ (DCIS) and invasive ductal carcinoma (IDC) groups (P < 0.01) Hierarchical clustering of the most commonly differentially expressed genes between ductal carcinoma in situ (DCIS) and invasive ductal carcinoma (IDC) groups (P < 0.01). Color scale at bottom of picture is used to represent expression level: low expression is represented by green, and high expression is represented by red. breast carcinoma samples [18]. The reason for this discrepancy is unclear at this point. The role of Clusterin in cell survival, cell death and neoplastic transformation remains controversial [15].
Another commonly observed downregulated gene in DCIS libraries was NSEP-1 (-36.2-fold; P = 0.0001). Also known as YB1, NSEP-1 is a member of the highly conserved Ybox family of proteins, which regulate the transcription of several genes associated with cell death including both fas, a cell death-associated receptor, and the tumor suppressor gene p53 [19]. The decrease in expression of NSEP-1 transcripts could play an important role in the early stages of breast carcinogenesis in order to overcome cell proliferation controls.
Interestingly, and as previously observed, we also detected significant downregulation of various cytokines and chemokines: interleukin enhancer binding factor 2 (ILF2), interleukin 13 receptor alpha 1 (IL13RA1), leukemia inhibitory factor (LIF), cardiotrophin-like cytokine (CLC), chemokine C-C ligand 2 (CCL2), and chemokine C-X-C ligand 1 (CXCL1). All these cytokines and chemokines are highly expressed in normal mammary epithelium and are dramatically downregulated in the DCIS samples. These differentially expressed genes were detected within a range of 0.02 <P < 0.05 by means of the modified t test analysis. These small secretory molecules, although usually linked to inflammatory processes, could also play important autocrine and/or paracrine roles in the physiology of normal mammary epithelial cells in particular because receptors for these cytokines are also normally found expressed in normal breast epithelial cells [20]. Some of these molecules (e.g. CXCL1, LIF) appear to play important roles in the normal periodic cycles of growth and involution of the mammary gland following pregnancy and lactation. They may thus be part of the physiologic mechanisms associated with the massive apoptosis observed during involution [21,22]. Unfortunately we understand very little of the relevance of their intriguing de facto silencing in expression, both in in situ as well as in invasive breast cancer lesions.
Interestingly, we also detected a series of transcripts commonly overexpressed in the DCIS samples: polycyctic kidney disease 1-like (PKD1-like), START domain containing 10 (STARD10), EPS8-like2 (EPS8L2), and KIAA0545 protein (Fig. 1d). One of these genes, EPS8-like2, encodes a protein that is related to epidermal growth factor receptor pathway substrate 8 (EPS8), and was shown to be essential in Ras/PI3K to Rac signaling [23]. PKD1-like encodes a member of the polycystin protein family. Members of this protein family may function in cell development and morphology, and may modulate intracellular calcium homoeostasis and other signal transduction pathways [24,25]. Although the PKD1 gene has been associated with cancer mechanisms, this homologous family member has not been implicated in carcinogenesis processes to the best of our knowledge. KIAA0545, also known as signalinduced proliferation-associated 1 like 3 (SIPA1L3), is a member of the Sipa1 family and encodes a protein bearing a domain highly homologous to the catalytic region of human Rap1 GTPase-activating protein (Rap1GAP).   Sipal1 is involved in the regulation of the Ras-mediated signal transduction pathway for cell proliferation and cell cycle progression [26]. These genes could be involved in signaling pathways that lead to cell proliferation, but their potential role in malignant transformation remains unknown.

Differentially expressed genes associated with NF-κB and tumor necrosis factor pathways
One of the transcripts observed to be most differentially expressed when comparing normal tissue with DCIS was NFKBIA (better known as IκBα), demonstrating a 150-fold higher expression (P < 0.0001) in normal mammary epithelial cells (Table 2 and Fig. 1b). NFKBIA is a member of IκB family genes that play a critical role in regulating the activity of the NF-κB transcription factor [27,28]. NF-κB plays a major role in diverse biological processes such as cell proliferation, differentiation, apoptosis and metastasis [29,30]. NF-κB is also required to prevent cell death induced by tumor necrosis factor (TNF) [31].
Interestingly, and perhaps pointing to connected pathways and related outcomes, we also detected a strong decrease in the expression levels of TNFRSF10 (29-fold; P < 0.0003), LITAF/PIG7 (29-fold; P < 0.0003) and TNFAIP3 (eightfold; P < 0.0083) transcripts in the DCIS group. The protein encoded by TNFRSF10, also known as TRAIL/ APO2, is a member of the TNF-receptor superfamily and contains an intracellular death domain. This receptor can be activated by TNF-related apoptosis inducing ligand and its role is to transduce apoptosis signals [32,33]. LITAF/ PIG7, a transcription factor, termed lipopolysaccharide-induced TNF-alpha factor, also found downregulated, was reported to regulate TNF-alpha gene expression playing a major role in TNF-alpha activation [34]. This gene, also known as P53-induced gene 7 (PIG7), has been shown to be induced by p53 when apoptosis is triggered, and therefore could also play a role in programmed cell death [35]. The concerted decline of these transcripts early in breast tumor progression appears conducive to a virtual silencing of apoptosis induction pathways and a consequential net increase in cell proliferation. In other words, the homeostasis of proliferation cell death normally operating in the breast epithelium is altered and inclined towards a net gain in cell numbers via multiple signaling pathways.

Global comparison of in situ and invasive carcinomas
We found 149 transcripts differentially expressed between DCIS and IDC at P < 0.01. All of these genes were found overexpressed commonly at the invasive stage (Fig. 2). Table 2 summarizes the 52 most commonly overexpressed genes in invasive carcinoma lesions. We defined and classified the 149 genes differentially expressed in 10 functional categories [13] as shown in Fig. 3b. Interestingly, we found that 37% of these upregulated transcripts are related to the cell cycle (12%), extracellular matrix or secreted proteins (13%), cell adhesion and motility (6%), and signal transduction (6%).
We were also able to detect 31 underexpressed genes in invasive carcinomas when compared with DCIS, but only when the stringency of the statistical comparison was dropped to within the 95% confidence interval (i.e. P <
The first of these transcripts, TM4SF1, was also the most dramatically downregulated gene in DCIS when compared with normal breast libraries (-442.6-fold; P = 0.0083). The transmembrane proteins TM4SF1, also known as the tetraspanin superfamily, are implicated in diverse signal trans-duction events that play a role in the regulation of cell development, cell proliferation, differentiation and motility [36]. The tetraspanins are associated with adhesion receptors of the integrin family and regulate integrindependent cell migration [36]. In the present study, the loss in gene expression of TM4SF1, from normal breast tissue to invasive carcinomas, appears to be a common event in the progression of breast carcinomas. In addition, downregulated levels of the TRAF4 transcript could cooperate in the evolution from DCIS to invasive carcinomas. TRAF4 is a proapoptotic gene member of the TRAF family of adaptor proteins that mediate cellular signaling by binding to various members of the tumor necrosis family receptor superfamily and interleukin-1/Toll-like receptor superfamily [37]. Interestingly, a recent study showed that overexpression of TRAF4 can induce apoptosis, playing a role in p53-mediated proapoptotic signaling in response to cellular stress [38].

Differentially expressed genes related with extracellular matrix remodeling and invasion processes
During their metastatic conversion, epithelial carcinoma cells acquire the ability to invade the surrounding tissues and later disseminate to secondary organs mostly via lymphatic vessels. The metastatic process is not just a function of acquisition of novel migratory and invasive properties by the epithelial tumor cells; the surrounding stroma also plays a critical role in this process [2]. Dramatic changes take place in order to remodel the extracellular matrix environment in response to the infiltrating cancer cells (desmoplastic reaction) [39][40][41]. In this sense, we identified high expression levels of several transcripts that could be a reflection of the host stromal response, such as collagen 1α1, collagen 1α2, collagen 3α1, collagen 6α1, fibronectin I, fibrilli, microfibrillar-associated protein 2, and Spondin 2.
It is known that the proteolytic degradation of the extracellular matrix is more than the simple removal of a physical barrier to invasion; such processes and the increased expression of the involved genes are known to also significantly influence mechanisms controlling cell proliferation [42]. Matrix metalloproteinases are zinc-dependent endopeptidases involved in matrix degradation and tissue remodeling [43]. These endopeptidases are capable of degrading both the extracellular matrix and basement membrane, physical barriers that play important roles in preventing against expanding growth and migration of cancer cells [44]. It is therefore widely accepted that overexpression of matrix metalloproteinases is associated with cancer-cell invasion and metastasis. A member of the matrix metalloproteinase family (MMP-2) was highly expressed (29.1fold; P = 0.0008) in IDC libraries in comparison with in situ carcinomas. MMP-2 has been shown overexpressed in various human tumors, including breast cancer [45,46].  To no surprise and as observed in other studies, we also detected significant increases in SPARC (286-fold; P = 0.0003) and a new related gene SPARC-like1 (21.7-fold; P = 0.005) among the groups of genes upregulated in invasive lesions. The SPARC gene encodes for a secreted protein acid rich in cysteines also known as osteonectin [47]. This protein is involved in a variety of diverse biological processes including tissue remodeling, cell adhesion, proliferation, differentiation, matrix synthesis/turnover, angiogenesis and tumor cell migration and invasion [47]. Overexpression of the SPARC gene has been reported associated with melanoma and metastatic carcinomas of the breast, and increased SPARC expression has been observed in conjunction with increased c-Jun and Fra-1 expression in a panel of invasive breast cancer cell lines [48].
Human SPARC-like1, also known as mast9 or hevin, is a member of the SPARC protein family. Interestingly, previous reports indicated downregulation of SPARC-like1 in prostate and colon carcinomas [49,50]. Contrary to these observations, we observed consistent high expression of this transcript across all IDC libraries. Functional assays suggest that SPARC-like1 may serve as an antagonist to cell adhesion, playing a key role in the inhibition of attachment, and may facilitate spreading of endothelial cells on fibronectin substrates [51].
Taken together, these expression profiles suggest that MMP-2, SPARC and SPARC-like1 are probably critical mediators of extracellular matrix remodeling and are all important in facilitating breast cancer invasion and progression.
Other genes commonly expressed at high levels in invasive carcinomas and of much lower expression in DCIS and normal breast tissues include lumican (LUM/LDC) (56.7-fold; P = 0.0011), versican (CSPG2) (18.2-fold; P = 0.0017), vimentin (VIM) (17.7-fold; P = 0.0014), decorin (DCN/ PG2) (13.9-fold; P = 0.0007) and adlican (DKFZp564I1922) (12-fold; P = 0.0086). Lumican and decorin are members of the small leucine-rich proteoglycan family of proteins [40]. Several studies have demonstrated that small leucine-rich proteoglycan proteins can modulate cellular behavior, including cell migration and proliferation during tumor growth. Furthermore, the high expression level of lumican was associated with high tumor grade and was expressed specifically in breast cancer tissues, but not in normal breast tissues, suggesting that lumican is differentially expressed during breast tumor progression [40,52]. These findings suggest that lumican may play an important role in breast cancer growth.
Recent studies have suggested that expression of increased amounts of versican, a chondroitin sulphate proteoglycan, in neoplastic tissues may play a role in promoting tumor cell proliferation and migration [53]. Abnormal versican deposition has been observed in a number of tumor types, including breast cancer [54]. Furthermore, it has been suggested that the versican-rich extracellular matrices exert an anti-adhesive effect on cells, thus facilitating tumor-cell migration and invasion [55]. Schematic model portraying some of the most significant transcriptomic changes observed in breast cancer progression Schematic model portraying some of the most significant transcriptomic changes observed in breast cancer progression. DCIS, ductal carcinoma in situ; IDC, invasive ductal carcinoma.

R511
Vimentin is a type III intermediate filament normally expressed in cells of mesenchymal origin [56]. However, numerous studies have now demonstrated that vimentin can also be expressed in epithelial cells involved in physiological or pathological processes requiring epithelial cell migration [57]. Vimentin has indeed been described in migratory epithelial cells involved in embryological and organogenesis processes and tumor invasion [58]. Also, vimentin antisense transfection in vimentin-expressing breast cell lines was shown to reduce their in vitro invasiveness or migration, strongly emphasizing a functional contribution of vimentin to epithelial cell invasion/migration [59].

Conclusions
Using comprehensive gene expression profiling by means of SAGE combined with a recently developed statistical approach, we identified the most consistent and statistically significant changes occurring in breast cancer progression detected by this methodology. A comparison of the genes identified in our DCIS and IDC analysis with previous observations [11,12,14,41] revealed expected similarities. More importantly, several genes were identified in our analysis that were not previously reported or detected in other SAGE studies. This suggests that the comparative analysis we performed of normal breast tissue, DCIS and invasive carcinomas by means of the modified t test appears statistically rigorous and applicable to SAGE studies in which multiple libraries are compared.
In the present study we observed that deregulation of genes involved in the control of cell proliferation, apoptosis and mammary gland development are frequently altered at the in situ stage (Fig. 5). Meanwhile, alterations in the expression of genes related to the cell cycle and extracellular matrix remodeling (proteinases, collagenases, cysteine proteinases), and several transcripts related to cell adhesion and motility, were abundantly deregulated at the invasive carcinoma stage (Fig. 5). Additional analysis and validation of the identified genes will be required to determine the clinical value, and to determine whether they may constitute novel targets for translational research.

Competing interests
None declared.

Additional files
The following Additional files are available online: