Next generation sequencing-based expression profiling identifies signatures from benign stromal proliferations that define stromal components of breast cancer

Introduction Multiple studies have shown that the tumor microenvironment (TME) of carcinomas can play an important role in the initiation, progression, and metastasis of cancer. Here we test the hypothesis that specific benign fibrous soft tissue tumor gene expression profiles may represent distinct stromal fibroblastic reaction types that occur in different breast cancers. The discovered stromal profiles could classify breast cancer based on the type of stromal reaction patterns in the TME. Methods Next generation sequencing-based gene expression profiling (3SEQ) was performed on formalin fixed, paraffin embedded (FFPE) samples of 10 types of fibrous soft tissue tumors. We determined the extent to which these signatures could identify distinct subsets of breast cancers in four publicly available breast cancer datasets. Results A total of 53 fibrous tumors were sequenced by 3SEQ with an average of 29 million reads per sample. Both the gene signatures derived from elastofibroma (EF) and fibroma of tendon sheath (FOTS) demonstrated robust outcome results for survival in the four breast cancer datasets. The breast cancers positive for the EF signature (20-33% of the cohort) demonstrated significantly better outcome for survival. In contrast, the FOTS signature-positive breast cancers (11-35% of the cohort) had a worse outcome. Conclusions We defined and validated two new stromal signatures in breast cancer (EF and FOTS), which are significantly associated with prognosis. Our group has previously identified novel cancer stromal gene expression signatures associated with outcome differences in breast cancer by gene expression profiling of three soft tissue tumors, desmoid-type fibromatosis (DTF), solitary fibrous tumor (SFT), and tenosynovial giant cell tumor (TGCT/CSF1), as surrogates for stromal expression patterns. By combining the stromal signatures of EF and FOTS, with our previously identified DTF and TGCT/CSF1 signatures we can now characterize clinically relevant stromal expression profiles in the TME for between 74% to 90% of all breast cancers.


Introduction
The tumor microenvironment (TME) is comprised of fibroblasts, endothelial cells, immune cells and extracellular matrix (ECM). The cells within the TME have been demonstrated to play significant roles in the development and progression of cancer [1][2][3][4][5][6][7][8], but most studies view each of these cell groups as a relatively uniform population that is similar across different patients.
Profiling the non-neoplastic cells within the TME directly is difficult due to the variety and relative paucity of these cells in the tissue and practical issues with the isolation of these cells. Our approach rests on the hypothesis that, similar to lymphomas where each tumor is a clonal outgrowth of a particular lymphoid cell type, each soft tissue tumor type can also be regarded as a clonal outgrowth of a particular connective tissue cell type to represent subclasses of stromal proliferation that occur in epithelial tumors.
In our approach, soft tissue tumors (STTs) which are a homogenous collection of a single mesenchymal cell type have phenotypes distinct from each other, can be easily profiled and act as "discovery tools" for various types of TME expression patterns to yield a relatively uniform signature. Using gene array-based expression profiling of fresh frozen specimens of fibroblastic tumors (desmoid type fibromatosis-DTF and solitary fibrous tumor-SFT) and macrophage-rich tumors (tenosynovial giant cell tumor-TGCT/CSF1), we previously discovered novel types of stromal reaction patterns that emphasize the variations in the fibroblast and macrophage compartment in breast cancer between different patients [1,2,8,9]. The biological significance of the identification of these stromal reaction patterns was borne out by the fact that several of these stromal expression patterns have prognostic significance independent from traditional prognosticators such as tumor size, tumor grade and even lymph node status [1,2,8].
Based on our previous findings, the DTF signature robustly defined a stromal pattern for 25 to 35% of invasive breast cancers [2], while the TGCT/CSF1 signature was found in 17 to 28% of breast cancers [1]. However, a significant number of breast cancers were not classified by these signatures. In order to find additional stromal patterns we performed gene expression profiling on a spectrum of fibroblastic lesions. As most of these lesions are quite small, they are routinely entirely submitted as formalin fixed paraffin embedded (FFPE) tissue. Here, we have applied an RNA-Seq method for the expression profiling of archival, FFPE tissue, termed 3SEQ (3′end RNA sequencing), which we have previously developed. This method can be used to perform global gene expression profiling of FFPE material [10][11][12], as well as to discover and characterize expression levels for lncRNAs [13]. Within this study, we have used 3SEQ to determine specific gene expression signatures for 10 types of fibroblastic tumors and found that 2 can identify breast cancers with distinct clinical outcome. Taken together with the two previously identified stromal signatures (DTF and TGCT/CSF1 signatures), the combined four stromal signatures now classify 74% to 90% of breast carcinomas.

Samples selection, treatment and 3SEQ analysis
Paraffin blocks from 53 fibrous tumors were collected from the Department of Pathology at Stanford University Hospital, with Health Insurance Portability and Accountability Act (HIPAA)-compliant Stanford University Medical Center Institutional Review Board approval. And the tissues were collected with a waiver of consent due to the archival nature of the specimens. The samples consisted of collagenous fibroma (FC, six cases), elastofibroma (EF, four cases), infantile digital fibromatosis (IF, three cases), palmar fibromatosis (PF, eight cases), nasopharyngeal angiofibroma (NPAF, six cases), fibroma of tendon sheath (FOTS, four cases), nodular fasciitis (NF, six cases), dermatofibrosarcoma protuberans (DFSP, four cases), desmoid type fibromatosis (DTF, seven cases) and solitary fibrous tumor (SFT, five cases). Multiple 2 mmdiameter cores were taken from diagnostic FFPE material for RNA isolation as reported previously [12].
3SEQ sequencing and data analysis 3SEQ libraries for next generation sequencing-based expression profiling were built according to previously described methods [10][11][12][13] and protocols in our lab website [14], and then were sent to Stanford Center for Genomics and Personalized Medicine to be sequenced directionally (36 bp) from 5′end of mRNA fragments towards their poly (A) ends using Illumina GA IIx machines (Illumina, Inc., San Diego, CA, USA).
Sequence reads, after filtering for read quality, were refiltered by fastx (fastx_artifacts_filter) [15], and mapped to the transcriptome (refMrna, downloaded from the UCSC genome browser, www.genome.ucsc.edu/) by using SOAP2, allowing at most two mismatches [16]. Total numbers of sequence reads for each gene symbol from the transcriptome mapping were determined and were used to create the gene-expression profile matrix. Read counts from each library were normalized to transcripts per million reads (TPM). Adequacy of sequencing depth for each library was assessed using an estimation of library saturation for transcriptome detection, which showed that our sequencing depths were sufficient to detect an average of 67% of reference mRNAs. The distances between sequencing reads and 3′ ends of mRNAs were measured to be around 100 to 200 nt as we expected based on the length of mRNA after shearing.
Hierarchical clustering was performed as previously described (filter by SD100, adjust data by "log transform data" and "center genes", then perform Hierarchical clustering with "Spearman Rank Correlation" and "Centroid Linkage") [8] and a clustered heatmap was visualized with Java TreeView (http://sourceforge.net/projects/jtreeview/ files/). Two class SAM analysis (Significance Analysis of Microarrays) [17] was used to identify genes expressed differentially between each type of fibrous tumor versus the other fibrous tumors with FDR of 0.05. Genes that were specific for a particular tumor type and that were highly expressed were used to define the fibrous tumor signature for each tumor type. All gene expression profiling data used for this study have been deposited in the Gene Expression Omnibus (GEO) and are publicly accessible through GSE42948. The 3SEQ and SAM analyses scripts for this study are attached as Additional file 1.

Analysis of breast cancer microarray expression data
Four publicly available breast cancer expression datasets with clinical follow-up information were used for this study (NKI [18], GSE1456 [19], GSE3494 [20] and GSE4922 [21]). The combined sets of positively expressed genes that are unique for each fibrous tumor signature were identified in each of the four breast cancer datasets. Using only the expression data for these genes, the breast cancers were grouped by unsupervised hierarchical clustering using Cluster 3 software (http://bonsai.hgc.jp/~mdehoon/ software/cluster/). The resulting heatmap was visualized with Java TreeView. We then identified the breast cancer cases that were most closely associated with each fibrous tumor signature. These cases were visually identified in the heatmap as sharing the largest group of genes with high expression. These observations were validated by calculating the volume of expression (the averaged expression values of each probe within the gene set after removing all the negative values in the matrix) to confirm that the visually identified group of cases had indeed the highest levels of expression relative to other groups of cases.
After studying 10 fibrous tumor signatures in four breast cancer expression datasets, we identified the fibrous tumor signatures which significantly stratify breast cancers into two groups with different outcomes; only the EF and FOTS signatures divided each of four breast cancer datasets into two groups with statistically significant outcome differences in the four datasets. To refine our analysis, we determined the "core" genes for EF and FOTS signatures: genes that are consistently coordinately and highly expressed in the breast cancer datasets. We defined the EF and FOTS core genes as those exhibiting greater than 0.1 correlated expression and present in at least three of the four gene clusters in four datasets [1,2].
For the association analysis between core gene signatures, a chi-square test was performed using the software GraphPad (GraphPad Software, Inc., La Jolla, CA, USA). Kaplan-Meier plots were used to compute the survival curves, and Log-rank (Mantel-Cox) Test was used to determine the statistical significance of survival between groups by using GraphPad Prism5 software. Univariate and multivariate analysis by the Cox proportional hazard method was performed by using the survival package in R. Analysis of gene ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway was done by using DAVID Bioinformatics Resources version 6.7 (http://david.abcc.ncifcrf.gov/).

Expression of fibrous tumor gene signatures in breast cancer
Gene expression profiling by 3SEQ was performed on 53 fibrous tumors representing 10 groups of benign fibrous soft tissue tumors (Figures 1 and 2, see Additional files 2, 3, 4). For each diagnostic entity, up to 2,598 genes were highly expressed relative to all other tumors based on two-class SAM analysis, defining 10 unique fibrous tumor gene signatures (Table 1, see Additional file 5).
To the fibrous tumor gene signatures in breast cancer, we analyzed four publicly available breast cancer datasets (GSE1456, GSE4922, GSE3494 and the NKI Dataset). Each breast cancer dataset was clustered separately with the gene sets specific for each of the 10 fibrous tumor signatures. We identified the fibrous tumor genes coordinately over-expressed in a subset of breast cancers for each dataset as representing the stromal expression pattern in the breast cancers that is most similar to that particular fibroblastic tumor. Four fibrous tumor signatures did not show coordinated expression patterns in the four breast cancer datasets. Another four fibrous tumor signatures demonstrated coordinated expression patterns in the breast cancer datasets, but did not consistently show a statistically significant difference in the prognosis between the two groups of breast cancers defined by these fibrous tumor signatures. On the contrary, both the EF and FOTS gene signatures demonstrated meaningful outcome results for survival in all four breast cancer datasets (see Additional file 6).

Core gene sets of FOTS and EF and their implication in the prognosis of breast cancer
To extend our analysis, we determined the fibrous tumor "core" gene signature as being the genes coordinately over expressed in all four breast cancer datasets as we previously had defined the core gene set for DTF signature [2], by requiring that each "core" gene has to be present in at least three of the four gene clusters in four datasets. For each core gene signature and breast cancer dataset, the group of breast cancer samples with the highest aggregate of expression of fibrous tumor signature genes was identified as having a correlation >0.2 ( Figure 3). The breast cancer subgroups were assessed for clinicopathologic correlations in each of the four  Unsupervised hierarchical clustering of 53 tumors (columns) using 961 genes (rows). Within the heatmap, red represents higher expression, black represents mean expression, green represents lower expression, and gray represents normalized expressional level zero. Note: In several instance tumors of the same diagnosis appear adjacent to each other in the figure. It should be noted that the arms of the dendrogram can rotate freely and that it is the length of the dendrogram arms that determines the similarity between cases. To better understand the functions of the two new stromal core gene set signatures (EF and FOTS), we performed Gene Ontology (GO) and KEGG pathway analysis. The results show that the EF core genes are significantly enriched in biological processes including 'response to wounding' and BMP signaling, were enriched in Pathways including Tyrosine metabolism, Complement and coagulation cascades (see Additional file 8). For the FOTS core gene set, biological processes, including glycolysis, were enriched based on the 16 core genes (see Additional file 8), KEGG pathways including "Glycolysis/Gluconeogenesis" and "Fructose and mannose metabolism" were enriched (see Additional file 8).

Independent signatures of FOTS and EF with DTF fibroblast and TGCT/CSF1 macrophage signatures
Previously, our group reported that the DTF fibroblast signature is associated with good outcome in breast cancer [2,8], and that the TGCT/CSF1 macrophage signature is associated with higher tumor grade in breast cancer [1]. We compared these signatures with those of EF and FOTS to determine the degree of overlap between cases with either of the two good prognosis signatures (EF and DTF fibroblast) and the overlap between cases with either of the bad prognosis signatures (FOTS and TGCT/CSF1 macrophage) in the four independent breast cancer datasets (NKI, GSE1456, GSE3494 and GSE4922). In order to make these signatures more comparable and reproducible, we defined the DTF fibroblast breast cancers and TGCT/CSF1 macrophage breast cancers by the same criteria as the criteria for EF/FOTS-like cases (cases with high expression of signature and >0.2 correlation within clusters). Chisquare test indicated that the association between DTF fibroblast cases and EF-like cases is not statistically significant in three of four datasets although there are overlapping assignments by these two signatures. Likewise, Figure 3 Hierarchical clustering of four breast cancer datasets by FOTS/EF core gene sets. Within the heatmap, red represents higher expression levels, green represents lower expression levels, and black represents mean expression levels. Sample clusters designated with Orange and Blue trees are FOTS + breast cancers and EF + breast cancers in GSE1456 (A), GSE3494 (B), GSE4922 (C) and NKI dataset (D).
the association between TGCT/CSF1 macrophage cases and FOTS-like cases is also not statistically significant in three of four datasets (see Additional file 10). These data show that the signatures associated with good outcome (DTF and EF) identify non-overlapping sets of tumors and that similarly the two signatures associated with poor outcome (TGCT/CSF1 and FOTS) also describe distinct sets of tumors.
To examine the degree to which the two new signatures provide additional information to the previously identified DTF/TGCT/CSF1 signatures, we stratified breast cancer into four categories based on DTF/EF signatures for good outcome (DTF+/EF+, DTF+/EF-, DTF-/EF+, DTF-/EF-) or FOTS/TGCT/CSF1 signatures for poor outcome (FOTS+/TGCT/CSF1+, FOTS-/TGCT/ CSF1+, FOTS+/TGCT/CSF1-, FOTS-/TGCT/CSF1-) in the four breast cancer datasets. EF+/DTF + breast cancers account for 11 to 16% of the cases, while EF-/ DTF-breast cancers account for 44 to 47% of the cases. FOTS+/TGCT/CSF1+ breast cancers account for 6 to 12% of the cases, whereas FOTS-/TGCT/CSF1-breast cancers account for 41 to 65% of the cases in the four breast cancer datasets (see Additional file 7). The prognosis of these four categories for DTF/EF signatures or FOTS/TGCT/CSF1 signatures was assessed by overall survival (OS), disease specific survival (DSS), and disease free survival (DFS) by the pooled cases of four datasets ( Figure 5). EF-/DTF-breast cancers demonstrated significantly worst outcome in overall survival, disease specific survival and disease free survival. EF + breast cancers showed the best outcome in overall survival, disease specific survival and disease free survival, independent of DTF signature status, while the EF-/DTF + breast cancers showed better outcome than EF-/DTF-breast cancers. FOTS-/TGCT/CSF1-breast cancers demonstrated significantly better outcome in overall survival, disease specific survival and disease free survival. The FOTS+/TGCT/ CSF1-breast cancer showed the worst outcome in overall survival, disease specific survival and disease free survival.

Discussion
Stromal components within the TME are known to be involved in cancer initiation, progression and prognosis [1][2][3][4]6,7,9,[22][23][24][25][26][27]. In many studies, the different cellular components of the TME are treated as relatively invariable factors that are assumed to play a similar role in tumor samples from different patients. However, through systematic analysis of breast cancer H&E images with a novel machine learning based method, C-Path, we have recently shown that the morphological features of the tumor stroma vary markedly between tumor samples and that they are not only significantly associated with survival in breast cancer, but that their impact on outcome is even stronger than the features of the epithelial component itself [9].
By clustering breast carcinoma expression profile datasets using only the genes that are specific for distinct fibroblastic tumors, we can observe subsets of cancer that contain different fibroblastic subtypes in the tumor stroma. In contrast, hierarchical clustering that uses all genes in the dataset obtained from an entire tumor specimen usually groups samples together based predominantly on the gene expression pattern in carcinoma cells as these cells often represent the majority of the cells within a tumor and often show the most variation in expression patterns. Thus, their transcript levels represent the strongest signal in the sample. As a result, differences between tumors based on their stromal expression patterns are often not observed in datasets where the entire gene expression profile of the sample is used.
It is difficult to obtain gene expression profiles from normal fibroblast subtypes as in normal and tumor tissue they are typically closely associated with other cell types, such as epithelial cells, and techniques, such as micro-dissection, are laborious. Our approach rests on the hypothesis that, similar to lymphomas where each tumor is a clonal outgrowth of a particular lymphoid cell type, each fibroblast tumor type can be regarded as a clonal outgrowth of a particular connective tissue cell type [28]. Fibroblastic tumors thus represent neoplasms of different normal fibroblast subtypes and differentially express genes typical for various fibroblast functions. Moreover, each fibroblastic tumor type represents a largely homogenous population of cells and can be robustly profiled. By using this approach, our group has previously demonstrated that a specific stromal gene expression pattern, the DTF fibroblast signature, could robustly and reproducibly define a subgroup of breast cancer patients with good prognosis [2,8] and that a second stromal pattern, the TGCT/CSF1 macrophage signature, is associated with breast cancers of a higher tumor grade, with decreased expression of ER/PR, and increased mutations of TP53 [1]. Subsequent studies have shown that these different TME variants can even be identified in cases of pre-invasive ductal carcinoma [6]. These findings indicated that the type of TME can vary between patients and that expression profiles obtained from STTs form a useful tool to distinguish these TME variants.
Our prior studies allowed us to identify distinct TME subtypes in up to 50% of breast cancers. In order to extend our findings we determined the gene expression profile for an additional eight fibroblastic tumors. Previous studies required fresh frozen tumor samples, but for many of the fibroblastic lesions we intended to analyze only FFPE material was available. We therefore used a novel gene expression profiling approach (3SEQ) that uses next generation sequencing of RNA fragments purified by oligo-dT selection from FFPE material (Additional file 12). Applying eight novel fibrous signatures to four publicly available breast cancer expression profiling datasets, we found that three of these signatures were not expressed in the breast cancers in a coordinated manner. Of the five remaining signatures, three did not show differences in outcome analysis. In contrast, the EF and FOTS signatures could stratify the breast cancer samples into two groups, through highly coordinated gene expression with consistent association with outcome in all the four breast cancer datasets. The EF signature positive breast cancers demonstrated good outcome, while the FOTS signature positive breast cancers showed bad outcome. In this study, the SFT signature is significantly associated with worse outcome only in the NKI dataset, but there is no clear pattern in the other three datasets, consistent with our previous findings [2]. The current DTF-(3SEQ) signature, which is similar to the DTF signature previously defined against SFT [2,8], is associated with good outcome in three of the four breast cancer datasets, though the association in this analysis is not statistically significant. The difference in the significances of old and new signatures related to outcome can be explained by the fact that the genes differentially expressed for a particular lesion is to a great extent determined by the other samples in the dataset to which it is compared. The original DTF was determined through a comparison with SFT only while the currently defined DTF-(3SEQ) signature was determined through a comparison with a much larger number of distinct fibrous tumor types. As a result, the current DTF-(3SEQ) signature contains 42 genes from the comparison between DTF and the other nine types of fibrous tumors including SFT, while the original DTF signature contains 237 genes from the comparison between DTF and only one other type of fibrous tumor, SFT.
The EF and the previously identified DTF fibroblast signature both identify good outcome in breast cancer, while the FOTS and the previously identified TGCT/ CSF1 macrophage signature both identify bad outcome in breast cancer. In order to explore the relationships between the good or bad signatures, we compared the breast cancer sample assignments between them. The comparison results showed that 11 to 16% of breast cancers were positive for both EF and DTF core gene sets, while 44 to 47% of breast cancers were negative for both, 23 to 33% of breast cancers were EF-/DTF+, and 9 to 20% of breast cancers were EF+/DTF-. In addition, 6 to 12% of breast cancers were positive for both of FOTS and TGCT/CSF1 core gene sets, 41 to 65% of breast cancers were negative for both FOTS and TGCT/CSF1 core gene sets, 5 to 24% of breast cancers were FOTS+/TGCT/ CSF1-, and 23 to 24% of breast cancers were FOTS-/ TGCT/CSF1+. EF+/DTF + breast cancer cases were more likely to be ER+/PR+, low grade, with less lymph node than EF-/DTF-breast cancers. FOTS+/TGCT/CSF1+ breast cancers are more likely to be ER-/PR-, high grade, base-like breast cancers than FOTS-/TGCT/CSF1-breast cancer cases.
In order to test the prognosis power of the combined core gene sets of EF, FOTS, DTF and TGCT/ CSF1, we pooled the outcome data from four breast cancer datasets in overall survival (OS), disease specific survival (DSS) and disease free survival (DFS). Kaplan-Meier analysis for the combined core gene sets in the pooled dataset showed that EF-/DTFbreast cancers were associated with worst outcome in OS, DSS and DFS, while EF + breast cancers, no matter whether they are DTF-or DTF+, were associated with better outcome in OS, DSS and DFS. FOTS-/ TGCT/CSF1-breast cancers were associated with better outcome in OS, DSS and DFS, while FOTS +/TGCT/CSF1-breast cancers were associated with worse outcome in OS, DSS and DFS.
To better understand the potential functions of the two new stromal core gene set signatures (EF and FOTS) in breast cancer, Gene Ontology (GO) and KEGG PATHWAY analysis were performed, which show that the 41 EF core genes are significantly enriched in biological processes including 'response to wounding' and BMP signaling. Within the 41 EF core genes, almost one fourth of them (9/41), such as AOC3, AOX1, C6, CFD, CFH, CLU, GSN, LYVE1 and MECOM, were related to 'response to wounding'. The association with wound healing has been previously identified in a study of prognostic gene signatures in breast cancer [29]. However, a comparison between the EF gene list and the previously identified gene signatures of the "minimum number core serum response" (CSR) genes [29] necessary for tumor classification, we found that only one gene, GSN, was shared between EF core signature and CSR genes (from quiescent samples, in contrast to activated samples). The BMP pathway is well known to modulate cross-talk between stromal cells and epithelial cells [30][31][32], and three genes (CHRDL1, GREM2 and MSX1) involving BMP signaling were comprised of the EF core signature. For the FOTS core gene set, biological processes, including glycolysis, were enriched based on the 16 core genes. This suggests that the increase in glycolysis, which serves a critical role in cancer cell growth and invasion [33], may also influence stromal fibroblast. FOXM1 is an example of FOTS core genes involved in both of epithelial cancer cells and the cancer associated fibroblasts. By comparing the gene expression between isolated breast cancer-associated fibroblasts (CAFs) and normal mammary fibroblasts (NFs) isolated from the same patient, Mercier et al. found that FOXM1 was up-regulated in CAFs rather than the NFs [34]. TPI1 gene in FOTS core signature was up-regulated in the CL4 fibroblast, which could supply epithelial cancer cells with pyruvate/lactate as "fuel" to help epithelial cancer cells to escape the anti-angiogenic treatment [35]; therefore, the TPI1 targeted therapy (or FOTS gene targeted therapy) for cancer-associated fibroblast will be an actionable way to control the progression of cancer in conjunction with anti-angiogenic therapy.