Current single sample pathway assignment tools are unstable
The currently available tools including PLAGE [1], z-score [2], ssGSEA [3], and GSVA [5] infer activation scores for a given patient relative to a large cohort of patient profiles. We asked if, and to what degree, scores determined in this relativistic manner, were affected by perturbations to the composition of patients in the comparative dataset. Towards this end, the activation of gene signatures from MSigDB C2 (literature curated) [12] were measured using the METABRIC dataset [13] in three distinct ways: using (1) patients with only ER-positive samples, (2) patients with only ER-negative samples, and (3) all patients. Figure 1a depicts the distribution of the absolute differences between the activation scores obtained by each tool using only ER-positive samples versus the scores obtained for all patients from METABRIC. This is repeated to compare ER-negative-specific scores versus all patients from METABRIC.
We observed a significant difference in the inferred activation score for all of the tested approaches (Wilcoxon’s P value < 0.0001 for all), with the largest differences observed for the ER-positive versus pan-METABRIC comparisons (Fig. 1a). Although the scores obtained from the four tools are incomparable, the results show that scores are influenced by perturbations in the patient composition of the dataset, establishing that none of these current approaches are absolute [9]. Analogous results were obtained when the dataset was stratified by HER2 status and grade (Additional file 2: Figure S1), suggesting that many clinicopathological factors may influence pathway activation tools in this manner. Concretely, if those approaches were to be used in a clinical context, conclusions about the activation status of any given pathway would be greatly influenced by the comparative “control” group used by the treating clinician.
A simple method to measure signature activation
The instability described above is analogous to what we and others have previously reported in the context of bioinformatics tools such as prediction analysis of microarray 50 (PAM50) that infer breast cancer subtype from gene expression profiles [9, 10]. In our previous work (AIMS), we developed a so-called absolute method that used only a gene expression profile from the target patient without recourse to or need of a larger compendium of patient profiles for comparative analysis [9]. Our goal here is to generalize and improve this framework to allow such absolute assignments to be made for any given molecular pathway or process for which there is a suitable gene signature.
The construction of our tool requires several components: (1) a collection of gene signatures that cover relevant biological processes including the hallmark molecular pathways/processes of breast cancer; (2) a suitably large collection of gene expression profiles of clinical breast cancer samples (ideally generated via both microarray and RNA-seq technologies) partitioned into training and validation datasets; and (3) a “gold standard” set of positive and negative patient profiles for each pathway with known activation status.
With respect to (1), we curated a collection of 6466 gene signatures from various databases of gene signatures including MSigDB [11] and GeneSigDB [14], which we complemented with our own in-house work to curate gene signatures especially applicable to breast cancer (n = 188, see “Methods” and Fig. 2a). With respect to (2), our training dataset comprised 4510 gene expression profiles obtained from 10 different cohorts on six different platforms (Table 1). We used our previously published dataset for validation (Affymetrix Gene ST, n = 429 patients [12]) and the most recent RNA-seq data from The Cancer Genome Atlas (TCGA) project (Table 1). We next describe how we estimate our gold standard.
Estimation of the activation levels of each gene signature
Our goal is to train our AIPS algorithm to accurately infer the activation status of a given pathway within an expression profile of a patient. In order to train our algorithm, we require examples of patients that have activation statuses that we believe to be correct for each gene signature of interest. Furthermore, the examples must cover all possible states (e.g. high, low, and latent activation). The nature of human clinical samples, however, does not allow us to determine the activation status of a pathway in a direct, rigorous manner. Therefore our gold standard learning set must be comprised of estimations of statuses across the relevant expression datasets (item (2) as mentioned previously).
For each biological process of interest (item (1) as mentioned previously), we applied a de novo non-parametric rank-based method that partitions the patients in our dataset into three classes depending on the pattern of expression exhibited by the genes within the signature. The three classes correspond to those patients that appear to have high activation of the signature, low activation of the signature, and a set of patients where the expression of the genes within the signature lose their characteristic pattern of pairwise correlation (Fig. 1b provides an example). The latter class is assigned to patients where the corresponding gene expression patterns are pairwise independent, thus supporting neither high nor low activation of the underlying pathway.
This de novo non-parametric test, referred to as the ROI at quantile q (ROIq), proceeds as follows. In a univariate fashion, each gene within a given signature is used to rank all patients from the lowest to the highest expression. In some cases, the direction of expression of each gene within the signature is known a priori (e.g. the gene is overexpressed or underexpressed in samples with activation of the target pathway). Before ranking, we first negate any expression measurements for genes that are known to be underexpressed: such genes that are negatively correlated with activation of the signature, order the patients in the reverse order. Now for each patient, the ranks of all k genes from the signature are summed (see Additional file 1 for full details). The patients are then linearly ordered from the lowest to the highest rank. The approach of mapping expression data to a linear order, which has been used previously in breast cancer [15], makes intuitive sense as we can view the expression of each molecular process or pathway as having a state between “turned off” and “turned on” completely. Figure 1b depicts a proof-of-concept linear ordering for an estrogen response signature from Doane et al. [16] using the METABRIC dataset (Table 1). Broadly speaking, such linear orders highlight patients at the left hand side that have low or negative expression of the signature, patients at the right hand side that have high or positive expression of the signature, and a region in the middle corresponding to patients with gene expression patterns that are independent. We refer to this as the observed linear order.
The second step in the ROIq procedure identifies the left and right boundaries of the low and high regions within the observed linear order. This is done via a permutation test where an “artificial” patient “n + 1” is created. Each of the k genes in the signature rank patient n + 1 with a uniformly randomly chosen number from (0… n + 1). Summing the randomized rank over all k genes in the signature, the position of patient n + 1 is computed within the observed linear order. This is repeated a suitably large number of times (e.g. n = 10,000). The ROIq is defined as the region that contains q% of the randomly generated samples (Fig. 1b bottom and see Additional file 1).
As expected, the patient ordering at ROI95 for the estrogen response signature depicted in Fig. 1b is strongly associated with breast cancer subtype as defined by ER and HER2 status. In particular, the low activation region of the ordering (left) is enriched for ER-negative and/or HER2-positive tumors (Fisher’s exact test, P < 0.000001, Fig. 1b, c), whereas the high activation region corresponds almost exclusively to ER-positive/HER2-negative tumors (Fisher’s exact test, P < 0.000001, Fig. 1b, c). Given that ER-positive status is strongly associated with good outcome in breast cancer [12, 17, 18], the patient partition produced by the ROI95 is strongly prognostic (log-rank P < 0.0001, Fig. 1d). Although only a single proof of concept example, the results suggest that the ROIq approach is capable of assigning pathway activation in breast cancer expression datasets. A more thorough investigation of the ROI95 is presented in Additional file 1. The analyses suggest that the ROI95 approach can faithfully recapitulate the low, independent, and high partitions of patients over a large range of biologically plausible parameters. For example, using simulated data, we tested the impact of several parameters including, for example, the gene signature size, fraction of patients in each category, and the strength of the signal on the capacity of the ROI95 approach to correctly assign patients in the low, independent, and high categories. We confirmed using this simulated set of data that the ROI95 is a robust approach within a wide range of parameters (see Additional file 1 and Additional file 2: Figure S2 and S3).
Identification of informative and non-informative gene signatures
To better ensure that the ROI95 is accurately determining pathway status, we applied the method to all gene signatures in our collection (n = 6466, Fig. 2a) using the METABRIC dataset (n = 1992). The fraction of low, independent, and high samples across all signatures in our collection is presented in Fig. 2b. As a control, the gene labels of the METABRIC dataset were randomly permuted. This procedure should, with high probability, break the vast majority of gene-gene correlations within signatures, causing the fraction of uninformative genes to rise. We should then observe an increase in the independent partition of the ROI95 with a concomitant decrease in the size of the low and high partitions. The results depicted in Fig. 2b confirm this, and suggest that for a large proportion of the signatures, the ROI95 method is indeed assigning activation status in a very non-random fashion.
The results depicted in Fig. 2b also suggest that the ROI95 method assigned almost every patient to the intermediate partition for some signatures. In other words, the ROI95 method applied to these specific gene signatures was not distinguishable from random expression patterns. We removed all such gene signatures from further consideration, in particular, a gene signature was removed when the fraction of samples in the ROI95 region exceed 0.8, as this is no better than partitions generated by random sampling. This led to a list of 3472 signatures that we considered informative in the context of breast cancer. A cutoff on the ROI95 region will exclude gene signatures activated or repressed in less than 20% of samples. Although more liberal thresholds could be used when studying an individual gene signature, we chose this conservative threshold here to enable our high-throughput global analyses.
Given that our gene signatures were collected from various sources, we asked whether any particular source was enriched for uninformative signatures. Of the remaining informative signatures, pathway databases such as BioCarta, Kyoto Encyclopedia of genes and genomes (KEGG) and Gene Ontology (GO) have higher fractions of signatures that have near random behavior (bottom of Fig. 2c). We note that sources that contributing signatures from transcriptional profiling have a higher proportion of non-random signatures (top of Fig. 2c).
Absolute single-sample gene signature activation in breast cancer
Based on the aforementioned results, we used the ROI95 method with the 3472 informative signatures to the training and validation datasets for calling signature activation levels using only the expression profile of a given single patient. The approach used here broadly follows our AIMS method that infers breast cancer subtype (Fig. 3a; also [9]).
First, the ROI95 is applied to each informative signature across 10 expression datasets generated from several microarrays (one-color and two-color) and RNA-seq platforms totaling 4510 samples (Table 1). This large and diverse training dataset provides us with more confidence that biases for specific clinicopathological or other patient variables are ablated, or at least reduced [9]. Our learning set consists of activation statuses (low, independent, and high) for 3472 signatures in each of 4510 patients from the training set.
Now, for each signature a model is learnt as follows. First, we consider every possible pair of genes x and y in the expression dataset, and we ask how strongly the rule “if expression of x is greater than y, then classify signature as high” is associated with our gold standard learning set. This is repeated for a modification of the rule where x is less than y, and for the other activation statuses of independent and low. Then, for each signature, the K most highly associated rules are chosen from the ranked list of all possible gene pair rules. The K rules are combined into a single probabilistic model using a naive Bayes’ classifier, and validated on an independent dataset (n = 742 samples, Table 1) [12].
The last step of our approach consists in selecting only those models with strong agreement with the ROI95 approach using a cutoff of 0.05 on the Bonferroni adjusted Kappa’s statistics P value. The two filtering steps that consist of first filtering out non-informative gene sets and then keeping models with significant agreement are essential to provide a set of reliable models (Fig. 3a).
Approximately 50% of informative signatures are amenable to absolute assignment
To ensure that our models are applicable across different technologies, we only retained models that significantly agreed with the gold standard in all 10 of the training datasets (kappa statistics Bonferroni-adjusted P value <0.05). This resulted in the retention of 1733 models (1733/3472, approximately 50%). We observed that the retained models had better agreement with the gold standard in the validation dataset in comparison with the models that were removed (Fig. 3b). This observation suggests that our training procedure did not introduce any significant over-fitting as selected models behave similarly in the training and validation sets and also all the models obtain a significant kappa statistic in the validation set (kappa statistics P values <0.01 for all).
We also stress here that the validation dataset contains data generated on a microarray platform (Affymetrix Gene ST) not present in the training dataset, suggesting that AIPS assignments are applicable on technologies not utilized in the training procedure. AIPS correctly assigns activation status for samples either assigned low or high activation (mostly red (approximately 80%) for the low-low and high-high lines in Fig. 3c). About 60% of samples were assigned to the independent class over all the 1773 models,
For the samples in the gold standard that were assigned independent status, AIPS correctly assigned this status in 60% of the cases, suggesting that predictions made for samples in the independent class are generally less reliable than predictions made for the low or high classes (mostly yellow for the “ind.-ind.” tagged line in Fig. 3c). Importantly, we rarely observed cases were AIPS predicted high activation when the gold standard was low, and vice versa (mostly blue for the low-high and high lines in Fig. 3c).
Last, we note that the AIPS models used fewer genes to infer pathway activation status than the original ROI95 method to generate the gold standard (median 50 versus 200 genes for AIPS versus ROI95 respectively; Wilcoxon’s test P < 0.0001, Fig. 3d).
Overall, these analyses confirmed that AIPS could accurately recapitulate the assignments of the gold standard. The 1733 AIPS models are listed in Additional file 3: Table S1 and pathway activation assignments can be computed for new individual samples using our AIPS R package [19].
Absolute assignment of EGFR pathway activation using AIPS
Epidermal growth factor receptor (EGFR) is well-studied in breast cancer with high activation of this pathway associated with poor patient outcome [20, 21]. We examined the behavior of our AIPS-EGFR model in the McGill validation dataset (Fig. 3e–g). We observed that the activation of samples at the far left and right (low and high respectively) are nearly perfectly inferred by AIPS (kappa = 0.54, P < 0.0001) with the majority of disagreements related to samples in the independent region (Fig. 3e–g). Figure 3f depicts the simple binary rules used by the AIPS model for the EGFR signature across the patient samples. There is a large cluster of EGFR-high patients associated with the PAM50 basal-like (BasalL) subtype, and a second large cluster of EGFR-low patients associated with luminal A and luminal B subtypes, a finding consistent with previous studies [20, 21].
Interestingly, gene set enrichment analysis of the genes selected to participate in the binary rules revealed an enrichment for genes upregulated by EGFR in MCF7 cells (FDR q value = 1.45e-17) [22]. Furthermore, all of these genes are on the right side (or left side) of binary rules associated (or not-associated) with high EGFR activation (rules marked by an asterisk in the heatmap of Fig. 3f). Although AIPS selects gene pair rules for each model from the large space of all possible gene pairs, it still surprisingly often selects genes that were present in the original signature, and therefore are likely good markers of the underlying biological processes. The enrichment of genes from the original signature was also reported for other “absolute” models such as AIMS [9]. Almost all of our AIPS models had such enrichment (1335 out of 1733 models (77%), Additional file 2: Figure S5). It is important to note that although most models statistically significantly overlapped the original signature, the number of genes from the original signature was still below 10%, suggesting that AIPS models do require many other genes to mimic the ROI95 assignments.
We asked if the absolute nature of the AIPS method would result in a more consistent EGFR model across gene expression platforms. In particular, we asked if our AIPS model inferred the same activation status for the EGFR pathway in both the microarray and RNA-seq platform for the same patient. Using TCGA data for 398 patients [23], AIPS assignments agreed on 87% of patients between both platforms (Fig. 3h, kappa = 0.81, P < 0.0001). Systematic analysis over the entire partitions (n = 1733 models) revealed that this agreement value is representative of almost all the partitions induced by AIPS and is significantly different from a random distribution (Fig. 3i, Wilcoxon’s test P < 0.0001, all kappa statistics P < 0.0001) supporting the argument that absolute assignments are robust across multiple platforms [9]. Together these results suggest that AIPS is capable of inferring signature activation levels with comparable performance to relativistic tools but with the added benefits of an absolute single sample approach.
AIPS assignments agree with whole-cohort inferred pathway scores
Our goal was to compare AIPS assignments with a second approach from the literature that takes full advantage of an entire dataset to assign signature activation scores. In particular, we used 21 non-redundant scores from the publication of Gatza et al. generated from breast cancer RNA-seq expression data from the TCGA project (n = 456) [24]. Concomitantly, we estimated activation status using our AIPS models for these signatures on the same patients. Overall good agreement between AIPS and the pathway scores from Gatza et al. was observed (Fig. 4a) although the two approaches are quite dissimilar.
Figure 4a suggests that well-known breast cancer biological processes are recapitulated by AIPS assignments. For example, patients with the luminal A or B subtype (LumA or LumB) are mostly assigned to the AIPS-high class for the ER gene signature, consistent with the fact that subtypes are enriched form ER-positive patients [9, 25]. Also, the AIPS assignments are in good agreement with the proliferation, ESC human and MYC DUKE pathway scores as these processes are known to be associated with the highly proliferative basal-like (BasalL), Her2 and LumB subtypes [25, 26]. We also observed a significant proportion of Her2-positive patients assigned to the AIPS-high Her2 gene signature. The interferon alpha and gamma, STAT1 and TNF alpha pathway scores are in good agreement with the AIPS assignments; these processes are associated with the BasalL subtype [27]. The P53 WT signature from AIPS is in good agreement with the pathway scores and is enriched for the LumA subtype that has been shown to be depleted of P53 mutations [23].
Generally, if AIPS modules and the pathway scores of Gatza et al. are in good agreement, then the patients within the high class of AIPS should also have the highest pathway scores according to Gatza et al. We tested the agreement between these two approaches and observed a strong relationship (Wilcoxon’s P < 1.4e-14 for all, Additional file 2: Figure S4A). Overall this analysis suggests that the “single-sample” AIPS approach is in good agreement with an approach that uses an entire cohort of samples to judge activation.
Sample partitions induced by AIPS are prognostic and associated with breast cancer subtypes
As there are many pathways and process that are known to vary in their expression across breast cancer subtypes, we investigated the relationship between patient subtype (called using the AIMS tool [9] or using clinical information) and the entire patient partitions generated by AIPS on the McGill dataset (Table 1).
We first studied the relationship between the partitions induced by AIPS and survival for the different molecular and clinical subtypes (Fig. 5a, b). We noticed that almost half (42.4%) of the partitions are significantly associated with survival if the analysis is performed on the entire cohort (Fig. 5a, b; Additional file 3: Table S1). This number drops drastically if we restrict the survival analysis to patients of given subtype. For example, only around 5% of the partitions are significantly associated with survival for the BasalL, Her2 and Luminal A and to close to nothing for the LumB and normal-like (NormalL) subtypes. Similarly, for the clinical subtypes, we found between 30 and 50% of partitions associated with the ER-positve, Her2-negative and ER-positive/Her2-negative subtypes. Those numbers drops between 3 and 6% for the ER-negative, Her2-positive and ER-negative/Her2-negative subtypes. We found almost no partitions associated with the ER-positive/Her2-positive and ER-negative/Her2-positive clinical subtypes (Fig. 5a, b; Additional file 3: Table S1).
We also studied the association between AIPS partitions and the molecular, ER, Her2 and proliferation score [25] using Fisher’s exact test (Fig. 5a, b). We found that between 85 and 92% of the partitions are associated with the different grouping with the exception of the Her2 subtype and the clinical features (Fig. 5a, b). Almost all (92.4%) partitions are associated with proliferation.
We also examined the frequency of patients classified as low over the entire set of 1733 AIPS models (Fig. 5c). Patients with the LumA and NormalL subtyptes obtained a significantly higher number of assignments compared to the those with the remaining subtypes, with an increase of between 200 and 250 in low assignments (Fig. 5c).
AIPS applied on single RNA-seq samples can be performed in a timely fashion
AIPS should enable a significant amount of information to be extracted from single RNA-seq samples and the growing number of single-cell sequencing datasets. We measured the time necessary to obtain AIPS partitions from a single RNA-seq profile using Kallisto, a fast pseudoalignment program used to obtain transcript quantification from sequencing data [28].
Using 100-bp paired-end sequencing data (median of 155 million paired-ends per patient [28], n = 96, Fig. 5d) we monitored the time taken to obtain transcript quantification and AIPS partitions using a single central processing unit (CPU). Overall, it required a median 21.6 minutes to obtain AIPS partitions for an individual patient directly from raw paired-end sequencing data (Fig. 5e).
Given that different transcript quantification pipelines return slightly different results, we evaluated the agreement between AIPS partitions made using Kallisto quantification to partitions made using an alternative approach (Bowtie2 [29] + RSEM [30]). Overall, we found significant agreement between AIPS partitions made from the two quantification approaches (Wilcoxon’s test P < 0.0001 and all kappa statistics P < 0.0001) with a median agreement of approximately 85% (Fig. 5f). Together this establishes that AIPS could be applied in a time-effective manner on single-sample RNA-seq data with the aid of a sufficiently fast pseudoalignment program e.g. Kallisto.