Common variants in breast cancer risk loci predispose to distinct tumor subtypes

Genome-wide association studies (GWAS) have identified multiple common breast cancer susceptibility variants. Many of these variants have differential associations by estrogen receptor (ER) status, but how these variants relate with other tumor features and intrinsic molecular subtypes is unclear. Among 106,571 invasive breast cancer cases and 95,762 controls of European ancestry with data on 173 breast cancer variants identified in previous GWAS, we used novel two-stage polytomous logistic regression models to evaluate variants in relation to multiple tumor features (ER, progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2) and grade) adjusting for each other, and to intrinsic-like subtypes. Eighty-five of 173 variants were associated with at least one tumor feature (false discovery rate < 5%), most commonly ER and grade, followed by PR and HER2. Models for intrinsic-like subtypes found nearly all of these variants (83 of 85) associated at p < 0.05 with risk for at least one luminal-like subtype, and approximately half (41 of 85) of the variants were associated with risk of at least one non-luminal subtype, including 32 variants associated with triple-negative (TN) disease. Ten variants were associated with risk of all subtypes in different magnitude. Five variants were associated with risk of luminal A-like and TN subtypes in opposite directions. This report demonstrates a high level of complexity in the etiology heterogeneity of breast cancer susceptibility variants and can inform investigations of subtype-specific risk prediction.


Introduction
Breast cancer represents a heterogenous group of diseases with different molecular and clinical features [1]. Clinical assessment of estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2) and histological grade are routinely determined to inform treatment strategies and prognostication [2]. Combined, these tumor features define five intrinsic-like subtypes (i.e., luminal A-like, luminal B-like/HER2-negative, luminal B-like/HER2-positive, HER2-positive/non-luminal, and triple-negative) that are correlated with intrinsic subtypes defined by gene expression panels [2,3]. Most known breast cancer risk or protective factors are related to luminal or hormone receptor (ER or PR) positive tumors, whereas less is known about the etiology of triple-negative (TN) tumors, an aggressive subtype [4,5].
Breast cancer genome-wide association studies (GWAS) have identified over 170 common susceptibility variants, most of them single nucleotide polymorphisms (SNPs), of which many are differentially associated with ER-positive than ER-negative disease [6][7][8]. These include 20 variants that primarily predispose to ER-negative or TN disease [7,8]. However, few studies have evaluated variant associations with other tumor features, or simultaneously studied multiple, correlated tumor markers to identify source(s) of etiologic heterogeneity [7,[9][10][11][12][13]. We recently developed a two-stage polytomous logistic regression method that efficiently characterizes etiologic heterogeneity while accounting for tumor marker correlations and missing tumor data [14,15]. This method can help describe complex relationships between susceptibility variants and multiple tumor features, helping to clarify breast cancer subtype etiologies and increasing the power to generate more accurate risk estimates between susceptibility variants and less common subtypes. We recently demonstrated the power of this method in a GWAS to identify novel breast cancer susceptibility accounting for tumor heterogeneity [15].
In this report, we sought to expand our understanding of etiologic heterogeneity across breast cancer subtypes, by applying the two-stage polytomous logistic regression methodology to a large study population from the Breast Cancer Association Consortium (BCAC) for detailed characterization of risk associations with 173 breast cancer risk variants identified by GWAS [6,7] by tumor subtypes defined by ER, PR, HER2 and tumor grade. Ahearn et al. Breast Cancer Research (2022) 24:2

Study population and genotyping
The study population and genotyping are described in previous publications [6,7] and in the Additional file 3: Methods. We included invasive cases and controls from 81 BCAC studies with genotyping data from two Illumina genome-wide custom arrays, the iCOGS and OncoArray (106,571 cases (OncoArray: 71,788; iCOGS: 34,783) and 95,762 controls (OncoArray: 58,134; iCOGS: 37,628); Additional file 1: Table S1). All subjects in the study population were female and of European ancestry, with European ancestry determined by ancestry informative GWAS markers as previously described [6]. We evaluated 173 breast cancer risk variants that were identified in or replicated by prior BCAC analyses to be associated with breast cancer risk at a p-value threshold p < 5.0 × 10 -8 [6,7]. Most of these variants (n = 153) were identified because of their association with risk of overall breast cancer, and a small number of variants (n = 20) were identified because of their association specific to ER-negative breast cancer (Additional file 1: Table S2). These 173 variants have not previously been simultaneously investigated for evidence of tumor heterogeneity with multiple tumor markers [6,7,15,16]. Genotypes for the variants marking the 173 susceptibility loci were determined by genotyping with the iCOGS and the OncoArray arrays and imputation to the 1000 Genomes Project (Phase 3) reference panel.

Statistical analysis
An overview of the analytic strategy is shown in Fig. 1 and a detailed discussion of the statistical methods, including the two-stage polytomous logistic regression, are provided in the Additional file 3: Methods and elsewhere [14,15]. Briefly, we used two-stage polytomous regression models that allow modelling of genetic association of breast cancer accounting for underlying heterogeneity in associations by combinations of multiple tumor markers using a parsimonious decomposition of subtypespecific case-control odds-ratio parameters in terms of marker-specific case-case odd-ratio parameters [14,15]. We introduced further parsimony by using the mixedeffect formulation of the model that allows ER-specific case-case parameters to be treated as fixed and similar parameters for other markers (PR, HER2 and grade (as an ordinal variable)) as random. We used an expectation-maximization (EM) algorithm [17] for parameter estimation under this model to account for missing data in tumor characteristics.
Our primary aim was to identify which of 173 known breast cancer susceptibility variants showed heterogenous risk associations by ER-, PR-and HER2-status and tumor grade. This was tested using a global heterogeneity test by ER, PR, HER2 and/or grade, with a mixed-effect two-stage polytomous model (model 1), fitted separately for each variant. The global null hypothesis was that there was no difference in risk of breast cancer associated with the variant genotype across any of the tumor features being evaluated. We accounted for multiple testing (173 tests, one for each variant) of the global heterogeneity test using a false discovery rate (FDR) < 5% under the Benjamini-Hochberg procedure [18].
For the variants showing evidence of global heterogeneity after FDR adjustment, we further evaluated which of the tumor features contributed to the heterogeneity by fitting a fixed-effects two-stage model (model 2) that simultaneously tested for associations with each tumor feature (this model was fitted for each variant separately). We used a threshold of p < 0.05 for markerspecific tumor heterogeneity tests to describe which specific tumor marker(s) contributed to the observed heterogeneity, adjusting for the other tumor markers in the model. This p-value threshold was used only for descriptive purposes, as the primary hypotheses were tested using the FDR-adjusted global test for heterogeneity described above.
To help describe sources of heterogeneity from different tumor characteristics in models 2 and 3, we performed cluster analyses based on Euclidean distance calculated from the absolute z-statistics that were estimated by the individual marker-specific tumor heterogeneity tests (model 2) and the case-control associations with risk of intrinsic-like subtypes (model 3). The clusters were used only for presentation purposes and were not intended to suggest strictly defined categories, nor are they intended to suggest the variants are associated with tumor markers through similar biological mechanisms. Clustering was performed in R using the function Heatmap as implemented by the package "Complex Heatmap" version 3.1 [19]. Additional details for calculating Euclidean distance using absolute z-statistics are provided in Additional file 3: Methods.
We performed sensitivity analyses, in which we estimated the ORs and 95% CI between the variants and the intrinsic-like subtypes by implementing a standard polytomous model that defined the intrinsic-like subtypes using only the available tumor markers data (not using the EM algorithm to account for missing data in tumor markers). We analyzed OncoArray and iCOGS array data separately for all analyses, adjusting for the first ten principal components for ancestry-informative variants, and then meta-analyzed the results.

Results
The mean (SD) ages at diagnosis (cases) and enrollment (controls) were 56.6 (12.2) and 56.4 (12.2) years, respectively. Among cases with information on the corresponding tumor marker, 81% were ER-positive, 68% PR-positive, 83% HER2-negative and 69% grade 1 or 2 (Table 1; see Additional file 1: Table S1 for details by study). Additional file 1: Table S3 shows the correlation between the tumor markers. ER was positively correlated with PR (r = 0.61) and inversely correlated with HER2 (r = -0.16) and grade (r = -0.39). The most common intrinsic-like subtype was luminal A-like (54%), followed by TN (14%), luminal B-like/HER2-negative (13%), Luminal B-like/HER2-positive (13%) and HER2positive/non-luminal (6%; Table 1). These frequencies varied across BCAC studies because the studies were diverse in both design and country of origin (Additional file 1: Table S1). Notably, there is little populationbased data on the frequencies of intrinsic-like subtypes [20,21]. The overall frequencies in our study population are generally similar to those reported by SEER for non-Hispanic white females and the Scottish cancer registry [20,21]; however, given the diverse sources of our data, they are not directly comparable to country-specific cancer registries. Figure 1 shows an overview of the analytic strategy and results from three main analyses performed separately for each variant: 1) global test for heterogeneity by all tumor markers (model 1; primary hypothesis), 2) markerspecific tumor test for heterogeneity for each marker, adjusting for the others (model 2), and 3) estimation of case-control ORs (95%CIs) by intrinsic-like subtypes (model 3) and by grade (model 4).

Global test for heterogeneity by tumor markers (primary hypothesis)
Mixed-effects two-stage models (model 1) were fitted for each of the 173 variants separately and included terms for ER, PR, HER2 and grade to test for global heterogeneity by any of the tumor features (case-case comparison). This model identified 85 of 173 (49.1%) variants with evidence of heterogeneity by at least one tumor feature (FDR < 5%; Figs. 1, 2; Additional file 1: Fig. S1).

Marker-specific tumor test for heterogeneity for each marker, adjusting for other markers
Fixed-effects two-stage models (model 2) were used to test which of the correlated tumor markers was responsible for the observed global heterogeneity (case-case comparison). Figure 2 and Additional file 1: Fig. S1 show results of these analyses clustered by case-case z-values of associations between susceptibility variants and each tumor marker for the 173 variants. For the 85 variants with observed global heterogeneity, these analyses identified ER and grade as the two features that most often contributed to the observed heterogeneity (45 and 33 variants had marker-specific p < 0.05 for ER and grade, respectively), and 29 variants were associated with more than one tumor feature (Figs. 1, 2, Additional file 1: Fig. S1). Eighteen of these 85 variants showed no associations with any individual tumor marker at p < 0.05 (Fig. 2, Additional file 1: Fig. S1). Twenty-one variants were associated at p < 0.05 only with ER, 12 variants only with grade, four variants only with PR and one variant only with HER2 (Fig. 2, Additional file 1: Fig. S1, see footnotes).

Estimation of case-control ORs (95%CIs) by intrinsic-like subtypes (model 3)
Fixed-effects two-stage models for intrinsic-like subtypes (model 3) were fitted for each of the 85 variants with evidence of global heterogeneity to estimate ORs (95% CIs) for risk associations with each subtype (casecontrol comparisons). Additional file 1: Fig. S2 shows a summary of these analyses for the 85 variants, clustered by case-control z-value of association between susceptibility variants and breast cancer intrinsic-like subtypes, and Additional file 2: Fig. S3 shows forest plots for associations with risk by tumor subtypes. Nearly all (83 of 85) variants were associated with risk (p < 0.05) for at least one luminal-like subtype, and approximately half (41 of 85) of the variants were associated with risk of at least one non-luminal subtype, including 32 variants that were associated with risk of TN disease (Fig. 1, Additional file 1: Fig. S2 footnote 'h'). Ten variants were associated with risk of all subtypes (Fig. 1 variants associated with different patterns of associations with intrinsic subtypes (Fig. 3 a-d).
Three variants 19p13.11-rs67397200, 5p15.33-rs10069690 and 1q32.11-rs4245739 showed the strongest evidence of associations with risk of TN disease. All three of these variants showed weaker or no evidence of associations with risk of the other subtypes (Fig. 3b, Additional file 1: Fig. S2). Two variants in low LD (r 2 = 0.17) at 6q25, rs9397437 and rs3757322, and a third variant in 6q25, rs2747652, which was not in LD (r 2 < 0.01) with rs9397437 or rs3757322, showed strong evidence of being associated with risk of all subtypes. rs9397437 and rs3757322 were most strongly associated with risk of TN disease. rs2747652 was most strongly associated with risk of HER2-positive subtypes (Figs. 3c, Additional file 1: Fig. S2).
Five variants were associated with risk of luminal A-like disease in an opposite direction to their association with risk of TN disease. 1q32.1-rs6678914, 2p23.2-rs4577244, and 19p13.11-rs67397200 had weaker evidence of associations with risk of luminal A-like disease compared to associations with risk of TN disease, and 10p12.31-rs7072776 and 22q12.1-rs17879961 (I157T) had stronger evidence of an association with risk of luminal A-like disease compared to their association with risk of TN disease (Fig. 3d, Additional file 1: Fig. S2, for rs67397200 see Fig. 3b).

Estimation of case-control ORs (95%CIs) by tumor grade (model 4)
Case-control associations by tumor grade for the 12 variants that were observed associated at p < 0.05 only with grade in case-case comparisons are shown in Additional file 2: Fig. S4. 13q13.1-rs11571833, 1p22.3-rs17426269 and 11q24.3-rs11820646 showed stronger evidence for predisposing to risk of high-grade subtypes, and the remaining variants showed stronger evidence for predisposing to risk of low-grade subtypes.
When limiting analyses to cases with intrinsic-like subtypes defined only by available tumor marker data, results from case-control analyses were similar, but less precise than results from the two-stage polytomous regression model using the EM algorithm to account for missing tumor marker data (Additional file 1: Table S4).

Discussion
This study demonstrates the extent and complexity of genetic etiologic heterogeneity among 173 breast cancer risk variants by multiple tumor characteristics, using novel methodology in the largest and the most comprehensive investigation conducted to date. We found compelling evidence that about half of the investigated breast cancer susceptibility loci (85 of 173 variants) predispose to tumors with different characteristics. We identified tumor grade, along with confirming ER status, as important determinants of etiologic heterogeneity. Associations with individual tumor features translated into differential associations with the risk of intrinsic-like subtypes defined by their combinations.
Many of the variants with evidence of global heterogeneity predisposed to risk of multiple subtypes, but with different magnitudes. For example, three variants identified in early GWAS for overall breast cancer, FGFR2 (rs35054928 and rs2981578) [22,23] and 8q24.21 (rs13281615) [22], were associated with luminal-like and HER2-positive/non-luminal subtypes, but not with TN disease. rs4784227 located near TOX3 [22,24] and rs62355902 located in a MAP3K1 [22] regulatory element, were associated with risk of all five subtypes. Of the five variants found associated in opposite directions with luminal A-like and TN disease, we previously reported rs6678914 and rs4577244 to have opposite effects between ER-negative and ER-positive tumors [7]. rs17879961 (I157T), a likely causal [16] missense variant located in a CHEK2 functional domain that reduces or abolishes substrate binding [25], was previously reported to have opposite directions of effects on lung adenocarcinoma and lung squamous cell carcinoma and for Table 1 Distribution of estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), and grade and the intrinsic-like subtypes for cases of invasive breast cancer in studies from the Breast Cancer Consortium Association lung cancer between smokers and non-smokers [26,27]. Moreover, the risk association of rs17879961 has been reported to vary across tissue locations/cell-types, as this variant has been associated with a higher risk of pancreatic ductal adenocarcinoma [28], chronic lymphocytic leukemia [29], and colorectal cancer [30], and also associated with a lower risk of aerodigestive squamous cell carcinoma [31] and ovarian cancer [32]. To our knowledge, rs67397200 and rs7072776 have not previously been shown to be associated with subtypes in opposite directions. In a prior breast cancer GWAS that applied the two-stage polytomous model for risk variant discovery, we also identified five variants associated with risk of luminal A-like and TN disease in opposite directions [15]. Overall, these findings suggest that the same biological pathway has opposite effects on the susceptibility to different tumor types. This interpretation is supported by functional characterization of rs36115365, a variant on 5p15.33, which was found to have similar cis-regulatory effects on TERT in multiple cancers cell lines from different cancers, but was associated with a higher risk of pancreatic and testicular cancer and a lower risk of lung cancer [33]. Alternatively, a causal variant may differently influence cis-gene regulation and/or alter different biological pathways depending on the cell or tissue of origin [34]. Further studies of these variants are required to clarify the biological mechanisms for these apparent cross-over effects.
In prior ER-negative GWAS, we identified 20 variants that predispose to ER-negative disease, of which five variants were only or most strongly associated with risk of TN disease (rs4245739, rs10069690, rs74911261, rs11374964, and rs67397200) [7,8]. We confirmed these five variants to be most strongly associated with TN disease. The remaining previously identified 15 variants all showed associations with risk of non-luminal subtypes, especially TN disease, and for all but four variants (rs17350191, rs200648189, rs6569648, and rs322144), evidence of global heterogeneity was observed.
Little is known regarding PR and HER2 as sources of etiologic heterogeneity independent of ER status. Of the four variants that showed evidence of heterogeneity only according to PR, rs10759243 [6,35], rs11199914 [36] and rs72749841 [6] were previously found primarily associated with risk of ER-positive disease, and rs10816625 was found to be associated with risk of ER-positive/PRpositive tumors, but not other ER/PR combinations [12]. rs10995201 was the only variant found in case-case comparisons to be solely associated with HER2 status, although the evidence was not strong, requiring further confirmation. Previously, rs10995201 showed no evidence of being associated with ER status [37]. Most variants associated with PR or HER2, had not been investigated for PR or HER2 heterogeneity while adjusting for ER [9][10][11][12][13]. We previously reported rs10941679 to be associated with PR-status, independent of ER, and also with grade [10]. We also found suggestive evidence of PR-specific heterogeneity for 16q12-rs3803662 [13], which is in high LD (r 2 = 0.78) with rs4784227 (TOX3), a variant strongly associated with PR status. Our findings for rs2747652 are also consistent with a prior BCAC fine-mapping analysis across the ESR1 locus, which found rs2747652 to be associated with risk of the HER2positive/non-luminal subtype and high grade independent of ER [9]. rs2747652 overlaps an enhancer region and is associated with reduced ESR1 and CCDC170 expression [9].
Histologic grade is a composite of multiple tumor characteristics, including mitotic count, nuclear pleomorphism, and degree of tubule or gland formation, therefore susceptibility variants associated with tumor grade could affect multiple biological pathways [38]. Evidence from comparisons of tumor morphology and genomic and molecular alterations suggest that tumor grade is likely a 'stable' tumor feature and does not progress from lowto high-grade [39][40][41][42], thus the variants associated with grade are likely not associated with grade progression. Among the 12 variants identified with evidence of heterogeneity by grade only, rs17426269, rs11820646, and rs11571833 were most strongly associated with risk of grade 3 disease. rs11571833 lies in the BRCA2 coding region and produces a truncated form of the protein [43] and has been shown to be associated with both risk of TN disease and risk of serous ovarian tumors, both of which tend to be high-grade [44]. To our knowledge, rs17426269 and rs11820646 have not been investigated in relation to grade heterogeneity. The remaining nine variants were all more strongly associated with grade 1 or grade 2 disease. Six of these variants were previously reported to be associated primarily with ER-positive disease [6,36,45,46], highlighting the importance of accounting for multiple tumor characteristics to better illuminate heterogeneity sources.
We identified 18 variants with evidence of global heterogeneity (FDR < 5%), but no significant (marker-specific p < 0.05) associations with any of the individual tumor characteristic(s). This is likely explained by the fact that the test for association with specific tumor markers using fixed-effects models is less powerful than mixed-effects models used to test the primary hypothesis of global heterogeneity by any tumor marker [14].
To help describe and visualize the strength of the evidence for common heterogeneity patterns, we performed clustered analyses of z-values for tumor marker-specific heterogeneity tests and case-control associations with risk of intrinsic-like subtypes. Because they are based on z-values, these clusters reflect differences in sample size and statistical power to detect associations between variants and specific tumor subtypes. Thus, clusters should not be interpreted as strictly defined categories.
A major strength of our study is our large sample size of over 100,000 breast cancer cases with tumor marker information, and a similar number of controls, making this the largest, most comprehensive breast cancer heterogeneity investigation. Our application of the two-stage polytomous logistic regression enabled adjusting for multiple, correlated tumor markers and accounting for missing tumor marker data. This is a more powerful and efficient modeling strategy for identifying heterogeneity sources among highly correlated tumor markers, compared with standard polytomous logistic regression [14,15]. In simulated and real data analyses, we have demonstrated that in the presence of heterogenous associations across subtypes, the two-stage model is more powerful than polytomous logistic regression for detecting heterogeneity. Moreover, we have demonstrated that in the presence of correlated markers, the two-stage model, incorporating all markers simultaneously, has a much better ability to distinguish the true source(s) of heterogeneity than testing for heterogeneity by analyzing one marker at a time [14,15]. In prior analyses, we showed that the two-stage polytomous regression is a powerful approach to identify susceptibility variants that display tumor heterogeneity [15]. Notably, in this prior investigation we excluded the genomic regions in which the 173 variants that were investigated in this work are located [15].
Our study also has some limitations. First, many breast cancer cases from studies included in this report had missing information on one or more tumor characteristics. ER tumor status data was available for 81% of cases, but missing data for the other tumor markers ranged from 27 to 46%. To address this limitation, we implemented an EM algorithm that allowed a powerful analysis to incorporate cases with missing tumor characteristics under the assumption that tumor characteristics are missing at random (MAR), i.e., the underlying reason for missing data may depend on observed tumor markers or/and covariate values, but not on the missing values themselves [47]. If this assumption is violated it can lead to an inflated type-one error [14]. However, in the context of genetic association testing, the missingness mechanism would also need to be related to the genetic variants under study, which is unlikely. The 88 variants that did not meet the p-value threshold for significant heterogeneity in the global test, are likely to represent a combination of variants that are associated with risk of all investigated tumor subtypes with similar effects and variants for which we lacked power to detect evidence of global heterogeneity due to weak effect sizes or uncommon allele frequencies. In addition, our study focused on investigating ER, PR, HER2, and grade as heterogeneity sources; future studies with more detailed tumor characterization could reveal additional etiologic heterogeneity sources.

Conclusion
Our findings provide insights into the complex etiologic heterogeneity patterns of common breast cancer susceptibility loci. These findings may inform future studies, such as fine-mapping and functional analyses to identify the underlying causal variants, clarifying biological mechanisms that drive genetic predisposition to breast cancer subtypes. Moreover, these analyses provide precise relative risk estimates for different intrinsic-like subtypes that could improve the discriminatory accuracy of subtype-specific polygenic risk scores [48].
Additional file 1. Figures S1 and S2 and Table S1- Additional file 4. Funding and Acknowledgement. This file contains the additional funding not included in the main text, the acknowledgments, and the names of the people in the collaboration groups.