Genome-wide and transcriptome-wide association studies of mammographic density phenotypes reveal novel loci

Background Mammographic density (MD) phenotypes, including percent density (PMD), area of dense tissue (DA), and area of non-dense tissue (NDA), are associated with breast cancer risk. Twin studies suggest that MD phenotypes are highly heritable. However, only a small proportion of their variance is explained by identified genetic variants. Methods We conducted a genome-wide association study, as well as a transcriptome-wide association study (TWAS), of age- and BMI-adjusted DA, NDA, and PMD in up to 27,900 European-ancestry women from the MODE/BCAC consortia. Results We identified 28 genome-wide significant loci for MD phenotypes, including nine novel signals (5q11.2, 5q14.1, 5q31.1, 5q33.3, 5q35.1, 7p11.2, 8q24.13, 12p11.2, 16q12.2). Further, 45% of all known breast cancer SNPs were associated with at least one MD phenotype at p < 0.05. TWAS further identified two novel genes (SHOX2 and CRISPLD2) whose genetically predicted expression was significantly associated with MD phenotypes. Conclusions Our findings provided novel insight into the genetic background of MD phenotypes, and further demonstrated their shared genetic basis with breast cancer. Supplementary Information The online version contains supplementary material available at 10.1186/s13058-022-01524-0.

dense with white appearance, while adipose tissue is radiologically lucent with dark appearance on a mammogram [1]. Mammographic density (MD) has been widely established as one of the strongest risk factors for breast cancer [2][3][4], the most common cancer type among women in the USA [5]. Specifically, quantitative MD measures, including mammographic dense area (DA), non-dense area (NDA), and the percentage of dense area in the whole breast (PMD), have all been independently associated with breast cancer [3,[6][7][8]. In analyses adjusting for body mass index (BMI) and age, women with higher DA and PMD have an elevated risk of breast cancer, while NDA is associated with a decreased breast cancer risk.
Twin studies indicate that genetic factors explain a large fraction of the variation in MD phenotypes, with heritability estimates for DA, NDA, and PMD, after adjusting for age and individual-specific shared environmental factors, exceeding 60% [9][10][11]. Previous genome-wide association studies (GWAS) have identified 46 genetic variants that are significantly (p < 5 × 10 −8 ) associated with at least one MD phenotype, including 27 associated with DA, 17 associated with NDA, and 20 associated with PMD [12][13][14][15][16][17]. Importantly, many of these variants have also been discovered as the susceptible loci of breast cancer, suggesting the critical role played by MD as an intermediate phenotype for the disease. However, only a small fraction of the variance of MD phenotypes can be explained by these significant variants [14,17]. To enhance our understanding of the genetic basis of MD, additional GWAS with larger sample sizes is needed.
In the present study, we conducted GWAS and a transcriptome-wide association study (TWAS) in up to 27,900 European ancestry women with the goal of identifying novel loci associated with MD phenotypes.

Study population and data collection
We conducted a GWAS for three MD phenotypes (DA, NDA, and PMD) using data from 21 studies which provided individual-level genotype and phenotype data (Additional file 2: Table S1) as well as nine additional studies which provided GWAS summary statistics (Additional file 2: Table S2), under the Breast Cancer Association Consortium (BCAC) and the Markers of Density Consortium (MODE). The overall study sample comprised of 6666 breast cancer cases and 21,234 controls. All the individuals had PMD data collected, while the DA and NDA measures were only available in a proportion of the study population. The final sample sizes used in the meta-analyses were 24,579 (DA), 24,689 (NDA), and 27,900 (PMD). For breast cancer cases, mammograms collected prior to the cancer diagnosis were used for density assessment. Study-specific approaches to obtain quantitative measures of MD phenotypes are summarized in Additional file 2: Table S1, and studyspecific protocols for MD measurement are given in the Additional file 1. Most of studies included in our analysis used CUMULUS, a computer-assisted semi-automated thresholding software [18]. Age and BMI at time of mammogram collection were included as covariates in the GWAS. For participants with missing BMI at mammogram (N = 1767), self-reported BMI within five years of mammogram collection was used as an approximation.
Individual-level genotype data were generated with either the iCOGs [19] or OncoArray [20] arrays. We applied standard quality control filters as described elsewhere [19]. Genotype data were imputed to 1000 Genomes phase 3 version 5 using IMPUTE2 [21]. Genotype dosage, ranging between 0 and 2, was generated for imputed variants. Single nucleotide polymorphisms (SNPs) with low imputation quality (INFO < 0.3) or with a minor allele frequency (MAF) < 1% were excluded. Approximately 9.8 million variants were included in the association analysis. Genomic positions of the variants were based on Genome Reference Consortium GRCh37 (hg19).

Genome-wide association study (GWAS)
We conducted study-specific multivariable adjusted linear regression analysis for each MD phenotype. All MD phenotypes were square-root transformed before analysis as this resulted in distributions that were close to normal. Age and 1/BMI at mammogram, and the first ten ancestry informative principal components, as previously described [14], were included as covariates in each regression model. Analyses were performed using R 3.6.1 (R Foundation). We then combined study-specific GWAS results with previously derived GWAS summary statistics using a sample-size weighted meta-analysis (the 'SAMPLESIZE' scheme as implemented in METAL [22]). To be included in the meta-analysis, a variant needed to have a valid Z-statistic from at least three individual studies and a minimum sample size of 3000. Regional association plot for each genome-wide significant locus in the meta-analysis was generated using the LocusZoom software [23].

Sensitivity and conditional analysis
The majority of studies included in our analysis were population-based or breast cancer nested case-control studies. To assess if any identified SNP-MD associations was an artifact resulting from oversampling of breast cancer cases in our population, we replicated the association analysis for all genome-wide significant SNPs in controls only (N = 21,234), as a sensitivity analysis. As mammographic NDA is strongly associated with BMI, we also assessed the association between significant NDA loci and BMI among 13,915 individuals with NDA, BMI, and genotype data available.
To quantify the number of independent signals in each significant GWAS locus, we performed a conditional analysis using the COndition and JOint analysis tool implemented in the Genome-wide Complex Trait Analysis software (COJO-GCTA) [24]. Since COJO-GCTA requires beta and standard error estimates, which were not available in our sample-size weighted meta-analysis data, we performed a standard error weighted meta-analysis with the normalized square-rooted MD phenotypes (per study, [sqrt-MD − mean(sqrt-MD)]/stderr(sqrt-MD)) as the outcomes. For each locus, we defined the lead SNP as the first independent signal, and performed the conditional analysis for SNPs located within + / − 500 kb. The top-ranked SNP with conditional p value < 10 −5 was added to the independent signal list, and the conditional analysis was run again for rest of the SNPs. The conditional analysis was halted when no variant reached the threshold of conditional p value < 10 −5 . For the loci with multiple independent signals identified in the conditional analysis, all signals are annotated on the regional association plot.

Breast cancer association analysis
We examined the association between MD phenotypeassociated SNPs and breast cancer risk, overall and by estrogen receptor (ER) status, using publicly available breast cancer GWAS summary statistics [19], based on 122,977 cases (including 69,501 ER-positive and 21,468 ER-negative cases) and 105,974 controls of European ancestry from the BCAC. We also assessed if known breast cancer SNPs [25] were associated with MD phenotypes.

Exploratory bioinformatics analysis
We used linkage disequilibrium (LD) score regression to estimate the SNP heritability (h 2 SNP ) of MD phenotypes [26]. We partitioned the h 2 SNP by 74 functional genomic categories [27], and estimated the heritability enrichment for each category. We quantified the genome-wide genetic correlation between each MD phenotype and breast cancer [19,28]. We also estimated the local genetic correlation between MD phenotypes and overall breast cancer using ρHESS [29,30], which estimates the local shared heritability between two traits across 1703 independent genomic blocks, based on LD in European ancestry populations [31]. We defined statistically significant local genetic correlations as p < 0.05/1703 = 2.94 × 10 −5 .

Transcriptome-wide association study (TWAS)
To estimate the association between imputed gene expression and MD phenotypes, we conducted a transcriptome-wide association analysis (TWAS). We used genotype and gene expression data in mammary tissue from 396 individuals collected by the GTEx consortium (Release V8) to build gene-specific SNP prediction models of gene expression [32]. Predictive models were built based on variants located + / − 500 kb of each gene, using three different approaches (Top 1, Elastic Net [33], and LASSO [34]). Gene-specific expression levels were then imputed with the model showing the highest predictive R-square based on cross-validation. A total of 7284 genes with nominally significant (p < 0.01) heritability were included in the association analysis for each MD phenotype. Construction of predictive models and association analysis using GWAS summary statistics were performed using the R-based pipeline FUSION [34]. A significance threshold of p < 0.05/(7284*3) = 2.29 × 10 −6 was utilized to identify statistically significant associations between imputed gene expression levels and MD phenotypes.

Replication of novel GWAS and TWAS findings
Replication analyses of the novel GWAS loci were performed using data from a previous GWAS meta-analysis of mammographic density phenotypes in an independent population of 24,192 European ancestry women participating in the Kaiser Permanente Northern California (KPNC) Research Program on Genes, Environment and Health (RPGEH) [17]. Briefly, MD phenotypes were measured using Cumulus6 on a single craniocaudal view from 20,311 Hologic and 3881 GE full-field digital mammography (FFDM) exams. MD phenotypes were transformed separately within each cohort to attain standard normal distributions and to facilitate meta-analysis and interpretation of effect sizes in SD units. Genotypes were assayed using the Affymetrix Axiom array with > 650,000 variants, and imputed using the 1000 Genomes Project Phase III reference panel. Allele dosage effects were estimated using linear regression models adjusted for age at mammography, ln(BMI), the first ten principal components of European ancestry, genotyping reagent kit, and image batch separately in the Hologic and GE cohorts, and the estimates were combined using inverse-variance weighted meta-analysis.
Replication analyses of the novel TWAS loci were performed in 24,158 women from the Kaiser RPGEH mammographic density GWAS [17] with genotypes imputed using the Haplotype Reference Consortium reference panel for single-nucleotide variants, and 1000 Genomes Project Phase III reference panel for indels [35]. Expression levels of 7 genes (MTMR11, SHOX2, CRISPLD2, Table 1 Lead SNPs of the genome-wide significant loci identified in the GWAS meta-analysis of mammographic dense area (DA), nondense area (NDA) and percent mammographic density (PMD) 1 Genomic positions based on build GRCh37/hg19 2 Closest gene to the lead variant 3 Coded as reference allele/effect allele 4 Minor allele frequency, based on the European Ancestry population in the 1000 Genome project 5 Z-scores were obtained using sample size weighted meta-analysis of GWAS. The multivariate linear regression model used in GWAS analysis adjusted for age and BMI at mammogram, as well as the first ten principal components representing population structure. Additive inheritance model was used. Square-root transformed mammographic density phenotypes were used as the outcome variable 6 Whether the locus is novel (Yes) or has previously been reported (No) to be associated with at least one MD phenotype in the literature SMIM25, TMEM184B, EP300, and DESI1) were estimated using the PredictDB GTEx v8 Elastic Net models for mammary tissue [33], which did not include MRPL23-AS1. Associations of the predicted gene expression levels with the standardized MD phenotypes were estimated using linear regression models adjusted for age at mammography, ln(BMI), the first ten principal components of European ancestry, genotyping reagent kit, and image batch separately in the Hologic (n = 20,282) and GE (n = 3876) cohorts, and the estimates were combined using inverse-variance weighted meta-analysis.

Results
Our study population was on average 56.6 years old and had an average BMI of 26.5 kg/m 2 at the time of mammogram. The mean DA, NDA, and PMD were 28.5 cm 2 , 120.4 cm 2 , and 23.4%, respectively. Age and BMI at mammogram, as well as the square-root transformed MD measures, all approximately followed a normal distribution (Additional file 2: Figure S1). Genomic inflation factors (λ GC ) were between 1.11 and 1.13 (Additional file 2: Figure S2), with LD-score regression intercepts between 1.05 and 1.06, suggesting that the observed genomic inflation is partly driven by the polygenic effects of many variants [26].
Sensitivity analysis based on 21,234 controls showed consistent direction of effect and comparable effect size to the main analysis (Additional file 2: Table S5). None of the NDA-associated SNPs were associated with BMI at p < 0.05 (Additional file 2: Table S6), suggesting that observed SNP-NDA associations were not due to residual confounding with BMI.
We captured the lead SNPs from 46 distinct loci that have previously been reported to associate with at least one MD phenotype. We investigated their associations (N = 63) with the corresponding MD phenotype based on our study (Additional file 2: Table S7). We were able to replicate 26 out of 28 DA SNPs, 13 out of 16 NDA SNPs, and 15 out of 19 PMD SNPs at p < 0.05. Among these, 10 DA SNPs, 2 NDA SNPs, and 5 PMD SNPs were found with genome-wide significance at p < 5 × 10 −8 .

Association of MD significant loci with breast cancer
We assessed whether the identified MD phenotypeassociated SNPs were also associated with breast cancer risk. Of the 28 lead SNPs, 13 were associated with overall breast cancer risk at genome-wide significance ( Table 2). In addition, one SNP (22q13.1) was significantly associated with ER-positive breast cancer (p = 5.6 × 10 −8 for overall breast cancer). For nine of these 14 SNPs, the direction of association was consistent with that expected (i.e., the same direction for DA, PMD and breast cancer risk, and opposite direction for NDA and breast cancer risk), while for five SNPs the direction was the opposite; some of these conflicting results have been observed previously [14]. An additional seven lead SNPs were associated with breast cancer risk at p < 0.05. We also assessed the associations between 205 independent genome-wide  [19] and MD phenotypes (Fig. 2, Additional file 2: Table S8) at p < 0.05, 63 (31%, 48 with consistent direction as expected) breast cancer variants were associated with DA, 36 (18%, 20 with opposite direction as expected) with NDA, and 62 (30%, 49 with consistent direction as expected) with PMD, respectively. In total, 92 (45%, 67 with expected direction) breast cancer variants were associated with at least one MD phenotype.
We further quantified the genetic correlation between MD phenotypes and breast cancer risk (Fig. 3). We estimated the local genetic correlation between MD phenotypes and overall breast cancer by partitioning the genome into 1,703 independent blocks. In total, we identified nine significant pairwise local genetic correlations between MD phenotypes and overall breast cancer (DA: 6q25.1, 10q21.2, 11p15.5, 12p11.2, 22q13.2; NDA: 8p11.23; PMD: 5q33.3, 6q25.1, 10q21.2) (Additional file 1: Figure S9). All nine regions harbored at least one genome-wide significant locus for a MD phenotype and were directionally consistent with the breast cancer association.

TWAS of MD phenotypes
Finally, we performed a TWAS investigating associations between the imputed expression of 7284 genes and MD phenotypes (Additional file 1: Figure S10, Additional file 2: Tables S10-S12), and identified significant associations with eight genes (Table 3). Six genes were either   located in (MTMR11, SMIM25, and TMEM184B) or within the 1 Mb of the GWAS loci (EP300, DES11, and MRPL23-AS1). The imputed expression of two additional genes was associated with MD phenotypes, including SHOX2 (positively associated with NDA) and CRISPLD2 (negatively associated with PMD). We replicated our TWAS findings using individual-level data for 24,158 women from an independent GWAS to impute expression for seven of the identified genes with available models in PredictDB [17,33]. Five of the seven genes, except EP300 and DESI1, were replicated at p < 0.05 with a consistent direction of effect (Additional file 2: Table S13).

Discussion
We conducted a GWAS for three MD phenotypes in 27,900 European ancestry women. We identified 28 distinct loci that were associated with at least one MD phenotype at genome-wide significance. Nine of these have not previously been reported to be associated with mammographic density. In addition, 14 of the 28 loci were also associated with breast cancer risk at genome-wide significance. We quantified the genetic correlation between MD phenotypes and breast cancer, further establishing the shared genetic basis between MD phenotypes and breast cancer risk. Finally, we conducted a TWAS and identified two additional novel associations between imputed expression level and MD phenotypes. Previous GWAS based on data from MODE/BCAC identified 12 MD loci [12][13][14][15][16] and a recent GWAS of MD based on 24,192 women further discovered 31 novel loci [17]. In addition, GWAS investigating volumetric MD revealed one novel locus for percent dense volume (HABP2 at 10q25.3) and two loci for absolute dense volume (INHBB at 2q14.2, LINC01483 at 17q24.3) [36]. Previous studies support the association for DA lead SNP rs150249911, which is an intronic variant of the ITGA1 gene at 5q11.2. ITGA1-coded integrin α1 protein upregulated following the expression of estrogen receptor β, a marker of breast cancer [37]. SNP rs413472 (5q14.1) is located in the SSBP2 gene which has previously been implicated in breast cancer (p = 4.00 × 10 −5 ) in Indonesian women [38]. DA lead SNP rs10155920 is in a long non-coding RNA located downstream of the EGFR gene at chromosome 7p11.2. EGFR is one of the most wellstudied signaling pathways that contributes to the invasion, dissemination, and metastasis of breast tumors [39]. Breast cancer fine-mapping analysis has identified DA lead SNP rs7297051 as one of the four independent association signals of breast cancer at chromosome 12p11 [40], which is approximately 50 kb upstream of the PTHLH gene. PTHLH encodes parathyroid hormonerelated protein (PTHrP), which aids in normal mammary gland development [41]. PTHrP has also been related to the prognosis of breast cancer, as its occasional secretion by tumor cells may promote osteoclastic activity and contribute to osteolytic bone metastases [42]. PMD lead SNP rs11646715 is in the FTO gene at chromosome 16q12.2 which encodes an mRNA demethylase and is well-known for its association with fat mass and obesity. Research has indicated that the FTO gene may play a role in cellular sensing of macronutrients and may be involved in the regulation of cell growth, which can at least partly explain its relationship with both obesity and breast cancer [43]. As we adjusted for BMI in our GWAS model, and further, rs11646715 was not associated with BMI in our data, the mechanisms underlying the associations between genetic variation in this region and MD likely differ from its effect on adiposity. Future studies are essential to elucidate potential biological mechanisms that link these genes to MD and ultimately breast cancer susceptibility. The novel DA locus at 8q24 has previously been associated with multiple types of cancer, including breast cancer [44]. To rule out that our finding was an artifact due to oversampling of breast cancer cases, we assessed the association with DA using controls only and observed a significant association (p = 5.64 × 10 −4 ). Interestingly, rs58847541 is associated with breast cancer but in the opposite direction to the effect on DA.
Thirteen of the 28 GWAS loci were also associated with overall breast cancer risk with genome-wide significance, and had little difference in effect by cancer subtype. Among these, we observed multiple unexpected inconsistency in the direction of associations between MD-associated loci and breast cancer risk, including DA and PMD loci 1q21.2 and 2q14.2, DA loci 8q24.13 and 22q13.2, and NDA and PMD locus 8q11.23. The underlying biological mechanisms driving these discrepancies are unclear, but one potential explanation is that these loci may be involved in multiple pathways across life stages, which differentially affect breast development and the risk of breast cancer. Furthermore, the MD phenotypes we studied were radiologic reflection of the underlying breast tissue composition, which made it difficult to distinguish the epithelium from stroma tissue of the breast. We also investigated the association between 205 independent breast cancer SNPs and MD phenotypes and found that 45% of the variants were associated with at least one MD phenotype at p < 0.05. The local genetic correlation analysis in our study highlighted specific loci at which MD phenotypes and breast cancer showed evidence of shared heritability (DA: ESR1, ZNF365, LSP1, and MKL1; NDA: 8p11.23; PMD: ESR1 and ZNF365). These observations reinforce the strong shared genetic basis between mammographic density and breast cancer.
The SNP heritability (h 2 SNP ), which can be interpreted as the proportion of phenotypic variance explained by the additive effects of all genotyped variants, was estimated to be 0.32 for DA, 0.24 for NDA, and 0.27 for PMD. Our estimates were slightly lower than those previously reported [17,36], perhaps due to differences in the study populations or methodology [45]. Twin studies have estimated the heritability of the three MD phenotypes to all exceed 60% [9,10]. The difference between the SNPheritability estimates in our analysis and the estimates from twin studies may reflect the effect of rare variants not being genotyped and not in LD with any genotyped variants, or may be due to non-additive genetic effects, interactions between genetic variants and environmental factors, or uncontrolled shared environmental factors in the twin studies.
Our TWAS identified eight genes for which imputed expression levels were significantly associated with MD phenotypes. Six of these were either located in or close to the identified GWAS loci, suggesting the observed  genotype-phenotype association may be mediated through gene expression. Two additional genes SHOX2 and CRISPLD2 were associated with NDA and PMD, respectively, and replicated in an independent study. Future studies are thus needed to elucidate the biological bases of these findings.
Our study has several strengths. It is the largest GWAS of mammographic density to date, enabling us to discover nine novel MD loci. We performed sensitivity analyses using controls, which reaffirmed that all significant associations were not spurious artifacts due to oversampling of cases. However, a few studies included in our study used the thresholding approach other than CUMU-LUS, which may cause inconsistency in the measurement of MD phenotypes and thus lead to biased results. Also, although previous studies have demonstrated that the MD measurement collected by CUMULUS was highly reproducible [8,46], it is important to acknowledge that it was a reader-dependent approach and thus might inevitably be subjective to measurement error. Another weakness with our study is the lack of diversity, as our study sample only included women of European ancestry. Considering that the risk of breast cancer attributable to mammographic density may differ among racial/ethnic groups [47], future efforts should be made to collect mammogram and genotype data from racially diverse populations.

Conclusion
In this study, we conducted a GWAS and TWAS of MD phenotypes using 27,900 women of European ancestry. Our study improved our understanding about the genetic background of MD phenotypes, and reinforced the evidence of their shared genetic basis with breast cancer risk.

Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Declarations Ethics approval and consent to participate
The use of data has been approved by each participating studies. See Supplementary Information. Table 3 Genes with significant association between genetically predicted gene expression and MD phenotypes (DA, NDA, PMD), based on transcriptome-wide association study (TWAS) 1