Identification of Novel Common Breast Cancer Risk Variants in Latinas at the 6q25 Locus

Background: Breast cancer is a partially heritable trait and over 180 common genetic variants have been associated with breast cancer in genome wide association studies (GWAS). We have previously performed breast cancer GWAS in Latinas and identified a strongly protective single nucleotide polymorphism (SNP) at 6q25 with the protective minor allele originating from Indigenous American ancestry. Here we report on additional GWAS and replication in Latinas. Methods: We performed GWAS in 2385 cases and 7342 controls who were either U.S. Latinas or Mexican women. We replicated 2412 cases and 1620 controls of U.S Latina, Mexican, and Colombian women. In addition, we replicated the top novel variants in study of African American and African women and in one study of Chinese women. In each dataset we used logistic regression models to test the association between SNPs and breast cancer risk and corrected for genetic ancestry using either principal components or genetic ancestry inferred from ancestry informative markers using a model based approach. Results: We identified 3 SNPs (p=1.9×10-8 - 2.8×10-8) at 6q25 locus not in linkage disequilibrium (LD) with variants previously reported at this locus. These SNPs were in high LD with each other, with the top SNP, rs3778609, associated with breast cancer with an odds ratio (OR) and 95% confidence interval (95% CI) of 0.75 (0.68-0.83). In a replication in women of Latin American origin, we also observed a consistent effect (OR: 0.88; 95% CI: 0.78-0.99; p=0.037). Since the minor allele was common in East Asians and African American but not European ancestry populations, we replicated in a meta-analysis of those populations and also observed a consistent effect (OR 0.94; 95% CI: 0.91 – 0.97; p=0.013). Conclusion: The effect size of this variant is relatively large compared to other common variants associated with breast cancer and adds to evidence about the importance of the 6q25 locus for breast cancer susceptibility. Our finding also highlights the utility of performing additional searches for genetic variants for breast cancer in non-European populations.

Hispanic/Latino populations are the second largest ethnic group in the U.S. [32] and yet have been understudied in genome wide association studies [33]. Latinos are a population of mixed ancestry with European, Indigenous American and African ancestral contributions [34][35][36][37].
Since there are no large studies of breast cancer in Indigenous American populations, studies in Latinos may identify novel variants associated breast cancer unique to or substantially more common in this population. We have previously used an admixture mapping approach to search for breast cancer susceptibility loci in Latinas and identified a large region at 6q25 where Indigenous American ancestry was associated with decreased risk of breast cancer [22].
Subsequently, we identified a SNP (rs140068132) that was common (minor allele frequency ~0.1) only in Latinas with Indigenous American ancestry and was associated with substantially lower risk of breast cancer, particularly estrogen receptor (ER) negative breast cancer and with lower mammographic density [23]. However, the variant we identified did not completely explain the risk associated with locus specific ancestry at 6q25 in Latinas, suggesting that other variants may account for this risk. We set out to fine-map and identify additional variants at 6q25 associated with breast cancer risk among Latinas.

Methods:
Populations: controls. Samples from this study were used as part of the initial discovery set. GALA1: GALA1 is a family-based study (including children with asthma and their parents) of pediatric asthma in Latino Americans. We included 112 females of self-reported Mexican origin from the GALA1 study to our set of population controls. The individuals are between 11 and 42 years of age (85% are older than 20 years). Samples from this study were used as part of the initial discovery set.
Breast Cancer Family Registry (BCFR): The BCFR is an international, National Cancer Institute (NCI)-funded family study that has recruited and followed over 13,000 breast cancer families and breast cancer cases with strong likelihood of genetic contribution to disease45. The present study includes samples from the population-based Northern California site of the BCFR. Cases aged 18-64 years diagnosed from 1995 to 2007 were ascertained through the Greater Bay Area Cancer Registry. Cases with indicators of increased genetic susceptibility (diagnosis at the age of <35 years, bilateral breast cancer with the first diagnosis at the age of <50 years, a personal history of ovarian or childhood cancer and a family history of breast or ovarian cancer in first-degree relatives) were oversampled. Cases not meeting these criteria were randomly sampled46. Population controls were identified through random-digit dialing and frequency-matched on 5-year age groups to cases diagnosed from 1995 to 1998. We included 641 cases and 61 controls who self-identified as Latina or Hispanic from this study. Samples from this study were used as part of the initial discovery set.
Multiethnic Cohort (MEC): The MEC is a large prospective cohort study in California (mainly Los Angeles County) and Hawaii. The breast cancer study is a nested case-control study including women with invasive breast cancer diagnosed at the age of >45 years and controls matched on age (within 5 years) and self-identified ethnicity47. For the current study, we used data and genetic data from 546 Latina women with breast cancer and 558 matched Latina controls. We also included an additional 1,941 controls who self-identified as Hispanic/Latino from this study (935 of these controls are men) selected as part of a GWAS of type 2 diabetes [38]. Samples from this study were used as part of the initial discovery set.
Research Project on Genes Environment and Health (RPGEH): The RPGEH is a large cohort study of over 100,000 men and women of all racial/ethnic groups who are members of the Kaiser Permanente Health Plan (additional recruitment criteria?). This analysis focuses only on women who are of self-reported Latina/Hispanic ethnicity (N=3801). We included both incident and prevalent cases (total N=225) in our analyses. We identified 44 women who were also included the SFBCS. The genetic data from these participants were included as part of the RPGEH since we considered the Affymetrix Lat array as a more comprehensive array.
Samples from this study were used as part of the initial discovery set. Replication genotyping: The CAMA samples which were not included in the GWAS and the CCGRN samples, were genotyped using Taqman probes for rs3778609 . The CAMA samples included 106 ancestry informative markers from genotyped on a Sequenome platform as previously described [41]. CCGRN samples included 100 ancestry informative markers that were included as part of a sequencing project. The sequence data were aligned to Hg37 using Burrows-Wheeler Alignment and genotype calls were made using Haplotypecaller which is part of the GATK platform.

Analysis:
Genotyping Quality Control and Imputation: Samples with >5% missing genotypes were removed from each dataset. We dropped variants with >5% missing data from each dataset.
Since excess homozygosity is more common in populations with substructure, particularly with ancestry informative markers, we did not use deviation from Hardy-Weinberg equilibrium as a criteria for excluding markers. All datasets were entered mapped to Hg19. Each dataset was then phased using SHAPEIT and imputed using the Haplotype Reference Consortium (HRC) with Minimac3 [45]. For the MEC datasets which included both 660K and 2.5M arrays, we used the overlapping SNPs (N=192,795) and imputed from those since we found that if we imputed them separately and then analyzed them together we got a large number of false positives. Each of the remaining GWAS datasets was submitted to the HRC server individually for imputation. Only variants with imputation quality scores of R2>0.5 were selected for additional analysis.
Genotype imputation for the ROOT consortium was conducted using the IMPUTE2 We used KING [43] to identify relative pairs either within the RPGEH cohort or between the RPGEH and SFBCS and/or NC-BCFR and performed the same analysis within the MEC and the CAMA study. We identified pairs of individuals with kinship coefficient >0.2 and dropped one from each of these pairs. If a relative pair included a case and control then we excluded the control. If a relative pair includes two cases or two controls we randomly dropped one of them. We dropped 127 individuals to eliminate all closely related individuals from the combined RPGEH, SFBCS and NC-BCFR.
Empirical Assessment of Imputation accuracy: We genotyped rs3778609, the top novel SNP, in the CAMA study in samples that also had GWAS data and checked the concordance between genotyped and imputed results. We found excellent concordance between the imputed and genotyped data with 1361/1369 (99.4%) concordance between the genotyped and imputed datasets.
Genetic Ancestry Inference We implemented principal component analysis to assess genetic ancestry in each of the discovery datasets in unrelated individuals. To do so, we first LD pruned typed SNPs with r 2 > 0.2 in PLINK. With the remaining data, we determined the principal components (PC) using EIGENSTRAT [44] within smartpca. For the replication datasets, we used ancestry informative markers and used the program ADMIXTURE [45] to calculate genetic ancestry, assuming a 3 population model with ancestry from African, European and Native American populations.
Association Testing: We performed single variant association testing using logistic regression models and adjusting for PC's 1-10 in PLINK [46]. For the replication datasets we entered ancestry into the model as covariates. We also performed association testing separately for estrogen receptor (ER)-positive and ER-negative breast cancer using this approach. In each analysis we also included study as a covariate.
To calculate linkage disequilibrium (LD), we calculated R 2 in the controls in our dataset using PLINK. We then performed conditional analyses by entering the most significant SNP in the model as a covariate in addition to PC's 1-10.
Power: Based on the sample size for discovery (2396 cases and 7468 controls) we had ~80% power to detect an odds ratio of 1.25, 1.355 and 1.475 with allele frequencies of 0.4, 0.2, and 0.1 respectively.

Results:
Individual Association Analyses: We conducted a meta-analysis across four GWAS discovery studies (Table 1) and identified 28 variants with genome-wide significant p-values at the 6q25 susceptibility region(Supplementary Table 1). No additional genome wide significant SNPs were identified. The top variants in the region included rs140068132 and rs147157845 which are in near perfect LD (r 2 =0.96) and which we have previously reported as genome-wide significant in this population [23]. Of the 28 top variants, 25 were in strong linkage disequilibrium (r 2 >0.4) with rs140068132, and 3 were in low LD (r 2 <0.2). These SNPs, rs3778609, rs7771984 and rs6914438 have a minor allele frequency of 0.19 and are in near perfect LD (r 2 =0.99). The minor allele of these SNPs are associated with lower risk of breast cancer, and the odds ratio (OR) for rs3778609 was 0.75 (95% CI: 0.68 -0.93, p=1.9x10 -8 ; Table 2). These SNPs are also independent (r 2 <0.2) of previously reported SNPs at this locus (Supplementary Table 2).
Another SNP, rs851983, was associated with a near genome-wide significant level of association (MAF=0.35; OR: 1.24, 95% CI: 1.25-1.34, p= 5.6x10 -8 Table 2). However, this SNP is in strong LD with SNPs that were previously reported (Supplementary Table 2). We performed conditional analyses by entering rs140068132 and other top SNPs at this locus in joint models. We found that rs3778609, rs7771984 and rs6914438 all remained nominally significant in joint models adjusting for rs140068132 (Table 2), although the adjusted odds ratios for these 3 SNPs were attenuated (OR~0.81-0.84; p<=0.001). We also found that rs851983 remained nominally significant in joint models with rs140068132 with mild attenuation.
When we included 3 SNPs that best represent each of the signals from each set of associated variants (rs140068132, rs3778609 and rs851983) in the same model all of the SNPs remained nominally significant with minimal attenuation of the odds ratios (compared to models including just pairs of variants; Table 2). Technical Validation and Replication: We used data from the portion of the CAMA study that did not have GWAS data, The COLUMBUS study and the CCGRN to replicate the association with rs3778609. We found a consistent direction in all 3 studies and a nominally significant association in a meta-analysis of the 3 studies (n= cases; n=controls; OR=0.88, 95% CI: 0.78-0.99, p=0.037, Table 3).  Table 3). Only rs851984 was significant in our study. However, nearly all of the others were directionally consistent and the 95% confidence intervals overlapped with the results from previous studies.

Association with Estrogen receptor subtypes:
We analyzed the association for each of the top SNPs separately and jointly by ER-status. As we have previously reported the minor (low risk) allele of rs140068132 is associated with a significantly lower odds ratio for ERnegative than for ER-positive breast cancer. We also found a significantly stronger effect size for ER-negative breast cancer for rs3778609. The effect size for rs851983 is also greater for ER-negative breast cancer; however, there is no significant difference between ER-negative and ER-positive breast cancer for this SNP.  table 4). We also examined rs3778610 which is in strong LD with these SNPs and had near genome-wide significant associations in our discovery sample (Supplementary Table 1). The strongest association in non-Latina populations was with rs3778610 which was particularly stronger in African ancestry populations (Supplementary table 4).

Discussion:
We have previously reported on a SNP at 6q25 associated with a minor allele that is unique to Indigenous American populations and associated with decreased risk of breast cancer [23]. Here, we investigate this locus in greater depth in an expanded sample size of Latina breast cancer cases and controls, the largest sample size of this population analyzed for breast cancer risk to date. We have identified several SNPs that are genome-wide significant and associated with breast cancer at this locus independently of other SNPs at this region previously reported by us and those previously reported by other groups at this locus. These SNPs are located in the region of ~152.13 -152.14 MB (Hg19). Replication in African American and Asian samples supports the association with this SNP. In addition, we have also shown that these novel variants at this locus have significantly stronger effect sizes on ER-negative breast cancer. Prior studies have also demonstrated a stronger effect size with ER-negative breast cancer for most variants at the 6q25 locus, consistent with our data [18,27].
Prior studies in other populations have reported a series of independent SNPs affecting breast cancer risk [11,18,[24][25][26][27]31]. The variants previously reported in other populations are not significant in our studies but most are directionally consistent. A combined fine-mapping and functional study of SNPs mapped in other populations at this locus found that they affect expression of ESR1, RMND1 and CCDC170 [27]. Since the new variants we report are common only in non-European ancestry populations, there is limited data to explore the potential effects of these variants on gene expression.
Our study is limited by sample size. Therefore, it is possible that we have missed other variants at this locus. In fact, several previously reported variants have odds ratios with point estimates that are close to the previously reported results, but have 95% confidence intervals that include 1, as would be expected with insufficient power. The effect size we observed in the replication dataset is substantially lower than in the discovery dataset, likely due to winner's curse. However, even if we take the replication odds ratio (0.88) as the closest to the true effect size of these SNPs, this is still a relatively large effect for a common variant. It is likely that there are other variants that have not yet been identified in European GWAS due to low allele frequency and that could be identified in Latinas where they are more common. Larger studies of Latina women are needed to identify these variants.

Conclusion
Our study demonstrates additional unique associations with variants at 6q25 and breast cancer risk. This further highlights the important contribution of this locus to breast cancer susceptibility, particularly ER-negative breast cancer susceptibility. Additional fine-mapping and functional studies are needed to elucidate all of the causal variants in our population.
However, the variants we identified in this study can be useful to add to the increasing pool of common variants coming from GWAS and will be particularly useful to risk stratify women of Latin American ancestry for breast cancer risk.

List of abbreviations:
GWAS This work was funded in part by Grants from the National Cancer Institute (K24CA169004, R01CA120120 to E Ziv and R01CA184545 to E Ziv and S Neuhausen). J Hoffman was supported by R25CA112355. The San Francisco Bay Area Breast Cancer Study was supported by the National Cancer Institute (CA63446 and CA77305), the US Department