A genome-wide association study identifies a breast cancer risk variant in ERBB4 at 2q34: results from the Seoul Breast Cancer Study

Introduction Although approximately 25 common genetic susceptibility loci have been identified to be independently associated with breast cancer risk through genome-wide association studies (GWAS), the genetic risk variants reported to date only explain a small fraction of the heritability of breast cancer. Furthermore, GWAS-identified loci were primarily identified in women of European descent. Methods To evaluate previously identified loci in Korean women and to identify additional novel breast cancer susceptibility variants, we conducted a three-stage GWAS that included 6,322 cases and 5,897 controls. Results In the validation study using Stage I of the 2,273 cases and 2,052 controls, seven GWAS-identified loci [5q11.2/MAP3K1 (rs889312 and rs16886165), 5p15.2/ROPN1L (rs1092913), 5q12/MRPS30 (rs7716600), 6q25.1/ESR1 (rs2046210 and rs3734802), 8q24.21 (rs1562430), 10q26.13/FGFR2 (rs10736303), and 16q12.1/TOX3 (rs4784227 and rs3803662)] were significantly associated with breast cancer risk in Korean women (Ptrend < 0.05). To identify additional genetic risk variants, we selected the most promising 17 SNPs in Stage I and replicated these SNPs in 2,052 cases and 2,169 controls (Stage II). Four SNPs were further evaluated in 1,997 cases and 1,676 controls (Stage III). SNP rs13393577 at chromosome 2q34, located in the Epidermal Growth Factor Receptor 4 (ERBB4) gene, showed a consistent association with breast cancer risk with combined odds ratios (95% CI) of 1.53 (1.37-1.70) (combined P for trend = 8.8 × 10-14). Conclusions This study shows that seven breast cancer susceptibility loci, which were previously identified in European and/or Chinese populations, could be directly replicated in Korean women. Furthermore, this study provides strong evidence implicating rs13393577 at 2q34 as a new risk variant for breast cancer.


Introduction
Breast cancer, one of the most common malignancies among women worldwide, is a complex polygenic disease in which genetic factors play a significant role in the disease etiology [1,2]. So far, genome-wide association studies (GWASs) have reported over 40 common low-penetrance variants in 25 loci that are associated with the breast cancer risk reported in the National Human Genome Research Institute catalog [3]. The most strongly and consistently associated single-nucleotide polymorphisms (SNPs) reside in intron 2 of the receptor tyrosine kinase FGFR2 (rs2981582) at 10q26. 13 and near the 5' end of the TOX3 gene at 16q12.1 (rs3803662) [4][5][6][7][8][9]. With the exception of three studies conducted among Asian women [10][11][12], all other previously published GWASs have been conducted primarily in women of European descent. Several studies, including our study, have investigated previously identified loci in European populations in other ethnic groups and validated the initial findings [13,14]. However, newly discovered loci initially identified in women of European descent tend to be weakly associated with breast cancer in women of Asian descent [10,11] or could not be confirmed in Asians because of the difference in linkage disequilibrium (LD) patterns between ethnic populations, suggesting that additional genetic variants for Asian women remain to be discovered.
In this study, we conducted a three-stage GWAS to identify common breast cancer susceptibility loci and to validate the previously reported loci by using Affymetrix Genome-Wide Human SNP Array 6.0 (Affymetrix, Inc., Santa Clara, CA, USA) with 2,273 patients with breast cancer from the Seoul Breast Cancer Study (SeBCS) and 2,052 healthy controls from a large urban cohort, the Korea Genome Epidemiology Study (KoGES), as stage I. By analyzing data from two replication stages that consisted of 4,049 cases and 3,845 controls, we found strong evidence for a new genetic variant that may be associated with breast cancer risk among Asian women.

Study population
A genome-wide association scan (stage I) was conducted with 2,385 patients with breast cancer from the SeBCS and 2,392 healthy controls from a large urban cohort that is participating in the KoGES (Supplementary methods in Additional file 1). For stage II, 2,052 cases were selected among the patients who were the participants in SeBCS but not included in stage I and 2,169 controls from another cohort recruited from two small cities with both urban and rural areas as part of KoGES. For stage III, 1,997 cases were selected from two independent breast cancer studies -the Korean Hereditary Breast Cancer (KOHBRA) study (n = 1,289) and the Yonsei Breast Cancer Study (n = 708) -and 1,676 controls were selected from health examinees from rural populations to study the risk factors for chronic diseases. Detailed descriptions of these participants are provided in Supplementary methods in Additional file 1, and descriptive statistics of the study subjects are shown in Table S2 of Additional file 2. The study protocols were approved by the institutional review boards of Seoul National University Hospital (institutional review board # H-0503-144-004) and each collaborating institute (Description of study participants in Additional file 1). Informed consent was obtained from all participants.

Genotyping
A GWAS (stage I) was performed by a single platform by using the Affymetrix Genome-Wide Human SNP Array 6.0 chip (Affymetrix, Inc.). In total, 4,777 samples were genotyped by using 500 ng of genomic DNA from peripheral blood. The Birdseed V2 algorithm was used to call the genotypes [15].
In total, 30 quality control (QC) samples were genotyped by using Affymetrix SNP Array 6.0. The average concordance rate between the QC samples was 99.8%. For internal validation of the Affymetrix SNP Array 6.0 platform, 12 SNPs were genotyped for all subjects by SNPstream UHT (12-plex, SNP-IT assay, Orchid Biosciences, Princeton, NJ). Samples of subjects that had a genotype call rate of below 95%, a high heterozygosity rate, or an incorrectly imputed gender were excluded. Calculated genome-wide average identity by state (IBS) between each pair of individuals was used to identify individuals who appeared to be in relationships with first-degree relatives or in relationships with more distant relatives whose clusters were tightly linked to the first-degree relatives. Pairwise IBS between individuals was calculated by using a subset of pruned markers (74,965 SNPs) that are in approximate linkage equilibrium. IBS analysis was performed by using the PLINK software package. Multidimensional scaling analyses based on pairwise IBS showed that, apart from some outliers, all subjects clustered closely with HapMap Asians ( Figure S2 of Additional file 3). Subjects with a cancer history and patients with diagnosed benign breast cancer were subsequently excluded. Finally, 4,325 individuals remained in the association analyses (Table S1 of  Additional file 2).
To ensure quality data for SNPs, SNPs were excluded if they met any of the following QC criteria: (a) deviation from the Hardy-Weinberg equilibrium P value of less than 10 -6 , (b) a genotype call rate of less than 95%, (c) a minor allele frequency (MAF) of less than 1%, (d) a poor cluster plot, (e) filtering out differential missingness between cases and controls (P < 10 -4 ), and (f) multiple positioning or mitochondrial SNPs or both. In total, 555,525 Affymetrix SNP Array 6.0 SNPs were used for the final association analyses (Table S1 of Additional file 2).
Genotyping for stages II and III was performed by using the 5' exonucleaseassay (TaqMan) employing the ABI Prism 7900HT Sequence Detection System (Applied Biosystems, Foster City, CA) in accordance with the instructions of the manufacturer. Primers and probes were supplied directly by Applied Biosystems (Foster City, CA, USA) as Assays-By-Design. For QC, about 2.2% of the samples were genotyped repeatedly. Only those SNPs that satisfied a concordance rate of greater than 99% in duplicates and a genotype success rate of greater than 99% were included in the subsequent replication phases.
Single-nucleotide polymorphism selection for validation study using stage I We selected the SNPs if they were implicated in previous GWASs and reported from the National Human Genome Research Institute catalog [16]. We included SNPs only if they had an assigned reference allele, a defined MAF, and an estimated odds ratio (OR) or a beta-coefficient and a 95% confidence interval (CI). For the SNPs that were not genotyped by using Affymetrix SNP Array 6.0 or successfully imputed (imputation QC r 2 was less than 0.3), we selected the best tagging SNPs on the basis of the LD metrics (r 2 and D'). Thus, we selected rs10736303 for the SNPs at 10q21.13 (rs1219648, rs2981572, rs2981585, and rs2981579) and rs10483813 for 14q24.1/RAD51L1 rs999737. We did not include the SNP rs614367 at 11q13.3/MYEOV, CCND1, ORAOV1, FGF19, FGF4, FGF3 since the MAF of rs614367 is very low in this study (MAF = 0.002) [9].

Single-nucleotide polymorphism selection for replication in stage II
After stage I, SNPs for replication were selected on the basis of the following criteria among the 555,525 SNPs that were directly genotyped and that had passed the QC procedure: SNPs (a) with an MAF of more than 5% for either cases or controls, (b) with very clear genotyping clusters, (c) with a P trend of per-allele OR of not more than 5 × 10 -4 , and (d) not in strong LD (r 2 < 0.5) with any of the GWAS-identified risk variants. Additionally, to select the SNPs resided in SNP cluster we selected the loci within which at least two SNPs had a P trend of per-allele OR of not more than 5 × 10 -3 . When multiple SNPs showed LD within 100 kb (r 2 > 0.2), the SNP with the lowest P trend was selected for replication.

Single-nucleotide polymorphism imputation
To infer the genotype of SNPs that were not observed in the Affymetrix Genome-Wide Human SNP Array 6.0 used in the present study, SNP imputation was carried out by using the hidden Markov model as implemented in MACH 1.0 [17,18]. Imputation was based on 555,525 autosomal SNPs that were genotyped in stage I and that had passed the QC procedure, and the phased CHB + JPT data from HapMap Phase II (release 22) were used as the reference panel, which consisted of over 2.4 million SNPs. In total, 2,210,823 SNPs showed an imputation quality score (r 2 ) of at least 0.30. The average r 2 of SNPs not found on the array but included in the validation study (15 SNPs in Table 1) is 0.97.

Bioinformatics
The LD metrics (r 2 and D') between SNPs were calculated by using Haploview version 4.2 software (Whitehead Institute, Cambridge, MA, USA) based on release 27, NCBI Build 36. Regional plots were drawn by using LocusZoom standalone version 1.1 [19] based on Hap-Map Phase II JPT+CHB for all SNPs in Figure S3 of Additional file 3 except for rs10736303, which was based on 1000 Genome August 2009 JTP+CHB.

Statistical analyses
Association on breast cancer risk was estimated by ORs and 95% CIs while assuming an additive model by logistic regression analysis adjusted for age. GWAS stage (stage 1) analyses were conducted primarily by using PLINK program version 1.06 [20] for directly genotyped SNPs, and the MACH2dat program [17,18] was used for imputed SNPs. For the comparison of the observed and expected distribution of test statistics, quantile-quantile analysis of 2-degrees-of-freedom logistic regression statistics was applied. The genomic inflation factor lamda (λ) was calculated as 1.043, suggesting that the population substructure, if any, should not have a substantial effect on result ( Figure S1 of Additional file 3).
To control the risk of false discoveries, the multiplecomparison-adjusted P values for stage II, stage III, and combined analysis were calculated by the Benjamini-Hochberg false discovery rate method [21]. Cochran's Q statistic to test for heterogeneity and the I 2 statistic to quantify the proportion of the total variation due to heterogeneity across all stages were calculated. Combined statistics were estimated from meta-analysis while assuming a fixed-effects model since there was no evidence of heterogeneity. To assess the relationship according to estrogen receptor (ER) or progesterone receptor (PR) status or both, ORs and 95% CIs were estimated after stratification by hormone receptor status. Case-only P value was used to test for heterogeneity and

Validation of previously identified breast cancer susceptibility loci in Korean women
We found that multiple genomic locations were potentially associated with the risk of breast cancer ( Figure  1). Table 1 presents the test of associations between the previously reported loci and breast cancer risk in 2,257 cases and 2,052 controls. Among the 27 SNPs reported in published GWASs, 10 SNPs located in seven loci showed significant associations with breast cancer risk (P trend < 0.05). Plots of regional LD and strength of association by chromosomal position for these seven loci of interest are shown in Figure S3 of Additional file 3.
smaller than that of previous reports. However, 14q24.1/ RAD51L1 rs10483813 (proxy of rs999737) showed no significant association with an OR of 1.21. Additionally, none of the remaining 17 SNPs was replicated in our study (P > 0.05), and the effect sizes were estimated to be less than 1.10, which is lower than that of the estimated ones in women of European descent. We further assessed these associations according to ER and PR status (Table S3 of Additional file 2). The 5q12/MRPS30 rs7716600 and rs4415084 exhibited a stronger association with ER + than with ERtumors (P heterogeneity = 0.02 and P heterogeneity = 0.05, respectively). Although no overall associations were found for 2q35 rs13387042 and 10p15.1/ANKRD16, FBXO18 rs2380205, stratified analysis revealed a statistically significant association with ER + (P trend = 0.03 and P trend = 0.04) or PR + (P trend = 0.05 and P trend = 0.05) tumors but not with ERor PRtumors. The test for heterogeneity was significant only for 10p15.1/ANKRD16, FBXO18 rs2380205 (P heterogeneity ER+ vs. ER-= 0.030 and P heterogeneity PR+ vs. PR-= 0.040). Additionally, two SNPs at 6q25.1/ESR1 (rs2046210 and rs3784805) exhibited a stronger association with ERthan with ER + tumors, although the differences were not statistically significant. There were no differences in associations by ER or PR status for the remaining SNPs.

Analysis of 2q34 rs13393577 and breast cancer risk
To search for additional independent genetic risk variants in Korean women, we selected the most suggestive 17 SNPs from stage I and genotyped in an independent set of 2,052 cases and 2,169 controls (stage II). Among the 17 SNPs evaluated in stage II, only one SNP, rs13393577 at 2q34, was significantly associated with breast cancer risk. The estimated ORs for rs13393577 were as follows: OR heterozygote = 1.39 (95% CI = 1.16 to 1.67; P = 2.05 × 10 -5 ), OR homozygote = 5.57 (95% CI = 2.26 to 13.7; P = 4.10 × 10 -3 ), and OR per-allele = 1.51 (95% CI = 1.29 to 1.78; P trend = 1.1 × 10 -5 ). SNP rs13393577 and three more SNPs (3q26.32 rs3806685, 6q25.1 rs9498283, and 17q24.3 rs11077488) showing marginally significant associations in stage II (P trend < 0.10) were further evaluated in stage III, which included 1,997 cases and 1,676 controls. Again, rs13393577 showed a significant association with breast cancer risk ( Table 2). The P trend reached 8.8 × 10 -14 in the combined analysis (Figure 2). The SNP rs3806685 at 3q26.32 was also associated with breast cancer risk with an OR per-allele of 1.18 (95% CI = 1.04 to 1.34; P trend = 1.8 × 10 -2 ) in stage III; however, the combined OR was not significant (P trend = 5.9 × 10 -1 ). The other two SNPs showed no significant association in stage III or in the combined analysis (P > 0.05). The test for heterogeneity suggested no difference in the genetic effects across ER or PR status for rs13393577 (data not shown). To capture additional signals for rs13393577, we investigated the SNPs nearby rs13393577 (Figure 3). Among the directly genotyped SNPs, two SNPs are in strong LD (r 2 > 0.8) with rs13393577, and the smallest P values were 1.2 × 10 -3 for rs6756468 (r 2 = 1.00) and 2.6 × 10 -2 for rs16848753 (r 2 = 0.81). SNP rs6756468 is in tight LD with rs13393577 located in the LD block containing the mir-548f-2 gene. Several imputed SNPs in high LD with rs13393577 were nominally associated with breast cancer risk (P < 0.05) (Table S5 of Additional file 2).

Discussion
In the present study, we conducted a three-stage GWAS in Korean women (6,322 cases and 5,897 controls). We not only confirmed previously identified loci in Europeans or Chinese populations or both but also found rs13393577 at 2q34/ERBB4 as a new breast cancer susceptibility variant in Korean women.
In the validation study, we evaluated whether 27 SNPs in the 20 GWAS-identified loci were also relevant in our population using stage I and identified that 10 SNPs at seven loci were significantly associated with breast cancer risk. As anticipated, the strongest and the most significant results were observed in rs2046210 at 6q25.1/ ESR1 and rs4784227 at 16q12.1/TOX3, and these results are slightly similar to those of the magnitude and direction of previous reports conducted in a Chinese population [10,11]. For the SNPs rs2048671 at 7q32.3/NR and rs10822013 at 10q21.2/ZNF365, which were also identified in Asians, we recently reported significant associations with breast cancer risk through multi-stage GWAS with a cumulative sample size up to over 34,000 East Asian subjects (OR per-allele = 1.10; 95% CI = 1.07 to 1.14; P trend = 5.87 × 10 -9 and OR = 1.08; 95% CI = 1.04 to 1.11; P trend = 6.21 × 10 -6 ) [12]. However, the associations of these SNPs were not significant in this study, possibly because of its limited power.
In addition, we could not evaluate the SNPs (rs1219648, rs2981572, rs2981585, and rs2981579) previously identified within intron 2 of FGFR2 at 10q21.13, because they were not genotyped or successfully imputed (imputation QC r 2 < 0.3). Thus, we selected rs10736303 as the best tagging SNP capturing 10q26.13/ FGFR2 since it is in high LD with the reported SNPs with pairwise r 2 values of 0.67 for rs2981579 (r 2 = 0.48 in CHB+JTP; r 2 = 0.74 in CEU), 0.57 for rs2981575 (r 2 = 0.36 in CHB+JPT; r 2 = 0.72 in CEU), and 0.53 for rs1219648 (r 2 = 0.29 in CHB+JPT; r 2 = 0.72 in CEU) on the basis of our data. Furthermore, the rs10736393 is located at intron 2 of FGFR2 within the sequences conserved across all placental mammals and suggested to be a functional variant to regulate FGFR expression by generating a putative ER-binding site [5]. In the present study, the rs10736393 G allele was significantly associated with increased breast cancer risk with an effect size of rs2981579 that was the same as in a previous report [8]. However, the recently added SNP, rs10510102, located in the 300-kb telomeric region of intron 2 of FGFR2 but not with a genome-wide significance level (P = 1.6 × 10 -6 ), was not replicated in the present study [24]. Subgroup analysis revealed that some of the validated associations differed by ER or PR status. Recent studies showed stronger associations with ER + than with ERtumors for several loci -rs13387042(2q35), rs4973768 (3p24), rs889312 (5q11.2/MAP3K1), rs7716600 (5q12/ MRP30), rs13281615 (8q24), rs1219648 and rs2981582 (10q26.13/FGFR2), and rs3803662 (16q12) -and with PR + than with PRtumors for rs2981582 (10q26.13/ FGFR2) [6,25,26]. Among these loci, rs13387042 (2q35), rs7716600, and rs4415084 (5q12/MRP30) showed significantly different associations by ER status, and rs4973768 (3p24) also showed a stronger association with ER + tumors, although the test for heterogeneity was not significant. The association of rs2380205 (10p15.1/ANKRD16, FBXO18) with breast cancer risk was also stronger for ER + or PR + tumors than with negative tumors, and this heterogeneity in association remains to be evaluated in other populations.
The stronger association of rs2046210 at 6q25.1/ESR1 with ERthan with ER + tumors has been well documented [10]. In the present study, we could also observe this heterogeneity for rs2046210 and its nearby SNP rs3784805, although the differences were not statistically significant. Direct replication in some of the loci showing significant differences in associations according to ER or PR status provides further support for the hypothesis that intrinsic subtypes of breast cancer should have different etiologic pathways; thus, the polygenic component of these subtypes of breast cancer should be different [27].
There are several potential reasons for the failure of validation for previously identified loci in women of European descent. First, several risk variants could escape detection because of the limited statistical power caused by either low allele frequency or a very small effect size of the initial findings. There are several SNPs of which the allele frequencies in Koreans are substantially lower than in Europeans: SNP rs11249433 at 1p11.2/NOTCH2, FCGR1B (4% versus 39%), rs1011970 at 9p21.3/CDKN2A, CDKN2B (7% versus 17%), rs865686 at 9q31.2/KLF4, RAD23B, ACTL7A (7% versus 24%), rs10995190 at 10q21.2/ZNF365 (2% versus 15%), and rs10483813, proxy of rs999737, at 14q24.1/RAD51L (3% versus 24%). Thus, we have only 8% to 30% of the statistical power to detect the reported effect sizes of 1.06 to 1.16 for these SNPs with the current sample size. We could not exclude the possibility that the effect size of the original reports could be represented as exaggerated ORs caused by 'winner's curse'. Second, a difference in underlying genomic structure between ethnicities could produce the bias to cover SNPs tagging the causal variants, although the reported SNPs could work effectively in women of European descent. Another possibility is that some of the variants evaluated may not be strongly associated with breast cancer risk in Asian women such as shown in the null association of rs2180341 (6q22.33). For rs3180341, we had a statistical power of 80% to detect an OR as small as 1.15; furthermore, the lack of an association has been shown in a study conducted in a Chinese population [13]. Moreover, the risk profiles of genetic variants could be manifested differently in different ethnic populations, assuming that the relative contribution of the risk variants to carcinogenic pathways of breast cancer varies between different populations. Finally, the interactions of environmental exposures, lifestyle, or other effect modifiers and even the difference in breast cancer prevalence could have an effect on the penetrance of these alleles.
The ERBB4, harboring rs13393577 in the first intron at 2q34, is a member of the epidermal growth factor (EGF/ERBB) family of receptor tyrosine kinases, which are key activators of signaling pathways involved in cell division, migration, adhesion, differentiation, and apoptosis [28]. It is reported that ERBB4 is frequently overexpressed in breast cancer, and the expression of transcripts encoding the cleavable ERBB4 isoforms was associated with ER expression and a high histological grade of differentiation [29]. Rokavec and colleagues [30] identified the presence of five germ-line variants in the ERBB4 5'-untranslated region and reported that one of these variants (ERBB4 -782T > G) was associated with breast cancer risk from the different promoter activity according to the different allele. However, rs13393577 is not in LD with ERBB4 -782T > G; thus, the potential influence of rs13393577 is unlikely to be mediated through this previously reported variant.
We conducted an in silico functional analysis to assess the potential biological function of rs13393577. The rs13393577 C allele had no predicted binding site, whereas several transcription factors were predicted to bind the rs13393577 T allele implementing six highscoring binding sites (maximum score = 92.7 points; minimum score = 85.9 points) [31]. In agreement with this, FASTSNP scored rs13393577 as 1-2 (intronic enhancer) [32]. Additionally, Murabito and colleagues [33] have shown that three SNPs in ERBB4 (rs905883, rs7564590, and rs7558615) were associated with breast cancer risk in a family-based GWAS that included 58 breast cancer cases, although no association was attained with genome-wide significance level. Among these variants, rs7564590 is in moderate LD with rs13393577 (r 2 = 0.44 in CHB+JPT and r 2 = 0.25 in CEU) whereas the other two SNPs (rs905883 and rs755861515) are in very weak LD with rs13393577 (all r 2 < 0.02 in CHB+JPT and CEU). Thus, if both rs13393577 and rs7564590 are not themselves functional, they might be in high LD with the true causal variants. Additionally, we could not exclude the possibility that the strong association shown in rs13393577 is related to the function of the mir-548f-2 gene harboring SNP rs6956468, which is in tight LD with rs13393577.

Conclusions
In summary, we have confirmed 10 SNPs in seven loci of breast cancer risk which were initially identified in European or Chinese populations or both and provided additional evidence confirming the heterogeneity in the risk of different tumor subtypes for common breast cancer susceptibility variants. Moreover, we identified rs13393577 in ERBB4 located at 2q34 as a new breast cancer susceptibility variant. Future studies, including fine mapping, functional assay, and a replication study with large sample sizes from diverse ethnic populations, are needed to validate our results.

Additional material
Additional file 1: Supplementary Methods. Description of study participants. Genotyping and quality control procedures.