Plasma metabolomics profiles and breast cancer risk



Breast cancer (BC) is the most common cancer in women and incidence rates are increasing; metabolomics may be a promising approach for identifying the drivers of the increasing trends that cannot be explained by changes in known BC risk factors.


We conducted a nested case–control study (median followup 6.3 years) within the New York site of the Breast Cancer Family Registry (BCFR) (n = 40 cases and 70 age-matched controls). We conducted a metabolome-wide association study using untargeted metabolomics coupling hydrophilic interaction liquid chromatography (HILIC) and C18 chromatography with high-resolution mass spectrometry (LC-HRMS) to identify BC-related metabolic features.


We found eight metabolic features associated with BC risk. For the four metabolites negatively associated with risk, the adjusted odds ratios (ORs) ranged from 0.31 (95% confidence interval (CI): 0.14, 0.66) (L-Histidine) to 0.65 (95% CI: 0.43, 0.98) (N-Acetylgalactosamine), and for the four metabolites positively associated with risk, ORs ranged from 1.61 (95% CI: 1.04, 2.51, (m/z: 101.5813, RT: 90.4, 1,3-dibutyl-1-nitrosourea, a potential carcinogen)) to 2.20 (95% CI: 1.15, 4.23) (11-cis-Eicosenic acid). These results were no longer statistically significant after adjusting for multiple comparisons. Adding the BC-related metabolic features to a model, including age, the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA) risk score improved the accuracy of BC prediction from an area under the curve (AUC) of 66% to 83%.


If replicated in larger prospective cohorts, these findings offer promising new ways to identify exposures related to BC and improve BC risk prediction.


Breast cancer (BC), the most common cancer and the leading cause of cancer death in women worldwide [1], is increasing over time, and established risk factors cannot account for this increase [2]. Metabolic phenotype represents the metabolite profile, influenced by genetic and environmental factors [3]. Characterization of metabolic processes may provide new insights into risk factors for breast carcinogenesis [4]. Thus, a comprehensive readout of the chemical body burden and the resulting endogenous response with the fast-evolving technologies of high-resolution mass spectrometry (HRMS) in the recent decade, metabolomics is one promising approach to gaining comprehensive insight into the etiological pathways leading to BC. [4,5,6] The small molecule profile of blood untargeted metabolomics provides an integrated readout of the body's chemical burden and its endogenous metabolic response.

At least 10 prospective metabolomic studies of BC risk using pre-diagnostic plasma (n = 88–1691 cases) with the mean follow-up ranging from 4 to 21 years have been carried out [7,8,9,10,11,12,13,14,15,16]. Most prior studies [7, 8, 11, 12, 15, 16], however, focused on postmenopausal women and none of these studies focused on women at high risk due to their family history. These studies reported that metabolites such as sex steroid-related metabolites, glycerolipids, and cholesteryl esters were altered several years prior to BC diagnosis; suggested that metabolomics is potentially a powerful approach to identify metabolomic biomarkers that are altered during BC development and before clinical symptoms. In addition, studies also found several BC-associated metabolic features were correlated with diet [15] or body mass index (BMI) [16]. For example, a nested case–control study identified 113 nutritional metabolites and found 3 metabolic features, including saturated fatty acids (from fats/oils), vitamin E derivatives (from desserts or vitamin supplements), and androgens (from alcohol), were associated with BC, with odds ratios (ORs) ranging from 0.6 to 2.2 [15]. These studies highlighted the associations between baseline plasma metabolomic signatures and BC risk and suggested potential metabolic pathways as a promising avenue for discovering therapeutic targets for prevention.

Women with a family history of BC are two to four times more likely to develop the disease compared to women with no family history [17]. BC risk associated with family history varies with the age of the individual, number of affected relatives and age at which the relatives were diagnosed with BC [18, 19]. Our prior study estimated lifetime risk based on the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA) for women enrolled in the Breast Cancer Family Registry (BCFR) and found there was substantial variation in absolute risk among participants [17]. Therefore, the BCFR is a unique cohort to identify biomarkers for women across the risk continuum. The goal of this pilot study, which employed a prospective nested case–control study design, was to interrogate the relationship between metabolomic features with breast cancer risk in pre-diagnostic plasma of women enrolled in the New York site of the BCFR, a registry of individuals within families with breast and/or ovarian cancer [17, 20].

Materials and methods

Study design

We conducted a prospective study among the women unaffected with BC at enrollment within the New York site of the BCFR (for details see [21]). At recruitment, eligible participants completed a questionnaire that included information on demographics, lifestyles, environmental factors, and family history of cancer [20]. All BCFR participants were requested to provide a 30 ml blood sample at the time of the baseline recruitment. Biospecimens were processed according to a common standardized protocol and stored at − 80 °C till metabolomic analysis.

We actively follow participants for subsequent information on cancer incidence and vital status and attempt to verify cancers through pathology reviews and medical records. For the present nested case–control study, we analyzed data for 40 prospectively ascertained BC cases and 70 age- (± 5yrs) matched controls. Of these 40 cases, 17 were diagnosed with BC within 5 years, 17 cases were diagnosed with BC between 5 and 10 years and six cases were diagnosed with BC more than 10 years after blood draw. This study was approved by Columbia University’s Institutional Review Board. All methods were performed in accordance with the relevant guidelines and regulations.

Liquid chromatography-high resolution mass spectrometry (LC-HRMS) analysis

To interrogate circulating metabolic differences, we conducted global metabolomics of blood plasma samples using a liquid chromatography-high resolution mass spectrometry (LC-HRMS)-based metabolomics workflow [22]. For sample pretreatment, blood plasma samples were thawed on ice; 50 µL was aliquoted and extracted with 100 µL ice-cold acetonitrile (ACN) pre-spiked with the internal standard mix (final ACN: sample, 2:1, v/v). After centrifugation, supernatants were collected, of which 10 µL was injected for LC-MS analysis. The analytes were chromatographically separated, ionized and analyzed on a Thermo Fisher Scientific Vanquish dual chromatograph coupled to a high-resolution accurate-mass (HRAM) quadrupole-Orbitrap Q-Exactive HF-X mass spectrometer (Waltham, MA, USA) under two complementary modes: hydrophilic interaction liquid chromatography-electrospray ionization mass spectrometry in positive ion mode (HILIC) and C18-electrospray ionization mass spectrometry in negative ion mode (C18). The HILIC column uses a polar stationary phase that retains well polar species (e.g., primary metabolites including many organic acids and amino acids), while the C18 column uses a more nonpolar stationary phase that separates more nonpolar species well; using both columns allows us to cover a broad range of metabolites. For both modes, we operated the instrument in full scan mode at 120,000 mass resolution (full width at half maximum, fwhm) scanning a mass-to-charge (m/z) range of 85–1,275. For quality assurance and quality control (QA/QC), extracts of NIST1953 plasma (Gaithersburg, MD, USA) and BioIVT plasma (New Cassel, NY, USA) were injected intermittently with sample extracts. Other QA/QC procedures were implemented, spanning timely mass calibration, sample randomization, blinding technicians from case–control status of samples, method blanks, and triplicate sample injection. We further performed stringent post-data procedures, including triplicate sample filtering (keeping only features with ≥ 2/3 occurrences), replicate median summarization, and combat correction (accounting for batch effects). The resultant dataset after QA/QC reached a CV of 7.92% of total ion chromatogram (TIC) intensity using all features for QC samples, and the pairwise Pearson Correlation within QC samples (averaged) has a mean of 0.95 and %CV range of 0.78–1.

Data processing and analysis

We converted the acquired RAW format data to mzXML format in ProteoWizard msConvert, and extracted mass spectral features and aligned separately for each mode using apLCMS [23] with modifications by xMSanalyzer [24]. We used ComBat [25] for batch correction. The resultants feature table consist of 5,992 HILIC and 5,780 C18 features, respectively, containing accurate m/z, retention time (RT), and peak intensity (i.e., peak area, as a semi-quantitative measure for statistics) for individual ion features in each sample, which are referred to as m/z features hereafter. For QC purpose, we filtered the feature tables to remove peaks that were detected in fewer than 20% of study samples (i.e., consistently detected in analytical replicates of at least one participant's sample). We did not observe statistically significant differences in the number of missing features between cases and controls. We retained a total of 2,264 metabolic features for HILIC and 2,988 metabolic features for C18 for data analysis; the remaining metabolic features with values below the detection limit were imputed with half the minimum of the non-missing values. Prior to statistical analysis, we log10 transformed and Pareto-scaled peak intensities [26]. To annotate compound structures for these detectable metabolic features, we used a multi-layered approach, and assigned confidence of annotation according to the Schymanski Scale [27] by the guidelines of the Metabolomics Standard Initiatives (MSI) [28]. Briefly, we referenced an in-house m/z-RT library that was established from over 900 authentic chemical standards (level 1) and applied de novo annotation (level 4) through matching accurate m/z against annotations from Mummichog pathway analysis (10 ppm) and filtered out unlikely annotations by (1) focusing exclusively on ESI adduct species [M + H]+, [M + H-H2O]+, [M]+, [M-H]−, and [M-H-H20]−, and (2) filtering based on machine learning predicted RT, using bidirectional recurrent neural network (BRNN) for HILIC RT and random forest for C18 RT, respectively.

Absolute risk of BC

We assessed the 1-year risk of breast cancer by leveraging familial pedigree and vital status data, encompassing cancer diagnoses, age at diagnoses, and information on BRCA1 and BRCA2 mutations. Our analysis employed the BOADICEA model [29], utilizing the obtained probability as a continuous risk score in subsequent regression analyses. Variables included in the BOADICEA algorithm include age at baseline, year of birth, first-, second, and third-degree relatives with BC, identical twin with BC, age at cancer diagnosis, bilateral BC, ovarian cancer, pancreatic cancer, prostate cancer, molecular subtype of breast tumors, vital status of family members, BRCA1 and BRCA2 mutation status and Ashkenazi Jewish heritage [30].

Statistical methods

We used the Wilcoxon rank test to compare the metabolic feature levels between cases and controls. We used the original p < 0.05 to select the candidate metabolite features for further multivariate logistic regression analysis. We also conducted partial least squares-discriminant analysis (PLS-DA) to examine the metabolic features by case and control groups while adjusting for confounding factors. Specifically, for data pretreatment, we removed potential batch effects by combat [31] normalization (using xMSanalyzer) [24], imputed zero values with half of the minimum within-sample peak intensity, and conducted log-transformation and Pareto scaling of the alignment datasets (HIL and C18 separately). We then performed linear regression adjusting for potential confounding variables including age (continuous years), BMI (continuous kg/m2), smoking, alcohol drinking, and menopausal status; the fit of the model was checked, and the resultant residuals were retrieved for PLS-DA using mixOmics. The variable importance in the projection (VIP) in the PLS-DA was retrieved in R using PLSDA.VIP() function of the mixOmics [32], and the VIP scores were plotted to assist in the sorting of the top candidate metabolic ion features contributing the most to the PLS-DA classification. We performed pathway enrichment analysis in MetaboAnalyst 5.0 using the mixed mode (combining data of HILIC and C18) as input, and applied the Mummichog [33] algorithm (p-value cutoff 0.05) to identify the most enriched metabolic pathways referencing against MFN (Homo sapiens), a human genome-scale metabolic model from the original mummichog package that has been manually curated from various sources including KEGG, BiGG and Edinburgh model.

We used logistic regression adjusting to calculate odds ratios (OR) and 95% confidence intervals (CI) for individual metabolic features with BC diagnosis. Model 1 adjusted for age at blood (continuous). Model 2 adjusted for age and BOADICEA breast cancer 1 year of risk score (continuous). Model 3 included variables in Model 2, BMI (continuous), race and ethnicity, alcohol, and smoking status (never, former and current), menopausal status (Pre and post-menopausal status). For ROC analysis, we conducted three logistic regression modes: Model 1, including age (continuous years); Model 2, including age and BOADICEA 1-year risk score; and Model 3, including age, BOADICEA risk score, and six metabolic features. All the variables were modeled as continuous rather than categorical, which makes it less likely that the model was over-fitted given the small sample size. We also conducted a sensitivity analysis by excluding the 4 cases diagnosed with breast cancer within 1 year after blood collection. Analyses were done in SAS (v. 9.4).


Table 1 presents the baseline characteristics for cases and controls. The mean ages were 45.2 ± 11.4 years for cases and 46.4 ± 13.4 years for controls. The average age at BC diagnosis was 51.6 ± 12.5 years. Twenty-four cases (60%) and 49 controls (71.0%) were pre-menopausal at baseline.

Table 1 Characteristics of study subjects, New York site of the BCFR

We detected and aligned 11,772 m/z features (5,992 HILIC and 5,780 C18) for the untargeted plasma metabolomic profiling. Of these, 5,252 (2,264 HILIC and 2,988 C18) were detected in at least 80% of samples. A non-parametric test found 289 metabolic features (135 HILIC and 154 C18) (Fig. 1A and B) were statistically significantly different between cases and controls if the original p-value was less than 0.05. Thirty-two metabolic features (17 HILIC and 15 C18) had fold changes (FCs) above 1.5 (case > control) or below 0.667 (cases < control) (Fig. 2A and B). Table 2 presents the changes in annotated metabolic features with significant fold changes (original p < 0.05) between cases and controls [34]. In this study, among HILIC features, we observed 4 positively associated and 13 negatively associated features in cases compared to controls. Among C18 features, we observed 12 positively associated and 3 negatively associated features in cases compared to controls.

Fig. 1
figure 1

Manhattan plots of metabolome-wide association study. Features heighted in purple indicate original p < 0.05 in Wilcoxon Rank Test. 

Fig. 2
figure 2

Volcano plot of metabolites/features. 

Table 2 Fold change of the significant metabolic features

Table 3 presents the odds ratio values of BC risk for the annotated metabolic features. The ORs of log-metabolic features range from 0.31 to 2.20 in Model 3. For the metabolites negatively associated with risk, the ORs range from 0.31 (95% CI: 0.14, 0.66) for HILIC feature (m/z: 138.066, RT: 25.4 s, L-Histidine) to 0.65 (95% CI: 0.43, 0.98) for HILIC feature (m/z: 222.0984, RT: 27.5 s, N-Acetylgalactosamine). For the metabolites positively associated with risk, ORs ranged from 1.61 (95% CI: 1.04, 2.51) for HILIC feature (m/z:101.58, RT:90.4 s, 1,3-Dobutyl-1-nitrosourea) to 2.20 (95% CI: 1.15, 4.23) for C18 feature (m/z:346.246, RT:126 s, 11-cis-Eicosenoic acid). These results were no longer statistically significant after adjusting for multiple comparisons.

Table 3 Breast cancer risk for 12 metabolic features, nested case–control study within the New York Site of the BCFR

We calculated the Area under the Receiver Operating Curve (AUC) to evaluate the performance of our classifier. The AUC of the model that included age, and BOADICEA 1-year risk score) improved from 0.66 to 0.83 once our six candidate metabolites were incorporated into the model (Fig. 3). We did not include two metabolic features, glucose (m/z:181.0721, RT 33 s) and caffeine (m/z:195.0878, RT 31.6 s), because both metabolite features were highly correlated with L-Histidine (m/z:138.0662, RT 25.4 s), with correlation coefficients of 0.83 (p < 0.0001) and 0.96 (p < 0.0001), respectively. We also conducted a sensitivity analysis by excluding four cases diagnosed with breast cancer within 1 year after blood collection. The results were similar (data not shown).

Fig. 3
figure 3

Receiver operating characteristics of a model with breast cancer risk factors and a model with breast cancer risk factors and six metabolite features, New York site of the BCFR. Metabolomic panel include: 1,3-Dibutyl-1-nitrosourea, L-Histidine, N(6)-Methyllysine, N-Acetylgalactosamine, 11-cis-Eicosenoic acid and LysoPE(0:0/24:6(6Z,9Z, 12Z,15Z, 18Z, 21Z)

In addition to the metabolome-wide association analysis, we also conducted a supervised classification approach PLS-DA to differentiate cases and controls based on the metabolic profiles. Figure 4 presents the PLS-DA score plots showing the separation of the two groups and shows both HILIC and C18 features with two clusters by case–control status with some overlap. Figure 5 and Supplement Table 1 present the results from the pathway enrichment analysis based on the Mummichog algorithm [33]. The main pathways associated with BC include arginine and proline metabolism and urea cycle/amino group metabolism.

Fig. 4
figure 4

Partial least square discriminant analysis (PLS-DA) of plasma metabolomic data comparing breast cancer cases and unaffected controls under two complementary modes of analysis including (A) PLS-DA of hydrophilic interaction chromatography (HILIC) positive ESI and (B) C18 chromatography negative ESI

Fig. 5
figure 5

Pathway analysis of the plasma metabolome comparing breast cancer cases and unaffected controls based on the Mummichog algorithm. The P-values are from Fisher’s exact test applied to an enrichment test of individual metabolic features on pathways, mapping m/z-matched metabolites against a permutation procedure to reduce Type I error while adopting a more conservative version of Fisher’s test to increase the robustness of the test


We conducted a metabolome-wide association study based on an untargeted metabolomics workflow and identified eight BC related-metabolic features that were statistically significantly different between cases and controls. One of the identified features is amino acid and another feature belongs to lipids. In addition, we identified metabolic features related to diet as well as potential carcinogens. Pathway enrichment analysis identified a realm of pathways linked to both amino acid metabolism (e.g., arginine and proline metabolism) and lipid metabolism (e.g., glycerophospholipid metabolism). Our findings suggest that those metabolites and associated pathways are worthy of further evaluation using targeted, quantitative metabolomics analyses for BC risk. However, we recognized that these differences were not statistically significant after adjusted for multiple comparisons; thus, these preliminary findings thus need to be further tested and validated in larger prospective studies of BC.

1,3-Dibutyl-1-nitrosourea has demonstrated carcinogenic potential in animal models [35,36,37,38]. Specially, exposure of rats to different doses of 1,3-dibutyl-1-nitrosourea via drinking water, resulted in a dose–response relationship with mammary tumors [36]. Other cancers such as leukemia and vaginal tumors were also observed in rats with high exposure to it [39]. However, additional data is needed in order for the International Agency for Research on Cancer (IARC) to determine whether a probable or possible carcinogen is carcinogenic in humans. To our best knowledge, this study is the first human data on an association of 1,3-dibutyl-1-nitrosourea with BC demonstrating the utility of the approach in identifying potential environmental exposures associated with the disease.

Dietary polyunsaturated fatty acids have been postulated as a modifiable factor that could influence cancer risk [40]. However, evidence for the effects of polyunsaturated fats such as omega-3 and omega-6 fatty acids on risk of cancer is conflicting [41,42,43]. Dietary intake of trans fatty acids was found to be associated with a slightly increased risk of BC (HR = 1.09, 95% CI: 1.01, 1.17) in the European Prospective Investigation into Cancer and Nutrition (EPIC) [44]. A systematic review and meta-analysis of randomized trials on omega-3, omega-6 and total dietary polyunsaturated fat on cancer incidence concluded that increasing omega-3 has little or no effect on BC incidence (RR = 1.03, 95% CI:0.89, 1,20) [45]. Through measuring serum phospholipid fatty acid composition among women in the E3N study, Chajes et al. found increasing levels of palmitoleic acid, a trans-monounsaturated fatty acid, was associated with an increased risk of BC (OR–1.7) [46]. We found both eicosapentaenoic acid, an omega-3 fatty acid, and 11-cis-eicosenoic acid, an omega-9-fatty acid, were associated with an increased risk of BC. Because omega-3 has been suggested as a supplement for BC prevention, a compensatory mechanistic route may occur in BC cases. Our finding needs to be validated in cohorts with a larger sample size.

LysoPE(0:0/24:6(6Z, 9Z, 12Z, 15Z, 18Z, 21Z), a lysophospholipid, is classified as a lipid mediator and elicits many biological effects such as cell proliferation, and migration [47] that are critically required for tumor formation and metastasis [47]. We found higher LysoPE(0:0/24:6(6Z, 9Z, 12Z, 15Z, 18Z, 21Z) was associated with higher BC risk. Alterations of lysoPC and lysoPE were observed in serum and plasma collected from BC patients [13, 48, 49] as well as breast tumor tissue [49]. It has been suggested that lipid oversupply enhances cancer cell proliferation by providing the raw materials needed to generate new cells [50]. Chronic lipid oversupply might increase BC risk, perhaps by supplying energy and nutrients to the growing tumors.

L-Histidine is an essential amino acid with unique roles in proton buffering, metal ion chelation, and scavenging of reactive oxygen and nitrogen species [51]. Histidine supplementation suppressed inflammation and improved insulin resistance in obese women with metabolic syndrome in a randomized controlled trial [52]. Histidine was associated with a decreased risk of BC (OR = 0.91, 95% CI, 0.84, 0.99) in a metabolome-wide association study within EPIC; however, the association was no longer statistically significant after adjustment for multiple comparisons [14]. Another metabolome-wide association study found histidine was associated with an increased risk of BC among premenopausal women in the French E3N cohort. [12]

Diabetes was associated with triple-negative breast cancer in a prospective analysis of the Sister Cohort [53]. Long term use of metformin has been associated with decreased risk of ER-positive BC [53]. Impaired glucose was associated with a non-statistically significant 40 percent higher BC risk in a cohort of 7,894 women aged 45–64 years from four US communities [54]. The inverse association between glucose and BC risk is challenging to interpret as the biospecimens were collected from non-fasting individuals in our study.

The epidemiological evidence on coffee consumption and BC risk is conflicting [55]. The EPIC study found an association between coffee intake and lower postmenopausal BC risk (HR = 0.90, 95% CI, 0.82, 0.98) [56]. While there was no evidence for an association in a cohort of 57,075 postmenopausal women [57]. Overall, current studies of coffee consumption and BC examined coffee consumption based on self-report questionnaire. One suggestion is that possible risk differences exist with rates of caffeine metabolism [58]. Further biomarker studies measuring caffeine metabolites are needed to better characterize the preventive effect of caffeine in BC development.

In addition to the metabolome-wide association analysis to identify individual metabolic features associated with BC, pathway enrichment analysis showed that selected metabolic pathways such as arginine, proline and urea cycle might be altered in early breast tumorigenesis [59,60,61,62]. Untargeted metabolomics is a hypothesis-generating strategy to discovery early signs of metabolome-wide perturbations in BC development. Measuring metabolomic profiles may be a potential screening tool to identify higher risk individuals [63, 64]. Perturbations in fatty acid, arginine, and proline metabolism were found in plasma from BC cases at the time of cancer diagnosis [64, 65]. Our findings could provide insights for the identification of pathways for BC development.

Due to the sample size limitations, we opted not to explore the metabolite profiles by BC molecular subtypes. The results of our study also need to be interpreted with caution. The metabolomic features were only measured in non-fasting blood samples from a single timepoint for each participant, and we saw three metabolic features (L-Histidine, glucose, and caffeine) were positively correlated with each other. Although it is likely that most of the endogenous metabolites are biologically reproducible within a 2-year period [66], further studies are needed to examine the effect of blood collection conditions such as seasonal variation or fasting time. In addition, six metabolite features remain statistically significantly different between cases and controls after adjusting for selected risk factors; however, there might be some unadjusted confounding factors.

Accurately identifying high-risk individuals is essential for effective primary prevention (e.g., chemoprevention) [67,68,69,70,71], and for risk-based screening options [72, 73] which emphasize risk rather than age for optimal screening outcomes. BC risk assessment models used in the clinic only have very modest discriminatory accuracy in the range of 65% [74,75,76,77], meaning that 35% of women are misclassified. Inaccurate risk assessment means that women are either subject to over-treatment with biopsies and multiple screens or under-treatment with missed opportunities for optimal prevention, including chemoprevention. The most widely known and most commonly used model for BC risk assessment is the Breast Cancer Risk Assessment Tool (BCRAT, or Gail model) [78, 79], which although it is well-calibrated, only has modest discriminatory accuracy at the individual level (AUC ~ 0.6-0.65) [80,81,82]. Recently, modest improvements were achieved by incorporating polygenetic risk score [83], epigenetic markers [84], and lifestyle factors [85]. Metabolome studies identified diet-and/or lifestyle-related metabolic features and their associations with breast cancer [16, 86, 87]. Metabolomics can detect metabolic shifts resulting from lifestyle behaviors and may provide insight on the relevance of changes to carcinogenesis. In addition, metabolomics analysis can also identify metabolic features associated with environmental exposure, such as polycyclic aromatic hydrocarbons (PAHs) [88]. Our prior study showed women with a higher risk of BC based on their genetic factors are more susceptible to PAH exposure [89]. Incorporating metabolite markers related to modifiable factors might result in substantially greater magnitudes of association with BC risk.

Strengths of our study include the collection of plasma before diagnosis (range of 1–15 years), and the use of an untargeted metabolomic approach allowing us to identify novel contributors to BC. In summary, our study identified selected metabolic pathways and potential exposure factors related to breast cancer. If replicated in larger prospective cohorts, these findings offer promising new ways to identify environmental exposures related to BC and improve BC risk prediction.

Availability of data and materials

No datasets were generated or analysed during the current study.

Code availability statement

The underlying code for this study are available upon reasonable request.



The area under the curve


Breast cancer


Breast cancer family registry


Body mass index


Breast and ovarian analysis of disease incidence and carrier estimation algorithm


Confidence interval


Liquid chromatography-mass spectrometry


Liquid chromatography-high resolution mass spectrometry


Hydrophilic interaction liquid chromatography


Metabolomics standard initiatives (MSI)




Odds ratios


Partial least squares-discriminant analysis


Retention time


Variable importance in the projection


