Risk prediction models with incomplete data with application to prediction of estrogen receptor-positive breast cancer: prospective data from the Nurses' Health Study

Introduction A number of breast cancer risk prediction models have been developed to provide insight into a woman's individual breast cancer risk. Although circulating levels of estradiol in postmenopausal women predict subsequent breast cancer risk, whether the addition of estradiol levels adds significantly to a model's predictive power has not previously been evaluated. Methods Using linear regression, the authors developed an imputed estradiol score using measured estradiol levels (the outcome) and both case status and risk factor data (for example, body mass index) from a nested case-control study conducted within a large prospective cohort study and used multiple imputation methods to develop an overall risk model including both risk factor data from the main cohort and estradiol levels from the nested case-control study. Results The authors evaluated the addition of imputed estradiol level to the previously published Rosner and Colditz log-incidence model for breast cancer risk prediction within the larger Nurses' Health Study cohort. The follow-up was from 1980 to 2000; during this time, 1,559 invasive estrogen receptor-positive breast cancer cases were confirmed. The addition of imputed estradiol levels significantly improved risk prediction; the age-specific concordance statistic increased from 0.635 ± 0.007 to 0.645 ± 0.007 (P < 0.001) after the addition of imputed estradiol. Conclusion Circulating estradiol levels in postmenopausal women appear to add to other lifestyle factors in predicting a woman's individual risk of breast cancer.


Introduction
Breast cancer risk prediction models have been developed for use as an entry criterion into breast cancer chemoprevention trials (for example, National Surgical Adjuvant Breast and Bowel Project tamoxifen trial and the Study of Tamoxifen and Raloxifene), in counseling women on the potential use of chemopreventives, and to provide insight into a woman's individual breast cancer risk [1][2][3][4]. The initial Gail model incorporated a subset of breast cancer risk factors, namely age, age at menarche, age at first birth, family history of breast cancer or of atypical hyperplasia, and history of breast biopsies [5,6]. Subsequently, several groups have developed more extensive statistical models that incorporate a greater number of breast cancer risk factors [1,4].
In postmenopausal women, circulating levels of estradiol predict subsequent breast cancer risk [7][8][9][10], particularly for estrogen receptor (ER)-positive disease [9]. However, plasma BBD = benign breast disease; BMI = body mass index; C statistic = concordance statistic; CI = confidence interval; ER = estrogen receptor; PMH = postmenopausal hormone; Q1, Q2, Q3, and Q4 = first, second, third, and fourth quartile; RR = relative risk. estradiol levels have not previously been evaluated within risk prediction models, and whether their addition would add to the model's predictive power is unknown. Plasma estradiol is available only from the Nurses Health Study nested case-control data set. We initially attempted to evaluate the addition of estradiol concentrations to the Rosner and Colditz risk prediction model using data from the Nurses' Health Study nested case-control data set. However, with the relatively modest size of the nested case-control data set and the large number of parameters to be estimated, a number of the risk factor parameters did not adequately reflect those from the parent cohort. Thus, development of an accurate risk prediction tool requires a large sample size as in the main Nurses' Health Study cohort. The purpose of this paper is to describe a methodology for developing a risk prediction rule when one or more predictors are incompletely observed and to apply it to assess the predictive power of plasma estradiol after adjusting for standard breast cancer risk factors as well as the effect of other risk factors adjusted for plasma estradiol.

Cohort
The Nurses' Health Study cohort was established in 1976, when 121,700 female US nurses (30 to 55 years old) responded to a mailed questionnaire that inquired about reproductive history and a range of lifestyle factors in addition to disease diagnoses [11,12]. Follow-up questionnaires have been mailed biennially to update exposure information and any major medical events. Deaths are reported by family members or the postal service or are identified by a search of the National Death Index. We estimate that mortality ascertainment is 98% complete [13,14]. This investigation was approved by the Brigham and Women's Hospital institutional review board.

Identification of breast cancer cases
On each questionnaire, we inquired whether breast cancer had been diagnosed and, if so, the date of diagnosis. All women who reported breast cancer (or the next of kin for decedents) were contacted for permission to review medical records to confirm the diagnosis. We include only invasive cases of breast cancer confirmed by the pathology report. ER and progesterone receptor status of the tumor was determined from the medical record. In this report, we evaluate ER + cases only, as we previously observed that plasma estradiol most strongly predicted this tumor subtype [9].

Population for analysis
The population of women used in this analysis has been described in detail in several previous publications [1,15]. Briefly, we excluded women with unknown, inconsistent, or out-of-range reports for height, weight in 1976 or at age 18, age at menarche or menopause or each pregnancy, parity, and duration or type of postmenopausal hormone (PMH) use (n = 42,886). Additionally, women with a simple hysterectomy (and hence unknown age at menopause) (n = 10,301) were excluded. Participants who were ineligible for the study (for example, prevalent cancer in 1976) or no follow-up after 1978 (n = 2,360) were excluded. In the current analysis, women who were premenopausal throughout follow-up were excluded (n = 6,342), but once they became postmenopausal, they could contribute person-time. Overall, 59,812 participants remained for this analysis. These women contributed 750,086 person-years from 1980 to 2000, during which 1,559 incident invasive ER + breast cancer cases occurred.

Blood subcohort and nested case-control study
From 1989 to 1990, 32,826 cohort members provided blood samples. Informed consent was obtained from each participant; details about the blood collection methods have been published previously [16,17]. Briefly, women arranged to have their blood drawn and shipped with an icepack via overnight courier to our laboratory, where it was processed and archived in liquid nitrogen freezers. Estradiol is stable in cooled whole blood for 24 to 48 hours [18]. At blood collection, women completed a short questionnaire that included questions on recent use of PMH (within the last 3 months). Follow-up of the blood study cohort was 99% in 2000.
In the current analyses, we used a previously described nested case-control study of sex steroids and breast cancer risk with cases diagnosed after blood collection through 31 May 1998 [9,16]. In addition, cases diagnosed up through 31 May 2000 and their matched controls (that is, a 2-year extension of the published report [9]) are included. At blood collection, cases and controls were postmenopausal, were not recent users of PMH, and had no prior diagnosed cancer (except nonmelanoma skin cancer). Control subjects were matched by age, month/year and time of day of blood collection, and fasting status and had not been diagnosed with breast cancer before the diagnosis date of their matched case. To mimic the larger population used in the risk prediction modeling, only cases and controls meeting the inclusion criteria described above were included (for example, no prior simple hysterectomy). Women were considered postmenopausal if they reported having a natural menopause (for example, no menstrual cycles during the previous 12 months) or had a bilateral oophorectomy. In all, 164ER + cases and 346 controls were included.

Laboratory assays
Estradiol was measured by radioimmunoassay following extraction and celite column chromatography, as previously described [9]. The coefficient of variation was less than or equal to 11%.

Description of the risk prediction model
We fit the log-incidence model of breast cancer to incident ER + cases, as previously described [1,15]. We assume that incidence at time, t(I t ), is proportional to the number of cell divisions, C t , accumulated throughout life up to age t; that is, (page number not for citation purposes) The cumulative number of breast cell divisions is factored as follows: Thus, λ i = C i+1 /C i represents the rate of increase of breast cell divisions from age i to age i+1. Log (λ i ) is assumed to be a linear function of risk factors that are relevant at age i. The set of risk factors and their magnitude may vary according to the stage of reproductive life. Details of the representation of the C i are given in [1,15]. The overall model is given by: The general rationale for a log-incidence model is that the number of precancerous cells increases multiplicatively with time but that historical exposures differentially affect the rate of increase. Specifically, for breast cancer, the number of precancerous cells is assumed to increase annually at the rate of exp(β 0 ) prior to menopause for nulliparous women, at the rate of exp(β 0 + β 1 s) prior to menopause for parous women with parity = s, and so forth. Finally, the number of precancerous cells increases immediately after the first birth by exp [β 2 (t 1t 0 )]. The incidence rate of breast cancer is assumed to be approximately proportional to the number of precancerous cells.
The log-incidence model was fit using iteratively reweighted least squares with PROC NLIN in SAS (SAS version 6.12; SAS Institute Inc., Cary, NC, USA) (1996). The parameters of the model are readily interpretable in a relative risk (RR) context. For example, exp(-β 0 ) = RR for a 1-year increase in age at menarche among nulliparous women, exp [-(β 0 + β 2 )] = RR for a 1-year increase in age at menarche among parous women, and so forth. In this analysis, women were followed until they had an event (ER + breast cancer) or were censored if they developed (a) ERbreast cancer, (b) breast cancer in which ER status is unknown, or (c) other types of cancer except nonmelanoma skin cancer or (d) if they died.

Imputation and inclusion of estradiol in the risk prediction model
Ideally, we would have estradiol levels measured on each main study participant at several points in time. However, since this was not possible, we used an indirect approach to impute estradiol. Let x = estradiol and z = other covariates in the risk prediction model.
From the main study, we can obtain Pr(D|z) given under the rare disease assumption by: We want to estimate Pr(D|x,z), where under the rare disease assumption Pr(D|x,z) ≅ exp(α* + β*z + δ*x).
From the blood study, we can estimate δ* based on conditional logistic regression. Indeed, in principle, we could also estimate β* from the blood study, but the estimates will be very imprecise due to the small sample size. Therefore, we used the main study population to estimate the parameters in Equation 5 by estimating x for all subjects in the main study based on a linear regression derived from the blood study: where x = ᐍn (estradiol) as a continuous variable, y = 1 if case and 0 if control, and Z imp = a subset of the other covariates Z in the risk prediction model. Z imp was ascertained by first forcing in y and then using stepwise-up regression to determine the subset of components of Z in the main study which were significantly associated with x at the 5% level.
In the blood study, estradiol levels on average were higher for cases than controls. The rationale for including y as a covariate in Equation 6 is to account for this relationship in the main study as well. In addition, because there is substantial overlap between the estradiol distribution of cases and controls, we used an imputation strategy to estimate x by adding error to the prediction such that for each main study participant we obtain where (a) e i = an error term that is normally distributed with mean 0 and variance σ 2 , (b) y i = 1 if a breast cancer case and = 0 otherwise, (c) σ 2 is estimated from Equation 6, and e is obtained by the RANNOR function of SAS so as to add error to the estimate of x for individual women. We then fit the model in Equation 5 using instead of x, thus obtaining the model Since the parameter estimates in Equation 8 may be influenced by the random error introduced in Equation 7, we repeated this imputation approach four additional times and used multiple imputation [19] to combine estimates from the separate imputations to obtain an overall estimate.
To assess the additional predictive power of serum estradiol, we computed age-specific (5-year age groups) deciles of the risk function without estradiol (model A) as well as including imputed estradiol (model B). From the cross-classification of risk decile model A × risk decile model B, we then compared the observed number of cases in specific risk deciles of model B with the expected number of cases within strata defined by model A risk decile. Specifically, let X ij = the number of breast cancer cases, N ij = the number of person-years, and p ij = X ij / N ij , which is the estimated incidence rate within the ith agespecific risk decile for model A and the jth age-specific risk decile for model B, and let ln(p ij ) = α i + β(j -1). 100% × [exp( )-1] is an estimate of the percentage increase in breast cancer incidence for an increase of one model B risk decile, holding the model A risk decile constant [20]. We wish to test the hypothesis H 0 : β = 0 versus H 1 : β ≠ 0. This approach of cross-classifying individuals by two different risk prediction rules is similar to the reclassification table approach used to compare risk prediction rules in the Framingham Heart Study [21]. In addition, to assess the predictive ability of our risk prediction models, we used the area under the receiver operating characteristic curve (that is, the concordance or C statistic). This statistic ranges from 0.5 to 1.0 and represents the probability that, for a randomly selected pair of women, one with ER + breast cancer and one without breast cancer, the woman with ER + breast cancer has the higher estimated disease probability. Also, we compared the C statistic for different risk prediction rules [22]. In our primary analysis, we evaluated the addition of imputed estradiol levels to risk prediction models in the entire cohort. As a secondary approach, we calculated Rosner and Colditz model risk scores in the entire cohort and then, in the nested case-control data set, assessed the impact of adding this score to the plasma estradiol and breast cancer model.

Results
Within the nested case-control data set, we observed a significant association between plasma estradiol and risk of breast cancer (P trend < 0.001), with an RR for the top (Q4) versus bottom (Q1) quartile category of 3.3 (95% confidence interval [CI] = 1.8 to 6.0) for ER + breast cancer ( Table 1). Each of the variables in the Rosner and Colditz risk prediction model was considered as a potential predictor of plasma estradiol. BMI was most strongly related to estradiol level; in addition, the birth index, case status, and duration of postmenopause each contributed modestly but significantly ( Table 2). Other variables, including family history of breast cancer, alcohol intake, and history of BBD, did not contribute significantly to the model and thus were dropped from further consideration. The r 2 , for the regression model, was 0.219.
Most variables in the Rosner and Colditz model incorporate a time component (for example, postmenopausal BMI = average BMI postmenopause × duration postmenopause). Because exposure status at the time of blood draw might be most strongly correlated with estradiol, we also evaluated each of the variables at the time of blood draw or, for variables ascertained only on the main study questionnaire (for example, alcohol intake), within 2 years of blood draw. All results were similar.
In ) with adjusting for log e estradiol. It appears that part of the effect of BMI, parity, and late menopause on the incidence of breast cancer is mediated in part by changes in log e estradiol caused by obesity (increase), multiparity with first birth at an early age (decrease), and delayed menopause (increase), respectively.
We now present the cross-classification of model A × model B risk decile in Table 4. It is clear from Table 4 that, within most model A risk deciles, there are important differences in estimated incidence according to model B risk decile (often twofold). Overall, for a given model A risk decile, the observed number of cases was higher than expected when the model B decile was high and lower than expected when the model B decile was low. The overall slope was β = 0.511 ± 0.034 (P < 0.001), indicating that there is a significant estimated 67% increase in breast cancer incidence for an increase of one model B age-specific risk decile, holding the age-specific model A risk decile constant. This indicated that there is substantial increased predictive power upon adding log e estradiol to the risk prediction model.
In addition, we compared the age-specific C statistics between model A versus model B. We found C statistics of 0.635 ± 0.007 for model A and 0.645 ± 0.007 for model B (C statistic model A versus C statistic model B; P < 0.001). Using our secondary approach (applying the population-based risk scores to the subset of women in the nested case-control study), the RRs of breast cancer by plasma hormone level were similar, though slightly attenuated, compared with those in Table 1. For example, the RRs of ER + breast cancer with increasing quartile of estradiol were 1.0, 1.5, 1.4, and 2.5 (95% CI = 1.5 to 4.2). Finally, in Table 5, we present the 5-year incidence of breast cancer by age and model B risk decile after adjusting for competing mortality risks [23]. The RR of breast cancer comparing women at the highest versus the lowest age-specific decile ranges from 5.0 to 8.5. For example, for 60-to 64-year-old women, the absolute 5-year risk of breast cancer is 436/10 5 (0.4%) for women in the first decile and 2,982/10 5 (3.0%) for women in the 10th decile (RR = 6.8), indicating substantial differences in absolute risk according to the model B risk equation.

Discussion
In the Nurses' Health Study, we found that estradiol levels, as imputed from a nested case-control study within the same cohort, added significantly to the Rosner and Colditz risk prediction model, which already includes most confirmed breast cancer risk factors. There was an increase of 67% in incidence per increase of one model B risk decile, holding model A risk decile constant. The increase in the C statistic was also statistically significant.
Strengths of this study include the large size of the cohort and the large number of available questionnaire-based breast cancer risk factors. Additionally, prospectively assessed estradiol levels were available in a subset of the same women. Through the use of both risk factors and case status in the linear regression, our imputed values as applied to the larger cohort accounted for both the association between hormone level and breast cancer and the correlation between hormone levels and other risk factors already in the risk prediction model.
One limitation of the study was that we did not have measured estradiol levels on all cohort members; however, this is a limitation of all large prospective studies because of the high cost of the assays. In addition, due to our desire to have consistent eligibility criteria throughout, only 164 cases and 346 controls in the nested case-control study met all criteria for the model (for example, known age at menopause). Thus, it was not possible within this small data set to provide a sufficiently precise evaluation of the Rosner and Colditz model (which contains 22 beta coefficients); in our initial attempt to evaluate the model, all beta coefficients had wide CIs. In our secondary analyses, in which we used the risk score within the nested case-control data set, plasma estradiol again contributed significantly to the model. However, the RR for estradiol in the secondary analysis for Q4 versus Q1 was 2.5 (95% CI = 1.5 to 4.2) when controlling for other risk factors using the Rosner-Colditz risk scores versus 3.3 (95% CI = 1.8 to 6.0) when individual risk factors were used within the case-control study. One would expect that the effects of other risk factors are more accurately measured by a single risk score derived from a large cohort study than individual risk factors derived from a relatively small nested case-control data set. More generally, this may indicate better control for confounding in small case-control studies   based on risk scores derived from large cohorts versus internal control for confounding based on individual risk factors whose regression coefficients are poorly estimated in small case-control studies. With further follow-up and within a collaboration across several cohorts at other institutions, we plan to re-evaluate the case-control approach.
In imputing estradiol levels, only BMI, the birth index, and duration of postmenopause (in addition to case status) were significant predictors of log e (estradiol). The correlation between BMI and estradiol was expected given that aromatization of androgens to estrogens in postmenopausal women occurs in adipose tissue [24]. The association with the birth index, a summary variable representing the number and spacing of pregnancies, has been evaluated less frequently and data are not as consistent [25][26][27][28]. The association with duration of postmenopause may be due to declines in estrogen after menopause. Alcohol intake, which previously has been found to correlate with estrogen levels in several studies [29], did not contribute significantly here, which is consistent with our previous report from a subset of the current population showing no correlation with estradiol [17]. To our knowledge, no other lifestyle factors have consistently been shown to predict post-menopausal estradiol levels. The correlation between measured and imputed estradiol was 0.47.
With the inclusion of imputed estradiol, the C statistic increased from 0.635 to 0.645, which is a modest improvement but suggests reasonable discriminatory ability overall. However, the reclassification table approach (Table 4) indicated that a substantial difference in incidence is explained by including imputed estradiol, suggesting that the C statistic may be relatively insensitive to additions of single predictors to risk prediction models [21,30]. However, the relationship of one or a combination of risk factors with disease must be very strong -RRs on the order of 100 to 200 between exposed and unexposed -to serve as a screening tool at the individual level [31][32][33]. Continued expansion of current models with other risk factors (for example, genetic factors, mammographic density, or cytology from nipple aspirate fluid [34]) may further improve the C statistic. In addition, if chemopreventive agents were developed with few risks (and an acceptable cost-benefit ratio), the need to minimize the false-positive rate would decrease, similar to the use of cholesterol-lowering agents for the prevention of heart disease.

Conclusion
In summary, our data indicate that circulating estradiol levels in postmenopausal women may contribute significantly to current risk prediction models. Further assessment of estradiol in other studies and of other biomarkers that predict risk is needed to continue to improve our ability to predict breast cancer risk and inform prevention strategies. Similar approaches can be used to incorporate other breast cancer biomarkers in overall risk prediction models.

Competing interests
The authors declare that they have no competing interests.

Authors' contributions
BR, GAC, and SEH contributed to the design of the study and to the analysis and interpretation of the data. JDI contributed to the revision of the manuscript and added important clinical insight. All authors read and approved the final manuscript.  a Five-year incidence per 10 5 person-years adjusting for competing mortality risks. RR, relative risk.