Study population
In the Netherlands, women aged 50–75 years have been invited for mammographic breast cancer screening every other year from 1989 and onwards. Approximately 80% of the women attend the screening program [25]. Since 2003 the transition from analog to digital mammography gradually took place, starting at one screening unit (Preventicon screening unit, Utrecht, The Netherlands) and in 2010 the transition was complete. For this study, all women were included who had one or more digital mammographic screening examinations at the Preventicon screening unit between 2003 and 2011. There are five screening regions in the Netherlands that follow the exact same procedures. The Preventicon screening unit is part of the Foundation of Population Screening Mid-West region. Women consent to their data being used for evaluation and improvement of the screening, by participating in the Dutch breast cancer screening program, unless they have stated otherwise.
The research ethics committee of the Radboud University Nijmegen Medical Centre declared that this study does not fall within the remit of the Medical Research Involving Human Subjects Act. Therefore, this study could be carried out (in The Netherlands) without approval by an accredited research ethics committee.
Data collection
We selected each woman’s first unprocessed (raw) digital mammography examination. All mammograms were taken using Lorad Selenia DM systems (Hologic, Danbury, CT, USA). During the first examination in the screening program, both craniocaudal (CC) and mediolateral oblique (MLO) views are always acquired. In subsequent rounds the MLO is the standard view and an additional CC view is taken only when indicated (e.g., visible abnormality, high mammographic density). Information during follow up was obtained through the screening registration system and through linkage with the Netherlands Cancer Registry to obtain complete information on both screen-detected and interval breast cancers. Screen-detected breast cancers were defined as breast cancers diagnosed on the basis of diagnostic work-up of an “abnormal” screening examination. Interval breast cancers were defined as breast cancers diagnosed within 24 months after a screening examination that did not lead to recall (negative mammogram), and before the next scheduled screening examination. The median time between the first available digital screening mammogram and breast cancer diagnosis was 3.7 years (IQR 2.0–4.3, minimum 0.1 years, maximum 7.9 years) for screen-detected breast cancers and 2.2 years (IQR 1.1–3.9, minimum 0.1 years, maximum 9.6 years) for interval cancers. Both invasive and ductal carcinoma in situ breast cancers were used for analyses.
We excluded all screen-detected breast cancer cases that were diagnosed based on the first digital screening examination, to minimize the number of breast cancer cases in the study that were diagnosed based on the same mammogram as was used for breast density and texture score assessment.
The texture measure used in this study was previously developed using a selection of women with and without breast cancer who had one or more digital mammographic screening examinations at the Preventicon screening unit between 2003 and 2011 [24]. Therefore, we also excluded all women whose mammograms were used to train the texture measure used in this study, to ensure an independent validation.
The data were obtained through the registry of a breast cancer screening program in which mammograms are routinely collected. Therefore, besides age, no additional information was available about the women.
Volumetric mammographic density assessment
Absolute dense volume (DV) and percentage dense volume (PDV) were automatically assessed from unprocessed mammograms of the left and right breasts, using Volpara Density (version 1.5.0, Volpara Health Technologies, Wellington, New Zealand) [26]. We used the mean of the left and right MLO views, since this is the routinely acquired view and CC views were not available for all women. In this way, we ensured that mammographic density was assessed in the exact same way in all participants.
Mammographic texture assessment
The deep-learning-based mammographic texture-based risk assessment was calculated from unprocessed mammograms using prototype software by Biomediq A/S as described by Kallenberg et al. [24]. The deep-learning framework was a 5-layer convolutional neural network that maps mammographic patches to a cancer risk score when trained as described below. The first four layers were three convolutional and one pooling layer. These layers learned mammographic features (mammographic structure/texture) of decreasing size and increasing level of abstraction. The initial three layers were trained in an unsupervised fashion: they learn features that describe mammographic structure independent of cancer risk. The final two layers (the last convolution layer and the final 5th Softmax classification layer) were trained in a supervised fashion using the features encoded in the previous layers as the starting point. The weights of these final layers were optimized to distinguish between patches from breasts without cancer diagnosis (at both baseline and follow up) and patches from breasts that were without diagnosis at baseline but were diagnosed with breast cancer at follow up. The implication of this is that the network was trained to score cancer risk realized as the probability that a patch originates from a breast with cancer-prone mammographic texture/structure. Further technical/mathematical details of texture methodology can be found in the article of Kallenberg et al. [24].The training dataset described subsequently corresponds to the dataset named “Dutch Breast Cancer Screening Dataset” in that same article [24]. For the purposes of this study, the deep-learning framework was trained on a subset of the Preventicon data consisting of 394 cancer cases and 1182 healthy controls - 3 controls per case, matched on age and acquisition date. The cancer cases included 285 screen-detected cancers and 109 interval cancers. For screen-detected cancers, the cases were represented by the contralateral view at the time of diagnosis. For interval cancers, the cases were represented by the contralateral view from the screening visit immediately prior to diagnosis. The laterality distribution of the controls was sampled to match that of the cases.
The left and right MLO views in the remaining independent validation subset of the Preventicon cohort were scored for texture-based risk using the framework above. The texture score for a single screening visit was obtained as the average of the left and right MLO texture risk scores. This scoring was performed such that both software and operator were fully blinded to cancer outcome during scoring. For each MLO view, the software extracted 500 randomly sampled patches within the fully compressed part of the breast tissue. To identify the fully compressed part of the breast, the geometry of the uncompressed breast is modelled as a semi-sphere, as has been proposed in the works of Highnam and Brady [27]. According to this model, the boundary between the fully compressed and the uncompressed part of the breast is found at those locations within the breast where the distance to the skin edge equals half the height of the breast. Each patch was scored for cancer risk using the trained deep-learning framework described above and the resulting texture score for a single view was obtained as the average of the 500 patch-based risk scores.
An example of mammograms from each of the four combinations of high or low texture score with high or low percentage dense volume is given in Fig. 1. The stronger textural properties of mammograms with high texture scores are clear in both density categories.
Statistical analysis
Age and breast measures (mammographic density and mammographic texture scores) were determined for the first available unprocessed digital mammography examination of each woman. In addition, the number of digital screening rounds and follow-up years were determined. We described our study population by the median and interquartile range (IQR) for each of these characteristics and tested whether these characteristics were significantly different in breast cancer and non-breast cancer cases. We used the two-sample t test for normally distributed measures and the Mann-Whitney U test for non-normally distributed measures. Breast density measures were transformed using the natural logarithm (ln) to obtain normal distributions and Pearson correlation coefficients were determined to test correlation between breast measures and between age and breast measures.
Associations of continuous measures (per standard deviation (SD) increase, using normally distributed measures) and quartiles of density and texture scores with breast cancer risk were determined using Cox proportional hazards analyses. We calculated hazard ratios (HR) and their 95% confidence intervals (95% CI). Age was used as the underlying time scale. The entry time was defined as subject’s age at the time of the first available digital mammogram. Exit time was defined as one of the following options: (1) age at breast cancer diagnosis (event), (2) age at death (censoring), or (3) age at 2 years after the last digital mammogram performed before 1 January 2012 (censoring). The age used as the exit time was determined by the option that occurred first.
We aimed to determine whether the previously described texture score is associated with breast cancer risk and has additional value, next to volumetric mammographic density measures, in distinguishing future breast cancer cases from non-breast cancer cases. To study this, we constructed several models. First, three Cox proportional hazard models were developed with dense volume, percentage dense volume, or texture as the determinant (model 1, 2 and 3, respectively). With these models we could determine the ability of a density or texture measure alone to separate breast cancer from non-breast cancer cases. Thereafter, we constructed two additional Cox proportional hazard models. The first contained both dense volume and texture determinants (model 1a). The other model contained both percentage dense volume and texture determinants (model 2a). To determine the ability of the models to discriminate between breast cancer cases and non-cases, concordance indices (c-indices) were obtained for all models. The c-index can be seen as the fraction of “case - non-case” pairs for which the model correctly identified the breast cancer case. Across 2000 bootstrap samples, c-indices of models containing only a breast density measure (model 1 or 2) were compared to models containing both density measures and texture scores (model 1a or 2a) to test whether differences in c-indices were statistically significant.
As the density and texture scores were expected to be strongly correlated, we prevented multicollinearity from occurring in models 1a and 2a by including the residuals of the texture scores regressed on breast density instead of the texture score itself. This “residual method” is often used in the field of nutritional epidemiology [28]. Residuals were obtained by using linear regression analysis. There was no correlation between the residuals and breast density.
Additionally, two extra Cox proportional hazard models were constructed in which the residuals of breast density (dense volume for model 3a and percentage dense volume for model 3b) regressed on texture were combined with the texture score. Using these models, we could determine whether breast density measures added some distinctive power to the texture score alone.
The proportional hazards assumption was evaluated by Schoenfeld residual plots and log minus log plots, and the assumption was not violated. To examine the presence of a linear trend in HRs over the quartiles of breast measures, quartiles were added to the models as continuous variables.
Finally, in a secondary analysis we also separately determined the associations between breast measures (dense volume, percentage dense volume, and texture) and breast cancer for screen-detected and interval breast cancers. Statistical analyses were performed using SPSS version 22 and R version 3.2.0.