Early detection of breast cancer based on gene-expression patterns in peripheral blood cells

Introduction Existing methods to detect breast cancer in asymptomatic patients have limitations, and there is a need to develop more accurate and convenient methods. In this study, we investigated whether early detection of breast cancer is possible by analyzing gene-expression patterns in peripheral blood cells. Methods Using macroarrays and nearest-shrunken-centroid method, we analyzed the expression pattern of 1,368 genes in peripheral blood cells of 24 women with breast cancer and 32 women with no signs of this disease. The results were validated using a standard leave-one-out cross-validation approach. Results We identified a set of 37 genes that correctly predicted the diagnostic class in at least 82% of the samples. The majority of these genes had a decreased expression in samples from breast cancer patients, and predominantly encoded proteins implicated in ribosome production and translation control. In contrast, the expression of some defense-related genes was increased in samples from breast cancer patients. Conclusion The results show that a blood-based gene-expression test can be developed to detect breast cancer early in asymptomatic patients. Additional studies with a large sample size, from women both with and without the disease, are warranted to confirm or refute this finding.


Introduction
Early detection of breast cancer can improve the chances of successful treatment and recovery. To date, mammographic screening is the most reliable method to detect breast cancer in asymptomatic patients. Although highly effective, it has significant limitations, so that the development of more accurate, convenient, and objective detection methods is needed. In the absence of microcalcification, mammography often fails to detect tumors that are less than 5 mm in size, and also mammograms of women with dense breast tissue are difficult to interpret. For example, in a study of over 11,000 women with no clinical symptoms of breast cancer, the sensitivity of mammography was only 48% for the subset of women with extremely dense breasts, compared with 78% sensitivity for the entire sample of women in the study [1]. In addition, when an abnormality has been detected, further tests involving invasive steps must complement mammography to establish whether the detected abnormality is a cancer.
A vast amount of literature is already available describing the potential use of large-scale gene expression analysis in disease diagnosis, including breast cancer [2][3][4][5][6][7][8]. However, most published work with implications in cancer diagnosis has involved clinical samples comprising either diseased tissues or cells. Obtaining such samples for clinical purposes requires a prior knowledge of both their presence and their location in the body. A gene-expression-based test to detect cancers that does not rely upon the availability of tissues or cells from the diseased area has not yet been described.
It has recently been suggested that circulating leukocytes can be viewed as scouts, continuously maintaining a vigilant and comprehensive surveillance of the body for signs of infection or other threats, including cancer [9]. In line with this view, we show that peripheral blood can be used to develop a geneexpression-based test for early detection of breast cancer. The rationale for using blood cells as monitors for a malignant disease elsewhere in the body is based on the hypothesis that a malignant growth will cause characteristic changes in the biochemical environment of blood. These changes will affect the expression pattern of certain genes in blood cells.
In this pilot study, we have analyzed gene-expression patterns in peripheral blood cells of women diagnosed with breast cancer and women with no signs of this disease. We have identified a panel of genes with distinct expression patterns in cancer versus noncancer samples. The results indicate that breast cancer causes characteristic changes in the biochemical environment of blood already during early stages of disease development. Blood cells sense and respond to the change by decreasing the expression of genes involved in protein synthesis and increasing the expression of defenserelated genes. We show that the expression pattern of the identified genes can be used to discriminate and predict the class of breast cancer and non-breast-cancer samples with high accuracy. Our findings should pave way for the development of a blood-based gene-expression test for early detection of breast cancer.

Blood samples
Blood samples were collected from donors with their informed consent under an approval from Regional Ethical Committee of Norway (331-99-99138). All donors were treated anonymously during analysis. Blood was drawn from women with a suspect initial mammogram, prior to any knowledge of whether the abnormality observed during first screening was benign or malignant. In all cases, the blood samples were drawn between 8 a.m. and 4 p.m. From each woman, 10 ml blood was drawn by skilled personnel either in vacutainer tubes containing ethylenediaminetetraacetic acid (EDTA) as anticoagulant (Becton Dickinson, Baltimore, MD, USA) or directly in PAXgene™ tubes (PreAnalytiX, Hombrechtikon, Switzerland). Blood collected in EDTA-containing tubes was immediately stored at -80°C, while PAX tubes were left overnight at room temperature and then stored at -80°C until use.

Preparation of cDNA arrays
One thousand four hundred thirty-five cDNA clones were randomly picked from a plasmid library constructed from whole blood of 550 healthy individuals (Clontech, Palo Alto, CA, USA). Based on the sequence analysis of more than 500 cDNAs, redundancy among the randomly picked clones was estimated to be about 20%. For amplification of inserts, bacterial clones were grown in microtiter plates containing 150 µl Luria Broth media with 50 µg/ml carbenicillin, and incubated overnight with agitation at 37°C. To lyse the cells, 5 µl of each culture was diluted with 50 µl dH 2 O and incubated for 12 min at 95°C. Of this mixture, 2 µl were subjected to a PCR reaction using 40 µmol of 5' -and 3' -sequencing primers in the presence of 1.5 mM MgCl 2 . PCR reactions were performed with the following cycling protocol: 4 min at 95°C, followed by 25 cycles of 1 min at 94°C, 1 min at 60°C, and 3 min at 72°C either in a RoboCycler Temperature Cycler (Stratagene, La Jolla, CA, USA) or DNA Engine Dyad Peltier Thermal Cycler (MJ Research Inc, Waltham, MA, USA). The amplified products were denatured with NaOH (0.2 M, final concentration) for 30 min and spotted onto Hybond-N + membranes (Amersham Pharmacia Biotech, Little Chalfont, UK), using a Micro-Grid II workstation in accordance with the manufacturer's instructions (BioRobotics Ltd, Cambridge, UK). The immobilized cDNAs were fixed using a UV cross-linker (Hoefer Scientific Instruments, San Francisco, CA, USA).
The printed arrays also contained controls for assessing background level, consistency, and sensitivity of the assay. These were spotted at multiple positions in addition to the 1,435 cDNAs, and included controls such as PCR mix (without any insert); controls of the SpotReport™ 10-array validation system (Stratagene), and cDNAs corresponding to constitutively expressed genes such as β-actin, γ-actin, glyceraldehyde-3phosphate dehydrogenase, human ornithine decarboxylase and cyclophilin.

RNA extraction, probe synthesis, and hybridization
Blood collected in EDTA tubes was thawed at 37°C and transferred to PAX tubes, and total RNA was purified in accordance with the supplier's instructions (PreAnalytiX). From blood collected directly in PAX tubes, total RNA was extracted in the tubes as above without any transfer to new tubes. Contaminating DNA was removed from the isolated RNA by DNAase I treatment using a DNA-free kit (Ambion Inc, Austin, TX, USA). RNA quality was determined visually by inspecting the integrity of 28S and 18S ribosomal bands after agarose-gel electrophoresis. Only samples from which good-quality RNA was extracted were used in this study. In our experience, blood collected in EDTA tubes often resulted in poor-quality RNA, whereas blood collected in PAX tubes almost always yielded good-quality RNA. The concentration and purity of extracted RNA were determined by measuring the absorbance at 260 nm and 280 nm. From the total RNA, mRNA was isolated using Dynabeads in accordance with the supplier's instructions (Dynal AS, Oslo, Norway).
Labeling and hybridization experiments were performed in 16 batches. The number of samples assayed in each batch varied from six to nine. To minimize the noise due to batch-to-batch variation in printing, only the arrays manufactured during the same print run were used in each batch. When samples were assayed more than once (replicates), aliquots from the same mRNA pool were used for probe synthesis. For probe synthesis, aliquots of mRNA corresponding to 4 to 5 µg of total RNA were mixed together with oligodT 25NV (0.5 µg/µl) and mRNA spikes of the SpotReport™ 10-array validation system (10 pg; Spike 2, 1 pg), heated to 70°C, and then chilled on ice. The membranes were equilibrated in 4 × standard saline citrate (SSC) (1 × SSC, 0.15 M NaCl, 0.015 M sodium citrate, pH 7.0) for 2 hours at 30°C and prehybridized overnight at 65°C in 10 ml prehybridization solution (4 × SSC, 0.1 M NaH 2 PO 4, 1 mM EDTA, 8% dextran sulfate, 10 × Denhardt's solution, 1% SDS). Freshly prepared probes were added to 5 ml of the same prehybridization solution, and hybridization continued overnight at 65°C. The membranes were washed at 65°C with increasing stringency (2 × 30 min each in 2 × SSC, 0.1% SDS; 1 × SSC, 0.1% SDS; 0.1 × SSC, 0.1% SDS).

Quantification of hybridization signals
The hybridized membranes were exposed to Phosphoscreens (super resolution) and an image file generated using Phos-phoImager (Cyclone, Packard, Meriden, CT, USA). The identification and quantification of the hybridization signals, as well as subtraction of local background values, were performed using Phoretix™ software (Nonlinear Dynamics, Newcastle upon Tyne, UK). For background subtraction, the median of the line of pixels around each spot outline was subtracted from the intensity of the signals assessed in each spot.

Data analysis
From the background-subtracted data for 1,435 genes, 1.25% of the lowest and 1.25% of the highest signals were trimmed from each membrane. Since the cDNAs with signals falling within this range varied between membranes, values of 67 cDNAs in total were removed from all membranes, and the expression data for only 1,368 genes were further analyzed. The data were normalized by dividing the value of each spot by the mean of signals in each array followed by a cube-root transformation. Supplementary Fig. 1 (left panel) (Additional file 1) shows a clear batch effect in the cube-root-normalized data (similar effects were also visible in the raw data). A simple one-way analysis of variance (ANOVA) was performed to adjust for the batch effects. Supplementary Fig. 1 (right panel) (Additional file 1) shows that the systematic batch effects were removed by the ANOVA adjustment. The batch-adjusted data were then analyzed using the nearest-shrunken-centroid method [10].
In this method, standard 'external' cross-validation is used to determine the optimal shrinkage threshold. This optimal threshold is then used with the full training set to construct the centroid. As a result, for each value of the threshold, the estimate of cross-validation error obtained is approximately unbiased for the true test-error rate.
The leave-one-out cross-validation approach was used in this work. The data were divided into M nonoverlapping subsets (M = number of unique blood samples present). The model was then trained M-1 times on these subsets combined, each time leaving out one of the subsets (unique blood sample) from the training data, but using only the omitted subset to compute the prediction error. The errors obtained on all parts were added together and used to compute the overall misclassification error. It is well known that leave-one-out cross-validation provides an approximately unbiased and reliable estimate of the misclassification rate that would be obtained from an independent sample of patients [11,12]. In the terminology of Ambroise and McLachlan [12], we used external cross-validation (as they recommend).
The raw and the batch-adjusted data for 1,368 genes in an Excel file is provided in Supplementary

Results
We analyzed gene-expression patterns in 60 blood samples obtained from 56 different women (Table 1). The experiments were performed in 16 batches. To investigate the reproducibility of results, 13 samples from women with breast cancer and 23 samples from women with no breast cancer were analyzed in different batches using aliquots from the same mRNA pool, giving a total of 102 experimental samples.
The generated expression data was preprocessed and then analyzed by the nearest-shrunken-centroid method [10]. A standard leave-one-out cross-validation approach was used to determine the optimal amount of shrinkage threshold. Since we had 60 unique blood samples and for some of them experiments were replicated more than once, for cross-validation the data were divided into 60 nonoverlapping subsets, where each subset represented a unique blood sample and included all the replicates present in the data set. A sample was judged as correctly classified only when a majority of members in the corresponding cross-validation segment were correctly classified. The minimum overall misclassification error was observed at a threshold value of 2.28, yielding a subset of 37 genes ( Fig.  1). At this threshold, 10 of the 57 samples were misclassified and 3 samples were judged nondecisions, because there was no majority for either the breast-cancer or non-breast-cancer class ( Table 2). A detailed prediction result is presented in Table 1.
The prediction was highly accurate for samples from women with early stages of breast cancer, stage 0 and stage I. Among the 14 samples representing early stages, there was one nondecision and 11 of 13 samples were correctly predicted. Five of seven stage II and one of two stage III samples were correctly predicted.
Most of the cancer samples (22 of 24) analyzed in this study were obtained from women who had cancer of ductal origin. One woman, the origin of whose cancer was not known, had a previous history of breast cancer and at the time of blood collection the cancer had spread to supraclavicular and infraclavicular nodes. Another sample that did not belong to the ductal group was obtained from a woman who had invasive lobular carcinoma in one breast and a tubular adenocarcinoma in the other. Unlike ductal carcinoma, which originates from cells lining ducts, lobular carcinoma originates from cells lining lobules. Both samples were incorrectly predicted. It is possible that cancer of other than ductal origin affects the expression pattern of the selected 37 genes in blood cells differently than ductal carcinomas.
Seventeen of 19 samples obtained from women with a suspect first mammogram were correctly predicted ( Table 1, subgroup A2), indicating the expression profile of the selected 37 genes to be highly efficient in discriminating between cancerous and noncancerous breast abnormalities. In two samples, we were not able to make any diagnostic decision.
Among the 17 samples from women with no reported breast abnormality, 13 were correctly predicted ( Table 1, subgroup A3). These included samples from breast-feeding women as well as those drawn at different times in the menstrual cycle from one woman. However, the three samples from pregnant women and a sample from a woman with acute bacterial infection at the time of blood collection were all incorrectly predicted. The woman with acute bacterial infection was, in addition, chronically infected with Epstein-Barr virus. It is known that both pregnancy and chronic infection may elicit Misclassification rate as a function of threshold value and the number of genes involved Misclassification rate as a function of threshold value and the number of genes involved. The error was calculated using the majority rule. A nondecision was counted as an error. The upper graph shows that the minimum overall misclassification error was observed at a threshold value of 2.28. The lower graph shows the profile for misclassification error for breast-cancer (C) and non-breast-cancer (N) samples as a function of threshold value and the number of genes involved. We also calculated the misclassification error, taking an average of the class probability for each sample in all 60 cross-validation segments as compared with our previous approach in which a sample was judged as correctly classified only when a majority of members in the corresponding cross-validation segment were correctly classified. Thus, each segment repre-sented an average class probability for each sample, and we predicted each sample to the class with the highest average probability. The main purpose of adopting this approach was to be able to make a unanimous decision with respect to class membership. The minimum error rate using the average-class approach was obtained at a threshold value of 2.42 and involved a subset of only 25 genes, giving a further reduction of 12 genes ( Supplementary Fig. 2) (Additional file 4). Also, 10 (7 breast cancer and 3 non-breast-cancer samples) of the 60 samples were misclassified, which is a slightly better result than that obtained with 37 genes, where there were 3 nondecisions ( Supplementary Fig. 3 Table 3 shows the shrunken t-statistic scores of the selected 37 predictive genes for comparing breast-cancer class to nonbreast-cancer class, the genes in the public databases to which they show sequence similarity, and their putative biological function. The relative expression of 12 predictive genes with highest scores is presented in Fig. 2. The majority of the predictive genes (29 of 37) had a decreased expression (positive score) in the samples from breast cancer patients. The identity of predictive genes was determined by partially sequencing the corresponding spotted cDNA clones and searching for gene similarities in public databases.

Gene-expression patterns in 60 blood samples obtained from 56 different women
Sequence analysis revealed that 8 of 35 predictive genes contained redundant information. Since the arrayed cDNAs were derived from randomly picked clones from a library constructed from whole blood from 550 healthy individuals, we had expected a redundancy of about 20% among the selected genes. Of the 35 genes, 18 (51%) encoded ribosomal proteins. In comparison, the frequency of cDNAs representing ribosomal proteins was estimated to be only about 8% among the arrayed cDNAs. All genes encoding ribosomal proteins had reduced expression in samples from breast cancer patients, indicating a decrease in ribosome production in the blood cells of these patients. Also, genes encoding a translation elongation factor, eEF1 and RACK1 (receptor for a When there was no majority for either the breast-cancer or non-breast-cancer class, the prediction was regarded as a nondecision. b Total error rate = 0.18; 3 nondecisions. C, breast-cancer samples; N, non-breast-cancer samples.

Figure 2
Relative expression of 13 predictive genes with the highest scores in breast-cancer and non-breast-cancer samples Relative expression of 13 predictive genes with the highest scores in breast-cancer and non-breast-cancer samples. Red circles represent samples from women with breast cancer and green circles represent samples from women with no signs of breast cancer. The number on the upper axis represents the position ID of predictive genes in the array (Table 3). activated C kinase), were expressed at a lower level in samples from cancer patients, indicating reduced protein translation activity in these samples. RACK1 plays a key role in the joining of 60S and 40S subunits into a functionally active 80S ribosome complex [13].
Among the eight predictive genes with increased expression in samples from breast cancer patients, two encoded histone replacement protein H3.3, which is thought to be involved in chromatin remodelling [14], and six encoded proteins that may play a role in defense-related functions. Four genes with increased expression encoded ferritin and calgranulin B. Ferritin is involved in intracellular storage and sequestration of iron. Increased expression of ferritin has been shown to reduce the accumulation of reactive oxygen species in response to oxidant challenge in HeLa cells [15]. Calgranulin B is expressed by blood cells both during infection and during inflammation and may play a role in host defense [16]. Interferon-induced transmembrane protein 2 has been implicated in the immune response, while human granule proteoglycan peptide core is assumed to form stable complexes with proteases and other granule-localized proteins to prevent their intragranular autolysis and facilitate their concerted action extracellularly [17]. Interestingly, most predictive genes identified in this study belonged to the family of genes that exhibited altered expression in neutrophils after stimulation by nonvirulent and virulent bacterial stimuli [18,19].

Discussion
This is a first report demonstrating that breast cancer affects gene-expression patterns in peripheral blood cells during early stages of disease development. The results presented represent an initial phase in the development of a blood-based gene-expression test for breast cancer detection. A larger number of samples, from both women with and women without the disease, should be further analyzed before the clinical efficacy of our finding can be evaluated. However, the results clearly show that by analyzing the expression pattern of selected genes in blood cells, a diagnostic test for breast cancer detection can be efficiently developed.
In the present study, we examined gene-expression patterns in peripheral blood cells as a whole, rather than specific cellular subsets. It has recently been shown that individual variations in gene-expression pattern in peripheral blood could be traced to altered relative proportions of the specific blood cell subsets [9]. If there were systematic differences in the relative proportions of peripheral blood cell types in women with breast cancer and those without this disease, such differences might explain the observed gene-expression patterns. Interestingly, Whitney and colleagues [9] found that transcripts involved in protein synthesis were over-represented in lymphocytes and monocytes as compared with granulocytes. The reduced expression of transcripts involved in protein synthesis and the increased expression of transcripts involved in defense responses in breast cancer patients may reflect a systematic shift in favor of granulocytes as compared with lymphoid cells in the peripheral blood of breast cancer patients. However, to our knowledge, no such systematic shift during breast cancer development has been reported, and the subject requires further investigation. Alternatively, changes in the expression pattern of genes involved in protein synthesis, chromatin remodelling, and defense-related genes in the blood samples of breast cancer patients may indicate systematic activation of certain blood cell subsets such as neutrophils in these patients.
Our ability to correctly assign the class of samples from women with Crohn's disease, rheumatic disease, or diabetes as non-breast-cancer suggests that breast cancer affects the expression pattern of identified predictive genes differently from some of the diseases associated with anemia and chronic inflammation. The correct prediction of two samples from a woman with ductal carcinoma in situ further suggested The position of genes in the array is shown as well as their scores, the accession number of sequences in public databases that match them, and their known or putative cellular function. a The score is a shrunken t-statistic for comparing breast-cancer class to non-breast-cancer class. A positive score means that expression was greater in the noncancer sample than the cancer sample; a negative score means that expression was greater in the cancer sample than the noncancer sample. -, no information available; ROS, reactive oxygen species.

Table 3 (Continued)
Details of the identified 37 predictive genes that malignant lesions, though confined within the breast duct, may induce similar changes in the expression pattern of these genes to the changes seen during the more advanced stages of breast cancer (stages I to III). However, incorrect prediction of a sample obtained from a woman with invasive lobular carcinoma and tubular adenocarcinoma and from a woman where the cancer had spread to supraclavicular and infraclavicular nodes indicates that malignancy in itself is not a prerequisite condition for the observed changes in the expression pattern of the identified predictive genes.
The efficient prediction of samples derived from patients whose cancer had not yet spread to lymph nodes shows that a blood-based gene-expression test can be developed for breast cancer detection in asymptomatic patients. As compared with existing methods, an accurate method for breast cancer detection based on peripheral blood as a clinical sample will be highly desirable because of the easy accessibility and the less invasive procedure for obtaining samples. The test could be integrated as an adjunct to already established methods and be used to improve their efficacy. For example, a blood-based gene-expression test could assist mammography in discriminating between benign and malignant breast abnormalities. It could become a part of routine screening programs, especially when the patient has an increased risk for breast cancer.
It is important that any test intended for use in breast cancer diagnosis has a low rate of both false positives and false negatives. Based on the expression pattern of identified 37 genes, the prediction achieved corresponded to a false positive rate of 0.12 and false negative rate of 0.26. Since, the main goal of this work was to see whether the information about breast cancer is present in peripheral blood samples in the form of changed gene-expression patterns, we analyzed only a limited number of gene candidates in this study. The genes analyzed corresponded to clones that were randomly picked from a plasmid library constructed from whole blood of 550 individuals. The motivation for this approach for selecting gene candidates was based on the assumption that if the expression pattern of certain genes in blood cells is affected during early stages of breast cancer, the genes affected would most likely include ones involved in cell maintenance and general metabolism. Since such genes are expressed at high level in a cell, they would be frequently represented in a cDNA library and selected preferentially when randomly picked. It is our view that expression techniques such as microarrays, where the expression of thousands of genes can be monitored simultaneously, can further be used to screen for better predictive genes and develop more accurate diagnostic models.
We envisage blood-based gene-expression tests to have the potential of becoming a versatile and powerful tool for detection of disease, including other forms of cancers. As with breast cancer, other diseases may also cause characteristic changes in the biochemical environment of blood and affect the gene-expression patterns in blood cells. Specific geneexpression-based models can then be developed and used for diagnostic purposes.

Conclusion
The results presented show that breast cancer even during early stages of disease development affects the expression pattern of certain genes in peripheral blood cells. By identifying these genes and analyzing their expression pattern, it is possible to develop a blood-based gene-expression test for early detection of breast cancer. Additional studies with a large sample size, both from women with and without the disease, are warranted to confirm or refute this finding.