Targeted mutation detection in breast cancer using MammaSeq™

Background Breast cancer is the most common invasive cancer among women worldwide. Next-generation sequencing (NGS) has revolutionized the study of cancer across research labs around the globe, however genomic testing in clinical settings remain limited. Advances in sequencing reliability, pipeline analysis, accumulation of relevant data, and the reduction of costs are rapidly increasing the feasibility of NGS-based clinical decision making. Methods We report the development of MammaSeq, a breast cancer specific NGS panel, targeting 79 genes and 1369 mutations, optimized for use in primary and metastatic breast cancer. To validate the panel, 46 solid tumor and 14 plasma circulating-free cfDNA samples were sequenced to a mean depth of 2311X and 1820 X respectively. Variants were called using Ion Torrent Suite 4.0 and annotated with cravat CHASM. CNVKit was used to call copy number variants in the solid tumor cohort. The oncoKB Precision Oncology Database was used to identify clinically actionable variants. ddPCR was used to validate select cfDNA mutations. Results In cohorts of 46 solid tumors and 14 cfDNA samples from patients with advanced breast cancer we identified 592 and 43 protein coding mutations. Mutations per sample in the solid tumor cohort ranged from 1 to 128 (median 3) and the cfDNA cohort ranged from 0 to 26 (median 2.5). Copy number analysis in the solid tumor cohort identified 46 amplifications and 35 deletions. We identified 26 clinically actionable variants (levels 1-3) annotated by OncoKB, distributed across 20 out of 46 cases (40%), in the solid tumor cohort. Allele frequencies of ESR1 and FOXA1 mutations correlated with CA.27.29 levels in patient matched blood draws. Conclusions In solid tumors biopsies and cfDNA, MammaSeq detects clinicaly actionable mutations (oncoKB levels 1-3) in 22/46 (48%) solid tumors and in 4/14 (29%) of cfDNA samples. MammaSeq is a targeted panel suitable for clinically actionable mutation detection in breast cancer.


Background 53
Advanced breast cancer is currently incurable. Selection of systematic therapies is 54 primarily based on clinical and histological features and molecular subtype, as 55 defined by clinical assays [1]. Large-scale genomic studies have shed light into the 56 heterogeneity of breast cancer and its evolution to advanced disease [2,3], and 57 coupled with the rapid advancement of targeted therapies, highlights the need for 58 use in ddPCR reaction. 1.5ul of diluted preamplified DNA was used as input for ddPCR 144 reaction. ddPCR was performed for ESR1-D538G, FOXA1-Y175C, and PIK3CA-145 H1047R mutations. Custom ddPCR assays were developed for ESR1-D538G 146 (Integrated DNA Technologies) and FOXA1-Y175C (ThermoFisher Scientific). 147 Sequences are described in Supplementary Table 3 . PIK3CA-H1047R was analyzed 148 using PrimePCR ddPCR assay (Bio-Rad Laboratories) dHsaCP2000078 (PIK3CA)/ 149 dHsaCP2000077 (H1047R). Nuclease-free water and buffy coat-derived wildtype 150 genomic DNA as negative controls, and oligonucleotides carrying mutation of interest 151 or DNA from a cell line with mutation as positive controls were included in each run 152 to eliminate potential false positive mutant signals. An allele frequency of 0.1% was 153 used as a lower limit of detection. 154

Statistical Analysis 155
All statistical analysis was performed in R 3.4.2. To determine if there was a 156 significant correlation between mutational burden and copy number burden, we 157 calculated the pearson correlation coefficient between the number of somatic 158 mutations in each sample, with the number of significant copy number changes in 159 each sample.

Development of MammaSeq TM Panel 162
To build a comprehensive list of somatic mutations in breast cancer, we combined 163 mutation calls from primary tumors in TCGA (curated list level 2.1.0.0) and limited 164 studies focused on metastatic breast cancer [16][17][18]. The biological function and 165 druggablity of mutated genes were investigated via Gene Ontology (GO) [19] and 166 DGIdb (v2.0) databases [20]. The information regarding FDA approved drugs was 167 downloaded from "https://www.fda.gov/Drugs" and added to our list. We used the 168 following criteria to priotrize the clinically important mutated genes: 169 • The mutated gene is among significantly mutated genes (SMGs) in primary 170 and metastatic samples. 171 • The mutated gene is clinically actionable (e.g. there is available FDA-approved 172 drug(s) against it). 173 • The mutated gene is of functional importance in cancer (e.g. kinase genes 174 were scored higher in the list). 175 • The mutation has been found in more than 5 primary tumors OR 2 metastatic 176 tumors. 177 • The mutation has been found in both primary and metastatic lesions. 178 The final mutation list was then curated and narrowed down to 80 genes and 1398 179 mutations. Additional amplicons were added to select genes to ensure sufficient 180 coverage of genes known to harbor functional copy-number variants. Amplicon probe design was unsuccessful for 29 mutations, including all 3 mutations in the gene HLA-182 A, yielding a final panel consisting of 688 amplicons targeting 1369 mutations across 183 79 genes. (Selected genes described in Table 2

Characterization of Genetic Variants detected by Mammaseq in a Solid Tumor 200
Cohort 201 To evaluate performance in mutation detection by the MammaSeq TM panel, 202 sequencing was carried out on a cohort of 46 solid tumor samples, with a mean read 203 depth of 2311X (Supplemental Figure 3). 4970 total variants (mean: 106, median: 204 82) were called across all patient samples. We removed identical genomic variants 205 that were present in more than 10 samples as these were likely to be sequencing 206 artifacts or common SNPs. Removing non-coding and synonymous variants yielded 207 1433 and 901 variants, respectively. To filter out less common polymorphisms, we 208 removed variants annotated in ExAC [12] or the 1000Genomes [13] databases in 209 more than 1% of the population. We removed variants with an allele frequency above 210 90% as these were likely germline. Finally, to focus on high confidence mutations, we 211 removed variants with a strand bias outside of the range of 0.5-0.6, yielding a total of 212 592 protein coding mutations (mean 12.9, median 3, IQR 3) ( Figure 1). 213 Interestingly, as noted by the variation between the mean and median, the total 214 number of mutations was skewed toward a subset of samples ( Figure 1-top panel). 215 408 of the 592 mutations (69%) were found in just 4 of the 46 samples (Supplemental 216 Figure 4). These 4 samples are by definition outliers, as they are all more than 1.5 217 times the IQR plus the median. 3 of these 4 samples with high mutational burden were 218 of triple negative subtype, the fourth being ER + /HER2 + . The most common mutated 219 genes were TP53 (57%) and PIK3CA (43%). We also noted common mutations in 220 ESR1 (21%), ATM (21%) and ERBB2 (17%). 221 To examine CNV changes, we established a baseline for pull down and amplification 222 efficiency by performing MammaSeq TM on normal germline DNA from 14 samples (7 223 patients -6 additional). CNVkit [15] was used to pool the normal samples into single 224 reference and then call CNV in the solid tumor cohort (Figure 1). CNV were identified in many common oncogenes including CCND1, MYC, FGFR1 and others. 2 of the 3 226 ERBB2 + samples (via clinical assay) showed CNV by MammaSeq. FGF19 and CCND1 227 were co-amplified in 9 of the 46 (20%) solid tumors. Both genes are located on 11q13, 228 a band identified in GWA studies as harboring variants, including amplifications, 229 associated with ER + breast cancers [22]. There wasn't a correlation between 230 mutational burden and copy number burden (pearson correlation p-value = 0.7445). 231

Clinical Utility of Genetic Variants Detected by MammaSeq 232
To determine how many of the mutations have putative clinical utility, we utilized the 233

Characterization of Genetic Variants detected by Mammaseq in cfDNA
To examine the potential of MammaSeq TM to detect variants in cfDNA, we sequenced 246 14 cfDNA samples isolated from 7 patients with metastatic disease. cfDNA samples 247 were sequenced to a mean depth of 1810X, while matched buffy gDNA was sequenced 248 to a mean depth of 425X (Supplemental figure 4) . 249 We applied the same filtering pipeline to the cfDNA variants and solid tumor variants, 250 except in the smaller cohort we removed all identical variants found in more than 4 251 samples, and lowered the minimum allele frequency to 1.0%. We identified a total of 252 43 somatic mutations across the 14 cfDNA samples (mean: 3.1, median 1, IQR 1.75) 253 ( Figure 3A). Similar to the solid tumor cohort, a single draw from 1 patient (CF_28-254 Draw 1) harbored 25 of the 13 (58%) total mutations. Using the same definition, this 255 sample is also an outlier. Similar to the solid tumor cohort, PIK3CA and ESR1 were 256 among the most commonly mutated genes. 257 Two of the identified somatic mutations (each identified in 2 draws from 1 patient) 258 are annotated at level 3 in the OncoKB database, ESR1 -D538G and PIK3CA -H1047R 259 ( Figure 3A). The ESR1 mutation was identified in 2 separate blood draws from patient 260 CF_28 taken 13 months apart. Interestingly, the FOXA1 -Y175C mutation was also 261 identified in the same draws from patient CF_28 ( Figure 3B). The allele frequencies 262 of these mutations strongly correlate with levels of cancer antigen 27-29 (CA-27.29), 263 indicating that the mutation frequencies are likely an indicator of disease burden. 264 Mutations identified in all three genes (ESR1, PIK3CA, and FOXA1) were 265 independently validated using ddPCR (Supplemental Figure 5).

Discussion 267
Advances in the accuracy, cost, and analysis of NGS make it an ideal platform to 268 develop diagnostics that can be used to precisely identify treatment options. 269 MammaSeq was developed to comprehensively cover known driver mutation 270 hotspots specifically in primary and metastasis breast cancer that would identify 271 mutations with potential prognostic value. Liquid biopsies are beginning to be utilized clinically after numerous proof-of-290 principle studies have demonstrated the potential of circulating cell-free DNA 291 (cfDNA) for prognostication, molecular profiling, and monitoring disease burden [11, 292 29-33]. We have demonstrated that the MammaSeq TM panel can be used to identify 293 mutations in cfDNA. For one patient (CF_28), we have cfDNA data from 5 blood draws 294 taken over the course of 13 months. The sharp drop-off in the number of somatic 295 mutations identified between the first and second draws co-occurs with a decrease in 296 CA.27.29 levels, suggesting that the patient may have responded well to treatment, 297 leading to disappearance of sensitive clones. In the later blood draws, we did not 298 observe an increase in the total number of somatic mutations, however, we did find 299 an increase in the allele frequency of ESR1-D538G and FOXA1-Y175C mutations, 300 which may be caused by therapeutic selection of resistant clones. 301 High-throughput genotyping of solid tumors and continual monitoring of disease 302 burden through sequencing of cfDNA represent potential clinical applications for NGS 303 technologies. It should be noted that targeted DNA sequencing panels such as 304 MammaSeq TM are far less comprehensive than whole exome sequencing and they do 305 not allow for evaluation of structural variants, which can often lead to gene fusions 306 that function as drivers [34]. Nevertheless, as a focused panels represent cost-307 effective and useful alternatives to whole exome sequencing for targeted mutation 308 detection.

Conclusions 310
Here we report the development of MammaSeq TM , a targeted sequencing panel 311 designed based on current knowledge of the most common, impactful, and targetable 312 drivers of metastatic breast cancer. This data provides further evidence for the use of 313 NGS diagnotsics in the management of advanced breast cancers. 314

Ethics approval and consent to participate 335
The research was performed under the University of Pittsburgh IRB approved 336 protocol PRO16030066. 337

Consent for publication 338
Not applicable. 339

Availability of data and material 340
Annotated, unfiltered, mutation and CNV data, along with R code related to this study, 341 are deposited on GitHub (https://github.com/smithng1215). 342