Breast cancer research output, 1945-2008: a bibliometric and density-equalizing analysis

Introduction Breast cancer is the most common form of cancer among women, with an estimated 194,280 new cases diagnosed in the United States in 2009 alone. The primary aim of this work was to provide an in-depth evaluation of research yield in breast cancer from 1945 to 2008, using large-scale data analysis, the employment of bibliometric indicators of production and quality, and density-equalizing mapping. Methods Data were retrieved from the Web of Science (WOS) Science Citation Expanded database; this was searched using the Boolean operator, 'OR', with different terms related to breast cancer, including "breast cancer", "mammary ductal carcinoma" and "breast tumour". Data were then extracted from each file, transferred to Excel charts and visualised as diagrams. Mapping was performed as described by Groneberg-Kloft et al. in 2008. Results A total of 180,126 breast cancer-associated items were produced over the study period; these had been cited 4,136,224 times. The United States returned the greatest level of output (n = 77,101), followed by the UK (n = 18,357) and Germany (n = 12,529). International cooperation peaked in 2008, with 3,127 entries produced as a result; relationships between the United States and other countries formed the basis for the 10 most common forms of bilateral cooperation. Publications from nations with high levels of international cooperation were associated with greater average citation rates. A total of 4,096 journals published at least one item on breast cancer, although the top 50 most prolific titles together accounted for over 43% (77,517/180,126) of the total output. Conclusions Breast cancer-associated research output continues to increase annually. In an era when bibliometric indicators are increasingly being employed in performance assessment, these findings should provide useful information for those tasked with improving that performance.


Introduction
In 2009, an estimated 194,280 new cases of breast cancer were diagnosed in the United States; breast cancer was estimated to account for 27% of all new cancer cases and 15% of cancer-related mortality in women [1]. Similarly, in Europe in 2008, the disease was reckoned to account for some 28% and 17% of new cancer cases and cancer-related mortality in women, respectively [2].
The last 50 years have seen an exponential increase in scientific yield generally, and particularly in oncology; a recent report demonstrated that in January of 2009 alone there were 11,215 new cancer-related papers and 1,220 review articles indexed in Pubmed [3]. The importance of quantitative and qualitative assessment of scientific output has increased in tandem with this information explosion, and these assessments now play an integral role in decisions regarding grant funding and prioritisation of resources, as exemplified by the Research Assessment Exercise in the UK [4]. Despite its aforementioned disease burden, relatively little effort has previously been made to understand the trends emanating from the breast cancer-associated literature. While there has been some concentration on the bibliometrics of cancer research generally [5,6], just three publications have evaluated breast-related output specifically; Dalpe et al. focused on the identification of BRCA1 and BRCA2 in the 1990 s [7], while Donato et al. published an analysis of the Portuguese contribution [8], and Li and McCain focused specifically on the development of research themes in the radiological detection of breast cancer [9]. The primary aim of this present work was thus to provide an in-depth evaluation of research yield in breast cancer from 1945 to 2008, using large-scale data analysis, the employment of bibliometric indicators of production and quality, and density-equalizing mapping.

Data source
Data were retrieved from the Web of Science (WOS) Science Citation Expanded database (SCI-Expanded) produced by Thomson Reuters. In order to approximate the overall number of published items on breast cancer, the following search strategy was employed; TS = ((phyllodes tumo$r$) OR (Cystosarcoma Phyllo$des) OR (Malignant Cystosarcoma Phyllodes) OR (breast invasive ductal carcinoma) OR (infiltrating duct carcinoma$) OR (mammary ductal carcinoma$) OR (breast cancer) OR (breast neoplasm$) OR (breast tumo$r$) OR (human mammary neoplasm$) OR (human mammary carcinoma$)) where TS = Topic search, $ = any character. Because this work was designed to assess overall activity in relation to breast cancer, we did not refine our search to include some document types such as original articles or reviews, or to exclude others such as letters and editorials. The time span analysed was 1945 to 2008 inclusive. The search was performed in November 2009, and thus 2009 was excluded as database entries for this period would not have been complete at the time of the search.
Each item of information downloaded from the WOS was contained in a 'data block'. Each block was preceded by a tag which gave information about the content of the block (that is, AU = authors, TI = title, PY = publication year). Software developed at the Charite University in Berlin was then employed to parse the data. Each time it found a tag it read the associated data and saved it to an Access database; the information was then later transferred to an Excel database for analysis. Published items were analysed using the citation report method as described previously [10,11]. The number of citations per year and the average number of citations per item were assessed, thereby indicating the average number of citing articles for all items in the set. This is the sum of the times cited divided by the number of results found.
Mapping was performed as described by Groneberg-Kloft et al. in 2008 [12]. Those nations which had contributed output were resized according to one of a number of different variables under study; that is, the average number of citations per item from each country. As part of this resizing procedure, the area of each country was scaled relative to, for example, the total number of items it had published on breast cancer. Specific calculations were based on Gastner and Newman's algorithm [13], published in 2004. These calculations employ a diffusion equation in the Fourier domain borrowed from elementary physics, which allows variable resolution by tracking moving boundaries [13,14].
Cooperation analysis was employed to determine bilateral and multilateral cooperation between countries on breast cancer research. A cooperation network between countries was computed by checking all combinations of those countries which registered international cooperation on at least 25 items over the study period. These data were then saved to a "matrix" or two-dimensional table, and the software then read this matrix and produced a density-equalising map which graphically represented this data. The threshold of 25 articles was set to improve readability.
Journals which had published on breast cancer were analysed relative to both the Journal Impact Factor (IF) and the recently developed Eigenfactor (EF). The former is based on two elements; the numerator, which is the number of citations in the current year to items published in the previous two years, and the denominator, which is the number of substantive articles and reviews published in the same two years [15]. The EF is calculated based on a complex algorithm that takes into account not only the quantity of citations but also their "quality" by assigning weights to the source of the citations. The full details of the algorithm can be found online [16].

Total number of published items
The number of published items on breast cancer was employed as an index of research productivity. During the period 1945 to 2008 (1974 excluded, n = 352), a total of 180,126 items were produced on this topic, as catalogued in the WOS. The earliest studies catalogued were published in 1945 (n = 17), although it was 1990 before activity began to increase considerably, year on year ( Figure 1); output more than doubled from 1990 (n = 1,436) to 1992 (n = 3,342). The greatest output for any year was that for 2008 (n = 17,413).

Total number of citations
The 180,126 indexed items have been cited 4,136,224 times since 1945. Figure 1 demonstrates the parallel increase in the number of citations in conjunction with the increase in published items. Articles published in 2001 were responsible for more citations than those published in any other year (n = 274,601). The average number of citations per item was greatest in 1957, however, when 40 items were responsible for 2,767 citations, returning an average of 69.01 citations per item published. There has been a downward trend in the average number of citations per item since the millennium.

Country of origin
A total of 155 different countries contributed to the literature on breast cancer over the study period. The United States was responsible for the greatest output, returning 77,101 items. Other high output countries included the United Kingdom (n = 18,357), Germany (n = 12,529), Italy (n = 10,828) and Japan (10,109) ( Table 1). Density equalising mapping of this dataset demonstrates that a relatively small number of countries was responsible for the majority of the output ( Figure 2). The Gambia had the highest average citation rate per item (67.67), followed by Kenya (40.69), and Costa Rica (39.53) ( Table 1). When confined to those countries which had produced at least 30 items, however, those with the highest average citation per item were Iceland (56.62), Finland (35.48), Denmark (32.88) and Switzerland (31.85) ( Figure 3).
Cooperation analysis was employed to assess bilateral and multilateral cooperation from 1973 to 2008; the first item in the dataset produced as a result of international cooperation was published in 1973. In total, 142 different countries had collaborated on at least one item published. International cooperation increased steadily through the study period, reaching a peak in 2008, with 3,127 entries produced as a result of cooperation. Bilateral cooperation was the most common form of cooperation (19,437 entries), followed by trilateral cooperation (n = 3,157) and quadrilateral cooperation (n = 836). Cooperation between the United States and Canada was the most common form of bilateral cooperation (n = 2,223), followed by that between the United States and the United Kingdom (n = 2,007) ( Figure 4). Relationships between the United States and other countries formed the basis for the 10 most common forms of bilateral cooperation (Table 2).

Publishing journals
A total of 4,096 journals had published at least one item on breast cancer. The journals which have published most prolifically on breast cancer, led by Cancer Research (5,290 items), are listed in Table 3. The top 50 most prolific titles, representing just 1.2% of all contributing journals, together accounted for over 43% (77517/180126) of the total output. Thirty of these top 50 titles were in the category 'Oncology' of the Journal Citation Report; other represented subject categories included 'Surgery' (n = 5), 'Pathology' (n = 4), 'Radiology, Nuclear Medicine and Medical Imaging' (n = 4).  (Table 3).

Discussion
In his seminal work on the exponential growth of science, Little Science, Big Science, Price noted in 1963 that all of the scientific periodicals founded since the first, the Journal de Scavaus (first published in 1665), had together produced a world total of six million scientific papers over the course of the preceding 300 years [17].  [18]. The results of this present analysis have demonstrated this growth in breast cancer research specifically, with an average 15% increase in output annually since 1945, and a greater than 100% increase since the millennium alone. This compares with a recent analysis of total scientific output from PubMed, which estimated an average growth rate of 4% per year between 1957 and 2007 [4].
This analysis has employed the citation count as a proxy measure of research quality. Forming an essential component in the dialogue of medical research [19], citations are regarded as a key indicator of the relevance and importance of a published item. We have shown a parallel increase in citation count with the number of breast cancer-related articles, a not unexpected finding recently mirrored in analyses of scientific output on scoliosis [20] and asthma [10]. The average number of citations per year was highest in 1957, although this was thanks largely to the citation classic by Bloom and Richardson in which they outlined their system for the histological grading of breast cancer and its association with prognosis [21]; it has since been cited 2,259 times. To put this figure into perspective, Garfield noted in 2006 that of 38 million items cited from 1900 to 2005, only 0.5% were cited more than 200 times [15]. Although there has been a decreasing trend in the average number of citations per item since the mid-1990 s, it is difficult to draw firm conclusions on the relevance of this finding; it may be explained by the sharp increase in the number of outputs in the intervening years, or indeed by the time-lag associated with citation analysis which results in an inherent bias towards older publications. This analysis has demonstrated the leading role which the United States plays in breast cancer research, a finding previously noted in other scientific disciplines [22,23]. This is not surprising given the enormous     amount of money spent on the management of breast cancer there annually; it has been estimated that new cases of breast cancer diagnosed globally in 2009 alone will have cost an estimated $28 billion; of this $28 billion, $16 billion was spent in the United States [24]. In addition to being the single largest contributor to the literature on breast cancer, the United States has further played a key role in fostering international cooperation, in particular with its neighbour Canada, but also with many European nations, including Germany, the United Kingdom and Italy. The large number of nations involved in breast cancer research reflects its global burden. That said, the map of global production shown in Figure 2 clearly demonstrates the dramatic underrepresentation of South America, Africa, and to a lesser extent, Asia. Given that the majority of the predicted 26% increase in the incidence of breast cancer by 2020 will occur in the developing world [24], there needs to be a concerted effort to further involve these areas in future research initiatives, particularly focusing on how the cost-effective diagnosis and management of breast cancer can be delivered with levels of efficacy similar to those presently seen in Europe and the United States.
The quality of breast-related output from both the United States and the United Kingdom was high as measured using the average citation rate per published item as a proxy measure for quality. In addition, the contribution of many smaller countries, including Iceland, Finland, Switzerland and Denmark, was of high quality, with all four associated with impressive average citation rates. Interestingly, all of these countries collaborated internationally in a high proportion of their output ( Figure 4)  Our finding that the breast cancer-associated research has been published across over 4,000 journals reiterates the view that it is now impossible for those working in breast cancer to ensure that they appraise all of the relevant literature. Our work has, however, identified a core set of journals publishing on breast cancer, with the top 50 accounting for 43% of the total output. The median IF and EF of these titles compares particularly well with the median values for all 143 journals in the JCR category oncology in 2008 (2.66, 0.01, respectively), and alludes to the quality of output in this subject area.
There are a number of limitations to this work. Output from 1974 (n = 352, 0.2% of total output) was accidentally excluded during data collection, and hence, was not included in the subsequent analysis. In addition, this study has focused on entries contained in the Web of Science only, and it should be noted that the employment of other databases including PubMed and Scopus may have yielded slightly different results. That said, Web of Science covers the oldest publications with archived records back to 1900 [25], and should provide an accurate overview of output over the entire study period. Finally, while we have provided an overview of geographic output on breast cancer, we have not related our findings to underlying socio-economic and demographic variables, and clearly this would be an interesting future avenue for investigation.

Conclusions
This work represents the first bibliometric assessment of research quantity and quality in breast cancer-associated literature. The results have demonstrated the ongoing expansion of that literature, while also identifying the key nations and journals involved in its production over the past half-century. In an era when bibliometric indicators are increasingly being employed in the assessment of individual, institutional and national performance, these findings should provide useful information for those tasked with improving that performance.