Jump to: Page Content, Site Navigation, Site Search,
You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.
BMJ 2005;330:1128 (14 May), doi:10.1136/bmj.38422.611736.E0 (published 12 April 2005)
Jonathan D Wren, research scientist1
1 Advanced Center for Genome Technology, Department of Botany and Microbiology, University of Oklahoma, 101 David L. Boren Blvd, Rm. 2025, Norman, OK 73019, USA
Correspondence to: J D Wren Jonathan.Wren{at}OU.edu
Design The internet was searched using an application programming interface to Google, a popular and freely available search engine.
Main outcome measures The proportion of reprints of journal articles published between 1994 and 2004 from within 13 subscription based and four open access journals that could be located online at non-journal websites.
Results The probability that an article could be found online at a non-journal website correlated with the journal impact factor and the time since initial publication. Papers from higher impact journals and more recent articles were more likely to be located. On average, for the high impact journal articles published in 2003, over a third could be located at non-journal websites. Similar trends were observed for the delayed or full open access publications.
Conclusions Decentralised sharing of scientific reprints through the internet creates a degree of de facto open access that, although highly incomplete in its coverage, is none the less biased towards publications of higher popular demand.
While some scientific publications may not have been published in an open access journal, they may, none the less, be openly accessible to the public on non-journal websites.7 A better understanding is needed of how commonly scientific publications are shared online, what types of publications are shared, and whether or not this is changing.
I examined the extent of scientific file sharing, including how commonly scientific publications are shared online, whether journal readership level is a predictor, how the amount of file sharing changes with the age of the article, and to what degree open access publications are shared on non-journal websites.
Selection of journals
I chose 13 subscription based journals for analysis on the basis of their 2002 journal impact factor, which correlates with the level of readership (box).8 All journals had articles indexed in Medline dating back at least to 1994 and were subscription based.
The query target
As my query target I chose PDF files rather than HTML files for several reasons. Firstly, because all necessary information (such as figures and tables) is in one file, it is easier to post a PDF than recreate a HTML file with all associated images. Secondly, journal reprints are typically distributed as PDF files and readers prefer them because they can be printed out without loss of formatting. Thirdly, PDFs enable specific page numbers to be used as part of the query.
Constructing Google queries to locate Medline articles online
Constructing queries with the digital object identifier (DOI) corresponding to each published article would be an ideal means of retrieving articles as DOIs are unique. However, though DOIs are recorded by PubMed, they are not provided in the distributed version used to obtain article information, and there is still variance among and within journals regarding the inclusion of DOIs within reprints (PDFs).
I therefore had to design highly restrictive queries to send to Google. The first query term was the rarest of the authors' last names. The second query term was the rarest word found within the authors' affiliation field. Thirdly, I used the title in quotes so that only exact matches would be returned.
One of the most important narrowing criteria for keyword queries was the use of implicit page numbers, which are normally not present within HTML files but are present within reprints.
Queries submitted to Google were thus of the form: "< rarest author last name > < rarest affiliation word > < first implicit page number (if one exists) > < second implicit page number (if one exists) > < exact title (in quotes) of article being queried >." The end result was a list of journal articles indexed by Google and freely available online at non-journal websites (see bmj.com for details).
Benchmarking query recall
I chose the Journal of Biological Chemistry (J Biol Chem) to assess how well the constructed Google queries located Medline articles online because J Biol Chem makes its articles open access at the end of the calendar year. I thought it preferable to benchmark using journals that have declared certain content freely available to the public. Additionally, J Biol Chem publishes more journal articles per year than most other journals (roughly twice that in the next highest journal in the 17 examined), offering a greater sample size.
Search engines are not comprehensive in their indexing of web accessible documents.9 Thus, before I could estimate query recall using J Biol Chem, I had to measure the number of J Biol Chem journal article PDFs indexed by Google. I downloaded the URLs corresponding to the location of full text PDF articles published between 1996 and 2003 from the J Biol Chem website and used them as the query string submitted to the Google API. I queried 45 282 PDF URLs from J Biol Chem on three separate occasions in 2004: 1 July, 2 August, and 13 September. The total number of article PDFs indexed by Google varied from 19 194 (42.4% of the total) on the July run to 25 084 (55.4%) on the August run to 16 442 (36.3%) in September. This suggested that overall statistics on query performance need to be gathered as close as possible to the time the index benchmarking took place.
|
To see if it was reasonable to use the rate of J Biol Chem article indexing by Google as a measure of overall recall, I ran a similar batch of queries on 9 August using 22 819 journal article PDF URLs corresponding to articles published during the same period (1996-2003) extracted directly from the Proceedings of the National Academy of Sciences (Proc Natl Acad Sci U S A) website, finding a total of 4022 (18%). Thus, while query performance versus indexed documents can be estimated, it is difficult to extrapolate these numbers to estimate the true recall of the queries (that is, what percentage of all web accessible journal articles is found).
I estimated precision by manually examining three sets of 50 PDFs identified as potential reprints of journal articles by the Google queries. Each set of PDFs was chosen randomly from within the entire list of queried article reprints and only PDFs found at non-journal websites were examined. A query was considered successful only if the first PDF it returned corresponded to the journal article being queried. Six documents returned either a blank page or a "404 not found" error. A total of 38/48 (79%), 37/48 (77%), and 34/48 (71%) top query results corresponded to the article being sought. The mean precision was 76% (SD 4%).
Journal queries
I queried 48 516 journal articles indexed by Medline within the 13 subscription based journals with a publication date between January 1994 and July 2004 (fig 1). Several trends are apparent. Firstly, journals with higher impact have a larger fraction of papers that can be found online at non-journal sites. A two tailed t test comparing the areas under the curve for high, medium, and low impact journals yielded: high v medium (P < 0.02), medium v low (P < 0.07), and high v low (P < 0.0002). Secondly, for these journals, the probability a paper could be found correlates with how recently it was published. Thirdly, many of these journals showed a recent drop in online availability. This is probably artificial, however, as journal citations often appear in Medline after a paper is accepted for publication but before it appears in print (or PDF). It is also possible that online posting tends to lag publication date.
|
I also examined file sharing for four open access or delayed open access journals (fig 2). The free availability of these articles could obviate the need to share them on non-journal websites. On the other hand, the free availability of articles might encourage them to be copied and shared.10 I used a t test to compare the area under the curve of these four open access journals with their subscription based counterparts and found that their online availability trends were more similar to the mid-range impact factor group (P < 0.46) than the high (P < 0.003) or low (P < 0.24). As the impact factors of these open access journals are in this mid-range, this suggests that the probability that a journal article can be found on a non-journal website is less a function of copyright or ownership than it is of impact factor or journal readership levels.
|
One weakness of this study is that it is difficult to assess the true fraction of journal articles accessible at non-journal websites because of incomplete search engine indexing9; the reported numbers almost certainly underestimate the real numbers. This incomplete indexing is not specific to Google. I also found a similar performance with Yahoo and MetaCrawler. The relatively low proportion of indexed articles may be due partly to difficulties searching PDF content. New search engines specifically for academics, such as Google Scholar, should help researchers to locate these full text articles with greater precision, although incomplete web page indexing will probably remain an issue.
|
Finally, a straightforward interpretation of figure 1 suggests that publications are becoming increasingly available online as time goes by. It could be equally hypothesised, however, that most of the observed trend is due to a relatively constant rate of article posting in combination with a time dependent decay in URL availability, which has been well established not only as a general phenomenon but also in scientific publishing.13-15
This is the abridged version of an article that was posted on bmj.com on 12 April 2005: http://bmj.com/cgi/doi/10.1136/bmj.38422.611736.E0 The National Library of Medicine provided electronic Medline records in XML format. I thank the API development team at Google for permitting use of their web search engine interface as well as Robert Dellavalle, Lisa Schilling, Peter Suber, and Tim Cole for helpful manuscript reviews.
Contributors: JW is the sole author.
Funding: This work was funded in part by a grant from NSF-EPSCoR (EPS-0132534).
Competing interests: None declared.
Ethical approval: Not required.
![]()
CiteULike
Complore
Connotea
Del.icio.us
Digg
Reddit
StumbleUpon
Technorati What's this?
Read all Rapid Responses