Publishing raw data and real time statistical analysis on e-journals

BMJ 2001; 322 doi: (Published 3 March 2001)
Cite this as: BMJ 2001;322:530

Recent rapid responses

Rapid responses are electronic letters to the editor. They enable our users to debate issues raised in articles published on Although a selection of rapid responses will be included as edited readers' letters in the weekly print issue of the BMJ, their first appearance online means that they are published articles. If you need the url (web address) of an individual response, perhaps for citation purposes, simply click on the response headline and copy the url from the browser window.

Displaying 1-3 out of 3 published

Hutchon [1] articulates some excellent points about the benefits accruing to the medical profession from the availability of data sets in support of published studies. However, the assertion that publishing raw data in a paper journal would usually be impractical may be misconstrued by some readers to mean that raw data sets can never be easily presented in paper journals. In fact, this is not the case, because many important endpoints are categorical. Such data can be displayed in full quite economically, in terms of space. The frequent practice of assigning arbitrary numerical scores to the categories, and then presenting only summary statistics (e.g., means and standard deviations) of these scores, should be replaced with a presentation of the data in full, organized as a contingency table.

[1]. Hutchon, DJR, "Publishing Raw Data and Real Time Statistical Analysis on E-Journals", British Medical Journal 322, 530, 3/3/01.

Competing interests: None declared

Vance W Berger, National Cancer Institute

Bethesda, MD, USA

Click to like:

Eysenbach provides an excellent summary of the problems of publishing raw data, an aspect that I did not attempt to address in my short article. The Journal of Medical Internet Research is a journal with vision but to be fair, the type of articles published so far are not those which are based on large amounts of original raw data.

I agree however there is no tradition to make raw data available in the majority of medical research for a number of reasons. The practical reasons I discuss in my paper. I entirely agree that there needs to be a clearer code of practice for the use of published raw data by other workers. I would reiterate however that the vast majority of data never sees the light of day following publication. Where researchers were aware of the potential for further research, this could be intimated in the first published article and it would be unacceptable for anyone else to use the data in this way, at least for a reasonable length of time after publication. The original group would in any case have a considerable time advantage to work on the data. As far as the possibility that further analysis of the data by others could be used to support a different view would simply highlight the tenuous nature of the original work.

The Journal for Medical Internet Research encourages the submission of raw data and specify the usual formats of spreadsheet and database files for this. Presumably this would be available as an attachment to the epaper and would require the reader to have the appropriate software, and the appropriate version of that software.. The method I show requires only an up-to-date browser. Not only does this increase the accessibility of the data, it ensures that the data and the paper are never separated.

I look forward to the development of a clear code of practice for the use of published raw data as I am sure that a collaborative goodwill can be developed in general medical research with the explosion of internet publishing.

David J R Hutchon

Competing interests: None declared

D J R Hutchon, Consultant Obstetrician and Gynaecologist

Memorial Hospital, Darlington

Click to like:

The Journal of Medical Internet Research (http:/ has, from the beginning of its existence, explicitely invited authors to submit raw data [1]. In the two years of its existence, no author has submitted a paper with raw data. What is the reason for this hesitation? Are authors perhaps afraid that other researchers analyze their data too thoroughly, "cream off" and publish interesting results, and thus preclude the publication of further papers ?
Clearly, laying open raw data is a double-edged sword:

On the positive side there is the apprehension that it may enhance the speed and quality of research. The method of opening up raw data has strong parallels to the "open source" movement of the software industry [2]. "Open source software" are computer programs, where developers freely distribute the source code and allow usage and modification. The Open Source Initiative explains the concept as follows: "The basic idea behind open source is very simple. When programmers on the Internet can read, redistribute, and modify the source for a piece of software, it evolves. (...) We in the open-source community have learned that this rapid evolutionary process produces better software than the traditional closed model, in which only a very few programmers can see source and everybody else must blindly use an opaque block of bits." (

This is a perfect justification for publishing raw data - other researchers not only can verify the conclusions of the author by reanalysing the data, but also play with the data and perhaps draw new conclusions. Pre-print servers, as well as innovative e-journals, offer possibilities to share "non-finished" preliminary data, and encourage other scholars to participate in the research process.

On the negative side there are concerns about priority and authorship, which may be the main reason for why many authors may have difficulties with this idea. The practice of sharing the "source" is already common practice in the field of of genomics, where (according to the so-called Bermuda agreement) researchers place sequence data on public and freely accessible databases as sequences are generated. Consequently, comparisons of genomics research to the open source software movement have been drawn [3]. However, in "open source" genomics research, debates over priority, authorship and credit for analyzing data in depth have already arisen [4]. If researcher A laid open the complete dataset, and researcher B accesses the dataset and discovers a new relationship or other "publishable" results, what rights of first publication does the researcher A who published the original dataset have? Ideally, researcher A and researcher B would publish a follow-up paper together, but one could also argue that researcher B may go ahead publishing anything he want, with a simple reference and acknowledgement to the results of researcher A - which may be unsatisfactory for researcher A, especially if he planned to do further analyses with the dataset. The genome research community has not yet found a clear answer to this - other than recommending a "collaborative goodwill" [4].

My personal feeling is that we need a more clearer code of practice on this issue. For example, in the open source software industry, everybody who amends open source code to produce more advanced software agrees that the new software must be open source again. In biomedical research, this may be translated into a practice where researchers using raw data from other researchers should not only make the original raw data available in their publications, but also new raw or "intermediate" data related to them. Also, one may think about a practice where authors who made available the original raw data (and also other authors who generated more results with these data) must appear as co-authors in any subsequent publications. This prospect may enhance the willingness of researchers to share their raw data.



  1. Journal of Medical Internet Research - Instructions for authors. article components [accessed March 2, 2001]
  2. Eysenbach G. The impact of preprint servers and electronic publishing on biomedical research [Guest Editorial]. Current Opinion in Immunology Oct 2000; 12(5): 499-503
  3. Russ AP, Aparicio SA, Carlton MB: Open-source work even more vital to genome project than to software [letter] . Nature 2000, 404:809
  4. Debates over credit for the annotation of genomes [editorial]. Nature 2000, 405:719.

Competing interests: None declared

G Eysenbach, Editor

J Med Internet Res

Click to like: