Data sharing in research: benefits and risks for cliniciansBMJ 2014; 348 doi: http://dx.doi.org/10.1136/bmj.g237 (Published 23 January 2014) Cite this as: BMJ 2014;348:g237
- Elliott Antman, professor of medicine, cardiovascular division, Brigham and Women’s Hospital, and associate dean for clinical/translational research, Harvard Medical School, Boston, MA
The cycle of research begins with identification of an idea, design of the study to answer the scientific question that is posed, conducting and analyzing the findings, and publishing the results. Figure 1 shows the position of the open source, open data, and open access components of the open science concept superimposed on the cycle of research.⇓ One can readily advocate, in principle, for open data to provide doctors and their patients all the data needed for optimum decision-making (figure 2).⇓ 1
To illustrate the issues involved, however, consider the flow of data when a clinical trial is completed (figure 3).⇓ A series of raw databases are created. They are populated from the case report forms, and contain patient level information on such topics as baseline demographics, concomitant medications, study drug compliance, suspected endpoint events, clinical laboratory data, and adverse events. Other raw databases might include the final adjudication by the blinded endpoint committee as to whether an endpoint occurred, genetic data, quality of life survey results, and core laboratory data. A data dictionary is developed to link the information from the raw databases to several derived databases that cover items such as endpoints, time to event, and allocation to treatment arm. The derived databases are the source from which data tables are generated for preparation of manuscripts that ultimately appear in the medical literature (and are therefore searchable in PubMed) as well as reports that are sent to regulatory authorities. Posting of the raw databases as well as the data dictionary offers the broad clinical and research community a rich array of information that, if used correctly, can contribute to the aspirational goals embodied in the phrases “personalized medicine” and “precision medicine.”2
We must understand the complex issues involved in posting raw databases on the internet in the spirit of open data and guard against unintended consequences.3 Figure 4 shows 11 hypothetical subjects in a trial where treatments for a chronic cardiovascular condition are being compared.⇓ A solid line designates the period of time while the subject is taking the blinded study drug and a dashed line designates the time they are off the blinded study drug. Endpoint events E1-E11 are shown—some occurring while on drug, some while off drug, and some (E10) after the final study visit and the database is locked. When considering a time to event analysis, depending on the lens through which one is looking at the data, different cohorts of subjects are analyzed. The denominators for these cohorts might differ and, as shown in the table, the events counted during the overall and on treatment periods will differ. Regulatory authorities in various parts of the world might differ in how they want to see the analyses performed to test for non-inferiority or superiority. Also, as illustrated at the bottom of the figure, the counting procedures for time to events during the period of time the subject was on study treatment require a technique called interval censoring, where subjects move into the at risk group while they are on study drug (solid lines) and out when they are off drug (dashed lines).
Depicted in figure 5 are the concerns that arise when analyses performed using raw datasets posted in an open data repository fail to take into account the range of biostatistical considerations noted above or are the product of a biased approach to the science at hand.⇓ Publications from such analyses become searchable in PubMed and have the potential to confuse or mislead the clinical community. Additional complexity is imposed by the need to strip out 18 personal health information identifiers before posting the raw data in a de-identified raw database. Depending on the scientific question being analyzed, this might limit the utility of such de-identified raw databases. Additionally, reverse engineering of the data could result in re-identification of the subject and most consent forms for clinical trials at present do not indicate to research subjects that their data may be posted in an open data repository.
What might the models for open data look like? Mello and colleagues discuss four possible models for expanded access to participant level data: open access, database query, sponsor review, and learned intermediary (figure 6).⇓ 4 The decision maker with regard to release of the data varies in each of the models, as does the process for requesting access to the data, and the criteria for releasing the data. In the open access model any researcher can download the data and the only requirement is self attestation that the data will be used in a responsible fashion. The database query model requires that a request be placed to the decision maker who determines if the proposed use of the data is grounded in sound science and the proposed public health benefit from its release outweighs the potential adverse consequences to the original sponsor or investigator.5 In the sponsor review model, an interested party places a request to the trial sponsor, who adjudicates the request based on sound science, benefit-risk balance, with the addition of whether the requesting team has the expertise to carry out the proposed analyses. I prefer the learned intermediary model where an independent review board reviews requests and judges them on the basis of sound science, benefit-risk, and expertise.
In addition to the objectivity that the learned intermediary model offers, elements that I see as critical as we move into an era of open data include rigorously defined data use agreements and the necessity for those requesting access to the data to submit a prespecified statistical analysis plan, just as the original investigators did before unblinding occurred. We will need a pool of highly qualified analysts who understand the issues involved (figure 4) and we need to provide a window of opportunity for the primary investigators to publish their results.
To achieve the benefits of open data, we will need to work through the potential disincentives to industry, investors, patients, and investigators who understandably have a sense of ownership of the data they worked hard to generate—sometimes over many years. This latter point is especially important to early career investigators who count on their manuscripts as the currency of academe by which their career progress is measured. Finally, none of this will really take hold until we see a sustainable business model that supports the infrastructure for open data. We look forward to the report from an Institute of Medicine committee that will develop guiding principles and a framework for the responsible sharing of clinical trial data.6
Cite this as: BMJ 2013;348:g237
Competing interests: I have read and understood the BMJ Group policy on declaration of interests. I am a member of the TIMI Study Group, an academic research organization that receives research grants from multiple industry and governmental sources to conduct clinical trials.
Provenance and peer review: commissioned; not externally peer reviewed.
These remarks and slides are based on a presentation made by the author at the Scientific Sessions of the American Heart Association on 18 November 2013 in Dallas, Texas.