Why researchers should share their analytic codeBMJ 2019; 367 doi: https://doi.org/10.1136/bmj.l6365 (Published 21 November 2019) Cite this as: BMJ 2019;367:l6365
- Ben Goldacre, director,
- Caroline E Morton, researcher,
- Nicholas J DeVito, researcher
- Correspondence to: B Goldacre
JAMA recently retracted and replaced an important clinical trial report from 2018 after a serious programming error was discovered.1 Quantitative medical research relies on analytic scripts: a sequence of commands issued to extract, reshape, manage, and then analyse data. In this case, there was a catastrophe. The “randomisation assignment” variable coded the control group “1” and the intervention group “2”; this had to be converted to “0” and “1” for the statistical analysis to run, but an incorrect conversion command resulted in the intervention and control groups being mislabelled. The results of the trial were almost completely reversed.
It is laudable that this single error was acknowledged and corrected with a retraction. However, neither the retraction notice nor the accompanying editorial acknowledged the systemic problems and opportunities exemplified by this case.12 Sharing analytic code is increasingly the norm across many fields.345 It provides an unambiguous record of the analytical methods used, aiding reproducibility.67 It also allows expert peer reviewers and the wider research community to audit the code, which increases the likelihood of errors being found and corrected.89
That benefit is exemplified by this retracted trial, and not only for the catastrophic central error leading to the retraction. While reviewing their code to correct their major error, the research team discovered at least two other areas of erroneous code (in the commands to impute missing values and to aggregate data into summary variables).1 However, error checking is only one of the benefits that come from sharing code; more broadly, sharing code under open licence for reuse by others generates an archive of clinically relevant code that can help avoid duplicated effort and accelerate innovation.
Some researchers object to this form of transparency. In our view these objections are either misplaced or fail to proportionately reflect the needs of patients and the scientific community. Sharing code, unlike sharing individual patient data, will typically present no privacy issues. We have been told that sharing code is difficult because the scripts are long, covering “many pages of information.”10 But there are numerous free, open platforms to share version controlled code,1112 and the most commonly used, GitHub,13 has a limit of 100 GB for each repository. For context, our group’s OpenPrescribing.net service is a substantial software project with 130 000 users a year: the whole project is over 30 000 lines of code, which is at least one order of magnitude bigger than any single epidemiological analysis script, but this equates to only 1.5 MB of storage.
Another objection is the time needed to create perfectly curated code, but there is no need for code to be converted into generalisable “libraries”; simply sharing practical working code is a good start.14 Emerging best practice is to share full analyses using tools such as R Markdown and Jupyter Notebooks. These are easy to use and embed narrative text, analytic code, and the outputs of that code all in a single interactive notebook. Using these tools, our team aims to share analyses and code alongside every published quantitative study: we have shared over 100 notebooks to date (https://github.com/ebmdatalab).
Some researchers may feel they have earned a competitive advantage from software developed in-house to make data management and analysis more efficient. In our view such concerns do not legitimise any attempt to withhold code in a way that undermines transparency for reproducibility, but these resource concerns would be better addressed by recognising and supporting good open software contributions. For example, it is already common to cite code that is reused, but these norms could be expanded and reinforced, with compliance audited. Moreover, a strategic approach to fund shared open analytic resources would be likely to produce better software than the current code produced ad hoc by individual teams, often with duplicated effort.
Overall there is much to be done. Firstly, journals should ask all submitting authors to share adequately documented code as supplementary material on publication and audit compliance. Secondly, institutions should ensure researchers can access tools and training to support sharing and other important practices such as code review and version control. Thirdly, as well as sharing their code, researchers should give credit when reusing others’ work and endeavour to critically review code as they do other aspects of a study’s methods. Finally, funders have an important role: they should require all grant recipients to share code, in the same way that many already mandate sharing of data and results15; they should audit compliance and review applicants’ previous sharing when assessing new applications; and they should explicitly support collaborative development of open analytic tools.
This is not an exhaustive list, and we are keen to hear further suggestions as well as objections. However, the prize is substantial. It is baffling that we are expected to rely on brief narrative text descriptions for complex technical data analysis. Medical research cannot progress at pace with its most foundational text—the code that analyses the data—withheld from view.
Competing interests: We have read and understood BMJ policy on declaration of interests and declare no relevant interests.
Provenance and peer review: Commissioned; not externally peer reviewed.