Intended for healthcare professionals


What’s holding up the big data revolution in healthcare?

BMJ 2018; 363 doi: (Published 28 December 2018) Cite this as: BMJ 2018;363:k5357
  1. Kiret Dhindsa, postdoctoral fellow1,
  2. Mohit Bhandari, professor2,
  3. Ranil R Sonnadara, associate professor2
  1. 1Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
  2. 2Department of Surgery, McMaster University, Hamilton, Ontario, Canada
  1. Correspondence to: K Dhindsa dhindsj{at}

Poor data quality, incompatible datasets, inadequate expertise, and hype

Big data refers to datasets that are too large or complex to analyse with traditional methods.1 Instead we rely on machine learning—self updating algorithms that build predictive models by finding patterns in data.2 In recent years, a so called “big data revolution” in healthcare has been promised345 so often that researchers are now asking why this supposed inevitability has not happened.6 Although some technical barriers have been correctly identified,7 there is a deeper issue: many of the data are of poor quality and in the form of small, incompatible datasets.

Current practices around collection, curation, and sharing of data make it difficult to apply machine learning to healthcare on a large scale. We need to develop, evaluate, and adopt modern health data standards that guarantee data quality, ensure that datasets from different institutions are compatible for pooling, and allow timely access to datasets by researchers and others. These prerequisites for machine learning have not yet been met.

Part of the problem is that the hype surrounding machine learning obscures the reality that it is just a tool for data science with its own requirements and limitations. The hype also fails to acknowledge that all healthcare tools must work within a wide range of human constraints, from the molecular to the social and political. Each of these will limit what can be achieved: even big technical advances may have only modest effects when integrated into the complex framework of clinical practice and healthcare delivery.

Although machine learning is the state of the art in predictive big data analytics, it is still susceptible to poor data quality,89 sometimes in uniquely problematic ways.2 Machine learning, including its more recent incarnation deep learning,10 performs tasks involving pattern recognition (generally a combination of classification, regression, dimensionality reduction, and clustering11). The ability to detect even the most subtle patterns in raw data is a double edged sword: machine learning algorithms, like humans, can easily be misdirected by spurious and irrelevant patterns.12

For example, medical imaging datasets are often riddled with annotations—made directly on the images—that carry information about specific diagnostic features found by clinicians. This is disastrous in the machine learning context, where an algorithm trained using datasets that include annotated images will seem to perform extremely well on standard tests but fail to work in a real world scenario where similar annotations are not available. Since these algorithms find patterns regardless of how meaningful those patterns are to humans, the rule “garbage in, garbage out” may apply even more than usual.13

Even if we had good data, would we have enough? Healthcare data are currently distributed across multiple institutions, collected using different procedures, and formatted in different ways. Machine learning algorithms recognise patterns by exploiting sources of variance in large datasets. But inconsistencies across institutions mean that combining datasets to achieve the required size easily introduces an insurmountable degree of non-predictive variability. This makes it all too easy for a machine learning algorithm to miss the truly important patterns and latch onto the more dominating patterns introduced by institutional differences.

Holistic solution

A holistic solution to problems of data quality and quantity would include adoption of consistent health data standards across institutions, complete with new data sharing policies that ensure ongoing protection of patient privacy. If healthcare leaders see an opportunity to advance patient care with big data and machine learning, they must take the initiative to establish new data policies in consultation with clinicians, data scientists, patients, and the public.

Improved data management is clearly necessary if machine learning algorithms are to generate models that can transition successfully from the laboratory to clinical practice. How should we go about it? Effective data management requires specialist training in data science and information technology, and detailed knowledge of the nuances associated with data types, applications, and domains, including how they relate to machine learning. This points to a growing role for data management specialists and knowledge engineers who can pool and curate datasets; such experts may become as essential to modern healthcare as imaging technicians are now.14 Clinicians will also need training as collectors of health data and users of machine learning tools.15

To truly realise the potential of big data in healthcare we need to bring together up-to-date data management practices, specialists who can maximise the usability and quality of health data, and a new policy framework that recognises the need for data sharing. Until then, the big data revolution (or at least a realistic version of it) remains on hold.


  • Competing interests: We have read and understood BMJ policy on declaration of interests and declare the following: RRS reports board membership for Compute Ontario, SHARCNET, and SOSCIP—all non-profit advanced research computing organisations. MB reports personal fees from Stryker, Sanofi, Ferring, and Pendopharm and grants from Acumed, DJO, and Sanofi outside the submitted work.

  • Provenance and peer review: Not commissioned; externally peer reviewed.


View Abstract