The poor performance of apps assessing skin cancer riskBMJ 2020; 368 doi: https://doi.org/10.1136/bmj.m428 (Published 10 February 2020) Cite this as: BMJ 2020;368:m428
- Jessica Morley, DataLab policy lead1,
- Luciano Floridi, professor of philosophy and ethics of information2,
- Ben Goldacre, DataLab director1
- 1Nuffield Department of Primary Care, University of Oxford, Oxford OX2 6GG, UK
- 2Oxford Internet Institute, University of Oxford, Oxford, UK
- Correspondence to: J Morley
Over the past year, technology companies have made headlines claiming that their artificially intelligent (AI) products can outperform clinicians at diagnosing breast cancer,1 brain tumours,2 and diabetic retinopathy.3 Claims such as these have influenced policy makers, and AI now forms a key component of the national health strategies in England, the United States, and China.
It is positive to see healthcare systems embracing data analytics and machine learning. However, there are reasonable concerns about the efficacy, ethics, and safety of some commercial, AI health solutions.45 Trust in AI applications (or apps) heavily relies on the myth of the objective and omniscient algorithm, and our systems for generating and implementing evidence have not yet met the new specific challenges of AI. They may even have failed on the basics. In a linked article, Freeman and colleagues6 (doi:10.1136/bmj.m127) throw these general concerns into stark relief with a close examination of the evidence on diagnostic apps for skin cancer.
The authors report results from a systematic review of studies evaluating the accuracy of smartphone apps that were offered directly to the public for risk stratification of skin lesions. Nine studies were included, evaluating a total of six apps. Even though methodological decisions made by the instigators of the studies probably led to overestimation of the apps’ real world performance, Freeman and colleagues still found evidence for accuracy to be lacking.
Some apps gave conflicting management advice for the same lesions, and their recommendations were commonly inconsistent with clinical histopathological results. In short, little evidence indicates that current AI apps can beat clinicians when assessing skin lesion risk, at least not in a verifiable or reproducible form.
Currently, two apps from the study are available in the UK. Freeman and colleagues report that SkinScan was evaluated in a single study of 15 images with five positive cases, and a sensitivity of 0%. The second, SkinVision, when validated against expert recommendations was found to be poor. Yet both are approved and regulated as “class I medical devices”; and both have a CE mark.
This official approval will give consumers the impression that the apps have been assessed as effective and safe. But “class I” is the European classification for low risk devices, such as plasters and reading glasses. The implicit assumption is that apps are similarly low risk technology. But shortcomings in diagnostic apps can have serious implications: for patients and the public, risks include psychological harm from health anxiety or “cyberchondria,” and physical harm from misdiagnosis or overdiagnosis; for clinicians there is a risk of increased workload, and changes to ethical or legal responsibilities around triage, referral, diagnosis, and treatment; for the system, there is a risk of inappropriate resource use, and even loss of credibility for digital technology in general.
The current regimen is clearly unsatisfactory. Collectively as a society we must decide what amounts to good evidence when evaluating health apps; who is responsible for generating, validating, and appraising this evidence; and how post-market monitoring of regularly updated software should be organised. These are complex questions.
Regulators clearly have a role. We must decide which activities they will regulate: risk stratification apps clearly perform a medical function; “wellness” apps for meditation and mindfulness are a grey area, but could nonetheless cause psychological harm. Regulators most accustomed to managing medicines will need new skills to evaluate digital technology. But wherever the perimeter is drawn, they must avoid false reassurance: when regulators are notevaluating technology, they should clearly flag this to patients and policy makers.
Softer governance measures (eg, policies and standards) from governing bodies such as NHS England, can facilitate the creation of rational and transparent markets. Clinicians, patients, and commissioners are all potential customers for health apps. Guidance, such as that recently produced by Public Health England on evaluating digital health products,7 can help ensure that they each know enough to require, find, understand, critically evaluate, and apply good evidence, within reasonable limits. This is likely to help drive better innovation, by rewarding only products that deliver tangible benefits. Clinicians should also be trained to evaluate the tools they recommend to patients, avoid the pitfalls of automation bias, and identify the clinical tasks that can be automated safely.
Lastly, we need a cultural shift. It must become the norm, or social expectation, that all those developing health apps and AI solutions support third party access to data, in a trustworthy manner; and code, within the parameters of technical feasibility, while respecting ownership of intellectual property. This would facilitate competition, reproducibility, audit, and error correction,8 driving up the overall quality of solutions available on the market. It would also enable more independent real world evaluations of market solutions to be conducted, provided that funders are willing to support this type of research.
Collectively, these actions will improve evidence and transparency across the whole algorithm lifecycle.910 Reliable evaluations must find the truth, purchasers must require and use those truths, and regulators and other governing bodies must support and enhance these processes. Without better information patients, clinicians, and other stakeholders cannot be assured of an app’s efficacy, and safety.
Competing interests: We have read and understood BMJ policy on declaration of interests. JM is a recent employee of NHSX, the governing body for digital, data, and technology policy in the NHS, and has received a research grant from the Digital Catapult in the past 12 months. Neither organisation has been involved in the writing of this editorial.
Provenance and peer review: Commissioned; not peer reviewed.