Can we trust AI not to further embed racial bias and prejudice?BMJ 2020; 368 doi: https://doi.org/10.1136/bmj.m363 (Published 12 February 2020) Cite this as: BMJ 2020;368:m363
When Adewole Adamson received a desperate call at his Texas surgery one afternoon in January 2018, he knew something was up. The call was not from a patient, but from someone in Maryland who wanted to speak to the dermatologist and assistant professor in internal medicine at Dell Medical School in the University of Texas about black people and skin cancer.
Over the next few weeks, over a series of phone calls, Adamson would learn a lot about the caller. Avery Smith is a software developer in his 30s whose wife, LaToya, had died the year before from melanoma. Smith had been researching how black people with melanoma were being underserved in healthcare for years, so when Smith said that the algorithms driving new cancer software were racist, Adamson was listening.
Melanoma is most common in white skin. Black people are less likely to get it, but they are more likely to die from it.1 Smith said that the images used to train algorithms to detect melanoma were predominantly, if not all, of white skin. He explained why this meant that artificial intelligence would struggle to detect cancerous moles in darker skin and, therefore, why the praise from some quarters for AI that has proved just as good as the best clinicians at detecting skin cancers in white people meant little for ethnic minority communities.
Of course, none of this is new. Mole checking guidance and the images used to train doctors to recognise skin cancers are also predominantly of white skin—possibly explaining the late detection of melanoma in black people. But the scale and speed at which AI could embed this problem is concerning. In a viewpoint article in 2018, Smith and Adamson concluded that AI wielded carelessly had the potential to embed and exacerbate racial health inequalities.2 Their findings are pertinent to the UK health setting, where the hype around AI is growing and pressures in the NHS mean that policy makers see automated diagnostic routes as an easy fix.
There has been much excitement about implementing AI in the NHS. In August 2019, health secretary Matt Hancock announced a £250m (€300m; $325m) investment in AI.3 At Moorfields Eye Hospital in London, DeepMind, which is owned by Google’s parent company Alphabet, has shown that AI can be used to diagnose retinal diseases, and the Department for Business Energy and Industrial Strategy put £50m towards five medical centres, which were due to open across the UK in 2019, that will use AI to speed up disease diagnosis.
The technology is promising; it could reduce patient waiting times, improve diagnosis, and relieve an overburdened NHS—benefits that apply across the field, regardless of race. “We have to remember that there are global principles that apply to all people for many conditions. We now cure 50% of all cancers, which is an improvement for everybody,” says Bissan Al-Lazikani, head of data science at London’s Institute of Cancer Research, which has recently been able to detect new types of breast cancer using AI.
Al-Lazikani points out that AI has the potential to overcome some current systematic bias in the NHS. Historically, clinical research has not been representative, with most of its participants being white, older, wealthy males.4 With AI, the ability to analyse huge amounts of data on whole populations means potentially more balanced demographics.
Companies such as DeepMind are analysing data from hospital inpatients rather than clinical trials, overcoming common sampling problems in clinical research. In cases of rare diseases, researchers can use global data to understand more about diagnosis and treatment.
But there are problems. First, we do not know how ethnically representative many of the data being used to feed AI are. There is no requirement for such data to be representative, and there is no systematic means by which companies can upload such data publicly. NHS Digital, which collects and publishes information on health and social care in England, has a data quality checker that gives each NHS trust a score based on the quality of data that it submits for review. But trusts provide these data on a voluntary basis, the data are hard to read, and the person reading the data has no idea which are being used to train which AI system.
“The best bet that we have at understanding how inclusive different AI are is the companies being open and honest about it,” says Eleonora Harwich, from the think tank Reform, who recently authored a paper about the application of AI in the NHS.5
Commercial interests compound these problems. Scientific principle dictates that research findings should be replicable and representative to be valid, but some commentators have said that it would be commercially unviable for companies to provide AI datasets for reproducible research. In some cases this could mean that the drive to make a profit outweighs the need to provide a robust, reliable, and representative dataset.
Julian Huppert is a Cambridge academic who focuses on science and technology policy and who was chair of the panel of independent reviewers DeepMind He told The BMJ, “If you are backed by a venture capital firm who says, ‘We have to have revenue within the next few months,’ the temptation could be to provide a product even if it’s not quite as good.”
Huppert thinks that larger companies like Google have sufficient resources to be honest and open about their data. But even when they are, the level of transparency falls far short of real representation. In a paper published in NatureMedicine on the DeepMind pilot at Moorfields, for example, there is no demographic breakdown of participants, just a description that the hospital is based in “an urban, mixed socioeconomic and ethnicity population centred around London.”6 Statistics from Moorfields indicate that the inpatient population is mostly white (39%), with 15% non-Chinese Asian, 9% black, 1% Chinese, 1% mixed, and 20% unknown, and 14% other.7
A representative from Google accepted this in a statement, saying: “The Nature Medicine research paper in collaboration with Moorfields Eye Hospital represented an initial proof of concept to understand if AI could be applied to the interpretation of retinal OCT [optical coherence tomography] imaging. The dataset used in this research was representative of patient cases found in clinical practice at the world’s busiest eye hospital.”
But they added that the technology will need to be rigorously tested before being deployed elsewhere. “As with all deep learning models it would need development into a safe medical device before being used in clinical practice, including regulatory approvals that include careful evaluation of performance in minority subgroups to ensure fairness and equity and clinical trials involving relevant patient populations. We are working on studies to validate machine learning systems in different global settings as well as different ethnicities and look forward to sharing research in the years to come.”
In the UK, representation for ethnic groups means being in the minority, says Smith. As a result, ethnic communities get a raw deal compared with their white counterparts. They can’t have as much confidence in AI because it hasn’t been tested on people like them, he says.
Adamson explains that it is fundamental to AI’s success that the data are representative. “If data are collected in an unequal manner, if the data aren’t representative of society, if they include factors that have often unmeasurable biases (such as race, zip code, gender), there is a risk of perpetuating a biased outcome. Remember machine learning algorithms are tools unconcerned with fairness of outcomes,” he says.
The result could be an AI that works just as well as the best trained clinicians for a white person but doesn’t recognise cancerous moles on a black person. Smith and Adamson’s research showed that in the US, AI was created by computer scientists at Stanford, trained on images of entirely white skin, and then praised in a peer reviewed and published paper,8 without anyone flagging that the AI might not work for black people.
The second problem is that if companies are not obliged to show their data, people might continue to think that a fancy new AI is relevant to everyone when it is not, says Smith. “My issue is if black people continue to go to the doctor thinking these products work for them when they don’t,” says Smith, adding: “They [companies] should be at least required to tell people.”
If there is one thing Adamson wants you to know, it’s that he is not a luddite. “I know that AI has huge potential to clear up health disparities. We just need to take the time to vet it properly like we would any other new health product,” he says. For him, that means making sure that the data used to train AI are representative, that products work on all regardless of skin colour, and that where they do not, a warning is provided.
Adamson recalls being challenged after writing about AI’s potential inability to diagnose cancer in black people. One observer made the point that it would be unfair to let the fact that the data were not wholly representative of communities to get in the way of a technology that could benefit a lot of people.
But although Smith and Adamson’s research specifically pointed to problems with skin cancer diagnosis, the potential problems are varied. In medical research, concerns have been raised about the accuracy of algorithms depending on ethnicity. Recent cases in the US have included an algorithm that pushed white people to the front of the line for special programmes for patients with complex, chronic conditions.9 In another case, a top US health provider was using an algorithm that allocated substantially more follow-up resources to white than to black patients with the same disease burden.10
“I don’t think companies should pretend it is anything other than a product that only works on white people if that is the case,” says Adamson.
Difficult to regulate
Harwich points out a key problem with regulating AI in healthcare, namely that bias and discrimination are such difficult things to regulate. The UK health space is full of regulators, she contends. There is a regulator who decides whether a product works as specified and works safely, but it doesn’t decide whether the product systematically discriminates against a particular population. There is data protection law, such as the General Data Protection Regulation, but that only governs whether the information held about a person is true, not whether it is robust, representative, or biased. There is a human rights law, but that would be difficult to apply to data for an entire population.
“If you want to be uber radical, you would scrap all the regulators—the Medicines and Healthcare Products Regulatory Agency, the National Institute for Health and Care Excellence, the Information Commissioner’s Office, and so on—and have just one. Their different functions would work as departments within that organisation,” says Harwich.
Reform has recommended a paid-for data quality report service for private companies that want to access NHS data.5 The reports would detail when the data were collected, how they were collected, the potential reasons for some data missing, and the limitations of the data. This could nudge companies to think about ethical questions—such as bias—because there are market incentives to provide more representative data. “If you don’t have diversity in your dataset, it will genuinely limit the scalability of your product. If you train your algorithm only on data from trust A, and trust A is not overly representative, it won’t have any external validity, so if ever you need to apply your model to trust B it won’t work,” says Harwich.
But for Smith, market incentives alone are not enough—they are part of the reason why inequality exists. Market incentives can’t overcome the hundreds of years of medical research missing for communities of colour.
“[We as black people] have been told for years that it is not profitable to make products for black people,” he says. “The problem of unrepresentative datasets goes back hundreds of years. It is based on largely white samples and it works for those people. It is going to cost a whole lot of money to address that problem, and if there is no moral drive to do that, it won’t happen.”
Adamson says that the cost of not fixing this could be huge. AI could misdiagnose, underdiagnose, and potentially increase healthcare disparities. For that reason, he says we need to keep in mind that AI might be an amazing tool, but it is no replacement for the complex human thinking required of ethical decision making.
“If you don’t power these machines with diverse input, you’re only going to automate bias and automate racism,” says Adamson. The worst problem is that it could do all of the above under the guise of objectivity. “AI is not objective. It is just a tool. We feed that tool. We decide what it learns, and it is possible that we might feed it bias or even racism—garbage in, garbage out, we have known that for years.”
He gives the example of an AI tasked with measuring the effectiveness of organ transplantations. AI could tell you which transplants failed and which were successful. But it couldn’t tell you whether they failed in black communities because of poverty, or addiction, or whether they were just ineffective.
Regardless, the algorithm would predict—perhaps accurately—that the transplant is less likely to work on black people. “It could not diagnose or understand the societal factors. Does that mean they shouldn’t get a transplant?” says Adamson.
In the UK health setting, Harwich says that expecting AI to solve all our problems could have dire problems. “There is so much hype around AI and these snazzy algorithms that sometimes I feel like people think it will absolve them of the need to think,” says Harwich. “It will never absolve you from having to think hard about big problems. Technology can’t choose what outcomes you want to achieve, or what type of society you want to be in.
“Those are very deep human questions that no one is going to answer for us. If you let them be answered for you, then you’re in deep shit.”
Provenance and peer review: Commissioned, not peer reviewed.
Competing interests: I have read and understood BMJ policy on declaration of interests and have no relevant interests to declare.