AI has the worst superpower… medical racism.

By LUKE OAKDEN-RAYNER

In this piece, Luke discusses a preprint* titled Reading Race: AI Recognises Patient’s Racial Identity in Medical Images. Instead of covering it in detail here, he presents an explanation of why he and many of my co-authors think this issue is important. So in a way, this blog post can be considered a primer, a companion piece for the paper which explains the why. Sure, AI can detect a patient’s racial identity, but why does it matter?

This paper was a huge undertaking, with a big team from all over the world. Big shout out to Dr Judy Gichoya for gathering everyone together and leading the group!

This work is also, in my opinion, a huge deal. This is research that should challenge the status quo, and hopefully change medical AI practice.

We want feedback and criticism, so I hope everyone will read the paper. I’m not going to cover it in detail here, instead I wanted to write something else which I think will complement the paper; an explanation of why I and many of my co-authors think this issue is important.

One thing we noticed when we were working on this research was that there was a clear divide in our team. The more clinical and safety/bias related researchers were shocked, confused, and frankly horrified by the results we were getting. Some of the computer scientists and the more junior researchers on the other hand were surprised by our reaction. They didn’t really understand why we were concerned.

So in a way, this blog post can be considered a primer, a companion piece for the paper which explains the why. Sure, AI can detect a patient’s racial identity, but why does it matter?

Disclaimer: I’m white. I’m glad I got to contribute, and I am happy to write about this topic, but that does not mean I am somehow an authority on the lived experiences of minoritized racial groups. These are my opinions after discussion with my much more knowledgeable colleagues, several of whom have reviewed the blog post itself.

A brief summary

In extremely brief form, here is what the paper showed:

AI can trivially learn to identify the self-reported racial identity of patients to an absurdly high degree of accuracyAI does learn to do this when trained for clinical tasksThese results generalise, with successful external validation and replication in multiple x-ray and CT datasetsDespite many attempts, we couldn’t work out what it learns or how it does it. It didn’t seem to rely on obvious confounders, nor did it rely on a limited anatomical region or portion of the image spectrum.

A small portion of the results from the paper

So that is a basic summary. For all the gory details see the paper. Now for the important part: so what?

An argument in four steps

I’m going to try to lay out, as clearly as possible, that this AI behaviour is both surprising, and a very bad thing if we care about patient safety, equity, and generalisability.

The argument will have the following parts:

Medical practice is biased in favour of the privileged classes in any society, and worldwide towards a specific type of white men.AI can trivially learn to recognise features in medical imaging studies that are strongly correlated with racial identity. This provides a powerful and direct mechanism for models to incorporate the biases in medical practice into their decisions.Humans cannot identify the racial identity of a patient from medical images. In medical imaging we don’t routinely have access to racial identity information, so human oversight of this problem is extremely limited at the clinical level.The features the AI makes use of appear to occur across the entire image spectrum and are not regionally localised, which will severely limit our ability to stop AI systems from doing this.

There are several other things I should point out before we get stuck in. First of all, a definition. We are talking about racial identity, not genetic ancestry or any other biological process that might come to mind when you hear the word “race”. Racial identity is a social, legal, and political construct that consists of our own perceptions of our race, and how other people see us. In the context of this work, we rely on self-reported race as our indicator of racial identity.

Before you jump in with questions about this approach and the definition, a quick reminder on what we are trying to research. Bias in medical practice is almost never about genetics or biology. No patient has genetic ancestry testing as part of their emergency department workup. We are interested in factors that may bias doctors in how they decide to investigate and treat patients, and in that setting the only information they get is visual (i.e., skin tone, facial features etc.) and sociocultural (clothing, accent and language use, and so on). What we care about is race as a social construct, even if some elements of that construct (such as skin tone) have a biological basis.

Secondly, whenever I am using the term bias in this piece, I am referring to the social definition, which is a subset of the strict technical definition; it is the biases that impact decisions made about humans on the basis of their race. These biases can in turn produce health disparities, which the NIH defines as “a health difference that adversely affects disadvantaged populations“.

Third, I want to take as given that racial bias in medical AI is bad. I feel like this shouldn’t need to be said, but the ability of AI to homogenise, institutionalise, and algorithm-wash health disparities across regions and populations is not a neutral thing.

AI can seriously make things much, much worse.

Part I – Medicine is biased

There has been a lot of discussion around structural racism in medicine recently, especially as the COVID pandemic has highlighted ongoing disparities in US healthcare. The existence of structural racism is no surprise to anyone who studies the topic, nor to anyone affected by it, but apparently it is still a surprise to some of the most powerful people in medicine.

The tweet and podcast that lead to the resignation of the editor-in-chief of one of the biggest medical journals in the world. Look at that ratio!

Sometimes medical bias is because different patient groups will need different clinical approaches, but we don’t know that because the evidence that supports clinical practice is biased; most clinical trials populations are heavily skewed towards white men. In fact, in the 1990s the US Congress had to step in to demand that trials include women and racial/ethnic minorities**. In general, trials populations don’t include enough womenpeople of colourpeople from outside the US or Europepeople who are poor, and so on***.

The geographic origin and race/ethnicity of participants of clinical trials, 1997 to 2014.

There is a long and storied history of the effects this has had. Examples include medicines for morning sickness that cause birth deformities but weren’t tested on pregnant women, imaging measurements that tend to overestimate the risk of Down’s syndrome in Asian foetuses, and genetic tests that fail for people of colour.

But medicine is biased at the clinical level too, where healthcare workers seem to make different decisions for patients from different groups when the correct choice would be to treat them the same. A famous example was reported in the New England Journal of Medicine, where blind testing of all pregnant women in Pinellas County FL for drug use revealed similar rates of use for Black and white women (~14%), but in the same period of time Black women were 10 times more likely to be reported to the authorities for substance abuse during pregnancy. The healthcare workers were, consciously or unconsciously, choosing who to test and report. This was true even for private obstetrics patients:

In the private obstetricians’ offices, black women made up less than 10 percent of the patient population but 55 percent of those reported for substance abuse during pregnancy.

Chasnoff et al, NEJM, 1990

There are innumerable other examples that can be described for any minoritised group. Women, people of colour, and Hispanic white men are less likely to receive adequate pain relief than non-Hispanic white men. Transgender patients commonly report being flat-out denied care, or receiving extremely substandard treatment. Black newborns are substantially more likely to survive if they are treated by a Black doctor.

In medical imaging we like to think of ourselves as above this problem, particularly with respect to race because we usually don’t know the identity of our patients. We report the scans without ever seeing the person, but that only protects us from direct bias. Biases still affect who gets referred for scans and who doesn’t, and they affect which scans are ordered. Bias affects what gets written on the request forms, which in turn influences our diagnoses. And let’s not pretend we aren’t influenced by patient names and scan appearances too. If we see a single earring or nipple piercing on a man, we are trained to think about HIV related diseases, because they are probably gay (are they though?) and therefore at risk (PrEP is a thing now!). In fact, around 20% of radiologists admit being influenced by patient demographic factors.

But it is true that, in general, we read the scan as it comes. The scan can’t tell us what colour a person’s skin is.

Can it?

Part II – AI can detect racial identity in x-rays and CT scans

I’ve already included some results up in the summary section, and there are more in the paper, but I’ll very briefly touch on my interpretation of them here.

Firstly, the performance of these models ranges from high to absurd. An AUC of 0.99 for recognising the self-reported race of a patient, which has no recognised medical imaging correlate? This is flat out nonsense.

Every radiologist I have told about these results is absolutely flabbergasted, because despite all of our expertise, none of us would have believed in a million years that x-rays and CT scans contain such strong information about racial identity. Honestly we are talking jaws dropped – we see these scans everyday and we have never noticed.

Artist’s impression of people listening to me at work parties…

The second important aspect though is that, with such a strong correlation, it appears that AI models learn the features correlated with racial identity by default. For example, in our experiments we showed that the distribution of diseases in the population for several datasets was essentially non-predictive of racial identity (AUC = 0.5 to 0.6), but we also found that if you train a model to detect those diseases, the model learns to identify patient race almost as well as the models directly optimised for that purpose (AUC = 0.86). Whaaat?

Actual footage of Marzyeh^ and me from our Zoom meetings

Despite racial identity not being useful for the task (since the disease distribution does not differentiate racial groups), the model learns it anyway? My only hypothesis is that a) CNNs are primed to learn these features due to their inductive biases, and b) perhaps the known differences in TPR/FPR rates in AI models trained on these datasets (Black patients tend to get under-diagnosed, white patients over-diagnosed) are responsible, where cases that are otherwise identical have racially biased differences in labelling?

But no matter how it works, the take-home message is that it appears that models will tend to learn to recognise race, even when it seems irrelevant to the task. So the dozens upon dozens of FDA approved x-ray and CT scan AI models on the market now … probably do this^^? Yikes!

There is one more interpretation of these results that is worth mentioning, for the “but this is expected model behaviour” folks. Even from a purely technical perspective, ignoring the racial bias aspect, the fact models learn features of racial identity is bad. There is no causal pathway linking racial identity and the appearance of, for example, pneumonia on a chest x-ray. By definition these features are spurious. They are shortcuts. Unintended cues. The model is underspecified for the problem it is intended to solve.

Adapted from Geihros et al, this diagram shows a hypothetical pneumonia detection model that, during optimisation, has learned to recognise racial identity.

However we want to frame this, the model has learned something that is wrong, and this means the model can behave in undesirable and unexpected ways.

I won’t be surprised if this becomes a canonical example of the biggest weakness of deep learning – the ability of deep learning to pick up unintended cues from the data. I’m certainly going to include it in all my talks.

Part III – Humans can’t identify racial identity in medical images

As much as many technologists seem to think otherwise, humans play a critical role in AI, particularly in high risk decision making: we are the last line of defence against silly algorithmic decisions. Humans, as the parties who are responsible for applying the decisions of AI systems to patients, have to determine if an error has been made, and whether the errors reflect unsafe AI behaviour.

Radiologists have a lot of practice at determining when imaging tests are acceptable or not. For example, image quality is known to impact diagnostic accuracy. If the images look bad enough that you might miss something (we call these images “non-diagnostic”) then the radiologist is responsible for recognising that and refusing to use them.

“What is this, Nuc Med?”

But what happens when the radiologist literally has no way to tell if the study is usable or not?

I’ve spoken about this risk before when I discussed medical imaging super-resolution models. If the AI changes the output in a way that is hidden from the radiologist, because a bad image looks like it is a good image, then the whole “radiologist as safety net” system breaks down.

AI-accelerated MRI imaging from the fast MRI challenge, with pretty “diagnostic quality” looking images, but the tear in the meniscus is no longer visible. How is a human meant to identify that the image study is flawed if it looks fine but the important part is missing?

The problem is much worse for racial bias. At least in MRI super-resolution, the radiologist is expected to review the original low quality image to ensure it is diagnostic quality (which seems like a contradiction to me, but whatever). In AI with racial bias though, humans literally cannot recognise racial identity from images^^^. Unless they are provided with access to additional data (which they don’t currently have easy access to in imaging workflows) they will be completely unable to appreciate the bias no matter how skilled they are and no matter how much effort they apply to the task.

This is a big deal. Medicine operates on what tends to be called the “swiss cheese” model of risk management, where each layer of mitigation has some flaws, but combined they detect most problems.

The radiologist slice of cheese is absolutely critical in imaging AI safety, and in this setting it might be completely ineffective.

It is definitely true that we are moving towards race-aware risk management practices, and the recently published algorithmic bias playbook describes how governance bodies might implement such practices at a policy level, but it is also true that these practices are not currently widespread, despite the dozens of AI systems available on the open market.

Part IV – We don’t know how to stop it

This is probably the biggest problem here. We ran an extensive series of experiments to try and work out what was going on.

First, we tried obvious demographic confounders (for example, Black patients tend to have higher rates of obesity than white patients, so we checked whether the models were simply using body mass/shape as a proxy for racial identity). None of them appeared to be responsible, with very low predictive performance when tested alone.

Next we tried to pin down what sort of features were being used. There was no clear anatomical localisation, no specific region of the images that contributed to the predictions. Even more interesting, no part of the image spectrum was primarily responsible either. We could get rid of all the high-frequency information, and the AI could still recognise race in fairly blurry (non-diagnostic) images. Similarly, and I think this might be the most amazing figure I have ever seen, we could get rid of the low-frequency information to the point that a human can’t even tell the image is still an x-ray, and the model can still predict racial identity just as well as with the original image!

Performance is maintained with the low pass filter to around the LPF25 level, which is quite blurry but still readable. But for the high-pass filter, the model can still recognise the racial identity of the patient well past the point that the image is just a grey box 

Actually, I’m going to zoom in on that image just because it is so ridiculous!

What even is this? This nonsense generalises to completely new datasets?!?!

This difficulty in isolating the features associated with racial identity is really important, because one suggestion people tend to have when they get shown evidence of racial bias is that we should make the algorithms “colorblind” – to remove the features that encode the protected attribute and thereby make it so the AI cannot “see” race but should still perform well on the clinical tasks we care about.

Here, it seems like there is no easy way to remove racial information from images. It is everywhere and it is in everything.

An urgent problem

AI seems to easily learn racial identity information from medical images, even when the task seems unrelated. We can’t isolate how it does this, and we humans can’t recognise when AI is doing it unless we collect demographic information (which is rarely readily available to clinical radiologists). That is bad.

There are around 30 AI systems using CXR and CT Chest imaging on the market currently, FDA cleared, many of which were trained on the exact same datasets we utilised in this research. That is worse.

The ACR-DSI website lists all FDA cleared and approved AI systems.

So how do we find out if this is a problem in clinical AI tools? The FDA recommends…

…you report all results by relevant clinical and demographic subgroups…

Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests, FDA 2007

which sounds nice, but thus far we have seen almost no evidence that this is being done^^. In fact, even when asked for more information by Tariq et alno commercial groups provided even a demographic breakdown of their test sets by racial identity, let alone performance results in these populations. I said at the start that I hope this research changes medical AI practice, so here is how: we absolutely have to do more race-stratified testing of AI systems°, and we probably shouldn’t allow AI systems to be used outside of populations they have been tested in.

So is the FDA actually requiring this, or has this been overlooked in the rush to bring AI to market?

I don’t know about you, but I’m worried. AI might be superhuman, but not every superpower is a force for good.

The line between superheroism and supervillainy is a fine one.

FOOTNOTES
* I want to be very clear that this is not yet peer reviewed. It is submitted to a journal, but we really want y’all to help us improve it too. This is a living piece of work that will be built on over time.
** “Minority groups” as defined by the NIH apparently refer only to racial and ethnic minorities, not, for example, gender and sexual minorities. No idea why.
*** Interestingly, there is some evidence that LGB folks tend to be over-represented in clinical trials for cancer, which is unexpected for an underserved group. Of course, the absence of the T in that acronym is unsurprising given the history of medical shitbaggery towards trans folks.
^ I hear a Canadian voice in my head exclaiming “Luuuuuuuuke! Noooooooooooo! How is thiiiis possible? It can’t be right, can it?” whenever I close my eyes now. Send help.
^^ cough the FDA should look into this cough
^^^ I did actually try to teach myself racial identity detection during this project, but no set of rules I could come up with worked much better than chance.
° this is not at all trivial. At minimum it would require test sets that are adequately powered to demonstrate performance for each racial subgroup, and in many locations accessing enough racially diverse patients can be challenging.

Luke Oakden-Rayner is a radiologist in South Australia. This post originally appeared on his blog here.



Leave a Reply