Spotted this on Twitter:

Luke Oakden-Rayner is a PhD candidate / radiologist, and he took a look at ChestXRay14, X-ray dataset used to train ML models for radiological image analysis. He believes the label quality in this particular dataset is not good enough for training ML systems for medical diagnosis.

Some key quotes below:

The main issue is that when the author looked at the data, he noticed that this particular data set is plagued by label noise:

This is the most important part of this review. There is a major issue with deep learning in radiology, and it can be a disaster if you aren’t looking at your images.

If these labels are so inaccurate, and if the meaning of the labels is so questionable, how as the papers built on this dataset reporting decent performance? What are the models actually learning?

There was a popular paper from Zhang et al. a while back which showed that deep learning can fit random labels in training data. I don’t think this really surprised anyone who trains deep learning models, but it was held up by many naysayers (incorrectly, in my opinion) as evidence that deep learning is bad.

That is not what is happening here. Instead, we are seeing models that can learn to correctly output the ground-truth in the test set, even though the ground truth is nearly visually meaningless.

This is probably because of some idiosyncrasies of radiologists text accounts on images and NLP methods used to extract the labels from the set.

The author trained a simple DL model that managed to achieve quite okay performance on a test set (at least, if the labels are to be trusted). What about the label noise? Maybe the model learned to ignore it? As a radiologist, he thinks the answer is no:

Despite an apparent AUC of 0.7 we get really bad classification performance, in line with the label inaccuracy. The model didn’t just ignore the incorrect labels and produce sensible predictions. It was not robust to label noise. Most importantly, the AUC value does not reflect the clinical performance.

This is a huge problem.

This AI system learned to reliably produce meaningless predictions. It managed to learn image features that create the above groups of “opacity” cases with almost no airspace opacities, and “no opacity” cases with big groups of severely abnormal lungs.

This is such a problem, because unless you look at the pictures, the results look great. Each team has progressively performed better, got higher AUC scores, so it looks like they are ‘solving’ a serious medical task.

[…]

The underlying problem here is different from in Rolnich et al. and Zhang et al., because the structured noise isn’t only in the training data. The label errors are consistent across the test data too. This means that if you learn to make bad medical predictions, you get higher test performance!

And finally TL;DR:

  • Compared to human visual assessment, the labels in the ChestXray14 dataset are inaccurate, unclear, and often describe medically unimportant findings.
  • These label problems are internally consistent within the data, meaning models can show “good test-set performance”, while still producing predictions that don’t make medical sense.
  • The above combination of problems mean the dataset as defined currently is not fit for training medical systems, and research on the dataset cannot generate valid medical claims without significant additional justification.
  • Looking at the images is the basic “sanity check” of image analysis. If you don’t have someone who can understand your data looking at the images when you build a dataset, expect things to go very wrong.
  • Medical image data is full of stratifying elements; features than can help learn pretty much anything. Check that your model is doing what you think it is, every step of the way.
  • I will be releasing some new labels with the next post, and show that deep learning can work in this dataset, as long as the labels are good enough.

The Lesson

I believe the most important point here generalizes beyond radiology: in difficult domains such as medical image analysis, you need to collaborate with domain experts who can look at your data and your results and say if they make sense. In ordinary image classification tasks, even a CS or math undergraduate is able to do basic sanity checks: look at the images in test set, then look at the model predictions, and check if there are clouds or dogs present in the test set images if the model and attached labels claims so. But faced with a dataset like this, the hapless ML professional who does not know anything about radiology is at the mercy of the provided test set labels.

Another thing about this story that worries me. Let’s imagine I would need to work on this kind of data, where I’m not able to judge the prediction quality by visual inspection and can’t do much else than report the test set performance. An expert weighs in, and claims that the labels are bad and the results I obtained are nonsense. The original team claims that the labels are good, and test set performance is a useful measure. What I am supposed to do?

I tempted to make conclusion that I at least need to work with trustworthy radiologists (preferably several), and also try to learn something about radiology myself in the process.