The SIIM’s Artificial Intelligence Webinar Series recently held a talk on the reproducibility crisis in machine learning, providing an often-unheard perspective on artificial intelligence’s pitfalls in medical imaging.

Although rapid advancements in artificial intelligence, machine, and deep learning have been lauded by numerous startups and news articles, very few tools have been clinically deployed, mainly due to a lack of model generalizability. Dr. Maleki and Dr. Azizi’s talk highlighted several reasons behind, and solutions to, this “generalizability crisis.” Reflecting more broadly, this problem can be compared to the “replication crisis” in scientific research, which refers to the difficulty of reproducing results from published studies. Like the replication crisis, the generalizability crisis limits the credibility of technological discoveries and advancements.

Various perspectives have been offered on the replication crisis in diverse domains of science. However, a common issue is the “fossilization” of incorrect or non-replicable findings. This occurs when unreproducible and therefore unreliable results are assumed to be true, which can mislead theories and future investigations, ultimately hindering scientific progress. Similarly, poor replicability of machine learning models can limit their further development due to limited accuracy.

To characterize this issue, machine learning scientists developed the concept of model generalizability. This refers to a tool’s ability to perform well on data outside its original training dataset. Many factors contribute to this limitation, such as variability in patient demographics, medical imaging tools, and data workflows. Dr. Maleki and Dr. Azizi’s talk outlined three methodological reasons for limited generalizability in machine learning: biased training datasets, poor performance measurements, and the aggregation of different datasets without appropriate demographic characterization.

The first methodological challenge makes sense given the pipeline used to develop these machine learning tools, whereby data is split into training, validation, and test sets. Training datasets are used to “educate” the model; validation datasets are used to retrain the model with more accuracy; and test sets are used for estimating the model’s error. The effectiveness of this pipeline depends on the training, validation, and test datasets being independently distributed. However, issues arise when the training dataset has a biased sample, which occurs when data from the validation or test sets appear in the training set. This can lead to an overestimation of a model’s performance. Although there are several complex reasons behind this biased sampling, they are generally unintentional and due to the impracticality of developing large-scale datasets.

The second methodological problem relates to the metrics used to evaluate a machine learning model’s performance, namely accuracy, precision, and recall. Accuracy describes how often a model is correct; precision describes how well a model minimizes false positives; and recall describes how well a model minimizes false negatives. Due to its simplicity, accuracy has historically been a favored performance metric. However, it is increasingly recognized as inadequate in imbalanced datasets (i.e., rare radiologic findings are overrepresented in datasets, difficult-to-describe imaging features are underrepresented). Moreover, due to the high cost of false positives and false negatives in healthcare, it is essential to include precision and recall when estimating a model’s usefulness in clinical settings. One reason these statistics have not been favored is the difficulty of calculating them: unlike accuracy, precision and recall require data to be categorized.

The final methodological error discussed is commonly known as “batch effects.” This occurs when training data have variations due to differences in collection or processing, rather than from the underlying medical conditions being studied. The most common cause of this in radiology machine learning models is when different datasets are aggregated and labeled inappropriately. One example discussed was a pneumonia detection model trained on three datasets: a healthy adult dataset, an adult pneumonia dataset, and a healthy pediatric population dataset. To demonstrate “batch effects,” their research team aggregated cases of adult pneumonias and healthy children during the training of the machine learning model. The effect was that the model labeled both adult pneumonias and anatomical features of children as pneumonias. Extrapolating from this example, machine learning models could create false positive diagnoses from benign anatomic variants, different imaging standards or protocols, patient demographics, and unrelated or incidental diagnoses not filtered out of the training dataset. To produce a machine learning tool that surmounts this issue, a large and accurate dataset would have to be developed, and separate machine learning models would have to be developed for each diagnosis or benign feature. Both these solutions require large amounts of resources. Therefore, the “batch effect” phenomenon is difficult to eliminate entirely and may in part be responsible for the so-called “hallucinations” in radiology artificial intelligence tools.

Taken together, this talk raised awareness about the generalizability crisis in machine learning and offered a realistic portrait of its implications for radiology. The talk conveyed serious challenges that machine learning faces in clinical settings but also hinted at possible solutions. Ultimately, researchers and firms aspire for artificial intelligence tools that can accurately and specifically diagnose disease based on medical imaging, but such a tool remains elusive. Despite increasing research and corporate investment in the field, the rise in artificial intelligence publications and startups has not produced a proportional number of clinically applicable tools. The reasons discussed in Dr. Maleki and Dr. Azizi’s talk may help explain why this has been the case. If communicated to a broader audience, their arguments could temper some of the hype around artificial intelligence – and hopefully inspire more solutions!

Written by

Rahman Ladak

Publish date

Jul 9, 2024


  • Artificial Intelligence
  • Machine Learning Challenges
  • Radiology
  • Research

Media Type

  • Blog

Audience Type

  • Clinician
  • Researcher/Scientist


SIIM Recognizes Leaders in Imaging Informatics at the SIIM24 Annual Meeting

Jul 19, 2024

FOR IMMEDIATE RELEASE Leesburg, VA – July 19, 2024 The Society for Imaging Informatics in Medicine (SIIM) convened its Annual…


PIIRT Chapter 10: Ancillary Services

May 7, 2024

Dive into the fascinating world of imaging informatics with SIIMcast, the official podcast of the Society for Imaging Informatics in…


Between Two Fellows with Dr. Elizabeth Krupinski and Dr. Tessa Cook

Apr 16, 2024

Welcome to another special episode of the SIIMcast, recorded live at the SIIM Annual Meeting in Austin, Texas. Join us for a…