Blind Spots: Insights from the June 2024 Webinar on Confronting the Reproducibility Crisis in Machine Learning

The SIIM’s Artificial Intelligence Webinar Series recently held a talk on the reproducibility crisis in machine learning, providing an often-unheard perspective on artificial intelligence’s pitfalls in medical imaging.

Although rapid advancements in artificial intelligence, machine, and deep learning have been lauded by numerous startups and news articles, very few tools have been clinically deployed, mainly due to a lack of model generalizability. Dr. Maleki and Dr. Azizi’s talk highlighted several reasons behind, and solutions to, this “generalizability crisis.” Reflecting more broadly, this problem can be compared to the “replication crisis” in scientific research, which refers to the difficulty of reproducing results from published studies. Like the replication crisis, the generalizability crisis limits the credibility of technological discoveries and advancements.

Various perspectives have been offered on the replication crisis in diverse domains of science. However, a common issue is the “fossilization” of incorrect or non-replicable findings. This occurs when unreproducible and therefore unreliable results are assumed to be true, which can mislead theories and future investigations, ultimately hindering scientific progress. Similarly, poor replicability of machine learning models can limit their further development due to limited accuracy.

To characterize this issue, machine learning scientists developed the concept of model generalizability. This refers to a tool’s ability to perform well on data outside its original training dataset. Many factors contribute to this limitation, such as variability in patient demographics, medical imaging tools, and data workflows. Dr. Maleki and Dr. Azizi’s talk outlined three methodological reasons for limited generalizability in machine learning: biased training datasets, poor performance measurements, and the aggregation of different datasets without appropriate demographic characterization.

The first methodological challenge makes sense given the pipeline used to develop these machine learning tools, whereby data is split into training, validation, and test sets. Training datasets are used to “educate” the model; validation datasets are used to retrain the model with more accuracy; and test sets are used for estimating the model’s error. The effectiveness of this pipeline depends on the training, validation, and test datasets being independently distributed. However, issues arise when the training dataset has a biased sample, which occurs when data from the validation or test sets appear in the training set. This can lead to an overestimation of a model’s performance. Although there are several complex reasons behind this biased sampling, they are generally unintentional and due to the impracticality of developing large-scale datasets.

The second methodological problem relates to the metrics used to evaluate a machine learning model’s performance, namely accuracy, precision, and recall. Accuracy describes how often a model is correct; precision describes how well a model minimizes false positives; and recall describes how well a model minimizes false negatives. Due to its simplicity, accuracy has historically been a favored performance metric. However, it is increasingly recognized as inadequate in imbalanced datasets (i.e., rare radiologic findings are overrepresented in datasets, difficult-to-describe imaging features are underrepresented). Moreover, due to the high cost of false positives and false negatives in healthcare, it is essential to include precision and recall when estimating a model’s usefulness in clinical settings. One reason these statistics have not been favored is the difficulty of calculating them: unlike accuracy, precision and recall require data to be categorized.

The final methodological error discussed is commonly known as “batch effects.” This occurs when training data have variations due to differences in collection or processing, rather than from the underlying medical conditions being studied. The most common cause of this in radiology machine learning models is when different datasets are aggregated and labeled inappropriately. One example discussed was a pneumonia detection model trained on three datasets: a healthy adult dataset, an adult pneumonia dataset, and a healthy pediatric population dataset. To demonstrate “batch effects,” their research team aggregated cases of adult pneumonias and healthy children during the training of the machine learning model. The effect was that the model labeled both adult pneumonias and anatomical features of children as pneumonias. Extrapolating from this example, machine learning models could create false positive diagnoses from benign anatomic variants, different imaging standards or protocols, patient demographics, and unrelated or incidental diagnoses not filtered out of the training dataset. To produce a machine learning tool that surmounts this issue, a large and accurate dataset would have to be developed, and separate machine learning models would have to be developed for each diagnosis or benign feature. Both these solutions require large amounts of resources. Therefore, the “batch effect” phenomenon is difficult to eliminate entirely and may in part be responsible for the so-called “hallucinations” in radiology artificial intelligence tools.

Taken together, this talk raised awareness about the generalizability crisis in machine learning and offered a realistic portrait of its implications for radiology. The talk conveyed serious challenges that machine learning faces in clinical settings but also hinted at possible solutions. Ultimately, researchers and firms aspire for artificial intelligence tools that can accurately and specifically diagnose disease based on medical imaging, but such a tool remains elusive. Despite increasing research and corporate investment in the field, the rise in artificial intelligence publications and startups has not produced a proportional number of clinically applicable tools. The reasons discussed in Dr. Maleki and Dr. Azizi’s talk may help explain why this has been the case. If communicated to a broader audience, their arguments could temper some of the hype around artificial intelligence – and hopefully inspire more solutions!

podcast

My Informatics Journey with Dr. David Avrin – Part 1

Dec 17, 2025

Dr. David Avrin, MD, PhD, is a pioneering leader in medical imaging informatics with decades in digital biomedical imaging, twice…

podcast

Hackathon Team Agentic Vibes (2025 1st Place Winner)

Nov 19, 2025

In this episode, we catch up with John Paulett and Faris Siddiqui to talk about their 2025 SIIM Hackathon project, Agentic Vibes, which won first…

podcast

My Informatics Journey with Dr. Steve Horii – Part 2

Oct 22, 2025

Affectionately known as Dr. DICOM. The SIIMcast team got a rare chance to sit with Dr. Steve Horii during the…

Learning & Events

Featured Events

Featured Learning

Resources

Featured Resources

About SIIM

Featured

More

NEWS

My Informatics Journey with Dr. David Avrin – Part 1

Hackathon Team Agentic Vibes (2025 1st Place Winner)

My Informatics Journey with Dr. Steve Horii – Part 2

Become a member

If you share our passion to improve patient care through imaging informatics, join us!

If you share our passion to improve patient care through imaging informatics, join us!

Learning & Events

Featured Events

Featured Learning

Resources

Featured Resources

About SIIM

Featured

More

NEWS

Blind Spots: Insights from the June 2024 Webinar on Confronting the Reproducibility Crisis in Machine Learning

Related Media

My Informatics Journey with Dr. David Avrin – Part 1

Hackathon Team Agentic Vibes (2025 1st Place Winner)

My Informatics Journey with Dr. Steve Horii – Part 2

Become a member

If you share our passion to improve patient care through imaging informatics, join us!

If you share our passion to improve patient care through imaging informatics, join us!