The SIIM’s Artificial Intelligence Webinar Series recently held a talk on the reproducibility crisis in machine learning, providing an often-unheard perspective on artificial intelligence’s pitfalls in medical imaging.
Although rapid advancements in artificial intelligence, machine, and deep learning have been lauded by numerous startups and news articles, very few tools have been clinically deployed, mainly due to a lack of model generalizability. Dr. Maleki and Dr. Azizi’s talk highlighted several reasons behind, and solutions to, this “generalizability crisis.” Reflecting more broadly, this problem can be compared to the “replication crisis” in scientific research, which refers to the difficulty of reproducing results from published studies. Like the replication crisis, the generalizability crisis limits the credibility of technological discoveries and advancements.
Various perspectives have been offered on the replication crisis in diverse domains of science. However, a common issue is the “fossilization” of incorrect or non-replicable findings. This occurs when unreproducible and therefore unreliable results are assumed to be true, which can mislead theories and future investigations, ultimately hindering scientific progress. Similarly, poor replicability of machine learning models can limit their further development due to limited accuracy.
To characterize this issue, machine learning scientists developed the concept of model generalizability. This refers to a tool’s ability to perform well on data outside its original training dataset. Many factors contribute to this limitation, such as variability in patient demographics, medical imaging tools, and data workflows. Dr. Maleki and Dr. Azizi’s talk outlined three methodological reasons for limited generalizability in machine learning: biased training datasets, poor performance measurements, and the aggregation of different datasets without appropriate demographic characterization.
The first methodological challenge makes sense given the pipeline used to develop these machine learning tools, whereby data is split into training, validation, and test sets. Training datasets are used to “educate” the model; validation datasets are used to retrain the model with more accuracy; and test sets are used for estimating the model’s error. The effectiveness of this pipeline depends on the training, validation, and test datasets being independently distributed. However, issues arise when the training dataset has a biased sample, which occurs when data from the validation or test sets appear in the training set. This can lead to an overestimation of a model’s performance. Although there are several complex reasons behind this biased sampling, they are generally unintentional and due to the impracticality of developing large-scale datasets.
The second methodological problem relates to the metrics used to evaluate a machine learning model’s performance, namely accuracy, precision, and recall. Accuracy describes how often a model is correct; precision describes how well a model minimizes false positives; and recall describes how well a model minimizes false negatives. Due to its simplicity, accuracy has historically been a favored performance metric. However, it is increasingly recognized as inadequate in imbalanced datasets (i.e., rare radiologic findings are overrepresented in datasets, difficult-to-describe imaging features are underrepresented). Moreover, due to the high cost of false positives and false negatives in healthcare, it is essential to include precision and recall when estimating a model’s usefulness in clinical settings. One reason these statistics have not been favored is the difficulty of calculating them: unlike accuracy, precision and recall require data to be categorized.
The final methodological error discussed is commonly known as “batch effects.” This occurs when training data have variations due to differences in collection or processing, rather than from the underlying medical conditions being studied. The most common cause of this in radiology machine learning models is when different datasets are aggregated and labeled inappropriately. One example discussed was a pneumonia detection model trained on three datasets: a healthy adult dataset, an adult pneumonia dataset, and a healthy pediatric population dataset. To demonstrate “batch effects,” their research team aggregated cases of adult pneumonias and healthy children during the training of the machine learning model. The effect was that the model labeled both adult pneumonias and anatomical features of children as pneumonias. Extrapolating from this example, machine learning models could create false positive diagnoses from benign anatomic variants, different imaging standards or protocols, patient demographics, and unrelated or incidental diagnoses not filtered out of the training dataset. To produce a machine learning tool that surmounts this issue, a large and accurate dataset would have to be developed, and separate machine learning models would have to be developed for each diagnosis or benign feature. Both these solutions require large amounts of resources. Therefore, the “batch effect” phenomenon is difficult to eliminate entirely and may in part be responsible for the so-called “hallucinations” in radiology artificial intelligence tools.
Taken together, this talk raised awareness about the generalizability crisis in machine learning and offered a realistic portrait of its implications for radiology. The talk conveyed serious challenges that machine learning faces in clinical settings but also hinted at possible solutions. Ultimately, researchers and firms aspire for artificial intelligence tools that can accurately and specifically diagnose disease based on medical imaging, but such a tool remains elusive. Despite increasing research and corporate investment in the field, the rise in artificial intelligence publications and startups has not produced a proportional number of clinically applicable tools. The reasons discussed in Dr. Maleki and Dr. Azizi’s talk may help explain why this has been the case. If communicated to a broader audience, their arguments could temper some of the hype around artificial intelligence – and hopefully inspire more solutions!
To receive credit, registrants must view the entire webinar and then complete the post-webinar survey. Webinar credits will only be awarded one time per webinar view, regardless of if the learner watches the content live or on-demand.
Note: In order to receive credits for any events/learning you attend you must select your eligible credit types found in the CE & Certification section of your MySIIM Account profile.
To access the webinar, once you are registered navigate to My Learning in your My SIIM Account profile.
Not a member? Join SIIM Today and Save!
- FREE live & on-demand webinars
- Unlimited CE learning opportunities at no extra cost
- Discounted conferences and training
- Complimentary Journal of Imaging Informatics in Medicine

Audience Type
- Clinician
- Researcher/Scientist
Speakers
Related Webinars
Webinar
Global Health AI for Radiology
Nov 12, 2025
This webinar will explore the transformative potential of artificial intelligence (AI) in radiology to advance global health, with a focus…
Webinar
From Compliance to Clinical Value: Modern Approaches to Dose Management
Oct 16, 2025
As healthcare organizations strive to balance regulatory compliance with the realities of busy clinical environments, the ability to integrate dose…
Webinar
Pixel Protectors: Security Strategies for Medical Imaging Devices, a continuing conversation for CyberSecurity Month
Oct 21, 2025
This webinar uses participant polling to engage in discussion on security preparedness, specifically focused on the need to implement and maintain a Zero Trust architecture and framework.