Ethics code: IR.LARUMS.REC.1404.019
1- Department of Basic Sciences, School of Medicine, Larestan University of Medical Sciences, Lar, Iran
2- Department of Basic Sciences, School of Medicine, Larestan University of Medical Sciences, Lar, Iran , z.amirkhani1357@gmail.com
Abstract: (51 Views)
Background and Objective: The rapid integration of Artificial Intelligence (AI) into medical education has created a need for quantitative evidence synthesis. This study sought to benchmark AI performance on medical knowledge assessments and to evaluate the preliminary effectiveness of AI‑driven educational interventions.
Materials & Methods: A systematic review and dual meta‑analysis were conducted in accordance with PRISMA guidelines. Five databases (PubMed, Web of Science, Embase, Scopus, and Google Scholar) were searched from inception through February 28, 2024. AI performance was evaluated using 35 accuracy data points derived from seven benchmarking studies encompassing 2,341 examination questions. The effectiveness of educational interventions was assessed using data from eight Randomized Controlled Trials (RCTs) involving 574 medical, dental, and pharmacy students. Random‑effects models were applied, including a three‑level proportion meta‑analysis for AI accuracy to account for within‑study dependence and a Hedges’ g meta‑analysis for intervention outcomes. Heterogeneity was quantified using the I² statistic, alongside exploratory meta‑regression and sensitivity analyses.
Results: The analyses yielded two primary findings. First, AI models demonstrated a pooled accuracy of 70.9% (95% CI: 65.1%–75.9%) on standardized medical examinations, with performance improving across successive model generations. Second, AI‑based educational interventions showed a large pooled effect size; however, this estimate was unstable due to a highly influential outlier (g = 6.72). Exclusion of this study altered the pooled effect from g = 1.40 to g = 1.95, with substantial heterogeneity observed (I² = 93.6%). This variability was largely attributable to small, early‑stage studies reporting disproportionately large effects. Leave‑one‑out sensitivity analyses produced effect sizes ranging from g = 1.26 to g = 1.45, while an inverse association between sample size and effect magnitude (p < 0.001) suggested systematic overestimation in smaller trials.
Conclusion: Advanced AI systems exhibit robust medical knowledge; however, evidence supporting their educational effectiveness remains preliminary and potentially biased. While the findings are encouraging, they highlight the need for larger, methodologically rigorous trials to establish reliable effect sizes and to inform the responsible integration of AI into medical education.
Article Type :
Review |
Subject:
Medical Education Received: 2025/12/27 | Accepted: 2026/05/10