Journal of Medical Education Development

en Evaluation of AI support for medical training in resource-constrained settings: performance of GPT-5 Pro, Gemini 2.5 Pro, and DeepSeek V3 on real examination questions آموزش پزشکی Medical Education پژوهشي اصیل Orginal Research Background & Objective: Recent advances in Large Language Models (LLMs) have expanded their potential applications in medical education and assessment. This study compared the performance of GPT-5 Pro (OpenAI), Gemini 2.5 Pro (Google DeepMind), and DeepSeek V3 (DeepSeek AI) on authentic, faculty-validated Multiple-Choice Questions (MCQs) from an Algerian francophone Medical Faculty. Materials & Methods: This parallel, cross-sectional comparative evaluation was carried out under standardized online conditions. A total of 480 faculty-validated, non-public MCQs from a private subscription repository, covering four pre-clinical modules and four clinical modules, were presented to each model in independent chat sessions. Accuracy was compared across models using Cochran’s Q and pairwise McNemar tests with Holm correction. Intra-model subgroup analyses (module, study cycle, question type, response format, and temporal factors) used chi-square or Mann–Whitney tests, with p < 0.05 considered significant. Results: Gemini 2.5 Pro achieved the highest accuracy (447/480, 93.1% [95% CI: 90.5 – 95.1]), followed by GPT-5 Pro (430/480, 89.6% [95% CI: 86.5 – 92.0]) and DeepSeek V3 (429/480, 89.4% [95% CI: 86.3 – 91.8]). The overall difference in accuracy was significant (Cochran’s Q = 8.65, p = 0.013), with a small global effect size (Kendall’s W = 0.009). Pairwise testing showed Gemini 2.5 Pro performed better than both competitors (p = 0.049), whereas GPT-5 Pro and DeepSeek V3 did not differ (p = 1.000). Within-model accuracy was stable across subgroups; non-responses were rare (< 2%) and did not change ranking. Conclusion: All tested LLMs demonstrated strong competence on structured medical MCQs and may support supervised formative learning in resource-constrained settings. However, although between-model differences were statistically significant, their absolute educational impact was modest, and their effect on real learning outcomes remains uncertain. Key limitations include potential residual training overlap, single-source MCQ sampling, and absence of explanation-quality assessment; future multicenter longitudinal studies should evaluate open-ended clinical reasoning and learning outcomes.   large language models, education, medical, artificial intelligence, formative assessment, resource-limited settings 4 11 http://edujournal.zums.ac.ir/browse.php?a_code=A-12-3487-1&slc_lang=en&sid=1 Redouene Sid Ahmed Benazzouz r.benazzouz@lagh-univ.dz 0009-0000-3614-5883 Yes Faculty of Medicine, Laghouat University, Laghouat, Algeria Massinissa Benyagoub m.benyagoub@lagh-univ.dz 0000-0001-9556-0331 No Faculty of Medicine, Laghouat University, Laghouat, Algeria Yacine Boufatah taha.boufatah@gmail.com 0009-0001-9605-5756 No Faculty of Medicine, Laghouat University, Laghouat, Algeria Fodhil Sadeki sadekifodhil@gmail.com 0009-0009-5383-175X No Faculty of Medicine, Laghouat University, Laghouat, Algeria Mohamed Safouane Benazzouz ms.benazzouz@univ-alger.dz 0000-0003-1980-8977 No Pasteur Institute of Algeria, Algiers, Algeria Mounir Ould Setti ouldsettimounir@gmail.com 0000-0002-8298-110X No Alma Mater Europaea University, Vienna, Austria