Evaluation of AI support for medical training in resource-constrained settings: performance of GPT-5 Pro, Gemini 2.5 Pro, and DeepSeek V3 on real examination questions

Benazzouz, Redouene Sid Ahmed; Benyagoub, Massinissa; Boufatah, Yacine; Sadeki, Fodhil; Benazzouz, Mohamed Safouane; Ould Setti, Mounir

Volume 19, Issue 2 (2026) J Med Edu Dev 2026, 19(2): 4-11 | Back to browse issues page

Ethics code: Ethics Committee of the Faculty of Medicine of Laghouat University, Algeria (Protocol No. 19/2025).

Mendeley

Zotero

RefWorks

Benazzouz R S A, Benyagoub M, Boufatah Y, Sadeki F, Benazzouz M S, Ould Setti M. Evaluation of AI support for medical training in resource-constrained settings: performance of GPT-5 Pro, Gemini 2.5 Pro, and DeepSeek V3 on real examination questions. J Med Edu Dev 2026; 19 (2) :4-11
URL: http://edujournal.zums.ac.ir/article-1-2664-en.html

Evaluation of AI support for medical training in resource-constrained settings: performance of GPT-5 Pro, Gemini 2.5 Pro, and DeepSeek V3 on real examination questions

Redouene Sid Ahmed Benazzouz ^*¹

, Massinissa Benyagoub²

, Yacine Boufatah²

, Fodhil Sadeki²

, Mohamed Safouane Benazzouz³

, Mounir Ould Setti⁴

1- Faculty of Medicine, Laghouat University, Laghouat, Algeria , r.benazzouz@lagh-univ.dz
2- Faculty of Medicine, Laghouat University, Laghouat, Algeria
3- Pasteur Institute of Algeria, Algiers, Algeria & Faculty of Pharmacy, University of Algiers, Algiers, Algeria
4- Alma Mater Europaea University, Vienna, Austria & Observational Studies Germany, Real World Solutions, IQVIA, Frankfurt am Main, Germany

Abstract: (487 Views)

Background & Objective: Recent advances in Large Language Models (LLMs) have expanded their potential applications in medical education and assessment. This study compared the performance of GPT-5 Pro (OpenAI), Gemini 2.5 Pro (Google DeepMind), and DeepSeek V3 (DeepSeek AI) on authentic, faculty-validated Multiple-Choice Questions (MCQs) from an Algerian francophone Medical Faculty.
Materials & Methods: This parallel, cross-sectional comparative evaluation was carried out under standardized online conditions. A total of 480 faculty-validated, non-public MCQs from a private subscription repository, covering four pre-clinical modules and four clinical modules, were presented to each model in independent chat sessions. Accuracy was compared across models using Cochran’s Q and pairwise McNemar tests with Holm correction. Intra-model subgroup analyses (module, study cycle, question type, response format, and temporal factors) used chi-square or Mann–Whitney tests, with p < 0.05 considered significant.
Results: Gemini 2.5 Pro achieved the highest accuracy (447/480, 93.1% [95% CI: 90.5 – 95.1]), followed by GPT-5 Pro (430/480, 89.6% [95% CI: 86.5 – 92.0]) and DeepSeek V3 (429/480, 89.4% [95% CI: 86.3 – 91.8]). The overall difference in accuracy was significant (Cochran’s Q = 8.65, p = 0.013), with a small global effect size (Kendall’s W = 0.009). Pairwise testing showed Gemini 2.5 Pro performed better than both competitors (p = 0.049), whereas GPT-5 Pro and DeepSeek V3 did not differ (p = 1.000). Within-model accuracy was stable across subgroups; non-responses were rare (< 2%) and did not change ranking.
Conclusion: All tested LLMs demonstrated strong competence on structured medical MCQs and may support supervised formative learning in resource-constrained settings. However, although between-model differences were statistically significant, their absolute educational impact was modest, and their effect on real learning outcomes remains uncertain. Key limitations include potential residual training overlap, single-source MCQ sampling, and absence of explanation-quality assessment; future multicenter longitudinal studies should evaluate open-ended clinical reasoning and learning outcomes.

Keywords: large language models, education, medical, artificial intelligence, formative assessment, resource-limited settings