Volume 19, Issue 2 (2026)                   J Med Edu Dev 2026, 19(2): 4-11 | Back to browse issues page

Ethics code: Ethics Committee of the Faculty of Medicine of Laghouat University, Algeria (Protocol No. 19/2025).

XML Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Benazzouz R S A, Benyagoub M, Boufatah Y, Sadeki F, Benazzouz M S, Ould Setti M. Evaluation of AI support for medical training in resource-constrained settings: performance of GPT-5 Pro, Gemini 2.5 Pro, and DeepSeek V3 on real examination questions. J Med Edu Dev 2026; 19 (2) :4-11
URL: http://edujournal.zums.ac.ir/article-1-2664-en.html
1- Faculty of Medicine, Laghouat University, Laghouat, Algeria , r.benazzouz@lagh-univ.dz
2- Faculty of Medicine, Laghouat University, Laghouat, Algeria
3- Pasteur Institute of Algeria, Algiers, Algeria & Faculty of Pharmacy, University of Algiers, Algiers, Algeria
4- Alma Mater Europaea University, Vienna, Austria & Observational Studies Germany, Real World Solutions, IQVIA, Frankfurt am Main, Germany
Abstract:   (73 Views)
Background & Objective: Recent advances in Large Language Models (LLMs) have expanded their potential applications in medical education and assessment. This study compared the performance of GPT-5 Pro (OpenAI), Gemini 2.5 Pro (Google DeepMind), and DeepSeek V3 (DeepSeek AI) on authentic, faculty-validated Multiple-Choice Questions (MCQs) from an Algerian francophone Medical Faculty.
Materials & Methods: This parallel, cross-sectional comparative evaluation was carried out under standardized online conditions. A total of 480 faculty-validated, non-public MCQs from a private subscription repository, covering four pre-clinical modules and four clinical modules, were presented to each model in independent chat sessions. Accuracy was compared across models using Cochran’s Q and pairwise McNemar tests with Holm correction. Intra-model subgroup analyses (module, study cycle, question type, response format, and temporal factors) used chi-square or Mann–Whitney tests, with p < 0.05 considered significant.
Results: Gemini 2.5 Pro achieved the highest accuracy (447/480, 93.1% [95% CI: 90.5 – 95.1]), followed by GPT-5 Pro (430/480, 89.6% [95% CI: 86.5 – 92.0]) and DeepSeek V3 (429/480, 89.4% [95% CI: 86.3 – 91.8]). The overall difference in accuracy was significant (Cochran’s Q = 8.65, p = 0.013), with a small global effect size (Kendall’s W = 0.009). Pairwise testing showed Gemini 2.5 Pro performed better than both competitors (p = 0.049), whereas GPT-5 Pro and DeepSeek V3 did not differ (p = 1.000). Within-model accuracy was stable across subgroups; non-responses were rare (< 2%) and did not change ranking.
Conclusion: All tested LLMs demonstrated strong competence on structured medical MCQs and may support supervised formative learning in resource-constrained settings. However, although between-model differences were statistically significant, their absolute educational impact was modest, and their effect on real learning outcomes remains uncertain. Key limitations include potential residual training overlap, single-source MCQ sampling, and absence of explanation-quality assessment; future multicenter longitudinal studies should evaluate open-ended clinical reasoning and learning outcomes.

 
Full-Text [PDF 444 kb]   (26 Downloads) |   |   Full-Text (HTML)  (12 Views)  
Article Type : Orginal Research | Subject: Medical Education
Received: 2025/12/11 | Accepted: 2026/02/22 | Published: 2026/04/1

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© 2026 CC BY-NC 4.0 | Journal of Medical Education Development | All rights reserved.