Abstract
Background & Objective: The increasing complexity and volume of data in medical education highlight the importance of using advanced analytical techniques, such as data mining, to analyze educational data. This review aims to identify and assess the applications of educational data mining in medical education.
Materials & Methods: This research is a scoping review conducted based on the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR). Data was collected on January 8, 2025, utilizing search strategies specifically tailored for the Scopus, Web of Science (WOS), and PubMed databases. The inclusion criteria consisted of research articles related to medical education and data mining. In contrast, the exclusion criteria included non-research articles, articles written in languages other than English, and articles that had been retracted. The screening of articles was performed in three stages: titles, abstracts, and full texts. Finally, the selected articles were reviewed and reported based on data mining tools, algorithms, software, and results.
Results: The data mining applications identified were categorized into four main themes: predicting students' performance, identifying at-risk students, analyzing student interactions in online learning, and evaluating the quality of exams. Algorithms that are commonly used include Artificial Neural Networks (ANN), Naive Bayes, and K-means clustering.
Conclusion: Data mining is a powerful tool for analyzing educational data, particularly for planners in medical sciences. It can help improve the quality of educational systems and enhance student academic success through its various techniques. The intentional use of data mining can also support strategic decision-making within educational systems, leading to improved teaching quality and a reduction in socio-economic disparities among students.
Introduction
Medical sciences education is critical because it is one of the fields that provides professional staff for health services. In contemporary times, modern tools such as artificial intelligence play a vital role in advancing and enhancing education in medical sciences, contributing significantly to improving learning quality and facilitating education [1]. Additionally, artificial intelligence has demonstrated its capabilities in the field of medical sciences [2]. An emerging and effective set of tools in the field is Educational Data Mining (EDM). Leveraging artificial intelligence and machine learning, EDM can effectively identify complex patterns and uncover hidden relationships in educational data [3].
EDM is an interdisciplinary field that combines education and computer science to extract meaningful patterns and actionable knowledge from large-scale educational data. Unlike traditional statistical techniques that test predefined hypotheses, EDM emphasizes a data-driven discovery approach, where hypotheses emerge from the data itself [4]. By employing various algorithms and tools, EDM enables researchers to examine student behavior, detect learning patterns, and optimize
educational settings [5, 6]. One of the key theoretical foundations of EDM is the learner-centered analysis framework, which emphasizes analyzing students' interactions within their learning environments. Additional frameworks, including the learning cycle, cognitive models, and motivational theories, provide conceptual bases for interpreting data within the context of medical education. Identifying and applying these frameworks allows for a deeper understanding of educational dynamics and supports more meaningful analyses [7, 8]. EDM encompasses a range of analytical tasks, including description, prediction, estimation, classification, clustering, and association analysis. These tasks are implemented through numerous data mining techniques, including regression, Naive Bayes, decision trees, neural networks, K-means clustering, Apriori, and FP-Growth algorithms [9]. In practice, EDM has been applied to predict student dropout rates using regression models [10, 11], forecast academic performance using techniques such as random forests, k-nearest neighbors, and support vector machines [12], and discover hidden relationships in educational datasets through association rule mining [13]. Clustering algorithms have also been used to group students based on performance indicators [14]. Several studies have demonstrated the practical value of EDM. For example, Rueangket et al. identified key factors affecting medical students' learning outcomes [15], while other research highlighted the predictive power of academic history and demographic features in determining student success [16]. Additionally, Yağcı employed a machine learning-based model to predict students' final grades using midterm results, highlighting how data-driven insights can inform decision-making in higher education [12]. Similarly, the study by Feng et al. combined clustering, discriminant analysis, and Convolutional Neural Networks (CNNs) to predict students' academic performance, highlighting the potential of hybrid EDM models to enhance prediction accuracy and early intervention in education [17]. Overall, EDM has demonstrated significant potential in improving educational quality, identifying at-risk students, and facilitating targeted interventions.
Despite the growing use of EDM in medical education, a comprehensive and structured review of its applications, tools, techniques, and outcomes has yet to be conducted. Moreover, it remains unclear which educational aspects have received the most attention in existing studies and what gaps persist in the application of these techniques within medical education. Considering the increasing complexity of medical education and the vast amount of data generated in the medical field, there is a growing need for advanced tools and techniques to analyze this data and extract meaningful insights. Therefore, the objective of this study is to conduct a scoping review to (1) identify the applications of data mining in medical education, (2) categorize the tools and techniques employed, and (3) summarize the key outcomes reported in the literature. By systematically reviewing relevant publications, this study aims to provide a comprehensive overview of how data mining contributes to the advancement of medical education. The data items extracted in this review were aligned with the PCC (Population, Concept, and Context) framework, which served as the foundation for defining the study's scope. The "Population" element referred to any studies focusing on learners, educators, or educational environments within the field of medical education. This included, but was not limited to, medical students, instructors, and academic institutions involved in various forms of medical teaching and learning. The "Concept" centered on the application of data mining or text mining techniques. This included studies that utilized algorithmic approaches or computational models to extract patterns, predict outcomes, or derive insights from educational data. Both structured data mining methods and unstructured text mining approaches were considered within this category, providing a broader perspective on the analytical tools used in the studies. The "Context" referred to any educational setting related to medicine. This encompassed a wide range of academic environments, including undergraduate medical education, postgraduate training, continuing medical education, and professional development programs. By including diverse contexts, the review aimed to capture the full breadth of how data mining techniques are employed across different stages and settings of medical education.
Materials & Methods
Design and setting(s)
This study was a scoping review conducted using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines [18]. The objective was to identify and evaluate studies on the application of data mining in medical education.
Research framework
The research objectives were defined using the PCC (Population, Concept, and Context) framework. The population included studies related to learners, educators, or educational environments in the field of medical education. This concept refers to the application of data mining or text mining techniques. Context included any medical education context, such as undergraduate, graduate, continuing medical education, or professional training programs.
Information sources and search strategy
The literature search was carried out on January 8, 2025, across three reputable databases: PubMed, Web of Science (WOS), and Scopus. The search was designed to retrieve all relevant articles related to data mining, education, and medical sciences. No time restrictions were applied, and the coverage extended to all articles published up to January 2025. The search strategy used the following queries: Scopus: TITLE-ABS-KEY ((education* OR "literacy program*" OR "training program*" OR workshop) AND ("data mining" OR "text mining") AND (medic*)); Web of Science (WOS): TS = ((education* OR "literacy program*" OR "training program*" OR workshop ) AND ("data mining" OR "text mining") AND ( medic* )); PubMed: (education*[Title/Abstract] OR "literacy program*" [Title/Abstract] OR "training program*" [Title/Abstract] OR workshop[Title/Abstract]) AND ("data mining" [Title/Abstract] OR "text mining" [Title/Abstract]) AND (medic*[Title/Abstract]) It is worth noting that the term "text mining" was included in the search strategy because it refers to the process of extracting patterns, concepts, or meaningful information from unstructured textual data such as narrative notes, open-ended responses, or written messages. Since text, mining focuses on extracting meaningful information from unstructured textual data—unlike data mining, which deals with structured databases [19]—its inclusion was intended to ensure comprehensive coverage of relevant literature.
Eligibility criteria and study selection
All search results were limited to English-language articles. The results were imported into EndNote, and duplicate entries were removed. The screening procedure was then carried out through multiple steps: First, title screening was performed, where research that did not pertain to the study goals was eliminated after examining the titles of the publications. Second, after selecting relevant papers for full-text review, the abstracts of the remaining publications were assessed. Third, to ensure all inclusion criteria were satisfied, the entire texts of the selected articles were reviewed. Two reviewers with expertise in information technology and medical education independently conducted screening. In cases of disagreement, conflicts were resolved through discussion and consensus; if needed, a third reviewer was consulted to reach a final decision. The inclusion criteria for this study comprised original research articles, publications related to medical education, and studies that employed text mining or data mining as a research methodology. Conversely, the exclusion criteria included non-research articles, such as reviews and letters to the editor, studies not written in English, publications without full-text availability, and articles that had been retracted.
Data collection methods & analysise
Lastly, the articles were reviewed and analyzed to address the objectives and research questions identified in this scoping review. The outcomes of this review are presented in a table that includes the authors and year of the study, objective, dataset, algorithms, software used, results, and conclusions.
Results
Figure 1 illustrates the process of finding, screening, and selecting relevant papers for this scoping review, as depicted in the PRISMA diagram. The first step includes searching three databases: PubMed, Web of Science (WOS), and Scopus. This step yielded a total of 2,122 studies. Among them, after removing duplicates and performing an initial screening based on titles and abstracts, a large number of articles were excluded. In the next stage, 78 studies were selected for full-text review. In the next step, 15 studies were excluded as irrelevant to the subject, 47 studies were excluded because they were not original articles, and two studies were excluded due to a lack of full-text access or being written in a language other than English, leaving a total of 78 studies for review. Finally, 14 studies were selected as eligible articles for analysis and conclusions in this study. This process clearly shows the systematic identification and selection of articles. Table 1 presents the results of the selected studies in the area of data mining in medical education. Educational data include students' grades, demographics, exam scores, online learning environment behavior, and academic and behavioral characteristics (Table 1). In EDM, these data are analyzed mainly to predict student performance, identify at-risk students, evaluate exam quality, discover association rules, and analyze student interaction. The software used in this
analysis includes Python (with libraries such as Scikit-learn, TensorFlow, Keras, Pandas, and NumPy), R, SPSS, Weka, Orange, MATLAB, and Stata. These tools are among the leading options for processing educational data due to their advanced analytical capabilities [20-27, 29-33].
Figure 1. Flow chart of the study selection process according to the PRISMA guidelines
In the realm of EDM, various applications have emerged, including predicting student performance, identifying at-risk students, assessing the quality of online exams, discovering association rules, and analyzing student interactions in online learning environments. For instance, predicting student performance often involves algorithms, such as Artificial Neural Networks (ANN) and Naive Bayes, which utilize various admission criteria and demographic features to forecast the performance of medical students. The outcomes of these studies suggest that enhancing the prediction of students' performance [20, 23, 29] is achievable through the optimal allocation of weights for admission factors. For predicting at-risk students, ANN and Naive Bayes are used to predict first-year medical students who may be at risk. The results showed that students' prior knowledge is the most critical factor in predicting academic success [21, 33]. Additionally, EDM has successfully predicted the complexity level of exercises in online learning systems [30]. In the field of evaluating the quality of online exams, k-means clustering has been used to analyze the quality of online exams during the covid-19 pandemic. It was found that the experiences from the first semester influenced the characteristics of second-semester exams. Consequently, it was suggested that appropriate guidelines should be established and that more advanced classification questions should be included [22]. Apriori and Eclat algorithms were utilized to uncover association rules in the Iranian national medical entrance exam data within the domain of association rule discovery. The results indicated that students with high scores are accepted for the exam, while those with low scores are rejected, regardless of other factors [24]. The use of SNA to study student interactions in online environments has garnered increasing interest, particularly in the context of online Problem-Based Learning (PBL) environments. The results showed that student-instructor interactions have a positive correlation with student performance, and that SNA indicators can predict low-performing students with an accuracy of 93.3% [27].
ANN, Naive Bayes, decision trees, random forests, logistic regression, KNN, and clustering algorithms such as K-means and PAM are some of the algorithms commonly used in these studies. These algorithms are widely utilized due to their ability to perform prediction, classification, and pattern recognition within educational data [20-25, 27, 31-33]. According to the findings obtained from the reviewed studies, the applications of educational data mining in medical education can be categorized into five main domains, as presented in Table 2.
Table 1. Comprehensive analysis of selected studies on educational data mining applications in medical education
Abbreviations: DT, decision tree; NN, neural network; RF, random forest; NB, naive Bayes; KNN, k-nearest neighbor; ANN, artificial neural network; LR, logistic regression; SVM, support vector machine; ADA, AdaBoost; XGB, XGBoost; SNA, social network analysis; CHAID, chi-square automatic interaction detection; ID3, iterative dichotomiser 3; ROCF, Rey-Osterrieth Complex Figure; BSR-ADVI, Bayesian Softmax Regression with Automatic Differentiation Variational Inference; PAM, partitioning around medoids; SOM, self-organizing map; MLP, multi-layer perceptron; CART, classification and regression trees; GPA, grade point average; CMBSE, Comprehensive Medical Basic Sciences Examination; EDM, educational data mining; MCQ, multiple choice question; DEA, data envelopment analysis; PBL, problem-based learning; BEDP, Bayesian inference-based exercise difficulty prediction; CEE, common entrance exam.
Table 2. Main thematic categories of educational data mining in medical education
Note: This table presents a qualitative categorization of educational data mining applications in medical education based on systematic literature review methodology.
Abbreviations: ANNs, artificial neural networks; GPA, grade point average; SNA, social network analysis.
However, a review of the existing literature reveals that many studies in the field of EDM suffer from methodological limitations that can compromise the accuracy and generalizability of their findings. One standard limitation is the reliance on narrowly defined educational indicators—such as grades and standardized test scores—while overlooking critical variables, including students' cognitive abilities, motivational factors, social characteristics, and the contextual features of their learning environments, as well as teacher attributes [20, 21, 32]. Some studies have exclusively employed a limited range of cognitive assessments, such as the Rey-Osterrieth Complex Figure (ROCF) test, while neglecting other well-established psychometric instruments that could provide a more comprehensive view of learner characteristics [29]. Moreover, several investigations suffer from small sample sizes or biased data selection practices—for instance, excluding students with incomplete records—, which may distort the accuracy of predictive models [21, 25]. Another recurring issue is the absence of cross-validation or sensitivity analysis, which are essential for assessing the robustness and reliability of machine learning models [25]. Additional limitations observed include the unavailability of detailed demographic data [22], the narrow focus on a specific subgroup of users within an educational platform [25], and an overdependence on historical or context-dependent datasets [23, 26, 27]. Collectively, these shortcomings highlight the need for future research to incorporate more diverse and multidimensional datasets, adopt advanced hybrid modelling approaches, and account for the cultural, social, and technological aspects of modern educational systems to enhance research quality and applicability.
Discussion
The findings of this study suggest that EDM plays a crucial role as an effective and flexible tool in analyzing educational data and enhancing learning systems. Educational institutions store various types of student data, ranging from academic records to personal details such as parental income and educational background [4]. However, one of the key limitations observed in the reviewed studies is their heavy reliance on localized and context-specific datasets, which limits the generalizability of findings to broader educational settings [34]. In this regard, Baker, Martin, and Rossi also pointed out the challenges of implementing EDM systems on a larger scale, particularly in resource-constrained environments or when dealing with incomplete datasets [34].
Furthermore, the findings from this review underscore the importance of various types of educational data, including student grades, demographic characteristics, and online interaction records. Utilizing advanced data mining techniques enables the identification of hidden patterns within the data. Admission criteria and demographic variables play a crucial role in predicting student success. In line with this, Waheed et al. achieved 85% prediction accuracy using demographic and geographic features [36]. Similarly, Costa-Mendes et al. predicted academic performance using variables such as income, age, employment status, cultural level, and place of residence [37]. Cruz et al. also utilized these socio-economic variables to predict student performance [38]. In another study, Alam enhanced educational outcomes by analyzing diverse data, including academic records, demographics, and external sources like social media platforms and online forums [39].
Nevertheless, the use of sensitive data, such as socio-economic and personal information, raises significant concerns regarding data privacy and ethics, which many of the reviewed studies failed to address [40] entirely. Romero and Ventura emphasized that these challenges can substantially hinder the real-world adoption of EDM in educational settings [41].
Furthermore, identifying at-risk students using data mining algorithms underscores the importance of prior data and its impact on academic success [21, 31]. For example, Ahmad and Shahzadi predicted poor academic performance with 85% accuracy by analyzing variables related to study habits, learning skills, and student-instructor interaction [42]. However, one major challenge in this area is the absence of a standardized definition for "at-risk" students across studies, which complicates cross-study comparisons [43]. Furthermore, implementing these predictive models in real-world scenarios necessitates a robust technological infrastructure and trained personnel [44].
In the evaluation of online exams, clustering-based techniques—such as the k-means algorithm—have demonstrated that multiple factors, including students' prior experiences, can influence assessment quality. These analyses suggest that incorporating higher-order questions and structured exam design can enhance the reliability of online evaluations [22]. However, many studies did not adequately consider external factors, such as internet accessibility or the quality of student devices, which can significantly impact assessment outcomes [45]. In the area of association rule mining, algorithms such as Apriori and Eclat have revealed functional patterns for determining student admission decisions based on test scores [24]. Moreover, analyses of student-instructor interactions in online learning environments suggest that positive engagement improves academic performance. The use of SNA indicators to identify low-performing students highlights the potential of EDM for designing interaction-focused learning models [27]. However, the reliance of SNA on dense interaction data makes it challenging to apply in low-resource educational environments (46). Additionally, inconsistencies found in the literature—such as the varying effects of online interaction on performance across different contexts—indicate the need for further empirical studies, as emphasized by Siemens [47].
Overall, this review highlights the substantial potential of EDM to transform the field of medical education. Nonetheless, challenges such as the lack of high-quality data, ethical and privacy concerns, and infrastructural limitations still impede its widespread application [41]. Future research should prioritize the development of generalizable EDM frameworks, the integration of ethical safeguards, and the creation of scalable solutions for settings with limited resources.
One of the limitations of the present study is that, as a scoping review, its primary objective was to map and categorize the scope of existing research rather than conduct a systematic evaluation of study quality or generalizability. Moreover, several reviewed studies demonstrated weaknesses, including small sample sizes, reliance on outdated datasets, the absence of validation methods (e.g., cross-validation), and limited analytical depth. For instance, studies relying on data from over a decade ago may no longer align with current medical education practices due to structural and technological changes in the field.
Conclusion
The current work illustrates the capability of EDM as a broad yet robust tool for identifying and improving the quality of learning systems. Advanced algorithms like ANN, decision trees, and random forests not only allow educational institutes to predict students' performance but can also help them identify latent patterns in the underlying data and use those insights for strategic decision-making. Additionally, it aids in identifying at-risk students, improving exam quality, and fostering communication between students and instructors by leveraging educational, demographic, and interaction data in online learning environments. The educational process can be optimized by the design of data-driven learning systems and predictive analyses, which will generate early intervention opportunities for students. The future of the educational system will be primarily driven by data-centricity and the adoption of sophisticated techniques, as this study demonstrates.
The findings of this research will provide valuable insights for researchers and policymakers on the application of data mining and machine learning algorithms in medical education. This, in turn, can lead to enhancements in the quality of teaching and student performance. These techniques can be used efficiently to analyze students' learning behaviour and predict their academic performance.
Additionally, these can be utilized to develop student-specific learning frameworks and identify areas of weakness, while offering remedial sessions to enhance learning outcomes.
The results of this study may help policymakers formulate evidence-based policies that can enhance constructive interactions between students and instructors, optimize online learning environments, and construct better-quality examinations. These strategies ultimately contribute to the advancement of the educational system and may lead to the training of more proficient and effective medical science professionals.
Ethical considerations
This is a review study, and all ethical principles in research have been observed. The study was approved by the Ethics Committee of Sirjan School of Medical Sciences under the ethics code IR.SIRUMS.REC.1403.049.
Artificial intelligence utilization for article writing
MD contributed to the conception and design of the study, literature review, data extraction, writing the initial draft, and critical revision of the manuscript. MK supervised the research process, contributed to the methodological framework, provided critical feedback, and edited the final version of the manuscript.
Acknowledgment
Not applicable.
Conflict of interest statement
The authors declare that they have no conflicts of interest at any stage of the study.
Author contributions
MD contributed to the conception and design of the study, literature review, data extraction, writing the initial draft, and critical revision of the manuscript. MK supervised the research process, contributed to the methodological framework, provided critical feedback, and edited the final version of the manuscript.
Funding
The Sirjan School of Medical Sciences supported this study (Grant No: 403000047).
Data availability statement
All data related to this study are fully available within the article.