Volume 17, Issue 55 (2024)                   JMED 2024, 17(55): 108-119 | Back to browse issues page

Ethics code: IR.SBMU.RETECH.REC.1399.1222


XML Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Sajjadi S S, Shomoossi N, Shabani E, Khazaei Feizabad A, Karimkhanlooei G. Dimensionality, discrimination power and difficulty of English test items: the case of graduate exam for healthcare applicants. JMED 2024; 17 (55) :108-119
URL: http://edujournal.zums.ac.ir/article-1-1869-en.html
Zahedan University of Medical Sciences, Zahedan, Iran
Full-Text [PDF 421 kb]   (89 Downloads)     |   Abstract (HTML)  (232 Views)
Full-Text:   (24 Views)
 Abstract
Background & Objective:
Administered by the Iranian Center for the Measurement of Medical Education, national university entrance exams are administered nationwide where English constitutes a vital section. This study aimed to assess dimensionality, discrimination power and difficulty of English test items in this graduate entrance exam.
Material & Methods: This quantitative study examined 160 English test items administered to 41633 test-takers applying for graduate studies in Iranian universities of medical sciences in 2021, and reported the characteristics of test takers during three successive years (2019, 2020, and 2021). NOHARM software (version 4.0) was used to analyze the data by examining dimensionality of the tests reporting a two-parameter model.
Results: Generally, female participants outnumbered the male, with a similar pattern among the admitted participants (70% females vs. 30% males). A positively significant correlation was found between participants’ Grade Point Average and English test scores (p < 0.05). In 2021, the results of four administration sessions with a high reliability (i.e. 0.92, 0.88, 0.90 and 0.91) were analyzed separately. Two dimensionality parameters (i.e., difficulty & discrimination) fitted the model while the guessing parameter did not. English tests proved to be “difficult”, with either “high” or “very high” discrimination power. Neither “easy” nor “very easy” items were found. No items were associated with “no” or “very low” discrimination power.
Conclusion: Overall, the tests functioned well; however, more research is required to rigorously evaluate the exams. Improvements concerning the social and long-term effects of these tests are suggested.

Introduction
A variety of English examinations are administered to screen out university applicants in healthcare majors across the world. For instance, the United States Medical Licensing Examination (USMLE) is regarded as one of the toughest exams in the world; it measures candidates’ clinical abilities, medical knowledge, and English language ability (1). Also, the Medical College Admission Test (MCAT) is a multiple-choice examination for admission to medical schools in the USA (2). Another well-known test is the Occupational English Test (OET), which evaluates the language communication abilities of healthcare professionals seeking to register and practice in an English-speaking workplace (3). In India, the Foreign Pre-Medical Entrance Test is administered (4, 5). In Iran, similar tests are designed by the Center for the Measurement of Medical Education in order to assess the language abilities of applicants in the healthcare majors.
National university entrance exams, publicly known as Konkour in Iran, are administered at undergraduate, graduate, and postgraduate levels. While hundreds of thousands of high school graduates participate in the undergraduate nationwide exam (6), participants in the graduate and postgraduate exams just amount to tens of thousands each year; in fact, all exams are quite competitive. The Center for the Measurement of Medical Education, subordinate to the Ministry of Health and Medical Education, administers the graduate and postgraduate level exams, which are vitally important for higher education applicants of healthcare and medicine.
Testing packages for each major (or set of similar majors) include a set of 40 English test items, together with tests of specialized courses. This Foreign Language Test consists of 20 vocabulary and 20 reading comprehension items (all multiple-choice), which must be completed in 40 minutes. The test is a norm-referenced test designed to assess applicants’ ability in reading comprehension and vocabulary knowledge of academic English for healthcare and medical students; however, the assessment of the four skills is not in perspective. The participants are all educated in Farsi, Iran’s official language. The items are normally developed with varying degrees of difficulty, and administered once a year in four sessions in two consecutive days, normally at a weekend. Furthermore, another 40-item English test is developed for the applicants of ‘medical journalism’, which is considered a more difficult test than the other four tests because a higher proficiency level is expected of its applicants who are mostly graduates of English and medicine. The tests are normally based on test takers’ academic needs and test items reflect their undergraduate courses of English.
As far as it is known, English level requirements upon entering MSc. programs is not determined through a centralized test of English in other countries. In fact, even non-native-English speking countries do not administer an English test for medical MSc. applicants as we do in Iranian universities. For instance, in most universities in Indonesia, Brunei Darussalam, the Philippines, Malaysia, Singapore, and other ASEAN countries, they require International English Language Testing System (IELTS) Band score 6.5 or TOEFL score of 550 at entry point In Türkiye, international English language proficiency tests such as PTE academic, TOEFL iBT (Score of at least 70) and IELTS (minimum score 5.0) are required for master’s programs; for PhD programs, the minimum score 6.0 is acceptable In general, Turkish universities demand a prerequisite entry language score on international language tests ranging between B1 and C1 level of proficiency according to the CEFR levels Therefore, this unique test of English requires specid consideration.
Due to its exclusive emphasis on a vocabulary and reading comprehension, this test is not regarded as a true test of English proficiency; rather, it is meant to measure applicants’ language performance to some extent (7). Furthermore, owing to the restricted number of seats available in medical universities, they are highly competitive and serve two purposes: as a gatekeeper to weed out the less qualified students and as a guarantee of the admitted applicants’ future academic abilities (8, 9). Nevertheless, despite its high-stakes nature and its evident impact on a significant number of test takers’ future academic and professional prospects, to our knowledge, no reliable reports have been published on its effectiveness, reliability and validity; even thecnical reports are unavailable on the web. Therefore, the present study aimed to evaluate these characteristics and dimensionality of English test items in this nationwide medical graduate entrance exam during three successive years. In fact, it was carried out to evaluate these English exam items in light of statistical computational approaches in order to reflect a technical evaluation of these test items. The findings should aid in revising the construction and administration procedures.

Material & Methods
Design and setting(s)
This quantitative study was designed aiming to investigate dimensionality, discrimination power and difficulty of English test items in the graduate entrance exam for healthcare applicants in Iran. The test includes a set of 40 English test items, which are administered on the same day as tests of specialized courses. These test items consists of 20 vocabulary and 20 reading comprehension items, in the multiple-choice format, which must be completed in 40 minutes.
Participants and sampling
The study was carried out on English test items (n = 160) administered to 14,827 test takers applying for graduate studies at Iranian universities in the medical sciences in 2021, and test takers’ characteristics were reported during three successive years (2019–2021). Normally, months prior to exam administration, the applicant enrolls for the exam and prepares for the exam. The applicants’ gender and details of their registration, absenteeism, and admission details are reported below in Table 1.
Data collection method
The study data were obtained under confidentiality requirements from the Center for the Measurement of Medical Education directed by the Ministry of Health and Medical Education, but the test takers’ personal information (e.g., name or identity information) were not included. The obtained data included the test takers’ performance on 160 English test items administered in 2021, together with their characteristics during three successive years (2019, 2020, and 2021). Using Excel and Word software from the Microsoft Office Package, we tabulated the data in different tables and organized them into different categories so that the analyses could be performed.
Table 1. Participants in the graduate entrance examination (2019-2021)
Candidates Gender 2019 2020 2021
Frequency Percentage Frequency Percentage Frequency Percentage
Registrants Male 21584 26.64 18233 26.01 21364 26.53
Female 59433 73.76 51861 73.99 59168 73.47
Total 81017 100 70094 100 80532 100
Participants Male 12619 26.95 12940 25.53 14718 26.1
Female 34206 73.05 37739 74.47 41633 73.9
Total 46825 100 50679 100 56381 100
Absentees Male 8965 26.22 5293 27.26 6646 27.52
Female 25277 73.78 14122 72.74 17505 72.48
Total 34192 100 19415 100 24151 100
Allowed to choose a major Male 11203 26.26 3522 26.58 8358 27.11
Female 31458 73.74 9730 73.42 22473 72.89
Total 42661 100 13252 100 30831 100
Admitted Male 1298 30.27 1376 30.5 1856 30.79
Female 2990 69.73 3136 69.5 4171 69.21
Total 4288 100 4512 100 6027 100
 
Data analysis
Statistical analyses were conducted using descriptive and inferential statistics; additionally, fitting into a dimensionality model was examined using Noharm version 4.0. Correlations between test-takers’ English test scores and Grade Point Average (GPA) were also examined. NOHARM software (version 4.0) was further used to analyze the data by examining the dimensionality of the tests and reporting a two-parameter model.
Dimensionality
In order to apply the item-response theory for item analysis, it is essential that each test undergoes unidimensionality evaluation. Unidimensionality is one of the two assumptions in item-response theory. It denotes that only one single dominant factor affects a testee’s performance, i.e. the test taker’s ability which is being tested and measured. Another assumption is local independence, which means that responding to a single item will be independent of other items if the dominant factor (i.e. ability) is controlled (10).
Different models are suggested for the item-response theory, which are labelled by the scoring model (e.g. two-parameter, multi-parameter, and nominal) and number of parameters (e.g. difficulty and discrimination parameters and guessing effect) (11). For determining the number of parameters of an item, all three fitting types must be examined with the data, and the most appropriate one should be selected. In the present analysis, likelihood indexes were used for comparing and choosing the right model.
Dimensionality parameters
Difficulty parameter in IRT is similar to item difficulty in its classical counterpart but the difference is that in IRT as the values increase the item becomes more difficult, and test takers need a higher ability to get the item right. It ranges from -4 to +4, and it becomes more difficult as we move from -4 towards +4. While this value may fluctuate between 0 and 1 in the classical test theory, its IRT values may even exceed 1. Guessing parameter estimates that to what extent an individual test taker with a very low ability can correctly answer an item. Low values of this parameter (below 0.1) is acceptable but above that is unsatisfactory. Items with guessing parameter above 0.25 are inappropriately constructed items due to higher guessing likelihood. Values below 0.1 are considered optimal items in a test. While references on the IRT models do not present clear-cut classifications for the parameters in question, Baker (13) developed and suggested a scale for difficulty and discrimination parameters, which is the basis of our analysis (Tables 2 and 3) too.
Item-Response Theory (IRT) Models
A variety of IRT models are available to accommodate different measurement situations. In a one-parameter model or the Rasch model, it is assumed that the discrimination parameter remains the same for all items but for each item, a difficulty parameter can be specified. An advantage of the Rasch model is its capacity to be used with smaller samples sizes. However, if equal discrimination is not assumed, the two-parameter model is applied where two parameters affecting an individual’s response to a particular test item are considered (i.e. difficulty level and item discrimination). Therefore, a difficulty level and a discrimination power value are separately reported for each item, as reported below.
Table 2. Levels of the difficulty parameter
2.001 to 3 1.001 to 2 0.001 to 1 -1 to 0.001 -1.001 to -2 -2.001 to -3
Very Difficult Almost Difficult Difficult Almost Easy Easy Very Easy
Table 3. Levels of the discrimination parameter
0 to 0.009 0.01 to 0.34 0.35 to 0.64 0.65 to 1.34 1.35 to 1.69 1.70
None Very low Low Medium High Very High
Item difficulty
Item difficulty is the total percentage of testers who score a certain item right and is represented by P. As the following formula indicates, P is computed by the number of testees who correctly answered a certain item (R) divided by the total number of test takers (T) multiplied by 100.
P = R/T × 100
Item discrimination
Represented by D, item discrimination power is an index that indicates how well an item is able to distinguish between high and low achievers. It is computed from equal-sized high and low-scoring groups on a test by subtracting the number of successes of the low-achievers on the item from the number of successes of the high-achieving group and then dividing this difference by the size of a group using D = (UG - LG)/ n formula. It may range from + 1 to -1. The higher the discrimination index, the better the test item can discriminate between students with higher test scores and those with lower test scores. For instance, D = 0 means the item has no discriminatory power, while D = 1 means the item has the highest perfect discrimination power.
Formula 2.   
D = (upper group right answers - lower group right answers) ÷ number of group members (upper or lower)
Finally, when guessing is plausible, the three-parameter logistic model applies and three parameters affecting an individual’s response to a particular test item are reported (difficulty level, discriminating power and the guessing effect) (12). But the decision to use one model over another depends on several factors, including the response format, whether the discrimination parameter can be kept constant across items, whether guessing is plausible, and whether different category response parameters must be estimated for each item on a scale (10).

Results
Analysis of the collected data showed that the enrollment of female participants outnumbered that of male ones throughout the three years; however, the proportion varied from one-fourth to almost one-third (Table 1). The ratio remained almost the same when we considered the total number of test takers by gender. As for absentees, 26% of them were male and the rest (74%) were female. Admitted participants consisted of 70% females and 30% males. Considering the number of admitted candidates, 8.7% the total female participants and 10.3% of the total male participants were admitted (Table 1).
between English test scores and GPA
In the present study, due to the large sample size, and the quantitative nature of English language scores and GPA, Pearson’s correlation coefficient was applied to investigate the possible correlation. In the present study, preliminary analyses were performed to ensure no violation of the assumptions of normality, linearity and homoscedasticity. Small size but significant correlation was observed (p < 0.05) (r = 0.260; confidence interval 95%). In other words, the higher an applicant’s GPA, the higher his/her English test score.
Evaluation of dimensionality
At first, NOHARM software (version 4.0) was used to check the dimensionality of the test (four sessions in 2021, each session containing 40 items). The Tanaka index values in the output of the software confirmed the unidimensionality of the test (e.g. for Session 1, Tanaka index of goodness of fit = 0.9853312, and Root Mean Square of Residuals (RMSR) or lower off-diagonals = 0.0090737; details of other three sessions are available on demand). If the Tanaka index value is greater than 0.90, the fit is acceptable, and if it is greater than 0.95, the fit is good. Considering that the value of the obtained indexes in all four sessions were above 0.95, the four tests were considered unidimensional. In addition, the very low value of RMSR was another proof of the suitability of the unidimensional model, leading to the enhanced dependability of the tests; dependability is seen as the extent to which test results reflect the level of the construct we are meant to measure (14). In other words, only one dominant factor had an effect on the subjects’ performance and, here, this dominant factor was the desired ability (i.e. language knowledge) of the individual.
Model selection
To choose the right model, the significance of the difference between the likelihood indices of the two models should be examined. Here, the difference between the likelihood indices between the one- and two-parameter models was greater than the value of the Chi-square table. As a result, the null hypothesis of no difference between the one- and two-parameter models was rejected. On the other hand, the value of this difference between the two- and three-parameter models was lower than the value of the Chi-square table, which confirmed the null hypothesis that there was no difference between these two models; therefore, the two-parameter model was used for analysis (Table 4).
Because the data we acquired for the present study were the result of four administrations each year, test items are analyzed separately and reported below. Before entering the exam analysis, it is necessary to mention that in the graduate exam, five parallel sets of questions are given to candidates who take the exam at the same time. In other words, the candidates of a series of similar fields take the exam at the same time (except for the medical journalism, which has its own set of questions). Accordingly, the answer sheets of all the candidates were subject to analysis.
A) The 2021 graduate exam (Session 1)
The first session of the 2021 graduate exam was conducted with 40 questions, administered to 13,290 participants. The maximum and minimum scores of the exam were obtained at 38.31 and 0.11, respectively; in addition, the reliability was calculated at 0.92 Table 5.
B) The 2021 graduate exam (Session 2)
The second session of the 2021 graduate exam was conducted with 40 questions, administered to 15,422 participants. The maximum and minimum scores of the exam were 34.55 and 0.03, respectively. The reliability value was obtained at 0.88 (Table 6).
C) The 2021 graduate exam (Session 3)
The third session of the 2021 graduate exam was conducted with 40 questions, administered to 9,441 participants. The maximum and minimum scores of the exam were 37.38 and 0.09, respectively. The reliability value was obtained at 0.90 (Table 7).
D) The 2021 graduate exam (Session 4)
The fourth session of the 2021 graduate exam was conducted with 40 questions, administered to 9,262 participants. The maximum and minimum scores of the exam were 38.12 and 0.15, respectively. The reliability value was obtained at 0.91 (Tables 8–10).
Table 4. Likelihood indices among the models (2021–Sessions 1, 2, 3, 4)
Year/Session One-parameter Two-parameter Three-parameter
2021 - Session 1 -244675.2 -242027.2 -242029.9
2021 - Session 2 -244675.2 -242027.2 -242029.9
2021 - Session 3 -161145.5 -159467.2 -159462.4
2021 - Session 4 -165435.3 -163290.5 -163276.1
Notes: Lower values of likelihood indices indicate better fit of the model to the data, aiding in model comparison and selection for the analysis of test items. The selection of the appropriate model is crucial for accurate analysis and interpretation of the test data.
Table 5. Item difficulty and discrimination (2021–Session 1)
Question Discrimination Discrimination Power  Difficulty  Level of difficulty
q121 1.999 Very high -0.32 Almost easy
q122 1.712 Very high 0.97 Almost difficult
q123 2.051 Very high 1.44 Difficult
q124 1.741 Very high 1.05 Difficult
q125 1.744 Very high 1.56 Difficult
q126 2.047 Very high 1.29 Difficult
q127 1.561 High 1.41 Difficult
q128 2.346 Very high 1.07 Difficult
q129 1.689 High 1.77 Difficult
q130 2.594 Very high 1.99 Difficult
q131 2.075 Very high 2.01 Very difficult
q132 1.693 High 0.91 Almost difficult
q133 2.247 Very high 1.63 Difficult
q134 1.422 High 2.78 Very difficult
q135 2.184 Very high 1.91 Difficult
q136 1.881 Very high 2.07 Very difficult
q137 1.285 Medium 1.13 Difficult
q138 2.527 Very high 0.97 Almost difficult
q139 1.932 Very high 1.55 Difficult
q140 2.425 Very high 0.33 Almost difficult
q141 1.477 High 2.02 Very difficult
q142 1.391 High -0.53 Almost easy
q143 1.422 High 0.19 Almost difficult
q144 2.285 Very high 1.39 Difficult
q145 1.676 High -0.17 Almost easy
q146 1.743 Very high 2.18 Very difficult
q147 1.995 Very high 1.70 Difficult
q148 1.424 High 1.10 Difficult
q149 2.424 Very high 1.35 Difficult
q150 2.264 Very high 1.23 Difficult
q151 2.2 Very high 2.09 Very difficult
q152 2.153 Very high 0.43 Almost difficult
q153 1.991 Very high 1.36 Difficult
q154 1.223 Medium 1.62 Difficult
q155 1.002 Medium 1.45 Difficult
q156 2.523 Very high 0.56 Almost difficult
q157 2.808 Very high 0.33 Almost difficult
q158 1.472 High 1.34 Difficult
q159 2.119 Very high 0.54 Almost difficult
q160 0.686 Medium 2.58 Very difficult
Average difficulty 1.26 Difficult
Average discrimination 1.89 Very high
Notes: Discrimination values indicate the ability of an item to differentiate between high and low performers, with higher values suggesting stronger discrimination. Difficulty values represent the level of difficulty for each item, with negative values indicating easier items and positive values indicating more difficult items.
Table 6. Item difficulty and discrimination (2021–Session 2)
Question Discrimination Power of Discrimination Difficulty Level of difficulty
q121 1.543 High 1.72 Difficult
q122 2.059 Very high 2.48 Very difficult
q123 1.62 High 1.52 Difficult
q124 2.118 Very high 1.86 Difficult
q125 1.853 Very high 2.54 Very difficult
q126 1.843 Very high 2.77 Very difficult
q127 1.664 High 2.49 Very difficult
q128 2.276 Very high 2.70 Very difficult
q129 1.496 High 2.70 Very difficult
q130 1.964 Very high 1.56 Difficult
q131 1.108 Medium 2.77 Very difficult
q132 1.944 Very high 2.31 Very difficult
q133 2.411 Very high 2.72 Very difficult
q134 1.624 High 2.48 Very difficult
q135 1.941 Very high 3.14 Very difficult
q136 2.237 Very high 2.49 Very difficult
q137 1.093 Medium 2.16 Very difficult
q138 2.048 Very high 2.92 Very difficult
q139 2.089 Very high 1.22 Difficult
q140 1.717 Very high 0.78 Very difficult
q141 1.064 Medium 1.94 Difficult
q142 1.785 Very high 1.81 Difficult
q143 2.22 Very high 1.74 Difficult
q144 1.927 Very high 1.66 Difficult
q145 1.978 Very high 0.41 Very difficult
q146 1.796 Very high 1.91 Difficult
q147 2.086 Very high 1.18 Difficult
q148 2.295 Very high 2.18 Very difficult
q149 2.344 Very high 3.60 Very difficult
q150 2.942 Very high 2.18 Very difficult
q151 1.846 Very high 0.88 Very difficult
q152 2.076 Very high 1.03 Difficult
q153 2.198 Very high 0.22 Very difficult
q154 0.785 Medium 1.94 Difficult
q155 2.489 Very high 2.79 Very difficult
q156 1.922 Very high 3.10 Very difficult
q157 1.571 High 3.03 Very difficult
q158 1.845 Very high 2.52 Very difficult
q159 1.556 High 0.58 Very difficult
q160 2.204 Very high 3.42 Very difficult
Average difficulty 1.89 Difficult
Average discrimination 2.09 Very high
Note: Discrimination values reflect the ability of each item to discriminate between high and low performers, with higher values indicating stronger discrimination. Difficulty values represent the level of difficulty for each item, with higher values indicating more difficult items.
Table 7. Item difficulty and discrimination (2021–Session 3)
Question Discrimination Power of discrimination Difficulty Level of difficulty
q121 1.761 Very high 1.65 Difficult
q122 1.012 Medium 1.82 Difficult
q123 1.775 Very high 1.59 Difficult
q124 2.187 Very high 0.57 Almost difficult
q125 0.934 Medium 2.44 Very difficult
q126 1.697 High 1.73 Difficult
q127 2.048 Very high 2.19 Very difficult
q128 2.228 Very high 1.66 Difficult
q129 1.59 High 2.30 Very difficult
q130 1.868 Very high 0.63 Almost difficult
q131 2.204 Very high 2.82 Very difficult
q132 2.507 Very high 1.30 Difficult
q133 1.272 Medium 2.46 Very difficult
q134 2.226 Very high 1.59 Difficult
q135 1.716 Very high 1.98 Difficult
q136 1.678 High 2.55 Very difficult
q137 0.978 Medium 3.24 Very difficult
q138 1.82 Very high 2.30 Very difficult
q139 2.247 Very high 2.29 Very difficult
q140 1.935 Very high 1.29 Difficult
q141 1.324 Medium 1.53 Difficult
q142 1.471 High -0.49 Almost easy
q143 1.488 High 0.36 Almost difficult
q144 1.637 High 0.25 Almost difficult
q145 1.644 High 1.55 Difficult
q146 1.756 Very high 0.84 Almost difficult
q147 1.807 Very high 1.35 Difficult
q148 2.999 Very high 2.76 Very difficult
q149 1.972 Very high 1.98 Difficult
q150 2.344 Very high 1.42 Difficult
q151 2.368 Very high 1.83 Difficult
q152 2.236 Very high 1.02 Difficult
q153 2.468 Very high 1.62 Difficult
q154 3.205 Very high 2.31 Very difficult
q155 2.71 Very high 1.23 Difficult
q156 1.894 Very high -0.23 Almost easy
q157 1.714 Very high 2.43 Very difficult
q158 2.384 Very high 2.29 Very difficult
q159 1.792 Very high 2.16 Very difficult
q160 1.898 Very high 1.55 Difficult
Average difficulty 1.92 Difficult
Average discrimination 1.65 High
Notes: Discrimination values indicate the extent to which each item distinguishes between high and low performers, with higher values suggesting stronger discrimination. Difficulty values represent the level of difficulty for each item, with higher values indicating greater difficulty.
Table 8. Item difficulty and discrimination levels (2021 – Session2021–Session 4)
Question Discrimination Power of Discrimination Difficulty Degree of difficulty
q121 1.638 High -0.79 Almost easy
q122 1.541 High 0.83 Almost difficult
q123 1.418 High 1.99 Difficult
q124 1.452 High 1.61 Difficult
q125 1.366 High 0.91 Almost difficult
q126 1.811 Very high 0.06 Almost difficult
q127 1.652 High 2.10 Very difficult
q128 1.404 High 2.18 Very difficult
q129 1.586 High 1.30 Difficult
q130 1.071 Medium 2.26 Very difficult
q131 1.414 High 1.41 Difficult
q132 1.241 Medium 1.47 Difficult
q133 0.829 Medium 2.33 Very difficult
q134 2.124 Very high 0.71 Almost difficult
q135 0.813 Medium 2.07 Very difficult
q136 2.351 Very high 2.12 Very difficult
q137 0.574 Few 2.75 Very difficult
q138 1.321 Medium 2.47 Very difficult
q139 1.094 Medium 1.22 Difficult
q140 1.945 Very high 1.60 Difficult
q141 1.795 Very high 0.45 Almost difficult
q142 1.559 High 0.36 Almost difficult
q143 1.785 Very high 1.00 Difficult
q144 0.519 Few 2.26 Very difficult
q145 2.062 Very high 0.01 Almost difficult
q146 2.19 Very high 2.19 Very difficult
q147 1.78 Very high 2.54 Very difficult
q148 1.758 Very high 2.49 Very difficult
q149 2.351 Very high 2.30 Very difficult
q150 2.038 Very high 1.68 Difficult
q151 1.823 Very high -0.02 Almost easy
q152 1.807 Very high 2.26 Very difficult
q153 2.688 Very high 1.21 Difficult
q154 2.414 Very high 2.22 Very difficult
q155 1.951 Very high 1.48 Difficult
q156 1.69 High 0.22 Almost difficult
q157 2.002 Very high 1.46 Difficult
q158 1.265 Medium 1.08 Difficult
q159 1.484 High 0.85 Almost difficult
q160 2.314 Very high 1.12 Difficult
Average difficulty 1.44 Difficult
Average discrimination 1.65 High
Notes: Discrimination values indicate the extent to which each item distinguishes between high and low performers, with higher values suggesting stronger discrimination. Difficulty values represent the level of difficulty for each item, with higher values indicating greater difficulty..
Table 9. Reliability and summary of average item difficulty and discrimination (2021–Sessions 1, 2, 3, 4)
Year Sessions Parameter Values Interpretation Reliability
2019 Session 1 Average difficulty 1.26 Difficult 0.92
Average discrimination 1.89 Very high
Session 2 Average difficulty 1.89 Difficult 0.88
Average discrimination 2.09 Very high
Session 3 Average difficulty 1.92 Difficult 0.90
Average discrimination 1.65 High
Session 4 Average difficulty 1.44 Difficult 0.91
Average discrimination 1.65 High

Notes: Average difficulty values represent the average level of difficulty across all test items for each session, with higher values indicating greater difficulty. Discrimination values indicate the discriminatory power of test items, with higher values suggesting better discrimination between high and low performers. Reliability coefficients measure the consistency and stability of test scores, with values closer to 1.00 indicating higher reliability.
Table 10. Status summary of questions (2021–Sessions 1, 2, 3, 4)
Power of discrimination Frequency Level of
difficulty
Frequency
Session
1
Session
2
Session
3
Session
4
Session
1
Session
2
Session
3
Session
4
No 0 0 0 0 Very
easy
0 0 0 0
Very low 0 0 0 0 Easy 0 0 0 0
Low 0 0 0 2 Almost easy 3 (7.5%) 0 2 (5%) 2 (5%)
Medium 4 (10%) 4 (10%) 5 (12.5%) 7 (17.5%) Almost
difficult
9 (22.5%) 5 (12.5%) 5 (12.5%) 9 (22.5%)
High 10 (25%) 7 (17.5%) 7 (17.5%) 12 (30%) Difficult 21 (52.5%) 13 (32.5%) 19 (47.5%) 14 (35%)
Very high 26 (65%) 29 (72.5%) 28 (70%) 19 (47.5%) Very
difficult
0 22 (55%) 14 (35%) 15 (37.5%)
Negative 0 0 0 0
Notes: Power of discrimination indicates the ability of questions to distinguish between high and low performers, with higher values representing better discriminatory power. Frequency counts show the number of questions falling into each category of difficulty and discrimination level for each session. Interpretations of difficulty levels, ranging from "Very easy" to "Very difficult," aid in understanding the distribution of questions based on their perceived difficulty. The absence of questions in certain difficulty or discrimination categories is denoted by "0" frequency counts.
Discussion
This study aimed to examine the dimensionality of English test items on the nationwide graduate entrance exam for healthcare applicants and to report test-takers’ characteristics. The characteristics of participants in three test packages belonging to three successive years (2019–2021) were described; four sessions in 2021 were analyzed for fitting a dimensional model. The results showed that female participants outnumbered male participants throughout the three years, and the proportion of admitted participants was similar (70% females vs. 30% males).
Additionally, a positively high correlation between participants’ GPA and English test scores was observed (p<0.05); in fact, the higher the participants’ GPA was, the greater the English test scores at the master’s entrance examination. While these findings highlight the importance of English language teaching in healthcare education, complementary views stress the significance of entrance test results as a predictor of test takers’ success and excellence in their majors (18). In simpler terms, this reciprocal influence underscores the pressing necessity to incorporate English proficiency assessments into master’s entrance exams. Doing so acts as a gateway, granting an edge to individuals with advanced English skills, and serves as a predictor of their prospective success in their chosen fields of study.
In addition, the results of four administrations in 2021 were analyzed separately and reported in detail as a sample. In fact, all four tests had high reliability indices (i.e., 0.92, 0.88, 0.90, and 0.91). In other words, 92%, 88%, 90% and 91% of the variation among test measures was reliable, and only 8%, 12%, 10%, and 9% (applicable to the four tests) of the variance was attributed to measurement errors (19).
An important finding was the suitability of a unidimensional model to some extent, leading to the enhanced dependability of the tests. In fact, dependability tests revealed that only one dominant factor had an effect on the subjects’ performance (i.e., language knowledge) (17, 20) Similar findings from Oman are reported in favor of psychometrically sound test items to attain satisfactory levels of unidimensionality to bridge the difficulty level of a test and participants’ ability (21). A further proof comes from the reliability coefficients (e.g., the four test reliability indices: 0.92, 0.88, 0.90, and 0.91). Similarly, the MHLE was reported to have a reliability of 0.862 (10) which is considered an acceptable reliability index (22). These two tests are both designed and administered by the Center for the Measurement of Medical Education to assess the language abilities of healthcare major applicants.
Another important aspect explored in the present study was the examination of dimensionality parameters, where only two parameters (i.e., the difficulty parameter and the discrimination parameter) fit the model and the third dimension (i.e., the guessing parameter) did not apply. The study analysis revealed that English language test results in all four sessions were “difficult,” with either “high” or “very high” discrimination power. In fact, neither “easy” nor “very easy” items were found in the tests; also, none of the items were associated with “no” or “very low” discrimination power. While we did not find studies on graduate exams for healthcare applicants, a similar study was conducted on the Ministry of Science, Research and Technology (MSRT), which is a high-stakes English language proficiency test (23). Analysis of the difficulty and discrimination indices of the total test revealed that 14% of the test items were either easy or very easy, 38% were medium, and 48% were either difficult or very difficult. This finding is not in line with our findings because they examined the whole test, including other sections (listening comprehension, structure, and written expressions, along with reading comprehension); however, the present study examined only vocabulary and reading comprehension items. They classified 14% of the total items as nonfunctioning, which discriminated negatively or did not discriminate at all; however, this was not the case in the present study. In their study, 38% of the items displayed satisfactory difficulty, but low discriminating power was reported because the items were too easy (14%) or too difficult (48%) (23).
While concerns about jeopardizing validity due to the difficulty of such tests have been raised (24), Table 10 shows a balance between degrees of difficulty and discrimination power. For instance, session 1 results indicated a 75% degree of either “almost difficult” or “difficult” items, while in the same session, “high” or “very high” discrimination power was observed for 80% of the items. For session 2, all the items were difficult, while 90% of the items were discerning. Similar patterns can be seen in the other two sessions.

Conclusion
In brief, the present study showed that language knowledge was indeed tested during the entrance examination since unidimensionality was observed. Additionally, difficulty and discrimination indices were evident in perspective, with no traces of the guessing effect. We found that the four sessions functioned well enough, with high reliability indices and good quality test items in terms of difficulty and discrimination. Overall, a good balance was observed between the two parameters (i.e., difficulty and discrimination) (see Table 9 for details). Additionally, acceptably high reliability indices (i.e., 0.92, 0.88, 0.90, and 0.91) were observed in all four administration sessions. For dimensionality parameters, the four tests proved to show acceptable levels of difficulty, with either “high” or “very high” discrimination power as a nationwide exam. No “easy” or “very easy” items were found. In addition, no items were associated with “no” or “very low” discrimination power.
With respect to the study limitations, the main concern was the confidentiality of the test takers’ performance. Additionally, obtaining the study data from the examination board required special arrangements that took a long time. In addition, we recommend that other researchers test the validity of the scale in future studies. We further feel that the concept of academic English was not fully operationalized as a construct due to practical limitations; for developing valid high-stakes tests, the inclusion of listening, writing, and speaking sections is suggested for future administrations. Another serious challenge may concern the consequential validity and occupational requirements for healthcare students, given the current trends (1, 25). A stronger emphasis should be placed on washback to bring about positive changes in teaching English to students in healthcare domains at the undergraduate level (11, 26), as well as on the revision of instructional systems at the graduate level. Future studies may focus on interviews with test developers and test takers to explore unheard voices.

Ethical considerations
The required data were obtained under confidentiality requirements from the Center for the Measurement of Medical Education directed by the Ministry of Health and Medical Education, without test takers’ personal information (e.g., name or identity information). The study was approved (Ethics code: IR.SBMU.RETECH.REC.1399.1222) by the Ethics Committee of Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Artificial intelligence utilization for article writing
No AI was employed to draft the present article, so the manuscript was written by the authors without using AI.
Acknowledgments
We are grateful to the research committee of Shahid Beheshti University of Medical Sciences, Tehran, Iran, for approving the research proposal. Also, we would like to thank the Center for the Measurement of Medical Education for provision of the relevant data.
Conflict of interest statement
None.
Author contributions
Seyyed Samad Sajjadi has supervised the study; Nematullah Shomoossi has designed and drafted the proposal and manuscripts; Enayat Shabani has obtained the data and assisted in the data analysis; Abdurrashid Khazaei Feizabadi has designed and drafted the proposal; all authors have read, revised and approved the final version of the article.
Supporting resources
The study was funded and approved (Ethics Code: IR.SBMU.RETECH.REC.1399.1222) by the Ethics Committee of Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Data availability statement
Relevant data are reported in the present article. Further data will be accessible for researchers only with the permission of the Iranian Center for the Measurement of Medical Education.

 
Article Type : Orginal Research | Subject: Education
Received: 2023/02/25 | Accepted: 2024/02/20 | Published: 2024/09/10

References
1. Dillon G, Boulet J, Hawkins R, Swanson D. Simulations in the United States medical licensing examination™(USMLE™). BMJ Quality & Safety. 2004;13(suppl1):i41-i5. [DOI]
2. Schwartzstein RM, Rosenfeld GC, Hilborn R, Oyewole SH, Mitchell K. Redesigning the MCAT exam: Balancing multiple perspectives. Academic Medicine. 2013;88(5):560-7. [DOI]
3. Séguis B, McElwee S. Assessing clinical communication on the Occupational English Test®. Global perspectives on language assessment: Research, theory, and practice. 2019:63-79. [DOI]
4. Anjali S, Sanjay Z, and Bipin B. India’s foreign medical graduates: an opportunity to correct India’s physician shortage. Educ Health (Abingdon). 2016;29(1):42-6. [DOI]
5. Thappa DM. Jawaharlal Institute of postgraduate medical education and research, pondicherry, India. Journal of Postgraduate Medicine. 2001;47(2):147.
6. Khodi A, Alavi SM, Karami H. Test review of Iranian university entrance exam: English Konkur examination. Language Testing in Asia. 2021;11:1-10. [DOI]
7. Lotfie MM. Language policy and practices in indonesian higher education institutions. Intellectual Discourse, 2018.26(2):p. 683–704-683–704.
8. Karakas, A. Turkish lecturers’ and students’ perceptions of English in English-medium instruction universities. 2016.
9. Özdemir-Yılmazer M. Direct Access to English-Medium Higher Education in Turkey: Variations in Entry Language Scores. Dil Eğitimi ve Araştırmaları Dergisi, 2022.8(2):p.325-345. [DOI]
10. Marandi SS, Tajik L, Zohali L. On the construct validity of the Iranian Ministry of Health Language Exam (MHLE). Journal of Language Horizons. 2020;4(2):9-36. [DOI]
11. Hekmati N, Davoudi M, Zareian G, Elyasi M. English for medical purposes: An investigation into medical students’ English language needs. Iranian Journal of Applied Language Studies. 2020;12(1):151-76. [DOI]
12. ShayesteFar P. A model of interplay between student English achievement and the joint affective factors in a high-stakes test change context: Model construction and validity. Educational Assessment, Evaluation and Accountability. 2020;32(3):335-71. [DOI]
13. Nguyen T, Han H, Kim M, Chan K. An introduction to item response theory for patient-reported outcome measurement. Patient. 2014;7(1):23–35. [DOI]
14. Deng S, Bolt DM. A sequential IRT model for multiple-choice items and a multidimensional extension. Applied Psychological Measurement. 2016;40(4):243-57. [DOI]
15. Baker FB. The basics of item response theory. 2nd ed. ERIC Clearinghouse on Assessment and Evaluation; College Park, MD, USA: 2001.
16. Kim S-H, Kwak M, Bian M, et al. Item response models in psychometrika and psychometric textbooks. Frontiers in Education. 2020 Jun 9 (Vol. 5, p. 63). Frontiers Media SA. [DOI]
17. Sheybani E, Zeraatpishe M. On the dimensionality of reading comprehension tests composed of text comprehension items and cloze test items. International Journal of Language Testing. 2018;8(1):12-26.
18. Ocbian MM, MP. Gamba, and J.D. Ricafort, Admission Test as Predictor of Performance of Students in the English Subject. JPAIR Institutional Research, 2015;6(1):p.34-45.
19. Abdellatif H, Al-Shahran i AM. Effect of blueprinting methods on test difficulty, discrimination, and reliability indices: cross-sectional study in an integrated learning program. Advances in medical education and practice, 2019:p.23-30. [DOI]
20. Baghaei, P. and V. Aryadoust. Modeling local item dependence due to common test format with a multidimensional Rasch model. International Journal of Testing, 2015.15(1):p.71-87. [DOI]
21. Abdellatif, H., Test results with and without blueprinting: Psychometric analysis using the Rasch model. Educación Médica, 2023.24(3):p.100802. [DOI]
22. Hughes, A. Testing for language teachers. 2020: Cambridge university press. [DOI]
23. Ghahraki S, Tavakoli M, Ketabi S. Applying a two-parameter item response model to explore the psychometric properties: The case of the Ministry Of Science, Research And Technology (MSRT) high-stakes English language proficiency test. Two Quarterly Journal of English Language Teaching and Learning University of Tabriz. 2022;14(29):1-26. [DOI]
24. Bazvand AD, Kheirzade S, Ahmadi A. On the statistical and heuristic difficulty estimates of a high stakes test in Iran. International Journal of Assessment Tools in Education. 2019;6(3):330-43. [DOI]
25. Shomoossi N, Rad M, Fiezabadi M, Vaziri E, Amiri M. Understanding the research process and historical trends in English for medical purposes using scientometrics and co-occurrence analysis. Acta Facultatis Medicae Naissensis. 2019;36(3):235-47. [DOI]
26. Shomoossi N, Rad M, Rakhshani MH. Efficacy of English language programs as judged by nurses and students of nursing: Do nurses in Iran need to know English? Acta Facultatis Medicae Naissensis. 2013;30(3):137. [DOI]

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.