Using Rasch Analysis to Develop an Extended Matching Question (EMQ) Item Bank for Undergraduate Medical Education Mike Horton Bipin Bhakta Alan Tennant
Medical student training
Miller (1990) identified that no single assessment method can provide all the data required for judging anything as complex as the delivery of professional services by a competent physician • Knowledge • Skills • Attitudes
Miller’s pyramid of competence
does
case presentations log books direct observation of clinical activity
shows how
OSCE
knows how
EMQ, FRQ, Essays
knows
MCQ
What are assessments used for?
Primary aim • To identify the student who is deemed to be safe and has achieved the minimal acceptable standard of competence
Secondary aim • To identify students who excel
Written Assessment
“True or False” questions “Single, best option” multiple choice questions Multiple true or false questions “Short answer” open ended questions Essays “Key feature” questions Extended matching questions
Free response questions
Broadly, the free-response type questions are commonly believed to test important higher-order reasoning skills Validity is high as examinees have to generate their own responses, rather than selecting from a list of options
However…
Only a narrow range of subject matter can be assessed in a given amount of time They are administratively resource-sapping Due to their nature, the reliability is limited.
Multiple choice questions Multiple choice type questions have been widely used and are popular as:
They generally have a high reliability They can test a wide range of themes in a relatively short period of time However…
They only assess the knowledge of isolated facts By giving an option list, examinees are cued to respond and the active generation of knowledge is avoided
What are Extended Matching Questions (EMQs)?
EMQs are used as part of the undergraduate medical course assessment programme
EMQs are used to assess factual knowledge, clinical decision making and problem solving
They are a variant of multiple choice questions (MCQs)
EMQs are made up of 4 components
Example of Extended Matching Question (EMQ) format (Taken from Schuwirth & van der Vleuten, 2003) Theme: Micro-organisms Answer options:
A B C D E F G H
Campylobacter jejuni Candida Albicans Clostridium difficile Clostridium perfringens Escherichia coli Giardia lamblia Helicobacter pylori Mycobacterium tuberculosis
Instructions:
I J K L M N O P
Proteus mirabilis Pseudomonas aeruginosa Rotavirus Salmonella typhi Shigella flexneri Tropheryma whippelii Vibrio cholerae Yersinia enterocolitica
For each of the following cases, select (from the list above) the microorganism most likely to be responsible. Each option may be used once, more than once or not at all. A 48 year old man with a chronic complaint of dyspepsia suddenly develops severe abdominal pain. On physical examination there is general tenderness to palpitation with rigidity and rebound tenderness. Abdominal radiography shows free air under the diaphragm. A 45 year old woman is treated with antibiotics for recurring respiratory tract infections. She develops a severe abdominal pain with haemorrhagic diarrhoea. Endoscopically a pseudomembranous colitis is seen.
Item Pools
Currently, a lot of institutions formulate their tests from year to year by selecting items from a pre-existing pool of questions. • Questions are pre-existing • Time and resources are saved by employing this method.
However…
It has been widely recognised that if tests are made up of items from a pre-existing item pool, then the relative difficulty of the exam paper will vary from year to year. [McHarg et al (2005), Muijtens et al (1998), McManus et al (2005)]
Item Pools
If the questions have been set, used and assessed using traditional approaches, then this will provide a certain amount of information about each of the individual questions, however, there are also drawbacks surrounding the traditional approach.
It has been recognised [Downing (2003)] that there are certain limitations to Classical Measurement Theory (CMT) in that it is sample dependent.
Thus, the comparability of examination results from year to year will be confounded by the overall difficulty of the exam and the relative ability of the examinees, rendering a comparison invalid.
This is particularly troublesome when we wish to • compare different cohorts of students • maintain a consistent level of assessment difficulty over subsequent administrations.
The problem
What is the best way to ensure that all passing students are reaching the required level of expertise??
2 forms of pass mark selection • Criterion referenced • Norm referenced
Criterion referenced refers to when a specific pass mark has been designated prior to the exam as a pass/fail point.
Norm referenced refers to when a specific proportion of the sample will be designated to pass the exam. i.e. the highest scoring 75% of students will pass
Norm referenced or Criterion Testing? Norm-referenced testing
Whatever the ability of the students taking the test, a fixed portion of them will pass/fail The standard needed to pass the test is not known in advance The validity of the norm-referencing method relies on a homogenous sample, which may not necessarily be the case There is also the risk that with a less able group of students, a student could pass the exam without reaching the desirable acceptable level of clinical knowledge. Norm referencing is not appropriate
Norm referenced or Criterion Testing? Criterion testing
Relative difficulty of Exam could change depending on the items that are in the exam, therefore a pre determined pass mark could be easier or harder to obtain depending upon the items in the test
Although having their own disadvantages, it has been recognised [Wass et al (2001)] that Criterion referenced tests are the only acceptable means of assessing that a pre-defined clinical competency has been reached
Solution?
It has been identified [Muijtjens (1998)] that a criterion referenced test could be utilised if a test was constructed by selecting items from a bank of items of known difficulty, which would then enable measurement and control of the test difficulty.
Difficulty estimates, as defined by classical test theory, are sample dependent!
Item Banking
Item banking
Item Banking is a process whereby all EMQ Items that have been used over the past few years are ‘banked’ and calibrated onto the same metric scale
Previously used EMQ Items
Psychometrically calibrated using Rasch Analysis
Data is linked by common items that have been used between the exams • “Common Item Equating”
Rasch Analysis
When data fit the model, generalisability of EMQ difficulties beyond the specific conditions under which they were observed occurs (specific objectivity ) In other words… Item Difficulties are not sample dependent as they are in Classical Test Theory
Item banking Term 1
Term 2
Term 3
Term 4
ITEM 1
ITEM 2
ITEM 3
ITEM 4
These Items cannot be directly compared as there are no common links
Item banking Term 1
Term 2
Term 3
Term 4
ITEM 1
ITEM 2
ITEM 3
ITEM 4
ITEM 5
These Items can be directly compared via the common link item across all terms
Item banking
Following calibration, the set of items within the bank will define the common domain that that they are measuring (in this case, medical knowledge).
These items, therefore, will provide an operational definition of this unobservable (latent) trait.
When all EMQ Items are calibrated onto the same scale, then it will be possible to compare the performance of students across terms, despite the fact that the EMQ exam content was made up of different items across each term.
Sufficient Linkage?
What is classed as sufficient linkage between item sets? There has been some variation in the literature regarding this. Three differing viewpoints suggest that: • linking items should be the larger of 20 items or 20% of the total number of items [Angoff (1971)] • 5 to 10 items are sufficient to form the common link [Wright & Bell (1984)] • one single common item could provide a valid link in co-calibrating datasets [Smith (1992)].
However, it has also been suggested [Smith (1992)] that the larger the number of common items across datasets, the greater degree of precision and stability for the item bank.
Potential Problems?
Limited Linkage • Data overlap reduced
Potential Misfit or DIF on Link Items
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25
Term8
Term7
Term6
Term5
Term4
Term3
Term2
Sparse Data Matrix Term1
Sample
Data was collected from 550 4th year medical students over 8 terms
All EMQ data was read in to a single analysis to establish the Item Bank
RUMM2020 Rasch Analysis software was used
Over the 8 terms, 6 different test forms were used (the test remained the same over the first 3 terms)
1 1 1 2 1 3 1 4 1 5 1 6 2 11 2 12 2 13 2 14 2 15 2 16 3 21 3 22 3 23 3 24 3 25 3 26 3 27 3 28 3 29 4 31 4 32 4 33 4 34 4 35 4 36 4 37 4 38 4 39 4 40 5 41 5 42 5 43 5 44 5 45 5 46 5 47 5 48 6 51 6 52 6 53 6 54 6 55 6 56 6 57 6 58 6 59 6 60 6 1051 7 61 7 62 7 63 8 71 8 72 8 73 8 74 8 75 9 81 9 82 9 83 9 84 9 85 9 86 10 91 10 92 10 93 10 94 10 95 10 96
2
3
4
5
6
10 10 10 11 11 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 15 15 16 16 16 16 17 17 17 17 17 18 18 18 18 18 18 18 19 19 19 19 19 19 20 20 20 20 20 20 21 21 21 21 21 22 22 22 22 22 23 23 23 23 23 23 23
97 98 99 101 102 103 104 111 112 113 114 121 122 123 124 131 132 133 134 141 142 143 144 145 146 151 152 153 154 161 162 163 164 165 171 172 173 174 175 176 177 181 182 183 184 185 186 191 192 193 194 195 196 201 202 203 204 205 211 212 213 214 215 221 222 223 224 225 226 227
1
2
3
4
5
6
Theme Code Question Code
1
Theme Code Question Code
Theme Code Question Code
EMQ Item Bank 23 24 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 27 27 28 28 28 28 29 29 29 29 30 30 30 30 31 31 31 31 31 31 31 31 32 32 33 33 33 33 34 34 34 34 35 35 35
228 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 261 262 263 264 265 266 267 271 272 273 274 281 282 283 284 291 292 293 294 301 302 303 304 305 306 307 308 311 312 321 322 323 324 331 332 333 334 341 342 343
1
2
3
4
5
6
• Approximately
25% of the items were changed from exam form-to-exam form • This provides good linkage
EMQ Item Bank Pretty Good!!
To be expected
Low. But…
Low Person Separation Index
Person Separation Index is fairly low We would expect this to a certain degree due to the highly focussed sample
Problematic Items
Approximately 12.5% (26/205) of Individual Items were found to display some sort of misfit or DIF
Misfit
What does misfit tell us?
Misfitting items are flagged up for non-inclusion and will either be amended or removed
Differential Item Functioning
Misfit & DIF
What does misfit tell us? Misfitting items are flagged up for non-inclusion and will either be amended or removed
DIF could be due to:
• Curriculum changes • Teaching Rotations • Testwiseness
Does Exam Difficulty remain equal over different exam forms?
Example across item set using Exam Form 1 as the Baseline
Measurement and the pass mark The latent trait Increasing student ability
Minimum standard
0
The “real” assessment 100 EMQ Exam 60% Pass mark
Equivalent standard?
Exam Form 1 was based on 98 Items Pass Mark was set at 60% 60% of 98 = Raw Score pass mark of 58.8
Equivalent pass mark?
Raw score of 58.8 on Exam Form 1 = 0.561 logits
Equating Problems
We had to remove 2 extreme items
Exam form 1 was out of 98 anyway
Exam Form
Maximum obtainable score
1
98
2
99
3
100
4
100
5
100
6
99
Equivalent standard? Exam form 1 had 98 items 60% of 98 = 58.8 58.8 score on Exam form 1 = 0.561 logits
Exam Form 0.561 logit equated score out of
1
2
3
4
5
6
58.7
57.7
58.2
59.6
62.4
59.9
98
99
100
100
100
99
59.90% 58.30% 58.20%
62.4% - 58.2% = 4.2%
59.60% 62.40% 60.50%
Conclusion
Item Banking is a way of assessing the psychometric properties of EMQs that have been administered over different test forms
Can identify and adapt poor questions
Can perform a comparative analysis of relative test form difficulty
Should the pass mark be amended every term?
References 1.
2.
3.
4.
5.
6.
7.
8.
9.
Miller GE. The assessment of clinical skills/competence/performance. Academic Medicine 1990; 65: S63-7 McHarg J, Bradley P, Chamberlain S, Ricketts C, Searle J, McLachlan JC. Assessment of progress tests. Medical Education 2005; 39: 221-227. Muijtjens AMM, Hoogenboom RJI, Verwijnen GM, Van der Vleuten CPM. Relative or absolute standards in assessing medical knowledge using progress tests. Advances in Health Sciences Education 1998; 3: 81-87. McManus IC, Mollon J, Duke OL, Vale JA. Changes in standard of candidates taking the MRCP(UK) Part 1 examination, 1985 to 2002: Analysis of marker questions. BMC Medicine 2005; 3 (13). Downing S M. Item response theory: applications of modern test theory in medical education. Medical Education 2003; 37: 739-745 Wass V, Van der Vleuten C, Shatzer J, Jones R: Assessment of clinical competence. Lancet 2001, 357(9620): 945-94 Angoff W H. Scales, norming, and equivalent scores. In: Thorndike R L, editor. Educational Measurement. 2nd ed. Washington (DC): American Council on Education; 1971. p508-600 Wright B D, Bell S R. Item Banks: what, why, how. Journal of Educational Measurement 1984; 21(4): 331-345 Smith R M. Applications of Rasch Measurement. Chicago: Mesa Press; 1992
New Book Smith EV Jr. & Stone GE (Eds.). Criterion Referenced Testing: Practice Analysis to Score Reporting Using Rasch Measurement Models. Maple Grove, Minnesota. JAM Press; 2009
Contact Details Mike Horton:
[email protected]
Matt Homer:
[email protected]
Alan Tennant:
[email protected]
Website:
http://www.leeds.ac.uk/medicine/rehabmed/psychometric/
Course
Introductory
Intermediate
Advanced
Date March 10-12 May 12-14 Sept 15-17 Dec 1-3 March 23-25 May 18-20 Sept 14-16 Nov 30-Dec 2 May 17-19 Sept 20-22 Dec 6-8 May 23-25 Sept 19-21 Dec 5-7 Sept 23-24 Sept 22-23
2010 2010 2010 2010 2011 2011 2011 2011 2010 2010 2010 2011 2011 2011 2010 2011