Intended for healthcare professionals


The long case versus objective structured clinical examinations

BMJ 2002; 324 doi: (Published 30 March 2002) Cite this as: BMJ 2002;324:748

The long case is a bit better, if time is equal

  1. Geoff Norman, professor
  1. Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Canada L8N 3Z5

    The examination of graduates of medicine to ensure competence has a long tradition predicated on the historical right of self regulation bestowed on the professions. While many may wish to replace such summative and frequently punitive assessment with softer assessment to facilitate learning, this amounts to a shirking of social responsibility. A consequence of the importance attached to such examinations is that considerable research has been devoted to establishing the reliability and validity of these examinations.

    One truism in educational research is that few self evident truths are true. Historically, it has seemed self evidently true that an experienced physician could, by active questioning around a case, determine whether a candidate was or was not competent—the long case. Unfortunately this assertion was challenged by evidence showing that the reliability of the long case was insufficient to justify decisions about competence to practice.1 The replacement of the long case by objective structured clinical examinations was predicated on a second self evident truth—the promise of truly objective clinical assessment using checklists, which should self evidently lead to more objective or reliable assessment.

    Now, horror of horrors, along comes a study showing that maybe the long case was not so bad after all.2 This study, published earlier this year in Medical Education, involved assessment of 214 final year undergraduates, each of whom did two long cases using patients, and 20 stations for objective structured clinical examinations. After the numbers settled out, the reliability of the long case was 0.84 and of the objective examinations 0.72. The authors conclude that the reliability of long cases is no worse or no better than objective structured clinical examinations in assessing clinical competence.

    But perhaps all is not quite what it appears. The long case was observed and evaluated against a “previously used check list which itemised key features of history taking …” as well as global ratings. The quoted reliabilities are for 200 minutes of testing involving 10 long cases or 30 stations for objective examinations. The problem is that no one I know was ever observed doing the long case, no examination ever prepared a detailed scoring sheet in advance, and no candidate ever had an examination with 10 cases. Typically the examination consists of one or two long cases lasting an hour or two where examiners ask their pet questions then give an overall score.

    These differences are critical. Performance on one problem is a poor predictor of performance on the next one.3 So the only solution is to sample many problems. That is the real strength of the objective structured clinical examinations and the real weakness of the traditional long case exam. Still, for equal testing time, the long case turns out a bit better than the objective structured clinical examinations.

    What happened to all the gains from the standardisation in the objective structured clinical examinations? Well, maybe they are an illusion after all. Since the variability of performance across problems is the major cause of poor reliability in assessment, efforts to standardise what happens within a case are likely to lead to only small gains.4 Further, in the long case, examiners used global ratings; in the objective examinations, they exclusively used checklists. And detailed objective checklists turn out to be less reliable than ratings.5 Perhaps the superiority of the long case in this study is related to using rating scales, not cases which were not standardised.

    A companion paper to this study is a further evaluation of the entire examination.6 This included four written examinations—a multiple choice, true or false test of basic factual knowledge; an extended matching test of problem solving skills; a short answer test of problem solving and data interpretation skills; and an essay to assess ability to present written debate and communicate with professional colleagues. The question was how best to add up the subscores. The answer turned out to accord with best statistical principles. The optimum approach, in terms of maximising reliability, was to weight according to the number of items. But this obscures an underlying philosophical dilemma. Combining scores on subtests which are supposed to measure different things amounts to an admission that they are not so different after all. The fine print in the paper actually confirms this—all the correlations between tests except two or three are in the mid range.

    The explanation for this finding is twofold, and is a second example of self evident truths that are not. Firstly, despite the claims of their inventors, different testing formats do not necessarily measure different things.7 Secondly, the notion that problem solving skills or communication skills can ever be separated from the content of the problem and assessed separately is simply wrong. 8 9 So differences in the correlations between tests probably reflect content differences more than different skills.

    Current discussions about best evidence medical education are an indication that, just as in clinical medicine, intuitions will frequently be at variance with evidence.10 And since we will continue to be engaged in activities to ensure that our graduates are competent, these procedures should be based on evidence of the effectiveness of these methods. The ignorance of relevant evidence is no more pardonable in education than in clinical medicine.


    1. 1.
    2. 2.
    3. 3.
    4. 4.
    5. 5.
    6. 6.
    7. 7.
    8. 8.
    9. 9.
    10. 10.
    View Abstract