Intended for healthcare professionals

Learning In Practice

The death of the long case?

BMJ 2002; 324 doi: (Published 16 February 2002) Cite this as: BMJ 2002;324:408
  1. John J Norcini (jnorcini{at}, senior vice president
  1. American Board of Internal Medicine, Philadelphia, PA 19006, USA
  • Accepted 23 January 2002

The long case has tested the clinical skills of students for many years, yet inherent problems preclude it from being the sole source of decision making

For most of the past century the long case has been an essential tool in the evaluation of clinical skills, being used to make important judgments about the education and practice of doctors throughout the world. Despite its widespread use and rich tradition, the long case has major flaws, making it unwise to use it as the sole basis for making decisions of consequence. In this article I describe the strengths and weaknesses of the long case and report on some of the modifications that have been introduced to improve its performance.

Summary points

The long case has been an essential tool in the assessment of clinical skills

It evaluates performance with real patients and enables students to gather information and develop treatment plans under realistic conditions

Problems have been found with the reproducibility of scores generated by the long case

The long case can be improved by increasing the number of encounters, examiners, or aspects of a competence assessed

Organisations responsible for high stakes testing are increasingly abandoning the long case or using it only in combination with other forms of assessment

The long case

Although there are numerous variations on the long case, traditionally a student is given unobserved time with a real patient in a clinical setting. During that time the student conducts an interview and performs a physical examination as appropriate. The student then presents his or her findings and plans to the examiners, who ask about the patient and related topics enabling them to judge the quality of the student's performance.


The primary strength of the long case is that it evaluates the student's performance with a real patient. In the early stages of training, the objective structured clinical examination is often used, which assesses clinical skills well.1 However, the standardised or simulated patients that are part of the examination are limited in the number and complexity of the medical problems they can portray. Consequently, as students approach entry to practice, assessment also needs to be based on performance with real patients who can exhibit the range of conditions seen in the clinical setting.

The long case also presents students with a complete and realistic clinical challenge. They are required to obtain all relevant information from the patient, structure the problem, synthesise their findings, and formulate a management plan. This contrasts with the typical objective structured clinical examination, with each station focusing on one aspect of the doctor-patient encounter.


Implicit in the use of the long case is the assumption that if the student was examined again with another patient and different examiners, the results would be the same. Otherwise the scores could not be trusted to predict performance in practice, and it would make no sense to use them for assessment. The concept that test results should be able to be generalised or be repeatable is referred to as reliability or reproducibility.2

Over the past 30 years it has become increasingly apparent that the long case does not yield results that achieve reasonable levels of reproducibility. For example, in the early 1970s the American Board of Internal Medicine's oral examination for cardiovascular disease consisted of two long cases, each with two examiners.3 Putting both cases together yielded a score that had a reproducibility coefficient of 0.39, meaning that 39% of the variability in scores was due to students' ability (signal) and 61% to errors of measurement (noise). When adjusted to prophesise what would happen with only one long case, the coefficient drops to 0.24, indicating that scores are composed of more than three times as much noise as signal. Other studies have obtained similar results, in stark contrast to clinical examinations such as the objective structured clinical examination, which often achieve reproducibility coefficients of 0.80 or better. 4 5 The table presents some typical findings for reproducibility of various formats.

Three major factors explain why the long case has problems with reproducibility.8 In decreasing order of importance they are the case specificity of problem solving, differences between examiners, and variability in the aspects of an encounter evaluated.

Case specificity

For the scores from the long case to be reproducible, students must perform at the same level regardless of the patient they examine, yet physician performance varies from case to case. The case specificity of problem solving was identified by Elstein and colleagues.9 It has been replicated in many studies.

These findings should not be surprising. Physicians know that they do not perform uniformly across all patient problems or even across different patients with the same problem. They have areas of relative strength and weakness, they respond differently to patients depending on their personal and professional experiences, and patients respond differently to them depending on a variety of factors. Therefore an assessment device must sample broadly across patients to generate scores that will generalise to typical performance.

Examiner effects

For scores to be reproducible, examiners must apply the same standards. Research shows that even experienced educators differ when assessing the same event.10 This should not be surprising either. Physicians have legitimate and desirable differences in knowledge, standards, emphasis, and values. Likewise, they occasionally respond out of their own deficits of knowledge or the inappropriate influence of the patient's or student's style, appearance, race, sex, ethnicity, and so on. Further, patients' conditions naturally vary in difficulty, and the examiners must precisely compensate for the differences among them. These issues, and others, ensure that examiners differ when evaluating the same thing, undermining the reproducibility of the scores for the long case. An assessment device must sample across examiners to generate reproducible results.

Reproducibility of assessment formats studied by the American Board of Internal Medicine (estimates, based on three hours' testing time, will vary in other settings depending on quality of test material and heterogeneity of examinees)

View this table:

Aspects of a competence assessed

For scores to be reproducible, it is important to assess several aspects of the competence being elicited by the student-patient encounter. Specifically, a variety of studies show that the information obtained from measurements is increased when examiners are instructed to evaluate a standardised list of different features of a competence, or when they observe the student-patient encounter rather than make a single global assessment or base their judgments on interrogation alone.1113 Again, these empirical results should not be surprising. Without specific instruction examiners will naturally attend to different aspects of an encounter, and this will be reflected in their evaluations. An assessment device must sample systematically across aspects of a competence to generate reproducible results.

Modifications to the long case

Without modification the traditional long case should not be used by itself to make critical decisions about the competence of a student. However, at least three strategies have been applied to improve its performance.


Increasing the number of student-patient encounters is the single most important step in rehabilitating the long case. The addition of each encounter produces noticeable gains in the reproducibility of scores, even if there are limits on the number of examiners and aspects of competence involved. With enough encounters it is possible to achieve a high level of reproducibility and, without enough cases, all other changes will have only minimal impact. Where it is impractical to add long cases, one alternative is to add short cases, stations in the objective structured clinical examination, or other forms of assessment to broaden the sample of student-patient encounters.


Examiners can be manipulated in at least three ways to improve the reproducibility of scores: by employing a statistical model that removes differences among them, by training them, or by increasing their numbers. By themselves these strategies produce only modest gains in reproducibility. Of them, increasing the number of examiners will have the largest effect, although gains will be minimal beyond a certain point (say four or five examiners).

Aspects of a competence

Increasing the number of aspects of a competence assessed and standardising them across examiners has a modest positive effect on the reproducibility of scores. For the long case this has included having the examiners observe the student-patient interaction or providing them with a list of competencies to evaluate. Again, the impact of this will be minimal beyond a certain point (say 5 or 10 aspects of a competence).


  • Competing interests None declared.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
View Abstract