Intended for healthcare professionals

Education And Debate

The need for caution in interpreting high quality systematic reviews

BMJ 2001; 323 doi: (Published 22 September 2001) Cite this as: BMJ 2001;323:681
  1. Kevork Hopayian, general practitioner (k.hopayian{at}
  1. Seahills, Leiston Road, Aldeburgh IP15 5PL
  • Accepted 25 June 2001

The emergence of systematic reviews raised hopes of a new era for the objective appraisal of evidence available on a given topic. Such reviews promised a synthesis of trial results, which could be conflicting, and an escape from the personal bias inherent in traditional reviews and expert opinion.1 As the discipline of systematic reviews has evolved, however, two new problems have arisen: the quality of reviews is variable 2 3; and two or more systematic reviews on the same topic may arrive at different conclusions, raising questions on the validity4-7 or the relevance8 of the conclusions. Moreover, adherence to a “checklist” system when appraising trials may overlook important clinical details in the original trials and so reduce the validity of the review. I uncovered this last shortcoming when I recently conducted a study of three systematic reviews; the study is reported here.

Summary points

The discipline of systematic reviews has given clinicians a valuable tool with which to synthesise evidence

As the methodology of systematic reviews has evolved, the quality of reviews has improved

Nevertheless, high quality systematic reviews may overlook important clinical details in the papers reviewed, thereby diminishing their validity

This shortcoming might be avoided if trials were assessed from a clinician's viewpoint as well as from a reviewer's viewpoint


Guidelines have been drawn up to improve the quality of reviews.9 Differences in the quality of reviews, however, do not always explain discordance. Jadad and McQuay4 identified six sets of reviews covering six topics in pain research; despite similar quality scores for reviews in each set, four of the sets contained discordant reviews. Jadad et al8 identified six generic differences between reviews that might lead to discordance: the clinical question asked; the selection and inclusion of studies; data extraction; assessment of study quality; assessment of the ability to combine studies; and statistical methods for data analysis.

The case of epidural steroid injection therapy for sciatica is a good illustration of the evolution of reviews. The results of randomised controlled trials of this treatment were inconsistent. Two traditional reviews of these trials appeared—in 198510 and 1986.11 They reached discordant conclusions. A decade later, two systematic reviews—by Watts and Silagy12 and Koes et al13—also reached discordant conclusions. A comparison of these reviews concluded that the difference in their methods—namely, vote counting versus pooling—explained the discordance.14 A further systematic review (of all types of injection therapies, including epidural) was published by Nelemans et al for the Cochrane Collaboration in 1999.15 The three systematic reviews overlap in their nature (qualitative versus quantitative), method for assessing the quality of randomised controlled trials (following that of ter Riet et al16 or Chalmers et al17), and conclusions (table 1). I therefore used them to conduct a general study of the validity of systematic reviews.


Summary of systematic reviews assessed for validity

View this table:

Assessing the validity of the three reviews

Background and method

My interest in the epidural steroid injection treatment for sciatica stems from a question arising in general practice and a general practice commissioning board. It was framed as a three part, focused question (box 1).18 I retrieved the relevant trials that were included in all three reviews and critically appraised each individual paper for validity and relevance to this question. 19 20

Box 1 : Three part focused question

Population—Patients with sciatica

Intervention—Injection of corticosteroid into the epidural space compared with placebo or injection of local anaesthetic

Outcome—Which intervention leads to quicker pain relief?


Box 2 : Quality of systematic reviews

View this table:

I tried to assess the quality of each systematic review using a validated rating scale, the Oxman and Guyatt index.21 This tool consists of questions about how the review is designed and reported; it does not require knowledge about the trials themselves. It was inappropriate for two reasons, however, to give scores. Firstly, the scale favours trials that combine data and therefore would have discriminated against Koes et al. Secondly, two of the items on the scale relate to aspects of systematic reviews that I am disputing in this article (see box 2 for comments on the criteria used in each review). The final step was the evaluation of the reviews' treatment of the randomised controlled trials against my own appraisals.


All three reviews were of high quality according to the Oxman and Guyatt index (box 2). Three problems, however, compromised their validity: the relevance of the study population (inclusion of atypical populations); the appropriateness of the intervention (inclusion of one study with a serious problem in its design); and the adequacy of the outcome measures (inclusion of studies with inappropriate outcome assessments).

Atypical populations

Both the Koes and the Nelemans reviews included atypical populations—notably patients with pain despite or because of spinal surgery. 22 23 One trial had a high proportion of patients with arachnoiditis,24 which can be a complication of surgery and of epidural injections when the steroid used is methylprednisolone. These populations are clinically and pathologically distinct from patients with back pain or sciatica who are treated by most clinicians and included in all the other trials.

Although the value of “lumping”—that is, the pooling of results from studies with heterogeneous populations—has been cogently defended,25 guidelines warn against combining studies that are too heterogeneous.9 The fundamental differences between most of the randomised controlled trials and the atypical ones means that lumping in this case make no clinical sense.

Flawed design

Koes contended that a design could be “fatally” flawed through the use of a checklist system to score randomised controlled trials: “One of the drawbacks of using this list of methodological criteria might be that trials showing a fatal mistake … might end up with a high score because of other criteria.”13

In the trial by Cuckler et al,26 for example, this did happen. Patients were assessed 24 hours after receiving either epidural steroid or placebo injections; those who had not improved were given active treatment. This led to contamination of the placebo group, so the analysis by intention to treat 13 months later was not really comparing treatment against placebo. Despite this flaw, the trial was included in all three reviews and received a comparatively high rating in all three, and its results were used in pooling by the two quantitative reviews.

That such papers came to be included suggests that problems exist with systems for scoring the quality of the methods used in trials. Application of the score depends on identifying features of the design and conduct of the trial from a checklist but apparently without the substance of the trial being scrutinised. Numbers are bewitching, and it is tempting to see those scores as objective even though they are the product of human judgment. Comparing the scores given by Nelemans and by Koes to the same papers is illuminating. Despite using the same scoring system, Nelemans et al and Koes et al arrived at different scores for the same papers. They came close to agreement (within 10 points) in only four out of seven papers (table 2).

Box 3 : Outcome assessments

View this table:

Inadequate outcome measures

Several validated tools for assessing outcome for musculoskeletal and back pain research are available, measuring pain, disability, or both.27 Some of the early primary studies used idiosyncratic tools that fell short of the standards we now expect of modern research. There are two consequences for modern reviews: the results of the older trials are less reliable, and their format means they are not comparable with modern studies. The trials by Beliveau et al (1971)28 and by Snoek et al (1977)29 (box 3) used idiosyncratic outcome assessments but were included in the reviews by Watts and by Koes. Both Nelemans and Watts included Beliveau (and Cuckler26) in their pooling, which casts doubt on their results. As Messerli said in another context: “A meta-analysis is like a Mediterranean bouillabaisse—in concert, all ingredients will enhance its delightful flavour but, no matter how much fresh fish is added, one rotten fish will make it stink.”30


Validity scores (on scale of 0-100, following ter Riet et al16) awarded by Nelemans et al and Koes et al for included trials

View this table:

That such papers were included shows that little weight is given to the measurement of outcomes, something in which clinicians are especially interested; the system used by Nelemans and by Koes et al allots only five out of 100 marks to assessments of outcome.


Does this mean that no conclusions can be drawn from the original randomised controlled trials? Certainly not. Analysis shows that most trials in this field were conducted at a time when trial methodology was less rigorous than it is now. The poor quality of some trials means that we must disregard their findings, or at least resist the temptation to pool them in a meta-analysis. One trial stands out: the trial by Carette et al31 was, at the time of the Nelemans review, the most recent, largest, and most rigorous. Nelemans awarded it a quality score of 76%. This trial was the best evidence available at the time, and therefore we should use its results to inform our decisions. To pool it with others of inferior quality is to accept uncritically that a meta-analysis must be better than a single trial. A large, rigorous trial provides better evidence than a non-credible meta-analysis.

Smith et al32 drew a distinction between the quality and the validity of randomised controlled trials. Quality relates to the conduct of the trial; the scoring systems mentioned above are among several that aim to measure quality. Validity relates to the ability of the trial to answer the question. We can draw a similar distinction in systematic reviews. The quality of the three systematic reviews is high, but their validity is compromised by overlooking important details in the trials themselves. The fact that these oversights occurred in not just one but all three reviews of the same topic suggests that it may be a general rather than an isolated problem. Clinicians were involved in all three reviews, so the oversights did not arise from a lack of involvement by clinicians. Perhaps it was the type of involvement.

This analysis suggests that reading a paper from a clinician's viewpoint is different from reading a paper from the viewpoint of a reviewer, who has a duty to apply a set of criteria from a checklist. Clinicians, whose usefulness up to now has been seen as “content experts” in systematic review teams, may be able to contribute to the future evolution of systematic reviews by exploring these different viewpoints.


  • Funding KH holds a primary care enterprise award from the research and development division of the Eastern regional office of the NHS Executive and has been awarded a grant from the Claire Wand Fund.

  • Competing interests None declared.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.