Assessing the quality of controlled clinical trialsBMJ 2001; 323 doi: https://doi.org/10.1136/bmj.323.7303.42 (Published 07 July 2001) Cite this as: BMJ 2001;323:42
All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
In a recent e-letter to the BMJ pertaining to a trial on electronic
fetal monitoring we focused on how dramatic can be the effect of intra and
inter-observer variation on the efficacy and rate of unnecessary
interventions of a clinical procedure and provided a web-site with
examples for discussion 1. For example, we demonstrated that for a
proportion of observer agreement of 0.30 and 0.71, for intervention and no
action, respectively, and an assumed efficacy and rate of unnecessary
interventions of 50% and 17%, respectively, the latter may vary from 0% to
100% and 0% to 33%, respectively, leading to discrepancies on relative
risks and odds ratios, in different studies and to poor average results in
large multi-observer and meta-analysis studies 1. Systematic reviews
including large studies and/or small studies with many observers will tend
to be biased towards average results whereas systematic reviews including
a large study with a single observer will tend to be biased towards the
high, average or low results of that particular study, making
generalization inadequate and potentially dangerous.
We agree that if an
"agreement study shows poor inter-observer agreement for a new method, the
technology must either be improved or abandoned" 3, and should not be
assessed in controlled clinical trials. However, if this is not so,
shouldn't the largely random effect arising from intra-observer variation
and the random and systematic effect, arising from inter-observer
variation 2, be obligatorily discussed in systematic reviews of controlled
1- Bernardes J, Costa-Pereira A. How should we interpret RCTs based on
2- Grant A. Principles for clinical evaluation of methods of perinatal
monitoring. J Perinat Med 1984;12:227-31.
3- Grant JM. The fetal heart rate is normal, isn't it ? Observer agreement
of categorical assessments. The Lancet 1991;337:215-218.
Competing interests: Bernardes and Costa-Pereira are involved in the
development and validation of reproducible computerized diagnostic tests
Competing interests: No competing interests
Klassen argues in favour of the quality scale developed by Jadad et
al . This scale focuses exclusively on three dimensions of internal
validity: randomisation, blinding and withdrawals, but gives more weight
to the quality of reporting than to actual methodological quality. A
statement on patient attrition, for example, will earn the point allocated
to this domain, independently of how many patients were excluded or
whether or not the data were analysed according to the intention to treat
principle. The scale addresses the generation of allocation sequences, a
domain not consistently related to bias, but it does not assess
concealment of allocation, which has clearly been shown to be associated
with exaggerated treatment effects . Therefore, the use of an open
random number table is considered equivalent to concealed randomisation
using a telephone or computer system.
A summary score of 3 and more points is generally considered to
indicate ‘high quality’ . In our experience, the majority of trials
reach this threshold because the words ‘randomised’ and ‘double-blind’
appear in the report and there is either a description of the method used
to ensure double-blinding or of the patients who dropped out (see for
example the recent review by Tramèr et al ). However, there is no
evidence that concealment of allocation was adequate or that the analysis
was according to intention to treat for most of these ‘high quality’
trials. Indeed, even a quasi-randomised trial which used alternation to
allocate patients and excluded a large proportion of participants from the
analysis will earn 3 points if it was reported as double-blind, the
experimental and control treatments were described as indistinguishable,
and the reasons for dropping out of the trial were tabulated.
The scale makes no allowance for the fact that some trials cannot be
blinded. For example, a large multicentre trial on a surgical intervention
which included all patients in the analysis irrespective of whether they
dropped out of the trial, and used central randomisation to allocate
patients would earn only 2 points (‘low quality’) because double-blinding
was impossible and the reasons for dropping out of the trial were not
As Klassen notes, the scale of Jadad et al was evaluated for
discrimination, reliability, and construct validity and is the only
published instrument that has been constructed according to psychometric
principles . While such careful development is commendable, it does
not follow that this particular scale is therefore necessarily better than
others. Quality scales allocate fixed weights to a standard set of items,
ignoring the fact that the importance of individual items and, possibly,
the direction of potential biases associated with these items may vary
according to the context, and that there are often specific issues that
apply in particular clinical situations . The mechanistic application
of scales may therefore dilute or entirely miss potential associations
. Not surprisingly, there is little evidence that quality scales can
detect bias . By contrast, our review demonstrates that there is strong
evidence that individual aspects such as inadequate or unclear concealment
of allocation, and lack of double blinding are often associated with bias
Our dislike of scales is based on these arguments and does not represent
‘bias’. Our particular concerns about the scale of Jadad et al  are
that it omits essential elements of clinical research, concealment of
allocation and intention to treat analysis, and gives too much weight to
the quality of reporting as opposed to the quality of methods. Its wide
use within systematic reviews is regrettable.
Peter Jüni, Douglas G. Altman, Matthias Egger
1. Jadad AR, Moore RA, Carroll D, Jenkison C, Reynolds DJM, Gavaghan
DJ et al. Assessing the quality of reports of randomized clinical trials:
is blinding necessary? Control Clin Trials 1996;17:1-12.
2. Jüni P, Altman DG, Egger M. Systematic reviews in health care:
assessing the quality of controlled clinical trials. BMJ 2001;323:42-6.
3. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M et al. Does
quality of reports of randomised trials affect estimates of intervention
efficacy reported in meta-analyses? Lancet 1998;352:609-13.
4. Tramèr MR, Carroll D, Campbell FA, Reynolds DJ, Moore RA, McQuay
HJ. Cannabinoids for control of chemotherapy induced nausea and vomiting:
quantitative systematic review. BMJ 2001;323:16-21.
5. Jüni P, Witschi A, Bloch R, Egger M. The hazards of scoring the
quality of clinical trials for meta-analysis. JAMA 1999;282:1054-60.
6. Greenland S. Quality scores are useless and potentially
misleading. Am J Epidemiol 1994;140:300-2.
Competing interests: No competing interests
This recent article by Juni and colleagues nicely summarizes many
aspects of assessing the quality of randomized controlled trials(RCT)
included in meta-analyses. However, one aspect of this article is
concerning as it appears to represent the biases of the authors against
the use of scales for assessing quality of RCTs.
The argument against the use of scales is that the results of
different scales varies when assessing the same RCTs. This should not be a
surprising finding given the rather arbitrary way many of these scales
were developed. The authors should have focused more on how the scales
were developed, i.e. were sound measurement principles used in their
development? One scale that has been developed according to sound
measurement principles is the one by Jadad and colleagues <1>. It
was developed to detect bias in RCTs and it appears to be quite consistant
in this detection. "Low"(< or = 2 points) quality trials have higher
estimates by 34% on average compared to "high"(> 2 points) quality
The other criticism that is offered is the lack of transparency as to
which component of the scale is contributing to bias. The Jadad scale
because of its simplicity can easily be analyzed by its included
components of randomization, double-blinding and withdrawals and dropouts.
In summary, it is disturbing that there is a continuing bias agains
the use of scales rather than a discussion of the evidence as it relates
to the performance of these scales in the detection of bias. I firmly
believe the discussion should move away from a component versus scale
discussion to an evidence-based framework as to what is the evidence is
for any tool that is used for bias detection of RCTs.
1. Jadad AR. Moore RA. Carrol D. Jenkinson C. Reynolds JM. Gavaghan
DJ. McQuay HJ. Assessing the Quality of Reports of Randomized Clinical
Trials: Is Blinding Necessary? Controlled Clinical Trials. 1996; 17:1-12.
2. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, Tugwell P,
Klassen TP. Does the quality of reports of randomized trials affect
estimates of intervention efficacy reported in meta-analyses? Lancet.
Competing interests: No competing interests