Statistics notes: Measurement errorBMJ 1996; 312 doi: https://doi.org/10.1136/bmj.312.7047.1654 (Published 29 June 1996) Cite this as: BMJ 1996;312:1654
All rapid responses
Though the previous paper did a reasonable job of describing "common
within-subject" variation (similar to coefficient of variation), it failed
to finish the discussion. Accordingly, this response attempts to add the
missing information in an 'easy-to-understand' format without simulations,
tables and complex math proofs.
First, the authors failed to mention that the variance for
1 child over many tests is distributed as a Chi-squared variable with 4-1
or 3 degrees of freedom (dof).
Second, the authors failed to mention that when you average the 'per-
child' variances to get 460.52, that average is distributed as a Chi-
squared variable with dof=20*(4-1) or 60. The value of 460.52 should be
compared to this distribution and NOT the Z distribution for normally
distributed data. The mistake made also carries over for 2 repeats per
child since that average variance would be compared to the Chi-squared
distribution with 20*(2-1) or 20 dof.
Third, the above paper and others by the same authors fail to discuss
the obvious need to compare average variances for 2 different devices over
a range of correlated samples. For instance, in the previous paper, 20
children were measured for PEFR 4 times each on only 1 device; but, what
if the same testing was done on a second device? Several references below
discuss the device comparison topic whereby a perfect, low error master
does not exist. If such a master existed, the two test devices could be
checked against that master using standard calibration theory for
linearity, repeatability, hysteresis and more advanced normal bivariate
error methods. Well, when the master does not exist and the device or
method comparison must take place. Most references only discuss regression
of A vs B for slope, intercept, R^2 and sdiff; a paired t test of A and B
to test the mean difference or a Bland-Altman plot which only subjectively
shows (not tests) bias and hints at variation.
When performing agreement or comparison tests, the goal is to
determine interchangeability. In order to do that, the 2 devices or
methods should be compared for mean performance, bias AND variation. The
Bland Altman plot does a good job of showing DC bias or changing bias and
it shows A-B on the Y axis which is a measure of dispersion or variance;
but it fails to show how much of that variance comes from device A and how
much comes from device B. Also, when more than one repeat exists, the
averaging required for the plots minimizes the variation trying to be
Note: Child 1 measured 190, 220 and 200 in the previous paper. If a
second flow device measured 103.33, 203.33 and 303.33 for an average of
203.33; though the average bias is zero, the extreme variation would
indicate a lack of interchangeability and hence, a lack of "agreement" (of
course projecting this trend across all children).
Several methods exist to make statistical tests or confidence
intervals on comparing two repeatability or within-treatment variances for
correlated data when only 1 repeat exists using t tests of A-B plotted
against A+B such as Maloney & Rastogi (1970). Many other related tests
exist as referenced in E.J.G. Pitman (1939), F.E. Grubbs (1942, 1973,
1982), J.L. Jaech (1973, 1979, 1981, 1985), G.W.Snedecor & Cochran
(1967, 1980, 1989), J.H. Hahn & W. Nelson (1970), P. Armitage & G.
Berry (1971, 1994).
It is surprising none of these references exist in any of the authors BMJ
papers on the topic of comparison or agreement.
A simpler approach (not well published for some reason), is to just
ratio the two average variances and compare them to the F ratio. This
method can easily be amended if each child's "true" value is assumed to
also have variation.
It should be noticed that the pooled 'within treatment' variation has the
child means taken out of consideration; hence, the two variances can be
thought of as coming from independent samples. This can be proven
mathematically as well as with simulation in Excel, VisualBASIC, MathCad
This author commends the authors of the previous paper in their
attempts to create a simple subjective plot to replace
an over-emphasis on regression R^2 values and slopes; but in a likewise
manner, it is important to point out the need to replace subjective
methods with quantitative ones when possible and not to forget to compare
replicate variances which must be done to infer interchangeability.
Competing interests: No competing interests
The methods discussed in the paper I feel need some clarification as
to the practicalities of implementing them with respect to quoting
measurement error. Firstly there are some logic errors that need to be
The authors state that, in the case of the children's peak flow
measurements, that there is a "true" average that can be discovered by
repeated measurements, all of which will vary around the true average as a
result of measurement error. This is only true if a device can equally
overestimate as well as underestimate a "true" value. Height measures for
instance can only underestimate; mercury thermometers can only
underestimate. By underestimate I mean in reference to their in-vitro
calibration error. Peak flow meters are an example of a simple device that
is in the main one that underestimates. Error can be intra-subject,
procedural or due to the device. In peak flow measurements the amount of
error possibly due to device is very small compared with these other
types.Hence it cannot be said that error varies about the mean. In the
case of peak flow it tends to vary beneath the maximum.
Such a case invalidates the effort to quote standard deviation as a
measure of error.
Even if such a measurement can be quoted, in the case of a device that can
vary about the true mean ( as would be the case with for instance an
oscillometric algorithm driven bp monitor) then the method proposed in
this study yields only the "population" within subject standard deviation,
as it utilises the data from many subjects. In essence it enables a single
Peak flow measurement to be quoted for an error in comparison with the
expected peak flow measurement for the population studied. Not for the
error in the individual subject, as it is quite clear from the data that
some subjects have differing natural "measurement error" ( their
individual standard deviations) than the whole population averaged out.
Competing interests: No competing interests