Are confidence intervals better termed “uncertainty intervals”?
BMJ 2019; 366 doi: https://doi.org/10.1136/bmj.l5381 (Published 10 September 2019) Cite this as: BMJ 2019;366:l5381
All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
I support Andrew Gelman in having no confidence in ‘confidence intervals’, indeed in 1988 I gave a talk at the PSI (Statisticians in the Pharmaceutical Industry) entitled ‘Confidence in Intervals is Misplaced’. In 1991 (1) I questioned whether “..persuading (clinicians) to use CI’s rather than p-values is to replace the unthinking use of one technique with that of another.” I also argued that the confidence we could have in “confidence intervals” came from the process of calculating the intervals. To Gelman’s four concerns I would add one more.
As Quenouille pointed out over 60 years ago, “… confidence intervals are concerned with the average in repeated sampling and should not be used as limits for estimates.”(2) Quenouille was concerned about interval estimates for ‘restricted parameters’. The argument is more general a well-known instance being two random samples from a uniform distribution on (U-½, U+½) providing a 50% confidence interval for U despite the fact that if the difference between the maximum and minimum of the two values is greater than ½ we know with certainty that the interval contains U. Similar problems arise if a sample size larger than 2 is taken.
A second example arises in the construction of a confidence interval for the mean of a normal distribution, with unknown variance, based on a single observation. (3,4) To illustrate this suppose you are a member of a group of n individuals with average weight W* in which the individual weights are Gaussian. If your personal weight is W, then W+- 4.84W is a 90% CI for W* irrespective of the value of the population standard deviation.
In my view we would be no better off going from ‘confidence intervals’ to ‘uncrtainty intervals’, in fact we may be worse off because the latter term could include not only ‘confidence intervals’ but also ‘posterior’ or ‘credible intervals’, ‘fiducial intervals’, ‘predictive intervals’ and ‘tolerance intervals’.
Turning to ‘compatible intervals’ as proposed by Sander Greenland. I would first point to a further extract from Quenouille. (2) In the sentence following on from the above quote Quenouille opined “ … (confidence intervals) indicate the hypotheses for which the observations are ‘typical’ rather than the hypotheses most compatible with the observations.” This may be a subtle distinction but goes back to a “confusion in interpretation”.
I also wonder whether there is a connection between ‘compatible intervals’ and ‘consonance intervals’ (5). In the authors words "…’consonance intervals’ didn't exactly sweep the field” (6), nonetheless it is worth considering what they are. As far as I can tell they are both meant to provide a measure of the support the data have for a particular model which again is more compatible with a Bayesian, or fiducial, interpretation rather than a frequentist ‘confidence interval’.
My main point is that renaming a ‘confidence interval’ – either naming it an ‘uncertainty interval’ or a ‘compatible interval’ will do little to improve their use. Whilst I have no particular love for a p-value I tend to agree that if we are to use a given statistical approach researchers need to be taught correctly what they are and how to use them. This goes back to my point about replacing one unthinking approach by another. Renaming a concept does not in and of itself lead to an improved use. Education may.
1. Grieve AP. Letter to the editor (Confidence Intervals). Royal Statistical Society News and Notes 1992; 18 (7): 3 4.
2. Quenouille MH. Fundamentals of Statistical Reasoning. Griffin, 1958.
3. Abbott JH and Rosenblatt JI. Two-Stage Estimation With One Observation on the First Stage. Annals of the Institute of Statistical Mathematics 1963; 14: 229-235.
4. Machol RE and Rosenblatt J. Confidence Interval Based on Single Observation. Proceedings of the Institute of Electrical and Electronics Engineers 1966; 54: 1087-1088.
5. Kempthorne O and Folks JL. Probability, Statistics and Data Analysis. Iowa State University Press, 1971.
6. Folks JL. A conversation with Oscar Kempthorne. Statistical Science 1995;10(4):321-36.
7. Lakens, D. The practical alternative to the p-value is the correctly used p-value. PsyArXiv Preprints. doi:10.31234/osf.io/shm8v, 2019.
Competing interests: No competing interests
Thank you both for explaining the issues. Might it also help if the actual (often bell shaped) likelihood distribution was always displayed diagrammatically? The baseline could be marked with the 95% and 99% ‘limits’ (and labelled ‘uncertainty’ or ‘compatibility’). The position of the null hypothesis could also be marked on the baseline. If this first null hypothesis is placed -z SEMs from the observed mean x̅, another mirror null hypothesis can be placed +z SEMs from the observed mean x̅. This pair of null hypotheses would form the basis of a 2 sided P value.
The bell’s heights at each point represent the likelihoods of observing the study mean conditional all the possible hypothetical values of the ‘true’ mean IF (1) the study methods had been described impeccably AND (2) the distribution model (e.g. Gaussian) was appropriate AND (3) there was no prior information available about the distribution of likelihoods. The way the diagram is interpreted would depend on the reader’s approach (e.g. from a Frequentist point of view). However, based on the above three assumptions, the provisional Bayesian probability of observing a true result less extreme than the null hypothesis in the direction of the observed mean would be equal to the one-sided P value [1].
If other prior information were available then a Bayesian calculation would combine the prior distribution with the observed distribution. The mean and the updated combined distribution would be shifted up or down and the ‘spread’ or variance would be reduced.
Reference
1. Llewelyn H (2019) Replacing P-values with frequentist posterior probabilities of replication—When possible parameter values must have uniform marginal prior probabilities. PLoS ONE 14(2): e0212302. https://doi.org/10.1371/journal.pone.0212302
Competing interests: No competing interests
P- values rathrer than confidence intervals, should be criticized- and downgraded
This head to head debate falls down on various points, above all, because in criticizing so much the name (the concept?) of the confidence interval, room is unfortunately left for the p-value to make a most unwelcome return.
A very low p- value may not mean a more effective treatment. Instead, it can reflect a larger sample size. The bigger the sample, the more precise the effect will seem. But the actual difference, in medical rather than statistical terms, could be useless. P-values, without confidence intervals, verge, in my opinion, on the meaningless.
Confidence intervals can show the power of a trial. The bigger (the less precise) the trial is, the bigger the confidence interval.
P-values are about Type I error only, but Confidence Intervals embrace Type I and Type II errors.
P-values, by obscuring the role of sample size, are a dead- end. They came about early in the 20th century through Fisher; it was later, in the 1930's, that Neyman and Pearson, using ideas such as maximum likelihood and n-space, invented confidence intervals.
P-values are solely parametric. Confidence Intervals- as the various possible names for them imply- are able to be parametric, Bayesian, computer-generated, whatever.
We might have a debate about the arbitrary character of certain statistical procedures- for example, 5% for the Type I error p-value, 80% for Type II error statistical power, and so on.
So let's bury the P-value, and praise the Confidence Interval- whatever we may call it!
Competing interests: No competing interests