# Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses

BMJ 2010; 340 doi: https://doi.org/10.1136/bmj.c117 (Published 30 March 2010) Cite this as: BMJ 2010;340:c117## All rapid responses

*The BMJ*reserves the right to remove responses which are being wilfully misrepresented as published articles.

I am troubled by the suggestion by Sun et al that a subgroup effect

consistent with a pre-specified direction will increase the credibility of

a subgroup analysis, and that getting the direction wrong weakens the case

for a real underlying subgroup effect. 1

Freedman has suggested that the concept of clinical equipoise in the

conduct of a study requires that there is uncertainty, not necessarily on

the part of the individual investigator, but within the clinical

community. 2

Greenland argues that the design and analysis of studies may be

biased towards results desired by an investigator. Investigator bias may

arise from many sources and can be a major source of uncertainty about

study effects. 3

While it may be appropriate for an investigator to reveal a bias in

one direction or another, it should not be necessary to accurately predict

what is going to be discovered in any sub-group analysis. If a bias is

revealed or a prediction made, it is important that neither influence the

conduct of the study, nor the validity or credibility of the results.

1. Sun X, Briel M, Walter SD, Guyatt GH. Is a subgroup effect

believable? Updating criteria to evaluate the credibility of subgroup

analyses BMJ 2010; 340: c117

2. Freedman, B. Equipoise and the ethics of clinical research. N Engl

J Med. 1987; 317:141-145.

3. Greenland, S. Accounting for uncertainty about investigator bias:

disclosure is informative: How could disclosure of interests work better

in medicine, epidemiology and public health? J Epidemiol Community Health

2009; 63:593-598

Competing interests:

None declared

**Competing interests: **
No competing interests

A good example of an unbelievable overall result but a believable

subgroup result is given in the latest Royal College of General

Practitioners oral contraception mortality study.1,2

Oral contraceptives are mostly used by young women. The subgroup

analysis for deaths under age 30 found 3 times more deaths in Ever pill

takers compared with Controls.

The widely publicised overall result of a mortality reduction 40

years after the start of the study has to be wrong. The main fault in the

study, apart from not recruiting only new takers, large losses and

switching controls to be takers, is failing to record HRT use in the last

10 years of the study when 3 out of 4 deaths were recorded. Both combined

HRT and combined oral contraceptives contain progestogens and oestrogens

which increase the main causes of death from cancers, vascular diseases

and mental illness.

1 Sun X, Briel M, Walter SD, Guyatt GH.

Is a subgroup effect believable? Updating criteria to evaluate the

credibility of subgroup analyses

BMJ 2010; 340: c117

2 Hannaford PC, Iversen L, Macfarlane TV, et al. Mortality among

contraceptive pill users: cohort evidence from Royal College of General

Practitioners' Oral Contraception Study. BMJ 2010; 340: c927

Competing interests:

None declared

**Competing interests: **
No competing interests

We read with great interest Sun et al’s paper about evaluating the

credibility

of subgroup effects. We agree with the third point “is the significant

subgroup effect independent” although we would raise one concern

regarding the authors’ final suggestion that in a small sample size it is

acceptable to pre-specify a limited number of important interactions to be

considered.

In analyses of non-randomized interventions it is important to adjust

for all

potential confounders since otherwise the estimated intervention effect

may

be biased. In an RCT with randomized allocation to interventions there

should be no confounders and it is not typically necessary to adjust for

other

factors (although doing so may increase precision in the estimated

intervention effect). However in an RCT, an analysis concerning

differences in

intervention effects in a subgroup should initially use a full model

including

all interactions between all possible confounders and that subgroup,

because

otherwise the interaction of interest may be confounded. This is well

noted

by the authors in their smoking and fracture-type example.

The authors suggestion that it is acceptable to pre-specify a limited

number

of important interactions to be considered is perhaps analogous to, and as

valid as, only including a pre-defined subset of possible confounders in

an

analysis of observational data. It would be better to create a full model

and

use a standard model selection process, such as change in estimate or

model

fit, to create a parsimonious model. If the sample size is too small to

permit

detailed consideration of all possible confounders to a subgroup effect,

it

may be more appropriate not to attempt to identify subgroup effects.

In our own work we have found a simple rule of thumb to be helpful in

judging whether subgroup effects might be confounded. If the coefficient

for

the subgroup interaction of interest is very similar to the difference

between

the coefficients from subgroup specific (i.e. stratified) analysis, then

the

model may be adequate and the subgroup difference inferred credible.

Conversely, if there is a discrepancy it suggests that the model is mis-

specified, additional interaction terms are needed and the subgroup

difference inferred may not be credible. Sun et al’s paper provides a very

helpful way forward for assessing the credibility of subgroup analyses,

which

need not be confined to randomized control trials and could be further

improved by being both more rigorous and more pragmatic.

Competing interests:

None declared

**Competing interests: **
No competing interests

**03 May 2010**

Sun et al point up the distinction between the mathematical process

by which statistically-significant interactions are identified and the

interpretational process by which these interactions are placed in the

context of our wider understanding. As Sun et al make clear, the seeming

unambiguity implied by the mathematical process is not tenable:

interpretation must take into account a number of additional sometimes ill

-defined issues. Below, I suggest some supplementary comments to include

with their admirable overview:

(a) Choice of significance level is a very real issue (section 3).

The size of F in ANOVA, along with other statistical values such as t, is

heavily dependent on the size of the sample: these values are effectively

computed from standard errors rather than standard deviations, so the

bigger the sample the lower will tend to be the probabilities of the null

hypotheses and the greater will tend to be the number of significant Fs,

everything else being equal. The choice of significance level should

therefore also address this issue.

(b) The number of interaction Fs is dependent on the number of

independent variables. Interactions can become increasingly difficult to

interpret as the number of independent variables is increased. The

interpretation of the single interaction emanating from two independent

variables is not normally problematic. However, four independent variables

yield eleven interactions, in addition to four main effects:

interpretatively, a statistically-significant interaction concerning all

four independent variables may be virtually impenetrable.

(c) The reference to untoward associations regarding independent

variables, also in section 3, suggests that analysis-of-covariance

(ANCOVA) may be a useful alternative technique in some circumstances, such

as that described in the second paragraph. ANCOVA is capable of teasing

out the contributions of two associated independent variables, along with

assessing significant difference. Preferably independent variables should

be orthogonal, but real-world data may not conform to this ideal.

Competing interests:

None declared

**Competing interests: **
No competing interests

Interesting paper - thank you for attempting to clarify subgroup

analyses and effect modification in general. I do feel however that some

of your suggestions are not entirely convincing.

1) You note, correctly, that the presence of an effect modification

depends on the model that is used for combining risks, i.e., additive or

multiplicative. But then you affirm that the multiplicative model is the

correct one, on the basis of an example that merely shows that an additive

interaction may correspond to the absence of a multiplicative interaction.

How does that make the multiplicative model better? In real life, results

fall often inbetween models - e.g. the risk of disease may be 10% in

female nonsmokers, 30% in female smokers, 20% in male nonsmokers, and 50%

in male smokers. So the effect of smoking is STRONGER in men on the

additive scale (+20% in women but +30% in men), but WEAKER on the

multiplicative scale (x3 in women but x2.5 in men). How can that be (or

rather: what can that mean in biological terms)? In fact the "effect

modification" is an artefact caused by the choice of the statistical

model. I would suggest that interpreting statistical interactions in

biological terms is hazardous unless the baseline risks are nearly

identical (e.g. the risks of disease in male and female nonsmokers). More

discussion of this problem in Rothman and Grenland's Modern epidemiology.

2) You submit that consistency (of the presence of an interaction)

across closely related outcomes lends credibility to the effect. That is

circular logic. If two outcomes are closely correlated within a sample,

their analysis MUST yield similar risk models. There is no corroborating

evidence in that observation.

Thank you for your thoughts on this, and apologies for taking up so

much space...

Competing interests:

None declared

**Competing interests: **
No competing interests

**06 April 2010**

Congratulations to the authors on an excellent and very useful

article. I wanted to make one minor point that could avoid some

misinterpretation. The first criterion is written as: "Is the subgroup

variable a characteristic measured at baseline or after randomisation?"

This could be taken to mean that both of these are acceptable ways of

specifying subgroups, when of course only the first is. The other

criteria are not written in this way, and only mention the acceptable

choice (or it is clear which is better, as with "within rather than

between studies"). It would be clearer to for the criterion simply to say

"Is the subgroup variable a characteristic measured at baseline."

Competing interests:

None declared

**Competing interests: **
No competing interests

**31 March 2010**

## Rethinking the premises of subgroup analyses

In updating criteria for evaluating sub-group analyses, Sun et al.[1]

emphasize that subgroup effects must be evaluated in terms of relative

effects on each subgroup. In doing so, they explain that the same

relative reduction of different base rates would yield different absolute

reductions.

I will not defend measuring subgroup effects in terms of absolute

reductions. But a fundamental flaw in subgroup analyses to date is the

premise that it is somehow normal that a factor will cause equal

proportionate changes in different base rates and that, when that does not

occur, one has identified some meaningful difference in the way the factor

affects the groups with the different base rates. Such premise overlooks

the pattern, inherent in features of normal risk distributions, whereby

the rarer an outcome, the greater tends to be the relative difference in

experiencing it and the smaller tends to be the relative difference in

avoiding it.[2-7] A corollary to such pattern is one whereby a factor

that reduces an outcome will tend to reduce it to a larger proportionate

degree in the group with the lower base rate while increasing the opposite

outcome to a larger proportionate degree in the other group.[3,4,6] It is

only with a recognition of these patterns that one may begin to identify a

meaningful subgroup effect (i.e., one that is not a function of the

differing based rates and one that would not yield an opposite

interpretation as to comparative effect size if one examined the opposite

outcome).

Even apart from the above patterns, it is illogical to regard it as

somehow normal that a factor will cause equal proportionate changes to

different base rates, for the simple reason that it is mathematically

impossible for a factor to cause equal proportionate changes in different

base rates of an outcome while at the same time causing equal

proportionate changes in the opposite outcome. That is, for example, if

one group has a base rate of 5% and another has a base rate of 10%, a

factor that reduces both rates by 20% (to 4% and 8%) would necessarily

increase the rates of experiencing the opposite outcome by different

proportionate amounts (95% increased to 96%, a 1.1% increase; 90%

increased to 92%, a 2.2% increase). And since there is no more reason to

regard it as normal for there to be equal proportionate decreases in one

outcome than to regard it as normal for there to be equal proportionate

increases in the opposite outcome, there is no reason to regard it as

normal for there to be equal proportionate changes in either outcome.

In any case, I suggest that the only effective way to identify a true

subgroup effect is to derive from the base and treated rates for each

group the difference between the means of the hypothetical underlying

distributions, as discussed in reference 6 and the Subgroups Effects page

of the Scanlan’s Rule page of jpscanlan.com.[8]

References:

1. Sun X, Briel M. Walter SD, and Guyatt GH. Is as subgroup effect

believable? Updating criteria to evaluated the credibility of subgroup

analyses. BMJ 2010;340:850-854.

2. Scanlan JP. Can we actually measure health disparities? Chance

2006:19(2):47-51:

http://www.jpscanlan.com/images/Can_We_Actually_Measure_Health_Dispariti...

(Accessed 6 June 2010)

3. Scanlan JP. Race and mortality. Society 2000;37(2):19-35:

http://www.jpscanlan.com/images/Race_and_Mortality.pdf (Accessed 6 June

2010)

4. Scanlan JP. Divining difference. Chance 1994;7(4):38-9,48:

http://jpscanlan.com/images/Divining_Difference.pdf (Accessed 6 June 2010)

5. Scanlan JP. The Misinterpretation of Health Inequalities in the

United Kingdom, presented at the British Society for Populations Studies

Conference 2006, Southampton, England, Sept. 18-20, 2006:

http://www.jpscanlan.com/images/BSPS_2006_Complete_Paper.pdf (Accessed 6

June 2010)

6. Scanlan JP. Interpreting Differential Effects in Light of

Fundamental Statistical Tendencies, presented at 2009 Joint Statistical

Meetings of the American Statistical Association, International Biometric

Society, Institute for Mathematical Statistics, and Canadian Statistical

Society, Washington, DC, 1-6 Aug. 2009:

PowerPoint Presentation :

http://www.jpscanlan.com/images/Scanlan_JSM_2009.ppt (Accessed 6 June

2010)

Oral Presentation: http://www.jpscanlan.com/images/JSM_2009_ORAL.pdf

(Accessed 6 June 2010)

7. Scanlan JP. Measuring Health Inequalities by an Approach

Unaffected by the Overall Prevalence of the Outcomes at Issue, presented

at the Royal Statistical Society Conference 2009, Edinburgh, Scotland, 7-

11 Sept. 2009.

PowerPoint Presentation:

http://www.jpscanlan.com/images/Scanlan_RSS_2009_Presentation.ppt

(Accessed 6 June 2010)

8. Subgroup Effects page of Scanlan’s Rule page of jpscanlan.com

(Accessed 6 June 2010)

Competing interests:

None declared

Competing interests:No competing interests07 June 2010