Comparison of treatment effects between animal experiments and clinical trials: systematic review

BMJ 2007; 334 doi: (Published 25 January 2007) Cite this as: BMJ 2007;334:197

On “concordance” and “discordance”

In their valuable paper Perel et al. [1] present an outline of a
systematic review [2] concerning the degree of agreement between animal
experimentation (or “vivisection” [6, p. 30, n. 1]) and clinical trials in
six fields of medical inquiry:

(a) thrombolysis in acute ischaemic stroke (AIS),

(b) bisphosphonates to prevent and treat osteoporosis,

(c) antenatal corticosteroids to prevent neonatal respiratory
distress syndrome,

(d) corticosteroids in traumatic head injury,

(e) tirilazad in acute ischaemic stroke,

(f) antifibrinolytics in haemorrage.

The authors conclude that in cases (a), (b), (c) there was
«concordance» or «similar outcomes», while in (d), (e), (f) there was not.
Most disquietingly -- but not unexpectedly for whoever knew [3] and [4],
or bothered to examine with some care a sample of the vivisectionist
literature -- in all six cases the quality of the animal experiments
examined turned out to be «poor». This was indicated by violation of such
basic requirements as randomization, adequate allocation concealment, and
blinded assessment of outcome (cf. table 1 in [1]).

Although this dire balance sheet is in itself a sufficient refutation
of the standard case for animal experiments as a scientific (rather than
psychological) foundation for clinical research, a closer examination of
the meaning of “concordance” and “discordance” is rewarding.

1 “Concordance”

In case (a) the authors found evidence of publication bias (i. e.
articles with positive results more likely to have been published), which
implies that the available data are statistically hard to make sense of.
Indeed, a general difficulty facing systematic reviewers is that «The
proportion of the work that gets published in a form that is available to
the public (rather than just being available to industry and the
regulatory authorities) is unknown» [13]; a good idea would be to contact
the regulatory agencies but, to cite the authors’ own example, the Home
Office in United Kingdom has no accessible register of animal experiments
licensed under the Animals Act 1986 [1].

In case (c) there was concordance as to reduction of respiratory
distress syndrome in animals and in neonates, but not as to mortality,
which was reduced in neonates and yet was increased in animal models, and
significantly so in ewe models (the pooled odds ratio was 12.5, with 95%
confidence interval 1.9 to 79.2, «with no evidence of significant
heterogeneity [...]»). Clearly mortality is such a crucial parameter,
that, for those who take animal experiments seriously, results in even one
animal model indicating correlation of a treatment with increase in
mortality should lead to immediate withdrawal of that treatment from
clinical use and experimentation. For antenatal corticosteroids we know,
from the clinical trials, that this decision would have resulted in the
loss of human lives.

The authors in their full report cite, as evidence of «lack of
communication» between the animal experimenters and clinical researchers,
the «conclusion of one animal head injury study that “timely high-dose
dexamethasone treatment may help clinicians to manage head injury cases”»,
which was published six months after the publication of the CRASH trial.
The CRASH trial had shown that this class of treatments actually increased
mortality in head injury patients [2, p. 45].

While this example is enlightening as further evidence that animal
experimentation, unaided by previous knowledge of what is the correct
result that “must” be found, is no better -- and no less dangerous -- than
divination, one should keep track, nonetheless, of the extent animal
experimenters who “got it right” had read in their (perhaps suitably
“elaborated”) data what they knew clinical investigators had already
shown. For example, in the “concordant” cases (b) and (c), as the authors
inform us in the full report, «The animal research continued after
effectiveness in humans was established» [2]. This raises further doubts
on the true degree and quality of the purported «concordance».

Let us clear the ground from a possible misunderstanding, often
lingering in the debate between supporters and critics of animal
experimentation. That some degree of “concordance” between animal
experiments and clinical trials will be found is no surprise at all. To
say that animal experiments are unreliable, as indeed [1] confirms they
are, does not mean that they give constantly the wrong indication. Were it
so, they would be exceedingly useful, since by systematically denying what
they indicate we should always get a correct answer to our medical
questions. Unfortunately, life is not so easy, and here lies precisely the
danger of all pseudoscientific practices [8].

2 “Discordance”

The discordance in cases (d) and (e) is particularly worrying, as
animal experiments failed to show the increase in mortality in case (d)
and the risk of death and dependency in case (e). This means that by
relying on the “indications” provided by animal experiments lives have
been lost.

In case (e), however, the authors try to show that the discordance
can be explained by using the results of “concordant” case (a). If
successful, this attempt might vindicate, at least in one example, the
image of animal experimentation as a potentially reasonable, self-
correcting endeavor. But is it so? Let us see.

According to the authors, the discordance in the tirilazad case
between animal experimentation on AIS and clinical trials may have arisen
from the difference in the time delays, between onset of AIS and
treatment, in humans (median 5 hours) and in animal models (median 10
minutes). This suggestion (also advanced in [5]) needs to be carefully

At a first level, the remark that excessive delay to treatment,
particularly in the case of AIS, might reduce the efficacy of a treatment
seems quite plausible. However, can the thrombolysis animal experiments,
where there was “concordance”, be used as a sufficient ground to think
that “similar” benefits (or lack of them) in animals and, respectively,
humans are obtained with quantitatively “similar” delays to treatment? In
fact, «there is some evidence that it might take many hours for damage to
develop in human brains» ([5]). Moreover, as can be read in the complete
report [2, p. 26], in the tirilazad experiments «there was no significant
relationship between delay to treatment and efficacy», although «maximum
efficacy was seen when treatment was given before the onset of ischaemia,
with efficacy appearing to fall thereafter with time». Thus even after the
systematic review ([1], [2]), the evidence for prescribing similar delays
to treatment in animal experiments and in human trials in order to get
clinically relevant animal data is far from unambiguous. It is a prudent
bet that whatever lessons will be learned from the tilirazad failure --
and from the failure of more than 37 supposedly neuroprotective drugs
tested in more than 114 clinical trials after encouraging results on
animals [5] --, they will come from careful study of the human, rather
than animal, data.

In general, if one wishes to proceed in a truly scientific way, one
cannot but refer to the mechanism of the pathological process and to the
mode of action of the drug – if known. More precisely, if one can assume

(1) in both animal models and humans an essential component of the
pathological process acts by the same mechanism and

(2) with the same time scale, or otherwise with a known scale factor;
and that

(3) the mode of action of the drug in both animal models and humans
is to interfere in the same way with that component of the pathological
process, and to affect in the same way any other organs involved;

then one might correctly draw from animal experiments likely
conclusions about the efficacy and adverse reactions of the drug on

Once the argument is made explicit it is easy to see what the trouble
is. The trouble is that even if we knew that (1) and (2) hold – and
unfortunately this is hardly ever the case, also because of the well-known
sensitivity of living organisms to small differences, both intraspecific
and interspecific ([6, pp. 15-32]) –- still (3) can be assessed only after
both the animal experiments and the human trials have been performed. To
put it shortly, we can judge the correctness of the inference from animals
to humans only with hindsight: “similarity” between animal data and
clinical results is normally the effect of a posterior and largely
arbitrary reconstruction. This means that animal experiments cannot be
used to (scientifically) predict results on humans. And yet the assumption
of predictive power is about the only justification of animal experiments
which can boast some degree of public acceptance.

3 Combining data from different species

A fundamental question which all systematic reviewers must face is
how to combine results from different animal species in a way that is
relevant to still another species. Two of the authors stated in 2002:
«Consistent results across species and models would provide some
reassurance that humans beings might respond in the same way» ([7]; see
also [3]). What is the meaning of «across species»? In how many, or in
which, species and strains should one get consistent results in order to
be rationally (not merely psychologically) reassured? It is useful to
remember that, among mammalians only, about 4,237 different species are
known (I take the figure from a book published in 1994), and as already
mentioned there are many cases (discussed for instance in [6] and [8])
where inconsistency occurs even inside one species, with outcomes
depending on strains, sex of animals, food, cages, laboratory environment
etc. Taking into account comorbidities, although prima facie reasonable,
would only exponentially magnify the intractability of the problem.

I am sure this point need not be belabored, at a time when huge
resources are being invested in the development of drugs tailored on the
genetic profile of the individual (human) patients. The main reason for
such gigantic research effort is, of course, the recognition that most
drugs are effective only on a minority -- sometimes a very small one -- of
patients ([9], [10]).

It is on this very basic issue that science and vivisection part
ways. There is simply no scientifically validated procedure for combining
data from experiments on different non-human species and to get
information relevant to humans. The burden of the proof is on those who
claim that such a validation has ever been provided. For example, as ECVAM
(European Centre for the Validation of Alternative Methods) director
Thomas Hartung recently stated, «The [animal] toxicity tests that have
been used for decades are “simply bad science”» [11]. There is no reason
to think that systematic reviews, in themselves, can remedy this
situation. For all we know, to combine results from several species might
be like the procedure used to solve the legendary problem of the length of
the nose of the Emperor of China, as described by physicist Richard P.
Feynman [11, pp. 295-6]: one asked Chinese people what they thought this
length was, and then averaged answers. All very well -- except that no one
in China had ever been permitted to see the Emperor.

In conclusion, it is not surprising that animal experimentation does
not inform human health care [3]: there is no known scientific way in
which it could do that. The usefulness of systematic reviews of animal
studies lies in bringing home this important truth, while the suggestion
to adopt «an iterative approach to improving the relevance of animal
models to clinical trial design» [1] seems in the light of all the
available evidence unwarranted.


[1] Perel P., Roberts I., Sena E., Wheble P., Briscoe C., Sandercock
P., Macleod M., Mignini L. E., Jayaram P., Khan K. S., “Comparison of
treatment effects between animal experiments and clinical trials:
systematic review”, BMJ, 15 December 2006


[2] --, “Testing treatment on animals: relevance on humans”,


[3] Roberts I., Kwan I., Evans P., Haig S. “Does animal
experimentation inform human healthcare? Observations from a systematic
review of international animal experiments on fluid resuscitation”, BMJ,
vol. 324, 23 February 2002, pp. 474-6.


[4] Pound P., Ebrahim S., Sandercock P., Bracken M. B., Roberts I.,
“Where is the evidence that animal research benefits humans?”, BMJ,, vol.
328, 28 February 2004, pp. 514-7.


[5] Green A. R., Odergren T., Ashwood T., “Animal models of stroke:
do they have value for discovering neuroprotective agents?”, TRENDS in
Pharmacological Sciences, vol. 24 (8), August 2003, pp. 402-8

[6] Croce P. Vivisection or science? An investigation into testing
drugs and safeguarding health, London: Zed Books, 1999.

[7] Sandercock P., Roberts I., “Systematic reviews of animal
experiments”, The Lancet, vol. 360, August 24, 2002, p. 586.

[8] Mamone-Capria M., “Pseudoscienza nella scienza biomedica
contemporanea: il caso della vivisezione”, Biologi Italiani, June 2003,
33(6), pp. 10-27


[9] Connor S., “Glaxo chief: Our drugs do not work on most patients”,
The Independent, 8 December 2003.

[10] Smith R., “The drugs don’t work”, BMJ, 13 December 2003, vol.


[11] Abbott A., “More than a cosmetic change”, Nature, 10 November
2005, vol. 438, pp. 144-6.

[12] Feynman R.P., “Surely You’re Joking, Mr. Feynman!”, Unwin
Paperbacks 1986.

[13] Mignini L. E., Khan K. S. , “Methodological quality of
systematic reviews of animal studies: a survey of reviews of basic
research”, BMC Medical Research Methodology, 13 March 2006, 6:10.


Competing interests:
None declared

Competing interests: No competing interests

10 January 2007
Marco Mamone-Capria
University of Perugia, 06123 Perugia
Click to like: