Are these data real? Statistical methods for the detection of data fabrication in clinical trials
BMJ 2005; 331 doi: https://doi.org/10.1136/bmj.331.7511.267 (Published 28 July 2005) Cite this as: BMJ 2005;331:267All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
There is no official definition of many terms used in randomised
trials, including double blind, single blind, intention to treat, and so
on. The term randomised does have precise technical meaning but it is
often misused. Labels are valuable only if they have a unique meaning and
are only used in the correct way.
John Williams queries the definition of a single blind trial. One
publication in the BMJ states that in a single blind trial “either only
the investigator or only the patient is blind to the allocation”.[1] The
term is thus unhelpful without clarification. Double blind trials are
just as confusing as single blind trials. A survey of physicians and a
review of textbooks and reports revealed numerous interpretations of the
designation “double-blind.”[2] Of key importance in both single and
double blind trials is whether the outcome assessor is blinded.
Hence the CONSORT Statement avoids labels and asks for specific
information: “Whether or not participants, those administering the
interventions, and those assessing the outcomes were blinded to group
assignment. If done, how the success of blinding was evaluated.”[3]
Likewise, in an article in which we tried to clarify the various
“terminological tangles” associated with blinding, we wrote: “we urge that
authors explicitly state what steps were taken to keep whom blinded. If
they choose to use terminology such as single-, double-, or triple-
blinding in reporting randomized controlled trials, they should explicitly
define those terms.”[4]
Arguments about the correct meaning of “single-blind” are pointless.
1 Day SJ, Altman DG. Blinding in clinical trials and other studies.
BMJ 2000;321:504.
2 Devereaux PJ, Manns BJ, Ghali WA, Quan H, Lacchetti C, Montori VM,
et al. Physician interpretations and textbook definitions of blinding
terminology in randomized controlled trials. JAMA 2001;285:2000-3.
3 Moher D, Schulz KF, Altman D for the CONSORT Group. The CONSORT
statement: revised recommendations for improving the quality of reports of
parallel-group randomized trials. JAMA 2001;285:1987-91. [see also
www.consort-statement.com]
4 Schulz KF, Chalmers I, Altman DG. The landscape and lexicon of
blinding in randomized trials. Ann Intern Med 2002;136:254-9.
Competing interests:
None declared
Competing interests: No competing interests
The methodologies suggested in this paper have further applications
in the reviewing process to check for data errors that may have occurred
not only through fraud but also through faulty data entry - I found a
very large s.d. in BMI field for one arm of a clinical trial in a recent
paper under review and this was shown to be due to a misplaced decimal
point - the BMI should have been ~ 30 and was ~ 300, as the decimal point
had been omitted in the weight field.
I am not sure that comparing the s.d. might always be the correct
comparison - in the example in this paper there is a phase shift between
the two trials, as one trial had hypertensive patients only. In this
context coefficient of variation (c.v.) may be a better measure of
dispersion.
Data from other papers about the same disease group might serve as
better comparators than data of the same type from a general population.
Consequently as I review it may prove to be useful to maintain a database
of the mean and s.d. of key variables, for me this would be in populations
of subjects with diabetes.
Finding outliers when comparing s.d.'s and means might also mean that
the data entry (as in my example above) was singly punched late at night
by a junior researcher with no validation checks - double entry data
systems with validation should be required by review and ethics
committees.
Competing interests:
None declared
Competing interests: No competing interests
Mr. Schwarz brings up the idea of a "single-blind" study as one in
which EITHER the human subject(s) or the experimenter(s) making the
clinical measurements are unaware of the trial conditions at the time of
measurement.
If indeed this was the meaning intended both in the criticized diet
study and the criticism of it, then the term may not have been misused by
Al-Marzouki et al, depending on which meaning was intended in the diet
study.
But, this raises other questions: If the subjects know the
conditions, why blind the experimenters? What use could this serve? The
experimenters merely could ask the subjects during the trials some
question revealing the group assigned. "How have you been feeling since
going on the fruit diet? Step onto the scale." If Mr. Schwarz's
definition was intended, then unblinded bias again would be as likely as
fraud as an explanation of some of the bias found in the statistical
averages.
Measurements of weight or heart rate are objective, so the Schwarz
use of "single-blind" would seem intended primarily to prevent intentional
falsification rather than unconscious bias. But, as just pointed out, an
INTENTION to fabricate would sidestep this kind of "single-blind" easily.
If this is a real difference in usage, it might be worthwhile for BMJ
or its interested readers to try to come to agreement on terminology on
this issue. Mr. Schwarz cites no reference for his definition of "single
-blind"; but, it would have been useful to be able to examine this
question in more detail.
It would appear that the usage proposed by Mr. Schwarz is ambiguous
and thus liable to be misused. What about a diet study in which the food
looks the same to the subjects, or an exercise study in which the subjects
can tell what they are doing, but do not know otherwise the purpose of the
study? This reverses the meaning of "single-blind" (experimenter vs.
subject) with no hint of this reversal to the reader.
I tried a search for "Single Blind" on Yahoo and on Google. Only one
of the first few dozen pages returned agreed with Mr. Schwarz's definition
(http://www.medterms.com). The NCI and other sites, which could be
considered authoritative, all unequivocally define "single-blind" the way
it is used in the reference I cited previously. Unfortunately, my old
copy of the Merck Manual does not address experimentation of this kind.
It is unclear to me whether Mr. Schwarz should have referred to his
definition as "scientific". It would be useful to know the context in
which Mr. Schwarz's definition has been used in science, and by whom
(other than the apparently erroneous usage by Al-Marzouki et al pointed
out in my previous posting).
Competing interests:
None declared
Competing interests: No competing interests
Let not this debate become lost in statistical complexities and keep our focus on results.
The authors compared 2 trials, one wonderfully well done, the MRC trial [MEDLINE 2861880] where in 85,572 years of observation with beta-blockerordiuretic in mildly hypertensive patients [diastolic
90-109 mm Hg], the difference in deaths at trial end vs. placebo was 5, one death per roughly 8500 years of drug use. Everything, even in hind sight, was perfect and that should have been the end of 2 drugs if mortality is an endpoint in such patients. But was it?
The later trial of 1992 [MEDLINE 1586782] lacked statistical rigor but it was an intriguing study in only 406 patients with suspected acute myocardial infarction in which 17 (44%) fewer patients died on the intervention diet.
There are questions but when many drugs don't save lives even in the best run mega-trials, one should repeat the trials that do find mortality benefit.
Paraphrasing Harvard's Dr.
Alexander Leaf in 1999 Circ. concerning an equally surprising
diet trial with similar benefits, the Lyon Diet Heart Study: "first
let's find an effect and then figure out what caused it".
When a prevention approach actually produces results in the time frame of a single human being, others should replicate such trials especially if it takes patient numbers only in the hundreds. This should have been the value of BMJ publishing this kind of study, preferably with
an editorial, in the first place. vos{at}health-heart.org
Competing interests:
None declared
Competing interests: No competing interests
This criticizm of the Al-Marzouki et al analysis stems from a
misunderstanding of the definition for a single-blind study.
Mr. Williams "standard definitions" are taken from a textbook of
Pharmacology which gives the definitions for Pharmacological studies.
These definitions are correct for
an open (unblinded) study (- the experimenters and the subjects are aware
of the conditions), and for a double-blind study (- both the experimenters
and the subjects are unaware of the conditions). However, the definition
for a single-blind study is not the full scientific definition.
A single blind study is one in which EITHER the experimenter OR the
subjects are unaware of the conditions!
In Pharmacological single blind trials (such as clinical trials of
new drugs) it is always the subjects who are "blind", since some recieve a
placebo. However, in diet studies (such as the one Al-Marzouki et al
analysed), the subjects always know what they are eating, so in this case
a single blind study means that the experimenters doing the various
measurements do not know to which group (intevention diet or control) the
person they are checking belongs. Thus, no bias of the data should have
been expected.
In any case, even if the trial was known to be unblinded, this would
still not explain the two other statistical tests. It is the combination
of the differences in means, variances, and digit preference which
strengthens the conclusion that data fabrication took place in the diet
trial.
Competing interests:
None declared
Competing interests: No competing interests
I read with interest the above discussion of Benford's Law as a route
to assessing fraud in clinical data. Whilst most data from clinical
trials may not be suitable for assessment by Benford's Law (as Prof Evans
notes), readers may be interested to know that a paper has recently been
published showing that Benford's Law can be used to screen other types of
analytical data -
http://www.rsc.org/publishing/journals/AN/article.asp?doi=b504462f .
Benford's Law has the great advantage over other statistical techniques,
involving the mean and standard deviation, that one has prior knowledge of
what answer to expect from the test, i.e the Benford distribution is a
property of data in general, whilst other statistical distribution are
properties of the particular data set. However for Benford’s Law to be
useful as a screening technique the data being examined must, as a rule of
thumb, span at least four orders of magnitude; a criterion which I suspect
would not be met for most sets of clinical data?
Competing interests:
None declared
Competing interests: No competing interests
It is my experience that Benford’s law is of very limited
use in the detection of fabrication or falsification in medical research data.
It is of unquestioned value in financial fraud, including health claims data,
but research data are of a different nature. For example systolic blood
pressures in a very large number of patients in a typical randomised trial may
all have the first digit as 1. Their cholesterol values may have no first
digits of 1 and almost certainly none of 2. These are clear departures from
Benford’s law but are definitely not examples of fabrication of falsification.
If all the variables are taken together then still the
pattern does not conform to Benford’s law. In neither of the trials studied in
our paper do any of the variables taken singly show even a remote fit to the
distribution suggested by Benford. Taken together the 5 variables in common
between the trials do not show such a fit, so that lack of fit to Benford’s law
is no evidence of fabrication or falsification in this situation.
For example from the MRC trial, the pattern for the five
variables considered is shown in the table;
First digit |
Frequency |
% |
Benford’s % |
1 |
1,774 |
42.34 |
30.1 |
2 |
131 |
3.13 |
17.6 |
3 |
4 |
0.1 |
12.5 |
4 |
87 |
2.08 |
9.7 |
5 |
333 |
7.95 |
7.9 |
6 |
537 |
12.82 |
6.7 |
7 |
547 |
13.05 |
5.8 |
8 |
431 |
10.29 |
5.1 |
9 |
346 |
8.26 |
4.6 |
The requirement is that the data must have a range that
covers at least two orders of magnitude. This often applies in financial data
but only rarely, if ever, in medical research data. If a very large number of
variables were taken together then the fit will be rather better, but the
problem is that the selection of variables can affect the distribution of first
digit even when many are chosen. The argument can always be that not enough
variables have been considered for Benford’s law to be applicable.
Incidentally, as with many such things, Newcomb, not Benford,
was the original discoverer of the law.
Competing interests:
None declared
Competing interests:
The problems raised by BMJ (1,2) are very important for medical
research community. Unfortunately, there is no simple solution to the
problem of fraudulent or tampered data. The paper by Al-Marzouki, Evans,
Marshall and Roberts (1) uses statistical test to compare variances for
baseline variables in two groups of randomized controlled trials (RCT).
Their implicit hypothesis (which seemed to be true) that if researcher
tampers with data he does not know that variance should be approximately
the same and generates all his data by hand.
Unfortunately only statistically illiterate researcher will do that,
especially for publication in high-impact journal. If researcher lacks
ethical constraint he will simply decide on necessary mean values, take
variance from published sources or small sample of patients then run
existing in all statistical packages function to generate, for example,
normally distributed data with given mean and variance. Then he can
project what result he would like to get and repeat process for made-up
follow-up data. He can even use non-normally distributed data, mixed
distribution, etc. There is almost no way to uncover such fraud, except
collecting on-site evidences that investigation was not performed. Such
investigation is legally difficult and very expensive, especially in
computer era, as most of data are accumulated in electronic - easy to
tamper with - form. All solutions (like registration of all controlled
trials with possibility of sudden on-site inspection, collection of data
with time-stamps in third place, etc.) will be very costly and eventually
will harm mostly innocent researchers due to rise in research
expenditures.
With future of statistical fraud control bleak, more discussion
should be directed to possibilities to decrease incentives connected with
fraud. Impact of research fraud would be lessened if consumers of
scientific information will remember about and demand reproducibility.
This, in turn, should influence decisions of Institutional Review Boards
(IRB) on repeating RCTs. It is now considered unethical to repeat RCT if
previous one showed one treatment superior. IRB in different institution
should allow repetition of RCT if controversial treatment was used in a
previous RCT or body of other knowledge does not support results of
previous RCT.
Unfortunately, possibility of scientific fraud is relatively high. In
a recent survey (3) 0.3% of US scientists funded by NIH confessed that
they had falsified or ‘cooked’ research data and almost every seventh
(15.3%) indicated that they dropped observations or data points from
analysis based on a gut feeling.
It is now probably the time to consider researcher as a source of
possible bias that is not controlled by use of randomization and use for
decision-making regarding RCT the same causation criteria of strength,
consistency, specificity, relationship in time, biological gradient,
biological plausibility, coherence of evidence, experiment and analogy
that were put forward by Sir Austin Bradford Hill and are standard for
assessing causation in non-randomized studies.
1. Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data
real? Statistical methods for the detection of data fabrication in
clinical trials BMJ 2005;331:267-270.
2. White C., Suspected research fraud: difficulties of getting at the
truth. BMJ 2005;331:281-288
3. Martinson BC, Anderson MS, de Vries R. Scientists behaving badly.
Nature 2005; 435; 737-738
Competing interests:
None declared
Competing interests: No competing interests
There are apparent problems with the Al-Marzouki, et al analysis as
well as with the study being criticized.
For one thing, Al-Marzouki et al adopt the criticized-study's term,
"single-blind", in their final-digit analysis.
However, the criticism uses a DOUBLE-BLIND definition to criticize
what was reported as a "single-blind" experiment.
By standard definition, an unblinded study is one in which the
experimenter(s) and the subjects (clinical subjects) are aware of the
conditions as they are administered.
In a single-blind study, the subjects are unaware of the conditions
as they are administered, but the persons administering the trials are
aware of them.
In a double-blind study, both the persons administering the trials
and the subjects are unaware of the conditions during the trials. For
example, see Bowman, et al, "Textbook of Pharmacology" (6th ed.), Chapter
20, "Clinical trial of new drugs".
Thus, apparently, Al-Marzouki et al applied a double-blind criterion
incorrectly. If the criticized study indeed was single-blind, then the
experimenters were aware of the conditions during trials, and some bias of
the data should be expected. Thus, at least some of the statistical
inference drawn by Al-Marzouki et al was not meaningful. A Bayesian
analysis might have thrown more light on the data than the methods
actually reported.
This doesn't mean that knowing falsification did not occur. But, it
is to say that there are serious risks in applying statistics to draw
conclusions about specific occurrences -- it's the same problem as with
circumstantial evidence in court, or with epidemiology in general. One
always should err on the side of assuming improbable statistics rather
than dishonesty. The statistics should raise suspicion, but dishonesty
should be inferred only because of more direct evidence.
Competing interests:
None declared
Competing interests: No competing interests
Letter to the British Medical Journal (BMJ)
The authors use "conventional statistical significance tests" to
compare the baseline characteristics of the two randomised groups,
treating the two trials separately. The use of tests such as the t-test or
F-test is questionable here.
There is no indication that the initial diet-trial group of subjects
from which the two subgroups were randomly selected was itself selected
from a well-defined population.
Therefore, it is reasonable to assume that the implied model is what
Lehmann [1, page 5] calls a "Randomization Model." In this model, there
are no populations, and, consequently, there are no population means or
variances and tests about such parameters are meaningless.
Further, since the only source of randomness, in the diet and other
similar trials, is through the purported randomization, the only purpose
of testing for baseline covariate balance, is to check whether there has
truly been random allocation to the two groups. As the randomization is
actually in question in the diet trial, the use of a formal tests is
justified here.
When there is no underlying population (distribution), one is forced
to use distribution-free procedures with vague hypotheses.[see Lehmann
[1], pages 22-24
and 31-32] The issue is not one of robustness against departures from
Normality but one of the nonexistence of an underlying distribution.
References:
[1] Lehmann EL, D'Abrera HJM. Nonparametrics: Statistical Methods
Based on Ranks. Holden-Day, 1975.
Competing interests:
None declared
Competing interests: No competing interests