Are these data real? Statistical methods for the detection of data fabrication in clinical trials
BMJ 2005; 331 doi: https://doi.org/10.1136/bmj.331.7511.267 (Published 28 July 2005) Cite this as: BMJ 2005;331:267- Sanaa Al-Marzouki1, research student,
- Stephen Evans (stephen.evans{at}Lshtm.ac.uk), professor of pharmacoepidemiology, Medical Statistics Unit1,
- Tom Marshall, senior lecturer in medical statistics1,
- Ian Roberts, professor of epidemiology and public heath1
- 1 Department of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London WC1E 7HT
- Correspondence to: S Evans
- Accepted 15 July 2005
Abstract
Objectives To test the application of statistical methods to detect data fabrication in a clinical trial.
Setting Data from two clinical trials: a trial of a dietary intervention for cardiovascular disease and a trial of a drug intervention for the same problem.
Outcome measures Baseline comparisons of means and variances of cardiovascular risk factors; digit preference overall and its pattern by group.
Results In the dietary intervention trial, variances for 16 of the 22 variables available at baseline were significally different, and 10 significant differences were seen in means for these variables. Some of these P values were extraordinarily small. Distributions of the final recorded digit were significantly different between the intervention and the control group at baseline for 14/22 variables in the dietary trial. In the drug trial, only five variables were available, and no significant differences between the groups for baseline values in means or variances or digit preference were seen.
Conclusions Several statistical features of the data from the dietary trial are so strongly suggestive of data fabrication that no other explanation is likely.
Introduction
Most statistical analyses of clinical trials are undertaken on the presumption that the data are genuine. Large accidental errors can be detected during data analysis,1 2 but if people are trying to “make up” data they are likely to do it in such a way that it is not immediately obvious, avoiding any large discrepancies. Nevertheless, fraudulent data have particular statistical features that are not evident in data containing accidental errors, and several analytical methods have been developed to detect fraud in clinical trials.3 4 The BMJ has taken a general interest in this field and has published a book on fraud and misconduct, now in its third edition, which has a chapter on statistical methods of detection of fraud.5
In this paper we use statistical techniques to examine data from two randomised controlled trials. In one trial, the possibility of scientific misconduct had been raised by BMJ referees, based on inconsistencies in calculated P values compared with the means, standard deviations, and sample sizes presented (see p 281). For comparison, we used the same methods to analyse a second trial for which there were no such concerns. We were not involved in either trial.
Methods
The trial about which doubts were raised (the diet trial) was a single blind, randomised controlled trial of the effects of a fruit and vegetable enriched diet in 831 patients with coronary heart disease, including patients with angina pectoris, myocardial infarction, or surrogate risk factors. Study participants were stated to be randomly allocated to the intervention diet (Group I, n = 415) or to the control group, which was the patient's usual diet (Group C, n = 416). The aim was to examine the effect of the intervention diet on risk factors for coronary artery disease after two years. We do not present data from the two year follow-up, because differences between groups could arise as a result of the interventions. After the reviewers had expressed suspicions about the integrity of the data, the BMJ requested the original trial data. These were provided by the trial's first author on handwritten sheets, which we entered on to computer, making appropriate checks to avoid transcription errors. The data are considered in the two randomised groups at baseline, Group I and Group C.
The second (“drug”) trial was a randomised controlled trial of the effects of drug treatment in 21 750 patients with mild hypertension from 31 centres, from which we randomly selected five centres with 838 patients who had complete data for the selected variables. Study participants were randomly allocated to receive the drug (Group I, N = 403) or a placebo (Group C, N = 435). The aim was to determine whether drug treatment reduced the occurrence of stroke, death due to hypertension and coronary events in men and women aged 35-64 years, when followed for two years (again we do not present data from the follow-up). The drug trial data were provided by the trial investigators as computer files. The data are presented by treatment group (I or C) at baseline, using the same notation as for the diet trial. The variables in this study in common with the diet study are weight, diastolic blood pressure, systolic blood pressure, cholesterol measurements, and height. Further details of the methods and results from that trial have been published.6
Statistical methods
We conducted various tests on the baseline data of the randomised groups in both trials, looking for patterns that might indicate that the data in the diet trial were not generated by the normal process of making and recording individual measurements on a series of patients. We used the data from the drug trial for comparison, since we expected them to show patterns typical of data collected normally during a trial.
Using basic descriptive statistics and conventional statistical significance tests we compared the baseline data in the randomised groups in both trials. In a randomised trial, the data at baseline should be similar in the randomised groups. (The mean, the variability, the shape of the distribution of the data, and the pattern of data resulting from the methods of measurement must be similar since the groups can differ from one another only by chance factors.) This is the reason why in general, tests for statistical significance are not conducted at baseline in genuine trials. If such tests are carried out about one in 20 of such tests will be significant purely by chance. We used t tests to compare the means of the randomised groups and F tests to compare the variances (standard deviations).
Data that are recorded (or invented) by people (as opposed to machines) tend to show preferences for certain numbers, such as rounding to the nearest 5 or 10. This is seen in the last recorded digit of numbers, and is called “digit preference.” This digit preference should be similar between groups formed just by a chance process—randomisation. We used χ2 tests to examine whether there was any tendency for the last digit to take on particular values and whether any observed digit preference was the same in the two groups created by randomisation. Digit preference can occur in all legitimate data based on human recording, but any pattern of this preference should be similar between groups formed using randomisation. We used SPSS, version 12.0.1 (Chicago, USA), for our data analysis.
Results
Table 1 shows descriptive summaries of variables common to both trials for both groups in each trial. The drug trial values show what might be expected in a randomised trial, but the diet trial shows notable differences in standard deviations for height and cholesterol measurements.
Table 2 shows for each trial the results of t and F tests, for differences in means and also in variances between the intervention and control groups at baseline for all available variables. In a genuine trial, correctly randomised, any such differences would be due to chance. Usually P values should not be quoted to greater precision than P < 0.001, but because of the extreme nature of these P values, their exact value is given. In the diet trial, differences in variances were significant for 16 of the 22 variables that were available, as were 10 differences in means for these variables. Several of the P values were extraordinarily small. The expectation is that about 5% of such comparisons would have P < 0.05, and extremely small P values should not occur. In the drug trial, none of the baseline means and none of the baseline variances showed statistically significant differences between the two groups, though only five variables were compared.
Table 3 shows the analysis of digit preference, assuming a uniform distribution of last digits. In the diet trial, all of the χ2 values were highly significant, indicating that all the variables showed strong digit preference, although some preference is not unexpected. Digit preference was also evident for the results of a laboratory cholesterol test, which is unexpected since human estimation of the results is not usual. Measurements of height were not supplied for the diet trial (they were derivable from body mass index and weight for means, but this is not relevant for digit preference). In the drug trial, the χ2 value was highly significant for height (indicating strong digit preference as might be expected) but not for any of the other measures Blood pressure measurement used a random zero machine, intended to remove digit preference. Table 4 shows the results of χ2 testing for a difference in the pattern of digit preference between the two groups created by randomisation. This allows for the fact that digit preference can occur, but this should show a similar pattern in each of the randomised groups. In the diet trial, the final digit distributions are significantly different between the intervention group and the control group at baseline for all variables apart from cholesterol, fasting blood glucose, caffeine, carotene, and vitamin A. In the drug trial, the two randomised groups are far from being significantly different in terms of the final digit.
Discussion
The data from the diet trial have various anomalous statistical features that are not present in the data from the drug trial. These features are differences in means, and, even more noticeable, in variances at baseline and in differences in pattern of digit preference between randomised groups.
Magnitude of P values
These differences in the means and variances between baseline variables in the diet trial indicate that the two groups simply cannot have been formed as a result of random allocation as the authors claim. The magnitude of the P values derived from t tests of these differences for several variables is not compatible with a chance effect. One or two variables might show a small effect, but several of these P values are extreme. Similarly, the significant difference in the pattern of digit preference between the randomised groups provides additional evidence that this is not a truly randomised trial.
Randomisation process
If this is not a randomised trial then how did these data arise? One possibility is that the data themselves are genuine but that the randomisation process has been subverted. This might explain, for example, some of the differences between the means of the variables at baseline. Had there been subversion of the randomisation process, in order for example to create differences between the groups at baseline, then smaller differences would have occurred and would also have been more consistent between the variables that are medically related—such as the different measures of cholesterol that show entirely different patterns between the groups. As it is, some are extreme and others are no different between the groups. What is more difficult to explain on the basis of subversion of the randomisation is the difference in the variability at baseline. Here we have highly significant differences in some variables both for the variances and the means, whereas for height, complex cholesterol, and triglyceride, there are highly significant differences in the variances but not in the means. Had there been a tendency to put patients with, say, higher blood pressures into one group, then we might have found significant differences in the mean values but with no difference in variance. However, we did not find this. Furthermore, no clear differences were apparent in the means for variables that would be readily available to a physician or health professional at the time of recruitment.
What is already known on this topic
Data fabrication is a rare form of scientific misconduct in clinical trials, but when it does occur it has serious consequences
Most papers are published without their data being independently verified, and there have been calls for data to be made available for scrutiny
Statistical methods for the detection of misconduct have been described, but few examples of their application have been published
It has been stated that statistical methods alone cannot prove data fabrication
What this study adds
Statistical methods can be applied to detect large scale fabrication of data in a randomised trial where data are available
Certain patterns of data are incompatible with randomisation, especially when a trial is “blind”
This paper shows the fabrication or falsification of data in a particular trial
Digit preference
Digit preference in itself is not evidence of misconduct. It is conceivable that the different patterns of digit preference between the two randomised groups may have arisen had one person recorded data for the treatment group and another recorded data for the control group. However, it is claimed that the trial was single blind, meaning that those recording data should not know to which group patients had been allocated. We would not expect differences therefore in digit preference between the randomised groups. But perhaps the trial was not single blind as described, and those recording the data were separated into groups according to whether they were dealing with patients allocated to either treatment or control. This could lead to differences in digit preference between randomised groups for variables where a human element of judgment was required. This would still not explain the differences in means and variances between the two groups since the effect of digit preference on the means and variances would only be slight. The combination of the differences in means, variances, and digit preference between the randomised groups is strong evidence that data fabrication took place in the diet trial.
Conclusion
We conclude that the data from the diet trial were either fabricated or falsified and that the strength of the evidence is such that appropriate steps should be taken to deal with this matter.
Footnotes
-
See also p 281, and Editorial by Smith and Godlee
We thank Tom Meade who, on behalf of the Medical Research Council, provided the data for the drug trial and Richard Smith for his encouragement to examine further the data from the diet trial. The BMJ provided the data from the diet trial, which were supplied by the original author for further investigation of these data.
-
Contributors SE and SAM had the ideas for the analysis, and SAM, SE, TM, and IR all contributed to the planning, conduct, and writing of the paper. SAM planned and carried out the statistical analyses. SAM and SE are jointly responsible for the overall content as guarantors. There are no other contributors.
-
Competing interests None declared.
-
Funding None.