Jump to: Page Content, Site Navigation, Site Search,
You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.
BMJ 2005;330:929 (23 April), doi:10.1136/bmj.38377.675440.8F (published 15 April 2005)
Mike Harley, director1, Mohammed A Mohammed, senior research fellow2, Shakir Hussain, statistician3, John Yates, professor1, Abdullah Almasri, visiting statistician3
1 Inter-Authority Comparisons and Consultancy, Health Services Management Centre, University of Birmingham, Birmingham B15 2RT, 2 Department of Public Health and Epidemiology, University of Birmingham, Birmingham B15 2TT, 3 Department of Primary Care and General Practice, University of Birmingham
Correspondence to: M Harley M.J.Harley{at}bham.ac.uk
Design A mixed scanning approach was used to identify seven variables from hospital episode statistics that were likely to be associated with potentially poor performance. A blinded multivariate analysis was undertaken to determine the distance (known as the Mahalanobis distance) in the seven indicator multidimensional space that each consultant was from the average consultant in each year. The change in Mahalanobis distance over time was also investigated by using a mixed effects model.
Setting NHS hospital trusts in two English regions, in the five years from 1991-2 to 1995-6.
Population Gynaecology consultants (n = 143) and their hospital episode statistics data.
Main outcome measure Whether Ledward was a statistical outlier at the 95% level.
Results The proportion of consultants who were outliers in any one year (at the 95% significance level) ranged from 9% to 20%. Ledward appeared as an outlier in three of the five years. Our mixed effects (multi-year) model identified nine high outlier consultants, including Ledward.
Conclusion It was possible to identify Ledward as an outlier by using hospital episode statistics data. Although our method found other outlier consultants, we strongly caution that these outliers should not be overinterpreted as indicative of "poor" performance. Instead, a scientific search for a credible explanation should be undertaken, but this was outside the remit of our study. The set of indicators used means that cancer specialists, for example, are likely to have high values for several indicators, and the approach needs to be refined to deal with case mix variation. Even after allowing for that, the interpretation of outlier status is still as yet unclear. Further prospective evaluation of our method is warranted, but our overall approach may be potentially useful in other settings, especially where performance entails several indicator variables.
In common with many other external and internal inquiries, little use was made of comparative data regarding the performance of individual consultants or surgical teams. For over 20 years, routine data sources such as the hospital episode statistics have been widely perceived as being of little value because of problems with completeness and accuracy, and it has been assumed that the type of information required to identify poor performance would necessitate a new data collection system. The Department of Health proposed the introduction of a "near miss" reporting system and dismissed the use of hospital episode statistics for identifying poor clinical quality, observing that historically, the uses of these data have concentrated on recording and assessing activity levels and on performance, including technical efficiency.2 Much is of variable quality and equally variable relevance to the quality and outcomes of the care that the NHS provides.2
Despite these concerns, hospital episode statistics data were used in the Bristol inquiry,3 albeit not to study the work of individual surgeons or teams. The conclusion of the subsequent Kennedy report regarding hospital episode statistics was unequivocal; hospital episode statistics "was [sic] not recognised as a valuable tool for analysing the performance of hospitals. It is now, belatedly." This paper explores this theme, by comparing the performance of 142 gynaecology consultants with the performance of Ledward over a period of five years, to determine if Ledward was a statistical outlier according to hospital episode statistics data.
|
We obtained complications by scanning all seven diagnostic fields of hospital episode statistics for International Classification of Diseases, 9th edition (ICD-9) codes 996-999 and ICD-10 codes T80-T88: "Complications of surgical and medical care not elsewhere classified."
We then calculated each indicator for each of the years from 1991-2 to 1995-6 for Ledward, his three colleagues in the same hospital, and all the gynaecologists in one other region, the West Midlands. The West Midlands data contained only anonymised consultant codes. At the time of our study, reliable data were not readily available for the whole of the region in which Ledward practised, so we were able to use the data only for Ledward's own hospital.
We undertook a retrospective desktop statistical analysis to determine whether Ledward could be identified as a statistical outlier. We assigned a study code to all consultants. Throughout the analysis, the analysts (SH and MAM) were blinded to the code of Ledward. The analysis proceeded in three stages.
Stage 1
Exploratory data analysisIn all, 143 consultants (coded 1-143) were in our data set, of whom 68 appeared in all five years. Table 22 shows the number of consultants in each year and the numbers excluded because of any missing data item. According to Little's D2 statistic for missing data in multivariate data sets,8 the pattern of missing data was consistent with data missing at random (P < 0.0005).
|
Stage 2
We carried out a multivariate analysis to detect outliers, based on the computation of a robust Mahalanobis distance9 for each consultant in each year. The statistical details are provided in the appendix on bmj.com. For each year we computed, from the variable space of the seven indicators, a Mahalanobis distance for each consultant. The Mahalanobis distance is in essence a measure of the "distance" between the origin in the seven indicator variable space and a given data point. So a consultant with average values for each variable will have a Mahalanobis distance of zero, and this represents the origin. Consultants who are furthest away from the origin will have relatively larger distances. For each Mahalanobis distance we also derived an approximate 95% confidence interval, using computer simulation techniques. We randomly simulated each variable, for each consultant, 1000 times from an underlying binomial or normal distribution (the parameters of which were based on the observed data and the sample size). We used this simulated data set to derive 1000 simulated Mahalanobis distances for each consultant, which in turn were used to determine the approximate 95% confidence intervals for each consultant's distance.
The square root of the Mahalanobis distance (
MD) is known to follow approximately a 
2 distribution with k degrees of freedom (k being equal to the number of indicator variables, seven in our case),9 and so we used the mean of the 
2, which is given by the
k degrees of freedom (
7 = 2.66) to define outliers.9 Consultants with 95% intervals above the 2.66 threshold were deemed to be outliers. We report the number of outlier consultants for each year.
Stage 3
We also investigated the change in MD over the five years, using hierarchical analyses for repeated measurements. We constructed a two level hierarchical model, with consultant at level 1 (highest level) and their respective Mahalanobis distances at level 2 (lowest level). We used the standardised residual output from this model (see figure 2) to identify outliers beyond 2 standard deviations.
|
We used S-PLUS, version 6.1 (Insightful Corporation, Seattle, USA), with the Robust Library, version 1 (Beta II),10 and MLwiN, version 2.1c (University of London, London), for our analyses.
MD for each consultant for each year, and table 2 summarises the number of outlier consultants.
|
We also constructed a model to investigate the variation in
MD over time (see bmj.com for further details), which reached significance (P = 0.0043). Figure 2 shows standardised residuals from the model. From this figure, we identified nine high outlier consultants and three low outlier consultants.
After these two analyses, MH revealed the consultant code and confirmed that Ledward was a statistical outlier (in three of the five years of figure 1 and in figure 2). Figure 3 shows the variable values for Ledward. Several other consultants were outliers. Two consultants were outliers in all five years, two consultants were outliers in four years, and seven consultants (including Ledward) were outliers in three years. Exploratory visual examination of the variable values for all these outlier consultants, also using figure 3 (results not shown) did not show any consultant as having consistently low values in all seven indicators.
|
Potential limitations of the study
The measurement of poor clinical performance in the NHS has no gold standard with which to compare this or any other statistical method,14 because in reality we are unable to calculate sensitivity and specificity of the "test" since we do not know the true underlying state of each subject. Recognising the limitations of statistics in this type of work is therefore important.14 Furthermore, the degree of statistical refinement applied to such problems must be weighed against the more fundamental limitations of the datasets available, their quality, and the role of human judgment in selecting the indicators.
Although we were not unduly hampered by the amount or pattern of missing data, the issue of what to do with subjects who have missing data is important. We excluded these subjects, but this creates the inappropriate impression that consultants with missing data may not be subject to a monitoring process. Although missing or poor quality data (an often cited criticism of hospital episode statistics data12) can hamper all analyses, they may not, as shown in the Bristol analysis,13 radically alter the ability to detect outliers. One statistical strategy to deal with missing data is imputation, although a more fundamental solution is to focus on the reasons for missing data or data of poor quality and deal with this through improved data collection methods as part of the overall monitoring system.14
The use of routine data sets such as hospital episode statistics places an important design constraint on analyses of this kind. Hospital episode statistics contain a limited number of variables, of which only some are potentially useful indicators of quality of care or of factors relating to the case mix of patients. However, this does not imply that analysis of routine data sets is without merit14; in recent years data from hospital episode statistics data have been used increasingly.3 15 16
Furthermore, one can easily reduce or increase the number of statistical outlier signals by shortening or widening the intervals of uncertainty, or by using non-robust statistical methods, but it is important to emphasise that this is not a purely statistical question. We must also consider the costs and benefits (including findings) of subsequent investigations. For example, after simulation to determine individual intervals of uncertainty, Ledward was an outlier in three of the five years, but his
MD was above the 95th centile (3.75) in four out of five years (fig 1), indicating that it may be prudent to review consultants with large Mahalanobis distance (say, above the 95th centile) even though the individual interval of uncertainty crosses (only just) the expected mean. So, although the setting of the threshold may be informed by statistical theory, we will ultimately require longer term empirical evidence to determine its utility.
Proposed framework for investigation
One proposed framework for investigation is the pyramid model of investigation.17 The model is based on the premise that the bulk of failure is attributable to the system and not the individual, and so the pyramid prescribes a check of the following variables in the order listed: check the data (recognising that some of the variation between consultants could simply be due to data quality or completeness13
14), check the patient case mix, check the structure, check the process of care, and, finally, carefully check the carers involved. The pyramid model of investigation was applied recently in the case of two general practitioners who were identified via the Shipman inquiry as having "unacceptably" high death rates.17 These general practitioners were found to have large numbers of patients in nursing homes (a factor that was not taken into account in the underlying statistical model), and this credibly explained their high death rates.
Careful handling is essential
In responding to a signal of potentially poor performance we must be alert to some real dangers. For example, the presence of substantial criticism in the media, and even appearance beforethe General Medical Council, does not guarantee that those so accused are actually guilty of poor performance,18
19 nor does it mean that all the remainder who have not been criticised are performing in an entirely acceptable manner. Once an individual has been publicly identified, the stigma remains,20 and we cannot undo what has been done. These issues are especially important if the explanation for the poor performance is outside the gift of the individual carer.11
|
Useful methods for monitoring performance
Although scanning methods14 such as ours will never have complete diagnostic certainty, they could be used to reliably identify signals from noise,13 which need to be systematically and sensitively examined, perhaps confidentially, by peers.21 Although our methods urgently need to be evaluated prospectively, organisations engaged in this type of performance monitoring, including the National Patient Safety Agency, the Healthcare Commission, the General Medical Council, the NHS Litigation Authority, and the National Clinical Assessment Authority may find our methods of interest. Nevertheless, although the ability to identify poorly performing clinicians after the event has its uses, prevention is preferable; but this presents an altogether different challengeone that seeks to engineer the safety of patients into the process of care by design.
We thank J Duffy for his statistical advice at initial stages of this project; R Penketh, consultant gynaecologist, for his advice on indicators; and R Holder for his advice regarding the limits of uncertainty. We are grateful to S Evans and R Lilford for their critical comments on earlier drafts of the manuscript. Thanks are also due to the Kings Fund for funding the initial part of this work. AA is supported by the Swedish Foundation for International Cooperation in Research and Higher Education.
Contributions: The project team was headed by MH, who also carried out the preliminary analysis and wrote the first draft of the paper. JY secured funding, undertook literature reviews, and was instrumental in the initial design. SH and MAM undertook the statistical analyses. MAM produced the final draft of the paper. AA, with guidance and support from SH and MAM, undertook the simulation work. All authors contributed to the writing of the final paper. MH is guarantor.
Funding: The Kings Fund funded the initial stages of this project.
Competing interests: None declared.
![]()
CiteULike
Complore
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
Read all Rapid Responses