Mortality control charts for comparing performance of surgical units: validation study using hospital mortality dataBMJ 2003; 326 doi: https://doi.org/10.1136/bmj.326.7393.786 (Published 12 April 2003) Cite this as: BMJ 2003;326:786
- Paris P Tekkis, research fellow of the Royal College of Surgeons of England ()a,
- Peter McCulloch, senior lecturer in surgeryb,
- Adrian C Steger, consultant surgeonc,
- Irving S Benjamin, professor of surgerya,
- Jan D Poloniecki, senior lecturer in biostatisticsd
- a Academic Department of Surgery, King's College Hospital, London SE5 9RS
- b Academic Unit of Surgery, University of Liverpool, University Hospital Aintree, Liverpool L9 7AL
- c Department of Surgery, University Hospital Lewisham, London SE13 6LH
- d Department of Public Health Sciences, St George's Hospital, London SW17 0QT
- Correspondence to: P P Tekkis
- Accepted 6 February 2002
Objective: To design and validate a statistical method for evaluating the performance of surgical units that adjusts for case volume and case mix.
Design: Validation study using routinely collected data on in-hospital mortality.
Data sources: Two UK databases, the ASCOT prospective database and the risk scoring collaborative (RISC) database, covering 1042 patients undergoing surgery in 29 hospitals for gastro-oesophageal cancer between 1995 and 2000.
Statistical analysis: A two level hierarchical logistic regression model was used to adjust each unit's operative mortality for case mix. Crude or adjusted operative mortality was plotted on mortality control charts (a graphical representation of surgical performance) as a function of number of operations. Control limits defined as 90%, 95%, and 99% confidence intervals identified units whose performance diverged significantly from the mean.
Results: The mean in-hospital mortality was 12% (range 0% to 50%). The case volume of the units ranged from one to 55 cases a year. When crude figures were plotted on the mortality control chart, four units lay outside the 90% control limit, including two outside the 95% limit. When operative mortality was adjusted for risk, three units lay outside the 90% limit and one outside the 95% limit. The model fitted the data well and had adequate discrimination (area under the receiver operating characteristics curve 0.78).
Conclusions: The mortality control chart is an accurate, risk adjusted means of identifying units whose surgical performance, in terms of operative mortality, diverges significantly from the population mean. It gives an early warning of divergent performance. It could be adapted to monitor performance across various specialties.
What is already known on this topic
What is already known on this topic League tables are an established technique for ranking the performance of organisations such as healthcare providers
Mortality control charts are another way to compare the performance of healthcare providers, particularly for outcomes of surgery
What this study adds
What this study adds Mortality control charts can be adjusted for case mix and case volume and are better than league tables for monitoring surgical performance
Mortality control charts have a “buffer zone” for indicating divergence from the mean mortality and are particularly useful for specialties with a low volume of surgery
Public concern in the United Kingdom after the Bristol inquiry into cardiac surgery is reflected in mounting pressure for open scrutiny of surgical outcomes.1 For some major types of surgery, operative mortality is an important measure of performance. To reflect performance accurately, however, mortality must be adjusted for the effect of pre-existing comorbid disease. Existing models of risk stratification have several problems. Increasing specialisation of surgery means that regression models developed from “general surgical” cohorts are inappropriate. Existing models are also poor at interpreting large fluctuations in crude mortality caused by a few deaths in units with a small volume of surgery. Lastly, the assumption that relations between predictive variables and mortality are identical across units may obscure factors affecting mortality that are specific to particular units.
Gastrectomy and oesophagectomy have the highest mortality among elective operations in Britain. Patients with gastro-oesophageal cancer often have other serious conditions that increase the risks of surgery. The provision of surgery for upper gastrointestinal cancer is undergoing major reorganisation in Britain, favouring subspecialisation and centralisation and causing major changes in the case mix of surgery units. Directly comparing operative mortality in specialist units with a high volume of elective surgery with mortality in district hospitals with a low volume of high risk gastrointestinal emergencies can be misleading. Evidence about the relation between case volume and outcome conflicts.2–4 The subspecialty of upper gastrointestinal cancer surgery exemplifies the general problem of quantifying surgical risk with adjustment for case mix and volume. We developed statistical techniques for evaluating surgical performance on a continuous scale and applied the techniques to data on upper gastrointestinal cancer surgery.
Data and methods
We took data on outcomes of gastro-oesophageal cancer surgery from two databases on upper gastrointestinal surgery: the stomach and oesophageal cancer outcome and techniques (ASCOT) prospective database and the risk scoring collaborative (RISC) database. There was no population overlap between the databases. Both databases provided comprehensive POSSUM (physiological and operative severity score for the enumeration of mortality and morbidity) data on large cohorts of gastro-oesophageal surgery patients.5
The ASCOT prospective database— This database on gastro-oesophageal cancer surgery, which was developed by the British Oesophago-Gastric Cancer Group, collects a comprehensive dataset on cases of gastro-oesophageal cancer referred to surgeons, whether or not an operation actually took place.6 The data include patients' demographic details, preoperative assessment, tumour staging, type of surgery, postoperative course, and pathology. For this study the database's coordinator used an independent source (hospital episode statistics) to validate a sample of 157 cases. From January 1999 to December 2000 the 31 hospitals across the United Kingdom that joined this voluntary collaboration submitted data on 1036 cases.
The RISC database— This database recorded data on 601 patients undergoing oesophageal and gastric surgery in five hospitals in the South East and Thames Region, which included cases from general and thoracic surgical units. Of the cases, 351 were recorded retrospectively from pre-existing databases, case notes, theatre books, and operating lists, and 250 were prospectively collected from January 1999 to January 2001. The data were independently validated against other hospital data sources (medical records or mortuary registers).
Inclusion and exclusion criteria
We included data on oesophageal and gastric operations for malignant and benign disease with palliative or curative intent. We excluded cases where patients were treated medically or by endoscopic techniques (n=572) and cases with missing notes (n=23).
End point and risk factors
The primary end point was in-hospital mortality (any death during the same hospital admission as the operation), which can be more reliably quantified than 30 day mortality and includes patients with complications who remained in hospital beyond 30 days. Risk factors studied were age; sex; POSSUM score; surgical procedure (as classified by the Office of Population Censuses and Surveys' list of surgical operations and procedures, fourth revision (OPCS4)7; mode of surgery (emergency or elective); tumour staging (according to the International Union Against Cancer (UICC) system, fifth edition)8; and malignancy (according to POSSUM category).
We used univariate analysis to identify risk factors for mortality. Continuous variables were grouped into subcategories, and unifactorial logistic regression was used to compare these with a reference level. We used the χ2 test to analyse categorical variables. To maximise information extracted by the model, we used the multiple imputation technique to substitute for incomplete data. 9 10
We used a multifactorial logistic regression model to adjust for different hospitals' case mix. We constructed a two level hierarchical regression model to allow for clustering of outcomes among patients from the same hospital. Risk factors, including their interaction terms relating to individual patients, were entered into the first level of the model, while hospitals constituted the second level of the model, whose coefficients were allowed to vary randomly between units. We calculated expected mortality for each unit by excluding each unit in turn and modelling the remaining centres (a cross validatory approach).11 The ratio of observed to expected mortality for each unit was multiplied by the mean mortality from the pooled data to derive each unit's risk adjusted operative mortality. We used a non-parametric bootstrap resampling technique with 10 000 iterations to calculate standard errors and to correct parameter estimation bias. We calculated exact binomial 95% confidence intervals for the observed mortality and risk adjusted operative mortality for each unit.
Validation of the model— To evaluate the performance of the model we used the Hosmer-Lemeshow ĉ statistic to assess calibration or goodness of fit (the ability of the model to assign correct outcome probabilities to individual patients) and the area under the receiver operating characteristics (ROC) curve to assess discrimination (the ability of the model to assign higher risks to patients who die than to patients who live). 12 13 Values for the area under the ROC curve from 0.7 to 0.8 indicate reasonable discrimination and values exceeding 0.8 indicate good discrimination.
Mortality control chart— This graphical method for monitoring surgical performance plots units' mortality as a function of number of operations. The exact binomial distribution is used to construct control limits (90%, 95%, and 99% confidence intervals) around the mean operative mortality for the group. These control limits indicate whether a particular unit's operative mortality differs significantly from the mean at 10%, 5%, and 1% significance levels. Each unit's operative mortality (unadjusted or adjusted for case mix) can be plotted as a single point representing the total mortality or as a running mean as a function of the number of operations done. Underperforming units will lie above the upper control limits, while units with unusually good results will lie below the lower control limits. Units lying within the 95% control limits have an operative mortality that is statistically consistent with the group mean.
Statistical software— We used Intercooled STATA 6.0 for Windows (StataCorp, College Station, TX), NORM Version 2.03 for Windows (Pennsylvania State University, PA), and MLwiN Version 2.1c (University of London, London).
Of 1637 cases, 1042 (63.7%) satisfied the inclusion criteria: 497 of 1036 cases (47.9%) in the ASCOT database and 545 of 601 cases (90.7%) in the RISC database. Although 36 hospitals contributed data to the study, the analysis was based on data from 29 centres, as seven units did not contribute operated cases and were therefore excluded. The cases comprised 538 oesophagectomies (51.6%), 443 gastrectomies (42.5%), and 61 palliative bypass procedures (5.9%) (table 1). Of the operations, 828 (79.5%) were elective and 78 (8.6%) were emergencies; in 136 cases (13.1%) the mode of surgery was not recorded. Nine hundred and nineteen operations (93.7%) were for cancer. The overall in-hospital operative mortality was 12% (9.4% in patients having an elective procedure and 26.9% in patients having an emergency procedure). No evidence of systematic under-reporting of risk factors was shown, and missing data were distributed evenly among the hospitals.
We used the two level hierarchical logistic model, together with the overall median regression line, to calculate the relations between age of patients and operative mortality (figure 1) and between preoperative POSSUM physiological score and operative mortality for each of the 29 hospitals (figure 2). Case mix (based on POSSUM scores) varied significantly across units, as shown in figure 2 by the different ranges in POSSUM score (Kruskal-Wallis test: χ2=62.159, df=28, P<0.0001).
The final multifactorial model used age, POSSUM score, POSSUM malignancy category, and mode of surgery as risk factors (table 2). Mode of surgery was retained in the model as it is clinically highly relevant and has been reported as an important predictor of outcome.2 The model fitted the data well (Hosmer-Lemeshow ĉ statistic: χ2=0.139, df=8, P=0.255) and had adequate discrimination (area under the ROC curve 0.78 (standard error 0.02)).
Units reported between one and 55 operations a year, with mortality ranging from 0% to 50%. The mortality control chart for unadjusted operative mortality shows that four units lay outside the 90% control limit (figure 3). When operative mortality was adjusted for case mix, however, no unit was shown to underperform at the 95% control limit, and the individual values regress towards the mean (figure 4). Two units had better results than the group average, with risk adjusted operational mortalities of 4.2% and 3.8%. Figure 5 shows the running means of the risk adjusted operational mortality for two of the units (31 and 33), representing two consecutive series of 102 and 166 cases. Despite fluctuations, unit 31 remained within the central part of the graph, whereas unit 33 repeatedly crossed the lower 99% control limit and thus could be said to be a truly outlying unit and a consistently good performer.
The mortality control chart improves on current methods of evaluating surgical units' performance. It is an accurate, risk adjusted means of identifying outlying units while giving an early warning of units approaching divergence from the mean.
Validity of the data
The information in the study was a combination of prospective data and medical records. Centres voluntarily contributed data, and at present there is no formal system for externally validating the completeness of the database. Internal validity was established by comparing the operative mortality for a random sample of five participating hospitals (157 patients) with hospital episode statistics obtained independently from the hospitals' information departments. The two databases reported similar overall mortality (14% in the ASCOT data and 13.8% in the hospital episode statistics), but they differed in the individual hospitals' volumes of operations and in the variability of mortality. Although overall operative mortality in the units in our study was consistent with recently published data from the West Midlands region, our units were not randomly selected, and we cannot be sure how representative they are of all UK hospitals. However, although the quality of our data is limited, implementation of such a monitoring system in hospitals should lead to an increased awareness of the data that need to be collected, with subsequent improvement in the quality of the data.
Quality of the statistical analysis
Hierarchical regression models are particularly useful in modelling observations with a hierarchical or clustered structure, such as patients in different hospitals or pupils in different schools.14 Such models avoid the penalty for ignoring the clustered nature of data on patients in hospitals—namely, an erroneously low standard error of regression coefficients.15 Hierarchical models acknowledge heterogeneity among units and assume that the variability between hospitals approximates a normal distribution.16 Such techniques have been adopted to rank the performance of organisations. 1 17 18 We used confidence intervals around the providers' performances to compare each unit's performance with the average, with wider confidence intervals for low volume units. If these wider limits are not allowed for, low volume providers are more likely to be ranked misleadingly at the top or bottom of the group. Confidence intervals can be placed around a unit's rank, thus emphasising “the caution with which any league tables must be treated.”19
Control limits in the mortality chart define outlying units and give an early warning when a unit's performance starts to diverge from the population mean. Mortality charts can express either performance over a period as a point estimate or sequential monitoring, using running means of operative mortality. Crossing the upper control limit indicates high mortality and crossing the lower control limit indicates low mortality that is not attributable to normal variation. In each case efforts should be made to identify the special causes. The less extreme control limits delineate an early warning “buffer zone” to trigger examination of practice. Because the control limits are much wider for low volumes, a high (risk adjusted) operational mortality in these hospitals should be interpreted carefully and may require longer monitoring to establish a meaningful estimate of mortality.
Usefulness of mortality control charts
Mortality control charts can be extended to any surgical specialty that uses risk adjusted outcomes. Similar graphical methods have been used to investigate the effect of case volume on unadjusted operative mortality in paediatric cardiac surgery.20 Control charts based on the approach of Walter Shewhart—the pioneer of the economic control of variation in manufacturing—have been described for monitoring surgical performance. 21 22 Other studies have described alternative techniques based on the cumulative sum (CUSUM) technique for longitudinal analysis of surgical performance. 23 24 In the sequential mortality control chart (figure 5), type I errors will occur more often but can be reduced by using the more extreme control limits and by interpreting divergences with proper caution. The mean operative mortality and corresponding control limits for any population will need to be reviewed periodically to reflect changes over time. The mortality control chart is intended to add to existing statistical methods for monitoring surgical performance rather than replace them.
We thank all the consultants who contributed data to the study, the data collection officers for their help, and the research staff at the Centre for Multilevel Modelling, University of London, for their invaluable help in developing the hierarchical models. Hospitals and trusts that contributed data were Addenbrooke's NHS Trust, Aintree University Hospital, Airedale NHS Trust, Barnet General Hospital, Bishop Auckland General Hospital, Broomfield Hospital, Chorley General Hospital, Colchester General Hospital, Furness General Hospital, Glenfield Hospital, Harefield Hospital, Harrogate District Hospital, Huddersfield Royal Infirmary, Ipswich Hospital, Kingston Hospital, Leicester Royal Infirmary, Leighton Hospital, Macclesfield District General Hospital, Maidstone General Hospital, Newham General Hospital, Norfolk and Norwich NHS Trust, North Staffordshire City General Hospital, Queen Alexandra Hospital, Queen Elizabeth Hospital, Queen Mary's Hospital, Royal Bolton Hospital, Royal Bournemouth Hospital, Royal Free Hospital, Royal Hull Hospitals NHS Trust, Royal Lancaster Infirmary, University Hospital Lewisham, Watford General Hospital, West Wales General Hospital.
Contributors: PPT, ISB, and JDP devised the original research and obtained funding. PPT, ACS, and PMcC were responsible for the completion and validation of the RISC and ASCOT datasets respectively. PPT and JDP analysed the data. PPT drafted and edited the paper and JDP and PMcC revised it. All authors contributed comments and corrections on the final draft. PPT and PMcC are the guarantors for the study.
Funding The Hue Falwasser Fellowship of the Royal College of Surgeons of England. The guarantors accept full responsibility for the conduct of the study, had access to the data, and controlled the decision to publish.
Competing interests None declared.
Ethical approval: The multicentre research ethics committee for Wales.