Mortality control charts for comparing performance of surgical units: validation study using hospital mortality dataBMJ 2003; 326 doi: https://doi.org/10.1136/bmj.326.7393.786 (Published 12 April 2003) Cite this as: BMJ 2003;326:786
All rapid responses
Mortality Control Charts for Comparing Performance of
Surgical Units: validation study, using hospital mortality
Tekkis PP et al, Br Med J 2003:326;786-788
The paper reports a mean mortality of almost 10% and
30% for elective and emergency gastro-oesophageal
surgery in 29 units with some approaching 50%. It
concludes that none had under-performed. It sets the
high mortality figures as a benchmark against which
performance is judged. A multi-factorial model is
created that best fit to this dataset and artificially
creates a high minimum risk by using the original
POSSUM (1) instead of P-POSSUM (2) greatly over
predicting mortality in low risk patients. This is
compounded it appears, by inclusion of identical risk
factors twice within the multi-factorial model i.e.
malignancy category, mode of surgery, and age,
already incorporated within POSSUM.
Whilst confirming that accurate and reliable data is
essential, it is acknowledged that the data in the study
is limited. It is unclear whether physiological scores
were derived from data collated on admission or as
intended (1) after resuscitation. It was also unclear
whether the original definitions, for operative
parameters, were consistently followed. Whilst an
emergency operation is defined as that within 24 hours
of admission, many surgeons disregard this and
include cases beyond 24 hours and returns to theatre
with complications. In our institution, 70% of
emergency patients did not meet the criteria, but
resulted in increase in the overall POSSUM score.
Ambiguity exists also as to the number of procedures
performed during an operation. Non clinical audit
officers are unable to validate POSSUM data without
clear guidelines. The occurrence of peritoneal soiling
or intra-operative blood loss serves only to enhance the
expected mortality, condoning the otherwise high
mortality. The institutions in the study were not
randomly selected and this is a significant flaw in the
study which may account for the high mortality. The
control charts excuse high mortalities for low
workloads.. It justifies a high mortality when
guidelines preclude small numbers being undertaken.
The POSSUM (1) formula was derived from a general
surgical population. The application of this in gastro-
oesophageal surgery is suspect albeit adjustments
were made. A new formula may be appropriate with
fine-tuning of specific operative parameters based on
prospective validated data in recognised centres with
adequate cases. Bias is inevitable if curative and
palliative, or oesophageal and gastric surgery, are
combined when numbers are few and surgeons have
widely differing operability rates ignored by POSSUM.
Even within these sub groups it is unwise to compare
outcomes for trans-hiatal oesophagectomy with those
of a three phase McEwan procedure with radical
lymphadenectomy (3). The former benefits from lower
operative mortality but is likely to result in worse
survival. The current POSSUM operative parameters
do not cater for the true operative severity as it
incorrectly equates the severity of a partial gastrectomy
with that of an oesophagectomy. Also POSSUM ignores
obesity and diabetes, yet it condones poor judgement
when surgery is undertaken in advanced cases where
the score would reflect this but not question the
wisdom of such surgery.
The purpose of a professionally led system of quality
control is the objective assessment of outcome taking
into account case-mix, co-morbidity and operative
severity but these systems should be sound. The
POSSUM system is neither simple nor practical
Validation of data, internal or external, is fraught with
difficulties. The observed to expected mortality ratio
generated by POSSUM or its derivatives can be
misleading since both components require critical
analysis, even though the overall ratio is favourable. It
is only then that a judgement can be made on whether
performance truly meets with standards. The Surgical
Risk Scale system (SRS) is preferred (4) since it is
independent of the surgeon and data can easily be
examined. Alternatively, the pre-operative
physiological scores alone are employed in a manner
similar to that of the ruptured aortic aneurysm (RAAA-
POSSUM) equation, reported (5) as being effective in
sub-specialty sub-group analysis.
George A Khoury MS, FRCS
Consultant General & Colorectal Surgeon
East Sussex NHS Trust,
St Leonards on Sea
(1) Copeland G P, Jones D. Walters M, POSSUM: a
scoring system for surgical audit. Br J Surg 1991;
(2) Wh iteley MS, Prytherch DR, Higgins B, Weaver PC
and Prout WG. An evaluation of the Possum surgical
scoring system. Br J Surg 1996; 83, 812-815
(3) Khoury GA, Oesophageal Surgery under Akiyama,
The Lancet 1989; 1:91-92
(4) Sutton, R, Bann S, Brooks m, Sarin S. The Surgical
risk Scale as an improved tool for risk-adjusted
analysis in comparative surgical audit. Brit J Surg
(5) Neary WD, Crow P, Foy C, Prytherch D, Heather BP,
Earnshaw JJ Comparison of POSSUM scoring and
the Hardman Index in selection of patients for repair of
ruptured abdominal aortic aneurysm. Br. J Surg
Competing interests: No competing interests
The paper by Tekkis et al illustrates the difficulties of trying to
compare outcomes for several units when case mix is variable. In their
analysis, mortality rates are adjusted as an attempt to level the playing
field but my concern is whether this can give rise to misleading results.
I shall illustrate by a hypothetical example. Consider a Unit that has
undertaken 50 procedures with a mean risk rather below average at 8% but
for which 8 deaths were recorded (i.e. a mortality rate of 16%). Before
any adjustment, each unit will have its own set of confidence limits
depending on their case mix. So, in this case, calculating exact
confidence limits based on the binomial distribution, the upper 90% limit
(two-tailed) is at 17%. With this limit as an 'alert' boundary, the
observed mortality rate should not give cause for concern.
mortality rate is derived from the ratio of the observed to expected
mortality multiplied by the pooled mean. (This is stated in the full text
of the paper on the BMJ web site.) For the hypothetical Unit the adjusted
rate is (16/8) times 12%, which equals 24%. After adjustment it is
desirable that the relationship a Unit has with respect to its own
confidence limits is preserved vis-a-vis the new limits that are used for
the control charts in figures 3 to 5. So, this hypothetical Unit that lies
within its own 90% limit should, after adjustment, lie within the 90%
limit on the control chart. Where does it actually lie? On the control
chart, the upper 90% limit after 50 procedures corresponds to a rate of
22% and the upper 95% limit to a rate of 24%. So the Unit lies on the
upper 95% limit and hence could signal a warning unnecessarily. Admittedly
this example is somewhat engineered, and mean risks of 8% after 50
procedures may be unrealistic, but it does highlight the dangers of using
this kind of plot for control charting. In an environment in which a unit
is to be judged by whether or not it lies outside a particular limit, the
errors it generates could be critical.
Competing interests: No competing interests
The paper by Tekkis, McCulloch, Steger, Benjamin and Polonecki (2003)
raises two important issues - first, the role of statistical process
control (SPC) in hospitals and secondly, the mechanisms and purpose of
SPC was developed in the 1920’s by Walter Shewhart and its employment
has been honed by a number of quality experts, the most prominent of whom
has perhaps been Edwards Deming (Salsburg 2001). SPC was developed so that
people manning industrial production lines could learn about the quality
of their product and take early corrective action if that quality began to
deteriorate. Deming (1993) emphasises again and again that the crucial
element is the system and its constituent processes, and that SPC is
useful only in a secondary role as part of the Deming cycle of fixing the
system, followed by monitoring, analysis and feedback - analysing and
optimising the system is the primary function in quality improvement, not
SPC. Nowhere is it advocated that SPC should be used to judge and compare
institutions; Deming in fact repeatedly warns against this behavior.
When we see SPC methods employed on hospital data, there is often
total disregard for the principles of quality improvement enunciated by
such people as Edwards Deming. SPC is used to judge performance and
compare institutions. Usually there is nothing said about analysing and
optimising the underlying system and its constituent processes. This is a
terrible misuse of SPC. Sanai (2003) describes in graphic detail what
happens when we ignore the system.
When used to judge, SPC significance levels must be set relatively
high to avoid false positive signals. This means that sensitivity suffers
so that there can be considerable delay in detecting genuine signals, and
during this time unnecessary patient injury can occur. In hospitals, the
“cost” of false negative states is of particular importance - it is the
problem that no one knows about or that is ignored that can be the most
dangerous. Thus, employing SPC in a judgmental way can destroy its ability
to give needed early warning. However, in a learning environment where a
department has first carefully analysed and optimised its systems and then
employed SPC to monitor, significance levels can be set to give high
sensitivity so that genuine changes are detected much more promptly. This
means that occasional false positives need to be tolerated. However,
provided there is sufficient specificity to prevent tampering, this is of
little consequence. It is usually easy for a suitably qualified medical
specialist to detect the occasional false positive. Viewed in this way, an
SPC signal does not indicate that there is a problem - it indicates that
there is sufficient evidence to search for a possible or probable problem.
Judgment is only appropriate if, after a signal, such a search is not
performed or, if having identified a problem, it is not addressed.
It is important also to consider other consequences of employing SPC
in a judgmental manner. First, because the primacy of the system is so
often ignored, there is an emphasis on blame rather than finding and
correcting causes of problems. Since these usually arise in systems and
staff in the wards may have little or no say in many of these systems,
morale is damaged leading to increased likelihood of medical error.
Furthermore, staff begin to game data in subtle ways that make them
useless for improving quality. Finally, staff use data to justify what
they do instead of using them to learn how to improve (Jacobson, Mindell
and McKee 2003).
When used appropriately to learn, SPC is an extraordinarily useful
adjunct to systems analysis and optimisation for quality improvement. When
misused to judge, the ability of SPC to improve quality is destroyed and
its potential to do harm becomes very real.
Substandard performance always has systems problems at its basis.
Dealing with substandard performance requires management that is
knowledgeable in understanding systems and that has the courage and
resources to correct systems problems. For example, a surgeon who does one
or two complex surgical procedures per year is likely to have inferior
results. The system problem is the performance of small amounts of complex
surgery and a hospital administration that allows this to occur is failing
to analyse and optimise its systems. As mentioned above, Sanai (2003)
gives an eloquent and moving account of the consequences of ignoring the
health of the system and Rothwell, Warlow, and the European Carotid
Surgery Trialists’ Collaborative Group (1999) and Carter (2003) describe
how difficult it is to use statistical methods to judge performance.
It is important to consider the mechanism and role of risk adjustment
for which the authors have performed a careful statistical analysis.
However, the POSSUM method they use figures prominently in another recent
publication by this group (Tekkis, Kessaris, Kocher, Poloniecki, Lyttle
and Windosor 1993); they found that for colorectal surgery it can be
poorly calibrated. In addition, it requires a good deal of patient
information, some of which may not be available for all patients. It is
worthwhile to reflect that Iezzoni, Ash, Shwartz, Daley, Hughes and
Mackeirnan (1995) show that “predicting who dies depends on how severity
is measured” and Thomas and Hofer (1999), after a careful analysis,
conclude that “reports that measure quality using risk adjusted mortality
rates misinform the public”. Perhaps the most comprehensive risk
adjustment method currently in use is the APACHE system yet anomalies
occur not infrequently when it is applied to differing populations. One
problem is that random variation can be very large compared with variation
due to substandard performance. In addition, the myriad of patient
characteristics that may exist can mean that precise and reliable risk
adjustment is a mirage.
However, when used appropriately by individual institutions on their
own data sequentially, for example in control charts, complex risk
adjustment is almost certainly unnecessary even if it were reliable enough
to be advocated. Lawrance, Dorch, Sapsford, Mackintosh, Greenwood,
Jackson, Morrell, Robinson and Hall (2001) have described a simple risk
adjustment tool for myocardial infarction patients that works well when
used in this manner in control charts. In addition, Sutton, Bann, Brooks
and Sarin (2002) have described a simple surgical risk adjustment method
that should be adequate for use by institutions monitoring their own
systems and processes sequentially. Simple tools that work are always
superior to complex ones with spurious precision and reliability that have
the capacity to mislead.
Tekkis P, McCulloch P, Steger A, Benjamin I and Poloniecki J
“Mortality Control Charts for Comparing Performance of Surgical Units:
Validation Study Using Hospital Mortality Data” BMJ 2003;326:786-791.
Salsburg D “The Lady Tasting Tea. How Statistics Revolutionised Science in
the Twentieth Century” New York W.H.Freeman and Co. 2001.
Deming W.E. “The New Economics for Industry, Government and Education”
Cambridge, Massachusetts Institute of Technology 1993.
Sanai L The Sunday Times January 26th 2003.
Jacobson B, Mindell J and McKee M “Hospital Mortality League Tables” BMJ
Rothwell P, Warlow C, and the European Carotid Surgery Trialists’
Collaborative Group “Interpretation of Operative Risks of Individual
Surgeons” Lancet 1999;353:1325.
Carter D “The Surgeon as a Risk Factor” BMJ 2003;326:832-833.
Tekkis P, Kessaris N, Kocher H, Poloniecki J, Lyttle J and Windsor A
“Evaluation of POSSUM and P-POSSUM Scoring Systems in Patients Undergoing
Colorectal Surgery” British Journal of Surgery 2003;90:340-345.
Iezzoni L, Ash a, Shwartz M, Daley J, Hughes J and Mackiernan Y
“Predicting Who Dies Depends on How Severity is Measured: Implications for
Evaluating Patient Outcomes” Annals of Internal Mecicine 1995;123:763-770.
Thomas W and Hofer T “Accuracy of Risk-Adjusted Mortality Rate as a
Measure of Hospital Quality of Care” Medical Care 1999;37:83-92.
Lawrance R, Dorsch M, Sapsford R, Mackintosh A, Greenwood D, Jackson B,
Morrell C, Robinson M and Hall A “Use of Cumulative Mortality Data in
Patients with Acute Myocardial Infarction for Early Detection of Variation
in Clinical Practice” British Medical Journal 2001;223:324-327.
Sutton R, Bann S, Brooks M and Sarin S “The Surgical Risk Scale as an
Improved Tool for Risk-adjusted Analysis in Comparative Surgical Audit”
British Journal of Surgery 2002;89:763-768.
Competing interests: No competing interests
We read with some interest the analysis of two databases by these authors
which appears to be an improvement on existing models of risk
stratification especially with the recently published annual assessments
by the Dr Foster group topical (1).
The authors highlight the difficulties in data collection for this
type of analysis. The analysed databases contains patient information
given on a voluntary basis, approximately one quarter of the data is of a
retrospective nature and a significant amount of data was not available
for analysis which will hamper its direct comparison to other UK units.
In other fields of surgery, for example colorectal surgery,
subspecialists attract not only a more elective practice, which might
lower the operative mortality figures, but also referrals of a more
complex nature, which could be expected to increase these figures (2-5).
Case mix influences results of surgery. Surgeons who specialise in
colorectal surgery, undertake a disproportionate number of elective (low
risk) cases, and as such their results may appear superficially better.
Murray et al 1995 (2) has shown that adjustment for case mix can lead to a
substantial change in the relative performance of surgeons. Sagar et al
1994 (3), has shown that by adjusting for patient differences the initial
appearances of the data may in fact be reversed. These referral practices
are hard to “control for” by examining only pre-operative risks and
We agree with the comment in the accompanying editorial - “hospitals
are complex systems that are part of larger systems and also contain
subsystems”. Variability in outcome has been previously attributed to the
interplay of multiple factors including; surgical ability, surgical
technique, case mix, case volume, institutional influences peri-operative
care and anaesthetic care (4,5). Units that have more experience in their
particular field may have a wider range of operative and non-operative
approaches available than less experienced units and may also have more
subspecialist resources available within the unit for improved decision
making pre and post-operatively. Units with the benefits of better support
of auxiliary surgical and medical services may also show improvements in
their figures, which reflects the multi-disciplinary nature of modern
Ideally surgical performance should be monitored prospectively and
examine not only by operative mortality but also by post-operative
morbidity and quality of life measurements, and allow for case mix with
comparison. Until then this paper does appear to improve on the current
methods of evaluating surgical units’ performance.
1. Tekkis P, McCulloch P, Steger AC, Benjamin IS, Poloniecki JD.
Mortality control charts for comparing performance of surgical units:
validation study using hospital mortality data bmj.com 2003;326:786
2. Murray GD, Hayes C, Fowler S, Dunn DC. Presentation of comparative
audit data. Br J Surg 1995, 82, 329-332.
3. Sagar PM, Hartley MN, Mancey-Jones B, Sedman PC, May J, Macfie J.
Comparative audit of colorectal resection with POSSUM scoring system. Br
J Surg. 1994, 81, 1492-1494.
4. Houghton A. Variation in outcome of surgical procedures. Br J Surg.
1994, 81, 653-660
5. Unhi SS, Kent SJS. Which surgeons in a district general hospital should
treat patients with carcinoma of the rectum. J R Coll Edinb. 1995, 40, 52-
Competing interests: No competing interests
There can be only one standard, 0% morbidity and 0% mortality.
Anything short of that should be interpreted as suboptimal performance.
That is not the least bit unrealistic even in patients with "co-
morbidities". Dr Foster's report conceals the degree to which the standard
of surgery in the NHS is, by this definition, suboptimal (1).
Applying Dr Foster's methodology to the report on oesophageal surgery
in this issue of the BMJ, for example, the median or mode 100 translates
into an operative mortality of 12%, range 0% to 50% in the 29 hospitals
which included 1042 patients (2,3). Surgeons in other countries, including
myself and some of those with whom I have worked, have performed large
numbers of equally large operations including oesophagectomies,
splenectomies, hepatic lobectomies and pancreatico-duodenectomies without
mortality (1,4,5,6). What is more the 0% mortality has for the most part
been obtained by surgeons who assist their residents doing most of the
operations. All have been general surgeons.
1. Hospital mortality league tables Bobbie Jacobson, Jenny Mindell,
and Martin McKee BMJ 2003; 326: 777-778.
2. Mortality control charts for comparing performance of surgical units:
validation study using hospital mortality data Paris P Tekkis, Peter
McCulloch, Adrian C Steger, Irving S Benjamin, and Jan D Poloniecki BMJ
2003; 326: 786-788.
3. Playing Russian Roulette Richard G Fiddian-Green
bmj.com, 11 Apr 2003 Rapid response to: Interactive case report: J Bligh,
R Farrow, Ruth, Richard Farrow, Linda Hands, Natasha Kapur, Malcolm H A
Rustin, John Benson, and Ed Peile BMJ 2003; 326: 804-807.
4. Coon WW. Splenectomy in the treatment of hemolytic anemia.Arch Surg.
5. Jarnagin WR, Gonen M, Fong Y, DeMatteo RP, Ben-Porat L, Little S,
Corvera C, Weber S, Blumgart LH. Improvement in perioperative outcome
after hepatic resection: analysis of 1,803 consecutive cases over the past
decade. Ann Surg. 2002 Oct;236(4):397-406; discussion 406-7.
6. Yeo CJ, Cameron JL, Sohn TA, Lillemoe KD, Pitt HA, Talamini MA,
Hruban RH, Ord SE, Sauter PK, Coleman J, Zahurak ML, Grochow LB, Abrams
RA. Six hundred fifty consecutive pancreaticoduodenectomies in the 1990s:
pathology, complications, and outcomes.Ann Surg. 1997 Sep;226(3):248-57;
Competing interests: No competing interests
Paris Tekkis and colleagues give a further example of the use of
adjusted operative mortality rates to compare institutional performance
using binomial limits to judge deviance from the mean. The interpretation
of this data depends on the level at which the bar is set. In this case
the mean of all institutions is the standard against which comparison is
made. This is fine if the question is ‘do we deviate substantially from
the mean of all our peers?’. This is a question about clinical governance
and was asked of paediatric surgical units to identify possible deviant
But if the issue is to encourage better performance, then the
question could also be, ‘how far do we deviate from units with better or
best practice?’. In that case an external reference value could be chosen
if one were available. Alternatively the mean of the top 50%, or top 20%
of the distribution could be chosen as the standard of comparison. Raising
the bar would mean that the exceptionally good practice of units 3 and 33
is likely to fall within the limits and more units with highest mortality
are likely to fall without. Is there any objection to changing the
standard in this way, and has it been used in other settings?
Competing interests: No competing interests