### Criteria Pilot Tests

**David C. Hadorn,**M.D. Manager, Special Projects

**Andrew Holmes,**M.D. Senior Medical Advisor

National Advisory Committee on Health and Disability

Wellington,

New Zealand

2 December 1996

**Introduction**

As described in the article, five sets of priority criteria were developed in New Zealand under the auspices of the National Advisory Committee on Health and Disability and the four Regional Health Authorities (RHAs). These criteria incorporate the factors used by experienced clinicians to arrive at judgements of severity of illness and expected benefit from treatment. Each criterion is divided into different levels (e.g., no pain, mild, moderate, or severe pain), with each level assigned points, or weights, based on the results of pilot studies such as those described in this article. Points obtained by patients on each criterion are then added together to form a total criteria score or "priority score," which is supposed to reflect clinicians' overall judgement of urgency and expected benefit. The priority score is designed to assist in determining patients' relative priority for surgery or renal dialysis.

In this paper we describe the pilot tests conducted on the first three sets of criteria: those for cataract surgery, coronary artery bypass graft surgery (CABG), and hip/knee replacement.

These tests were conducted to ascertain the extent to which the priority score calculated for each patient corresponded to clinicians' global clinical judgement of clinical urgency.

**Methods**

The approach used to test the criteria was as follows:

1. Tentative weights were assigned to each level of the criteria in advance of the pilot test, based on available evidence and consensus among members of the respective Professional Advisory Groups.

2. Surgeons participating in the pilot studies were asked to assess a series of consecutive patients as they normally would in order to form an opinion about the relative urgency or expected benefit from surgery.

3. Patients were scored based on which level was judged closest to describing their situation. Scoring was usually done by the surgeon, but in some cases an office nurse took primary responsibility for filling out the criteria form. The applicable criteria weights were added together to produce a total score for each patient. In the case of the CABG criteria, a prognostic adjustment formula was applied to the criteria that concerned life-expectancy, i.e., all except Class I-II angina and the "social factor". These factors were sometimes referred to as "quality of life" criteria. All other criteria were deemed to be indicative of the extent to which CABG prolongs life. For patients over age 70, the scores obtained on these latter criteria were multiplied by a factor of ((100-age) / 30), which had the effect of progressively reducing the score on prognostic-related factors as patients' ages increased. At age 80, for example, prognostic-relevant scores were multiplied by 2/3, reducing the points on these factors by 1/3. As discussed in the original article, this formula was applied because it was felt by project participants that the relative importance of life prolongation versus quality of life decreased as patients age, and clinicians in fact took age into account in making judgements of relative urgency and priority.

For use as dependent variables, clinicians were asked to provide a quantitative measure of the urgency of surgery for each patient based on global clinical judgement. In the case of cataract surgery and hip/knee replacement, surgeons rated patients on a numerical score from 0 to 100, where 0 meant that patients had no reasonable expectation of substantial benefit from surgery and 100 meant that the patient's level of expected benefit were as high as possible given they had no emergency indications (e.g., acute glaucoma produced by cataract).

Cardiologists and cardiac surgeons preferred an alternative measure of grading overall clinical urgency, namely an estimate of what a "reasonable waiting time" would be for each patient (e.g., 3 weeks, 6 months). These estimates were made in the context of an "adequately, not infinitely funded service" and "keeping in mind competing claims for resources both within and outside the health sector."

Where a substantial discrepancy existed between the priority score and a clinical judgement rating, clinicians were asked to provide reasons for this discrepancy. In this way, missing or inappropriate factors could be identified.

Regression analysis was performed in order to adjust the criteria weights to better accord with overall clinical judgement, using the 0 -100 score and reasonable waiting time (in days) as the dependent variables. Coefficients of variation, or R-squared statistics, were calculated to determine the extent to which the priority score "explained" the variation observed in the ratings of overall clinical judgement. In the CABG pilot test, regression weights were calculated on patients with initial scores below 65 (N = 173). This value appeared to be a reasonable threshold between emergency from non-emergency patients based on inspection of the scatterplot of values all scores. The rationale for developing weights based on elective cases is that the criteria were designed primarily for use in waiting list management, and factors characteristic of emergency patients might not be relevant in this setting.

Due to the relatively low number of data points and the large number of factors and levels within each factor, it was not possible to apply classical regression techniques. An alternative statistical technique known as simulated annealing(1) (2) was therefore used to estimate the priority criteria weights based on an objective function that minimised the least squared error of the fitted model. This technique is capable of generating stable weights on parameters using fewer observations than classical regression techniques.

Clinicians were not blinded with respect to patients' priority score at the time they developed their overall clinical judgements. It is likely, therefore, that these judgements were influenced by the priority score, potentially resulting in "biased" assessment. We address this issue in the Discussion section.

**Results**

Figures 1-3 depicts the draft criteria tested in these pilot studies.

**Figure 1: Priority Criteria for Cataract
Surgery**
**Figure 2: Priority Criteria for Coronary
Revascularisation**
**Figure 3: Priority Criteria for Major Joint
Replacement**

The draft cataract criteria weights added to a maximum greater than 100; scores reported here are normalised to a maximum score of 100. Forms were completed for 97 cataract patients, 260 CABG patients, and 69 hip/knee replacement patients.

Of the 260 CABG patients, 133 patients were evaluated at Green Lane Hospital, 119 at Dunedin Hospital, and 8 at Waikato Hospital. Of the 260 criteria forms received, 17 were missing data needed to perform the analyses, leaving 243 completed forms. Because, as was discussed above, regression weights were calculated only on non- emergency patients, the final 31 patients contributed by Green Lane Hospital were all non- emergency patients, who were enrolled in the latter part of the pilot study with the express purpose of increasing the number of such patients. The initial 102 consecutive patients from Green Lane included both emergency and non-emergency cases.

In this study we did not collect information on patients' ages except with respect to the prognostic adjustment factor for CABG described above. This factor was used 49 times during the pilot study, indicating that approximately 49/260 = 19% of patients evaluated during this study were over age 70. Entering age into the regression did not significantly improve explanatory power (R-squared unchanged). Thus, most of the effects of age on judgements of urgency or expected benefit from surgery appear to have been captured by the clinical criteria.

The distribution of priority scores is shown in Figures 4-6.

Click the following to view high resolution images:

**Figure 4: Histogram of Cataract Priority
Scores**
**Figure 5: Comparison of Distributions for
Coronary Revascularisation Recalibrated Priority Scores**
**Figure 6: Histogram of Hip and Knee Priority
Scores**

An approximately normal distribution is discernible for the cataract and hip/knee replacement patients, whereas the CABG scores formed a bimodal distribution, representing the separate populations of elective and emergency cases. As noted above, emergency cases tended to have scores above 65 points. The cataract scores in Figure 4 are based on the final weights, as discussed in the next paragraph.

Agreement between priority scores (using the initial weights) and overall clinical judgement was good in the case of the draft cataract criteria (Figure 7), with an R-squared statistic of 0.49, and excellent in the case of the hip and knee criteria (Figure 8), with an R-squared statistic of 0.94. In the latter case, no further improvement in R- square was achievable through manipulation of the criteria weights. However, by adjusting the weights for the cataract criteria, based on the regression analysis, R-squared was increased from 0.49 to 0.70.

Click the following to view high resolution images:

**Figure 7: Clinical Score vs Priority Score,
Cataract Criteria (n=97 R-squared=0.49)**
**Figure 8: Clinical Score vs Priority Score,
Hip and Knee Criteria (n=69 R-squared=0.94)**

As such, the new cataract weights are a better reflection of clinical judgment with respect to urgency. A scatterplot using the revised cataract weights is shown in Figure 9. These new weights were incorporated into the final cataract criteria.

Click the following to view high resolution images:

**Figure 9: Clinical Score vs Revised
Priority Score, Cataract Criteria (n=97 R-squared=0.70)**

Regarding the CABG criteria, scatterplots of the 173 patients with initial scores below 65 are presented in Figure 10, with scores calculated using the original criteria weights. Figure 11 shows the same patients with scores calculated using the regression- based weights. The improved degree of fit to the regression line is evident, as quantified by R-squared statistics of 0.48 and 0.65, respectively, for the original and regression-based weights.

Click the following to view high resolution images:

**Figure 10: Reasonable Waiting Time vs Original
Priority Score (a=4.4 b=minus 0.1 r=0.721 r-squared=0.521 n=173)**
**Figure 11: Reasonable Waiting Time vs
Regression Weights Priority Score (a=4.5 b=minus 0.2 r=0.806 r-squared=0.649
n=173)**

Figure 12 is a scatterplot of patients' scores, with the original scores along the X axis and the regression-weighted scores along the Y axis. The degree of correspondence was high (R-square = 0.84), although in many cases patients' scores changed substantially between the original weights and new weights.

Click the following to view high resolution image:

**Figure 12: Original Priority Score vs Regression
Weights (r=0.918 r-squared=0.843 n=255)**

With respect to factors responsible for discrepancies between priority scores and clinical judgement, only one missing factor was identified, namely co-existing ocular pathology in the case of cataract surgery. This factor substantially reduced the rating on overall clinical judgement relative to the priority score in eight of the study patients, as reported by three different ophthalmologists. As a result, this factor was added to the criteria set, while another was eliminated, namely "reduced visual field in other eye." This latter factor did not account for a substantial amount of variance in the clinical judgement rating.

Weights were assigned to the new ocular comorbidity factor based on clinical judgement. A final change made to the draft cataract criteria was to replace an "activities of daily living" criterion with a "visual function" criterion, which also incorporated and replaced a "need to drive" factor in the draft criteria. The final cataract criteria and weights are depicted in Figure 13.

**Figure 13: Priority Criteria for Cataract
Surgery**
**Figure 14: Distribution and Priority Criteria
for Coronary Revascularisation**

Regarding CABG, the distribution of patients across the variable levels is depicted in Figure 14, along with the initial weights and the weights derived from the regression analysis. It can be seen that substantially greater weight was given in the regression to the presence of (1) left main coronary artery disease, (2) Class IV angina, and (3) immediate threat to work, caring, or independence, as compared to the initial weights. On the other hand, ejection fraction and surgical risk made little or no independent contribution to clinicians' judgements of urgency.

As a result, these factors were eliminated from the revised criteria. Changes were made in the weights assigned to the remaining variables based on the results of the analysis and taking into account the medical literature. The final weights are shown in the original BMJ article and in Figure 15.

**Figure 15: Priority Criteria for Coronary
Revascularisation**

In the case of the hip and knee replacement criteria, no consistently mentioned factors were identified as a cause for discrepancies between priority score and clinical judgement. Thus, no changes were made in these criteria or in their weights based on this pilot test.

**Discussion**

We observed good to excellent correlation between the priority criteria
scores and overall clinical judgement in these pilot tests. In the case
of the cataract and CABG criteria, improved correspondence was achieved
by modifying the criteria weights and some of the variables. We conclude,
therefore, that the priority criteria score faithfully reflects clinical
judgement concerning the degree of benefit expected from cataract extraction
and hip or knee replacement.

Our study has several limitations. First, the number of patients included in this test is relatively small, perhaps rendering the criteria weights derived from this exercise unstable.

However, we believe the number of cases was sufficient to establish the face validity of the criteria for purposes of estimating the expected benefit of surgery. The exact values of the weights are far less important than the validity of the criteria, as discussed below.

Second, as they were not randomly selected, participating surgeons may have differed in some significant way from remaining specialists in New Zealand (or elsewhere), particularly, perhaps, in their degree of support or enthusiasm for the criteria approach. However, we have no particular reason to believe that such differences, if they existed, would lead to less valid assessments of the criteria.

Third, the final versions of the cataract and CABG criteria were not subjected to further testing to determine if this version outperforms the draft version in the degree of correspondence with clinical judgement. Nonetheless, we believe (and most participants agreed) that the face validity of the revised criteria is superior to the draft version.

Fourth, surgeons were aware of the criteria score at the time they formed their judgements about relative priority and expected benefit. Thus, the clinical score, as expressed on the 0 - 100 rating scale, was almost certainly influenced by knowledge of the criteria score, thus raising the concern of biased assessment. This concern may be allayed by careful consideration of the role and function of the criteria, as discussed below.

**The Criteria as Linear Models**

From a statistical or epistemological perspective, the priority criteria, as constructed, constitute linear, or additive, models. In such models, weights are assigned to each level of two or more factors; scores on each factor are ascertained for individual cases and the corresponding weights added to arrive at a total score. Numerous such models have been developed in a variety of settings, including studies assessing and comparing the relative performance of linear models versus unaided human judgement. Such studies have been conducted in many medical settings, including diagnosis of appendicitis and prediction of six-month survival in cancer patients, and in such non-medical settings as prediction of faculty rating of graduate students, business bankruptcy, and divorce.(3) (4) (5) (6) The performance of linear models is consistently and substantially superior to unaided human judgement.

In his classic paper, "The Robust Beauty of Improper Linear Models,"(3) Dawes cites what is now generally accepted to be the reason for this consistent superiority, namely "the distinction between knowing what to look for and the ability to integrate information":

In summary, proper linear models work for a very simple reason. People are good at picking out the right predictor variables and at coding them in such a way that they have a conditionally monotone relationship with the criterion. People are bad at integrating information from diverse and incomparable sources (pp. 573-574).

The bases for the difficulty people have in integrating information from diverse sources has been studied extensively by cognitive psychologists.(7)

As Dawes points out, even models with unit weights on the criteria (or other "improperly" derived weights, i.e., weights not derived using formal regression analysis) consistently outperform unaided human judgement. Thus, in a very real sense the pilot studies discussed in this article could be seen as providing a test of clinical judgement not of the criteria.

Indeed, one ophthalmologist member of the cataract Professional Advisory Group (a director of a hospital-based ophthalmological centre) declined to participate in the pilot study on the grounds that he trusted the priority score more than clinical judgement.

Accordingly, this centre began implementing the criteria to assess priority straight away.

Thus, we believe the principal conclusion to be drawn from the results of our pilot studies is that the face validity of the criteria was confirmed. As such, the pilot study enhances the plausibility that application of the criteria is a valid approach, although the stability of the numerical weights may be called into question because of low sample sizes and other factors.

As noted in our discussion of improper linear models, however, the actual weights assigned to the various factors and levels is less important than their face validity.

We anticipate further opportunities to test and revise the priority criteria in conjunction with longitudinal outcome studies, which are currently in the planning phase in New Zealand. In the meantime, we believe it is reasonable to consider the use of the priority criteria to be a valid method for estimating the expected medical benefit and the relative priority patients should receive for treatment under publicly funded health care systems.

**References**

1 Eglese R W. Simulated Annealing : A tool for Operations Research. European Journal of Operations Research 46 (1990) 271-281.

2 Marius A J. Combining robust and traditional least squares methods: A critical evaluation. Journal of British Educational Statistics 6 (1988) 415-427.

3 Dawes R M. The robust beauty of improper linear models. American Psychologist 1979; 34: 571-582.

4 Dowie J J, Elstein A. Professional Judgment: A Reader in Clinical Decision Making. Cambridge: Cambridge University Press, 1988.

5 Arkes H R, Hammond K R. Judgment and Decision Making:An Interdisciplinary Reader. Cambridge: Cambridge University Press, 1986.

6 Wasson J H, Sox J C, Neff R K, et al. Clinical prediction rules: applications and methodological standards. New England Journal of Medicine 1985; 313: 793-799.

7 Kahneman D, Slovic P, Tversky A. Judgment Under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University Press. 1982.