Research Methods & Reporting

Evaluating policy and service interventions: framework to guide selection and interpretation of study end points

BMJ 2010; 341 doi: (Published 27 August 2010) Cite this as: BMJ 2010;341:c4413
  1. Richard J Lilford, professor of clinical epidemiology1,
  2. Peter J Chilton, research associate1,
  3. Karla Hemming, senior research fellow1,
  4. Alan J Girling, senior research fellow1,
  5. Celia A Taylor, senior lecturer2,
  6. Paul Barach, visiting professor3
  1. 1Public Health, Epidemiology and Biostatistics, University of Birmingham, Edgbaston, West Midlands B15 2TT
  2. 2Department of Clinical and Experimental Medicine, University of Birmingham
  3. 3Patient Safety Centre, University Medical Centre Utrecht, PO Box 85500, 3508 GA Utrecht, Netherlands
  1. Correspondence to: R J Lilford r.j.lilford{at}
  • Accepted 9 June 2010

The effect of many cost effective policy and service interventions cannot be detected at the level of the patient. This new framework could help improve the design (especially choice of primary end point) and interpretation of evaluative studies

There is broad consensus that clinical interventions should be compared in randomised trials measuring patient outcomes. However, methods for evaluation of policy and service interventions remain contested. This article considers one aspect of this complex issue—the selection of the primary end point (the end point used to determine sample size and given most weight in the interpretation of results). Other methodological issues affecting the design and interpretation of evaluations of policy and service interventions (including attributing effect to cause) have been discussed elsewhere,1 and we will consider them only in so far as they may affect selection of the primary end point. Our analysis begins with a classification of policy and service interventions based on an extended version of Donabedian’s causal chain.

Avedis Donabedian conceptualised a chain linking structure, process, and outcome.2 The classification we propose is based on a model in which the process level is divided into three further categories or sublevels as shown in fig 1.3 4 5 Starting closest to the patient these are: clinical processes (encompassing treatments such as drugs, devices, procedures, “talking” therapy, complementary therapy, and so on); targeted processes (those aimed at improving particular clinical processes, such as training in the use of a device, or a decision rule built into a computer system); and generic processes (for example, the human resource policy adopted by an organisation).


Fig 1 Modified Donabedian causal chain. Interventions at structural (policy) and generic service level can achieve effects through intervening variables (such as motivation and staff-patient contact time) further down the chain. For example, an intervention at (x) produces effects (good or bad) downstream at (a), (b), (c), and (d)

When an intervention is designed, the level at which it first affects this chain should be clarified along with its plausible effects.6 There are four levels in the extended Donabedian chain at which it is possible to intervene. Starting closest to the patient these levels are:

  • Clinical interventions—for example, use of clot busting drugs for thrombotic stroke

  • Targeted (near patient) service interventions—for example, establishing a service to expedite administration of clot busting drugs for thrombotic stroke

  • Generic (far patient) service interventions—for example, providing yearly appraisal for all staff)

  • Structural (policy) interventions—for example, improving the nurse to patient ratio.

Evaluation of targeted and generic service interventions tends to be lumped together under portmanteau terms such as management research, service delivery and organisational research, or health services research. We shall show that, from a methodological point of view, generic service interventions have more in common with policy interventions than with targeted service interventions.

Assessing targeted service interventions

Clinical interventions have only one downstream level at which evidence of effectiveness may be observed—patient outcomes. However, the effect of targeted service interventions can be assessed by using either clinical processes (for example, the proportion of eligible patients who receive timely thrombolysis) or outcomes (proportion of patients who recover from stroke). Selecting a sample of sufficient size to measure changes in end points at both levels risks wasteful redundancy. If there is an established link between a clinical process and its corresponding outcome, then the least expensive option should be chosen. Costs are a function of sample size (number of participating centres and the number of patients sampled in each centre) and the cost of making each observation.

Changes in clinical outcome (such as mortality or infection rates) can never be bigger than changes in the clinical error rates on which they depend and are usually much smaller; it is rare for the risk of an adverse outcome to be wholly attributable to clinical error. Thus detection of changes in outcome requires larger, often much larger, samples than those needed to detect changes in the corresponding clinical process. Figure 2 compares the sample sizes for a standard simple before and after study designed to measure the effect of an intervention on compliance with a clinical standard (process study) and the risk of an associated adverse clinical outcome (outcome study). The sample size for the outcome study is about four times that of the corresponding process study even when the adverse outcome is 100% attributable to clinical error (that is, can arise only if the corresponding error has occurred, as in reaction to incompatible blood transfusion). The outcome study must be more than 200 times larger than the process study if the attributable risk is 25% (as in failure to carry out timely thrombolysis therapy after thrombotic stroke).


Fig 2 Specimen sample sizes for a simple before and after study or randomised controlled trial to detect an improvement in process compliance using process and outcome measures with conventional 80% power and 5% significance levels. At baseline, compliance with the targeted clinical process is 50% and the rate of adverse outcomes is 20%. The numbers needed for the outcome study increase as the percentage of outcome risk attributable to non-compliance with the process (attributable risk) decreases

The actual numbers will depend on baseline rates of compliance and adverse outcome and the study design—cluster studies with contemporaneous controls will require even larger samples than simple randomised controlled trials.7 8 9 However, study costs are a function not only of the number of observations, but also of the costs of making each observation. Although outcomes such as mortality and rates of infection are often collected routinely, health service systems seldom carry the numerator (process failure) and denominator (opportunity for failure) data required to calculate the rates of process failure.10 This information usually has to be obtained from case notes, bespoke data collection forms, or direct observation.11 The cost of reliably measuring failures in clinical process may therefore be considerable, depending on the process concerned.12

There are thus competing forces at work when evaluating a targeted service intervention; the generally higher cost of measuring clinical processes is in tension with the greater number of cases that must be sampled to measure outcomes with commensurate precision. The greater the size (in absolute terms) of the hypothesised effect on outcome and the more expensive the collection of data, the stronger the argument to rely on outcome measures. For example, an influential study to assess the effect of targeted processes to reduce infection associated with central venous lines used bloodstream infections as the outcome measure.13 Real time observations to assess the clinical processes that reduce infection risk would have been very expensive and substantial effects on the outcome (infection rates) were expected (and observed).13 However, as the signal (change in outcome due to intervention) diminishes in relation to the noise (changes in outcome due to uncontrolled sources of variation), a study based on process measurement will become more cost effective. Such was the case in Landrigan’s study of the effects of fatigue on the quality of care delivered by medical interns in the intensive care unit, which used direct observation of clinical processes.14

The above argument is predicated on circumstances where the clinical process of interest is a valid surrogate (proxy) for the relevant patient outcome. If this is not the case, the link between clinical process and patient outcome should first be confirmed—for example, with a double masked randomised controlled trial. However, the link between process and outcome cannot always be established robustly, particularly when the outcome in question is the egregious consequences of a rare clinical process failure—for example, transfusion of incompatible blood, oesophageal intubation, or intrathecal injection of vincristine.

Policy and generic service interventions

Generic service interventions have the potential to affect targeted processes, clinical processes, and outcomes. They may affect clinical processes directly through targeted processes or indirectly through intervening variables (such as morale, sickness absence, culture, knowledge, time spent with each patient).15 Figure 1 shows that there are four downstream levels at which effects may be observed. Policy interventions (such as building a new hospital, increasing reimbursement rates, or conferring ‘teaching’ status) can exert effects through five levels.

Narrow versus diffuse effects

The further to the left an intervention is applied in the causal chain, the greater the number of downstream processes that may be affected. For example, a targeted service intervention to prevent misconnection of oxygen delivery pipes in the operating theatre would affect only one clinical process—gas delivery. This is a narrow or tightly coupled effect. However, a generic service intervention (such as applying a system of appraisal for all staff) or a policy level intervention (such as increasing resources to improve the nurse to patient ratio) has the potential to affect myriad clinical processes across an institution—a diffuse effect. Nevertheless, these clinical processes converge on outcomes that can be placed in a limited number of discrete, identifiable groups (fig 3). For example, mortality, quality of life scores, patient satisfaction, and numbers treated are the final common pathway for hundreds, if not thousands, of individual clinical processes.


Fig 3 Interventions applied towards the far left of an extended causal chain can have diffuse effects on clinical processes but show convergence on outcomes

Selecting end points

It may be impractical to measure the effectiveness of an intervention with diffuse effects by observing hundreds or thousands of downstream clinical processes. For example, Donchin and colleagues estimated that patients in intensive care units experience a mean of 178 clinical processes every day.16 The hospital as a whole would provide many thousands of actions that could be affected by a change such as the ratio of doctors to patients. The effect on each clinical process might be so small that impracticably large samples would be required to avoid high probabilities (or the near certainty) of false null results. It would be logistically taxing to enumerate compliance with all (or even a meaningful proportion) of the clinical processes that might be affected downstream.

Given limited resources, it makes more sense to study the effects of such interventions by using outcomes on which large numbers of processes converge and which often can be measured at low cost. Patient outcomes also encapsulate the net effect of generic interventions on many individual processes, some of which may be negatively affected; various positive and negative effects are consolidated among a limited number of outcomes. However, this still leaves the question of the sample size required to investigate such outcomes.

Cost effectiveness of studies using patient level end points

Rare problems and small effect sizes

Sometimes it is not cost effective, or logistically possible, to measure the effectiveness of policy or service delivery interventions at either the clinical process or patient outcome level. In the case of targeted service delivery this situation arises in the context of rare incidents, such as transfusion of incompatible blood. For policy and generic service interventions the problem arises when the cost of the intervention is small relative to the magnitude of the plausible effect size.

In England and Wales, the National Institute for Health and Clinical Excellence (NICE), uses a heuristic maximum of between £20 000 (€24 000; $31 000) and £30 000 for a healthy life year.17 An intervention, such as a clinical computing system costing £10m a year might sound expensive, but would average £200 per patient in a hospital with 50 000 admissions a year. It would have to save only around two lives (of five years mean duration in good health) per 1000 patients admitted to be cost effective. In such a case the cost per life year saved is calculated as:

(Discounting at 3.5% a year increases the cost slightly to £21 500, which is still below the NICE threshold.)

If we assume a baseline mortality of 10%, as in fig 4, 700 000 patients would be required to detect a change of two lives per 1000 patients in a simple before and after study. Furthermore it would be risky to make a causal inference on so small a difference (0.2 percentage points) from a study with no contemporaneous controls—moderate biases are more important when measured differences are small. Even more patients would be needed to conduct a more valid cluster study incorporating a sample of control hospitals that were not exposed to the intervention.


Fig 4 Sample size needed in a simple before and after study to detect reductions in mortality from a baseline of 10% using conventional 80% power and 5% significance levels. Each death avoided is assumed to result in a patient benefit of five years of healthy life, which is used to generate the cost ceiling for the intervention using a threshold of £20 000 per quality adjusted life year

The above estimate of effect (two lives saved per 1000 admissions) is not unduly pessimistic. Many people were shocked to hear that one in 400 inpatients died as a result of deficiencies in their care in the famous Harvard malpractice study.18 If this could be halved (arguably an ambitious target), hospital mortality would decline by 0.125 percentage points (that is, by less than 2 in 1000). Figure 4 shows that the rate at which sample size increases as a function of diminishing effect size is such that detecting plausible effects of an intervention on death rates may be not only expensive but logistically impossible.19

Modelling cost effectiveness

Sometimes it is difficult to decide whether it would be cost effective to carry out a study that is powered on the basis of patient level outcomes (clinical processes or patient outcomes). In many cases, the decision can be informed by modelling effectiveness and cost effectiveness.

Effectiveness is modelled by mapping the pathway through which the intervention is hypothesised to work. For instance, the plausible effectiveness of a programme of rotating ward closures was based on observed rates of bacterial recontamination of cleaned surfaces and expert opinion on plausible consequences for hospital acquired infection.20

Cost effectiveness can then be modelled by offsetting putative benefits against costs. A simple “back of the envelope” calculation may be informative.21 In the above example, it turned out that the costs (particularly opportunity costs of ward closure) were not commensurate with even the most optimistic expert estimates of benefit.20 More often simple models will reveal an “inconvenient truth” that cost effective effects on patient level outcomes are plausible but too small to be easily detectable, as in the example of the hospital computer system above. However, in some cases, particularly in developing country contexts or when outcomes other than mortality are salient, the effect sizes may allow cost effective measurement. In these cases we advocate the use of bayesian value of information modelling, which has been used in health technology assessment,22 23 24 to investigate the cost effectiveness of proposed studies and to select the sample size that offers best value for money.

Alternatives to measuring effects at patient level

Situations where interventions may be cost effective, but are unlikely to produce measurable effects on patient level end points, raise the question of what is to be done in such cases. We take our cue from Walter Charleton, who in the 17th century, said that, “The ‘reasonable man’ will not require demonstrations or proofs that ‘exclude all dubiosity, and compel assent,’ but will accept moral and physical proofs that are the best that may be gained.”25 When end points at the patient level are unlikely to be sensitive to the intervention, evaluations must turn on theory and on other types of observation. These observations will be components of a general framework for all evaluations1 3 4 6 15 and include outcome of previous studies of similar interventions, results of preimplementation evaluations (alpha testing),3 and observations upstream in the extended Donabedian chain. These upstream observations include the fidelity of uptake of the intervention15 and effects on intervening variables,15 and could require the synthesis of quantitative and qualitative data.4

Clinical processes and outcome can still be measured even though they are not the primary end point. This will determine whether the observed effect is larger than expected and make data available for possible future systematic reviews. However, these patient level end points will not be used to determine sample size. Care must also be taken not to misinterpret a null result (no evidence of effect) as evidence of no effect since studies that examine upstream effects are not powered to detect changes in patient level end points.

Consider for example, the effect of an online intervention package to improve the general knowledge and attitudes of all clinical staff towards patient safety. Here it might be asking too much to expect to observe improvements in clinical processes or outcomes. The finding that staff attended the educational events, reported positively on the experience, and had improved scores on a reliable patient safety culture tool, may provide sufficient encouragement to continue the intervention, especially against a theoretical backdrop linking culture to safety built up from studies in many healthcare and non-healthcare settings.26


The classification we propose is based on a deconstructed version of Donabedian’s process level and does not readily map on to other classifications such as safety versus quality. The insight derived from the distinction between targeted and generic service interventions relates to the downstream effects of the intervention—targeted interventions with narrow effects versus policy and generic service interventions with diffuse effects.

We have described the causal chain as operating from left to right. However, bidirectional flow is plausible in some circumstances—a specific targeted intervention may produce upstream (feedback) effects. These in turn could bring about downstream changes (feed-forward) in a related activity. For example, introduction of clinical guidelines for asthma care in general practice may sensitise clinicians to the use of guidelines in general and thereby produce improvements in diabetes care.27 This phenomenon is sometimes referred to as the “halo effect,”28 although such spillover effects can also be harmful—for example, incentives to reduce waiting times for investigation of possible cancer may deflect attention away from other important diseases. The corollary of potential spillover effects when targeted specific interventions are implemented is that the end points observed may need to be widened to take account of plausible positive and negative effects in related practices.

Multicomponent service interventions may comprise both generic and targeted elements, such as the Health Foundation’s Safer Patients Initiative, which seeks to promote leadership and safety culture while strengthening specific practices by, for instance, promulgating evidence based guidelines.29 An evaluation in this case should consist of observations relevant to both generic elements (such as measurements of effects on intervening variables and perhaps outcomes) and the specific components (where targeted clinical processes are relevant).

Surrogate outcomes and publication bias

The observation that it may not be possible to detect worthwhile effects at the patient level inevitably places greater weight on upstream end points, which become surrogates for patient outcomes. It is therefore important to increase our knowledge of the construct validity of intervening variables such as culture, leadership, and morale. The literature correlating these upstream variables with patient outcomes is likely to be distorted by publication bias—an endemic problem in clinical epidemiology.30 Thus authors should consider using statistical methods that provide evidence of publication bias, as in a recent study of service interventions to improve acute paediatric care in developing countries.31 Readers should be aware that when they encounter strongly positive results, they may be sampling the most optimistic tail of a distribution of results, most of which is hidden from view. Suspicion that one may be dealing with an example of publication bias must be heightened if the results exceed the most optimistic of prior expectations.

Bayesian methods and decision analysis

Our analysis has been couched, for the most part, in terms of primary end points and statistical methods for hypothesis testing. This is partly because these conform to contemporary methodological models in quantitative research and partly because they provide convenient “handles” to help describe the underlying ideas. These ideas would still be relevant under alternative models where, for example, multiple end points were weighted on a sliding scale according to their contribution to a decision analysis model.32 Likewise, the relation between interventions and effect size would be relevant when considering cost effective sample sizes in a bayesian model.24 Here changes in credible limits and the centre of updated probability distributions would be the relevant considerations, but it would still be necessary to think carefully about sample size, cost effectiveness, and the distinction between interventions with diffuse and narrow effects. Bayesian methods can also be used to integrate multiple observations (including qualitative data).33

Representational nature of the model

The model we present is more fine grained than Donabedian’s original framework, but even so the process level could have been divided into more than three sublevels; the underlying construct is, in all likelihood, a continuum. The nub of the argument is that the further to the left an intervention is applied, the greater the number of downstream end points that might be affected until a point is reached where there are too many to capture at the clinical processes level. However, the effect on patient outcomes might be too small to detect, yet worth while, given an intervention that is inexpensive on a per patient basis. Our framework, like all representational models, is a “simplified view of the world to help us think about complex issues, but is not a true representation of the complexity itself.”34 Just as the map of the London underground does not need to represent the geography of track and stations literally to be helpful, so we hope that our model will be useful for those who navigate the complex intellectual terrain of policy and service evaluation.

Summary points

  • Management interventions may be divided into two categories; targeted service interventions with narrow effects, and generic service interventions that (like policy interventions) have diffuse effects

  • Measurement of clinical processes rather than patient outcomes may be more cost effective in evaluations of targeted service interventions

  • Clinical processes are not usually suitable primary end points for policy and generic service interventions because the effects at this level are too diffuse

  • Multiple clinical processes are consolidated on a small number of outcomes, which are the default primary end point for policy and generic service interventions

  • When the policy or generic service intervention is inexpensive, cost effective and plausible outcomes may be undetectable at the patient level

  • In such cases the effects of the intervention can still be studied at process levels further to the left (upstream) in an extended version of Donabedian’s causal chain


Cite this as: BMJ 2010;341:c4413


  • We thank Peter Lilford, Tim Hofer, Mary Dixon-Woods, Mohammed Mohammed, Cor Kalkman, Jon Nicholl, William Runciman, and Tim Cole for helpful comments.

  • Contributors RJL conceived the idea for the paper and drafted the initial core manuscript; KH, CAT, and AJG performed statistical analysis; PJC, KH, CAT, AJG, and PB contributed sections and critically reviewed and commented on the document. RJL is the guarantor.

  • Funding: National Institute for Health Research Collaborations for Leadership in Applied Health Research and Care for Birmingham and Black Country; European Union FP7 handover project: improving the continuity of patient care through identification and implementation of novel patient handoff processes in Europe; and the MATCH programme (EPSRC Grant GR/S29874/01). The views expressed in this work do not necessarily reflect those of the funders.

  • Competing interest: All authors have completed the unified competing interest form at (available on request from the corresponding author) and declare financial support for the submitted work from the National Institute for Health Research Collaborations for Leadership in Applied Health Research and Care for Birmingham and Black Country, European Union FP7 handover project: improving the continuity of patient care through identification and implementation of novel patient handoff processes in Europe, and the MATCH programme; no financial relationships with commercial entities that might have an interest in the submitted work; no spouses, partners, or children with relationships with commercial entities that might have an interest in the submitted work; and no non-financial interests that may be relevant to the submitted work.

  • Provenance and peer review: Not commissioned; externally peer reviewed.