Intended for healthcare professionals

CCBYNC Open access
Research

Evaluating the credibility of anchor based estimates of minimal important differences for patient reported outcomes: instrument development and reliability study

BMJ 2020; 369 doi: https://doi.org/10.1136/bmj.m1714 (Published 04 June 2020) Cite this as: BMJ 2020;369:m1714
  1. Tahira Devji, methodologist1,
  2. Alonso Carrasco-Labra, methodologist1,
  3. Anila Qasim, methodologist1,
  4. Mark Phillips, methodologist1,
  5. Bradley C Johnston, associate professor1 2,
  6. Niveditha Devasenapathy, associate professor3,
  7. Dena Zeraatkar, methodologist1,
  8. Meha Bhatt, methodologist1,
  9. Xuejing Jin, methodologist4,
  10. Romina Brignardello-Petersen, methodologist1,
  11. Olivia Urquhart, methodologist5,
  12. Farid Foroutan, methodologist1,
  13. Stefan Schandelmaier, methodologist1,
  14. Hector Pardo-Hernandez, methodologist6 7,
  15. Robin WM Vernooij, methodologist8,
  16. Hsiaomin Huang, methodologist9,
  17. Yamna Rizwan, methodologist10,
  18. Reed Siemieniuk, methodologist1,
  19. Lyubov Lytvyn, methodologist1,
  20. Donald L Patrick, professor11,
  21. Shanil Ebrahim, assistant professor1,
  22. Toshi Furukawa, professor12,
  23. Gihad Nesrallah, nephrologist13 14 15,
  24. Holger J Schünemann, professor1 16,
  25. Mohit Bhandari, professor1 17,
  26. Lehana Thabane, professor1,
  27. Gordon H Guyatt, distinguished professor1 16
  1. 1Department of Health Research Methods, Evidence and Impact, McMaster University, 1280 Main Street West, Hamilton, ON L8S 4L8, Canada
  2. 2Department of Community Health and Epidemiology, Dalhousie University, Halifax, NS, Canada
  3. 3Indian Institute of Public Health, Public Health Foundation of India, Gujarat, India
  4. 4School of Public Health, University of Alberta, Edmonton, AB, Canada
  5. 5Center for Evidence Based Dentistry, American Dental Association, Chicago, IL, USA
  6. 6Iberoamerican Cochrane Centre, Sant Pau Biomedical Research Institute (IIB Sant Pau), Barcelona, Spain
  7. 7CIBER de Epidemiología y Salud Pública (CIBERESP), Barcelona, Spain
  8. 8Department of Research, Comprehensive Cancer Organisation, Utrecht, Netherlands
  9. 9Department of Orthopedic Surgery, University of Michigan, Ann Arbor, MI, USA
  10. 10Department of Molecular and Cellular Biology, University of Guelph, Guelph, ON, Canada
  11. 11Department of Health Services, University of Washington, Seattle, WA, USA
  12. 12Department of Health Promotion and Human Behaviour, School of Public Health, Kyoto University Graduate School of Medicine, Kyoto, Japan
  13. 13Nephrology Program, Humber River Regional Hospital, Toronto, ON, Canada
  14. 14Division of Nephrology, University of Western Ontario, London, ON, Canada
  15. 15Li Ka Shing Knowledge Institute, St Michael's Hospital, Toronto, ON, Canada
  16. 16Department of Medicine, McMaster University, Hamilton, ON, Canada
  17. 17Department of Surgery, McMaster University, Hamilton, ON, Canada
  1. Correspondence to: T Devji devjits{at}mcmaster.ca (or @TahiraDevji on Twitter)
  • Accepted 31 March 2020

Abstract

Objective To develop an instrument to evaluate the credibility of anchor based minimal important differences (MIDs) for outcome measures reported by patients, and to assess the reliability of the instrument.

Design Instrument development and reliability study.

Data sources Initial criteria were developed for evaluating the credibility of anchor based MIDs based on a literature review (Medline, Embase, CINAHL, and PsycInfo databases) and the experience of the authors in the methodology for estimation of MIDs. Iterative discussions by the team and pilot testing with experts and potential users facilitated the development of the final instrument.

Participants With the newly developed instrument, pairs of masters, doctoral, or postdoctoral students with a background in health research methodology independently evaluated the credibility of a sample of MID estimates.

Main outcome measures Core credibility criteria applicable to all anchor types, additional criteria for transition rating anchors, and inter-rater reliability coefficients were determined.

Results The credibility instrument has five core criteria: the anchor is rated by the patient; the anchor is interpretable and relevant to the patient; the MID estimate is precise; the correlation between the anchor and the outcome measure reported by the patient is satisfactory; and the authors select a threshold on the anchor that reflects a small but important difference. The additional criteria for transition rating anchors are: the time elapsed between baseline and follow-up measurement for estimation of the MID is optimal; and the correlations of the transition rating with the baseline, follow-up, and change score in the patient reported outcome measures are satisfactory. Inter-rater reliability coefficients (ĸ) for the core criteria and for one item from the additional criteria ranged from 0.70 to 0.94. Reporting issues prevented the evaluation of the reliability of the three other additional criteria for the transition rating anchors.

Conclusions Researchers, clinicians, and healthcare policy decision makers can consider using this instrument to evaluate the design, conduct, and analysis of studies estimating anchor based minimal important differences.

Introduction

The role of the patient’s perspective in clinical research has increased over the past 30 years. Questionnaires looking at health status from the patient’s perspective—patient reported outcome measures—is an important strategy in determining the effect of interventions. Despite improvements in establishing their validity, reliability, and responsiveness, interpretation of outcome measures reported by patients is challenging.

Interpretability deals with how to determine differences in scores for patient reported outcome measures that constitute trivial, small but important, moderate, or large differences.12 To help in the design and interpretation of trials evaluating the effect of an intervention on patient reported outcomes, researchers developed a concept called the minimal important difference (MID).34 The MID provides a measure of the smallest change in an outcome measure that patients perceive as an important improvement or deterioration,34 and can be used as a reference point for judging the magnitude of treatment effects in clinical trials and systematic reviews.

The widely accepted optimal approach to establishing an MID for a patient reported outcome measure relates a score on the instrument to an independent measure—an external criterion or anchor—that is understandable and relevant to patients.5 The most widely used anchor is the patient’s global rating of change, also referred to as a transition rating. An example of a typical transition rating question would be “Since last month when we started the new treatment, are you feeling better or worse and, if so, to what extent?”, with responses of no change, small but important, moderate or large improvement, or worsening. Other anchors include measures of satisfaction, occurrence of an event, or other patient reported outcome measures assessing health status.

A second, but much less effective, approach to estimating the MID involves distribution based methods. These methods rely solely on the statistical characteristics of the study sample (eg, 0.5 standard deviation of scores for patient reported outcome measures) and fail to incorporate the patient’s perspective.67

The methodology behind an anchor based MID relies on two key components: choice of anchor and statistical method to estimate the MID. Some of these choices are more satisfactory than others; poor choices can lead to MIDs that mislead, and misleading MIDs will result in seriously flawed interpretation of results for patient reported outcome measures in clinical trials and systematic reviews. For MIDs to help inform patient care, investigators and decision makers (including those performing clinical trials, authors of systematic reviews, developers of clinical practice guidelines, and regulatory authorities, and their audiences of clinicians and patients) must be able to distinguish between unreliable and credible or trustworthy MIDs.

How likely is the design and conduct of studies measuring MIDs to have provided robust estimates? Currently, no accepted standards exist for evaluating the credibility of an anchor based MID. Here, we describe the development of an instrument to evaluate the credibility of anchor based MIDs and report the inter-rater reliability of the instrument.

Methods

Development of a credibility instrument

A steering committee was set up that included clinicians, health research methodologists, clinical epidemiologists, and a psychiatrist (TD, AC-L, TF, BCJ, GN, DLP, and GHG), with substantial experience in measuring health status. The steering committee coordinated the development of the credibility instrument, recruited collaborators, prepared and revised documents, and provided administrative support.

Selection and development of candidate credibility criteria

Our research group conducted a systematic review to develop an inventory of anchor based MIDs for patient reported outcome measures (A Carrasco-Labra, personal communication, 2020).8 To develop criteria for assessing the credibility of anchor based MIDs, during study selection for the MID inventory, we simultaneously screened for articles reporting on key issues or considerations about anchor based methods. Specifically, we selected articles with theoretical descriptions, summaries, commentaries, and critiques suggesting one or more criteria for the credibility of any aspect of anchor based methodology for estimation of MIDs. We searched Medline, Embase, CINAHL, and PsycInfo from 1989 to April 2015, to identify relevant articles for both projects. The search strategy, adapted to each database, included terms representing the MID concept and terms looking at patient reported outcome measures (appendix 1).

We used a standardised data extraction form to abstract candidate criteria for establishing the credibility of an anchor based MID from the methods articles selected (appendix 4). We also extracted excerpts for any rationale or explanation provided by authors for why a specific criterion would increase or decrease credibility. After data extraction, through qualitative analysis, we developed a taxonomy with a deductive approach and categorised criteria according to themes.910

The steering committee reviewed and discussed the results of the coded data extraction, and evaluated the themes that emerged from the qualitative analysis. Findings from the survey of the literature, coupled with our groups’ experience with methods of establishing MIDs,1256111213141516171819202122232425262728 allowed for full discussion of key credibility concepts. Issues that arose based on our experience were the effect of varying correlations between anchor and target instrument, the effect of duration of time required for recall, the relation between sample size and precision of MID estimates, and the relative merits of alternative statistical approaches for estimation of MIDs. The steering committee reviewed the candidate criteria and evaluated the importance of each. Criteria were eliminated when redundancy or overlap existed, and when criteria were not optimally relevant. The steering committee drafted an initial version of the instrument, including clearly worded items, associated response options for each item, and instructions for completing each item.

Piloting and user feedback

We conducted an iterative process of pilot testing and user feedback. We presented the initial instrument to a convenience sample of experts (about seven health research methodologists and clinical epidemiologists with expertise in instrument development, MID estimation, and patient reported outcomes) and target users (about two clinicians, 13 authors of systematic reviews, and three guideline developers). These individuals reviewed the clarity, wording, comprehensiveness, and relevance of the items of the instrument, and provided suggestions for the instrument. We incorporated this feedback. Based on subsequent work, including application of the draft instrument to anchor based MID estimation studies in our MID inventory8 and more applications of the instrument to inform the development of a clinical practice guideline,29 we modified the instrument and reduced the number of items. This process continued until the steering group reached consensus that the instrument would prove optimal for use.

Reliability study of the credibility instrument

Sample of MID estimates and raters

In our previously mentioned inventory of anchor based MIDs, we summarised more than 3000 estimates and their associated credibility, including MIDs for patient reported outcome measures across different populations, conditions, and interventions, obtained with different anchors and statistical methods.8 We enlisted help from masters, doctoral, and postdoctoral students with a background in health research methodology to conduct study screening, data extraction, and the credibility assessment. Before starting the review process, the reviewers received extensive training on the methodology of MIDs, including background reading of key methods articles on MIDs, web teleconferences to review screening and data extraction materials, and pilot and calibration exercises. Teams of two reviewers independently extracted relevant data from the studies selected for each MID estimate, collecting information on study design, characteristics of the patient reported outcome measure, anchor and analytical method, sample size, the MID estimate and associated measure of precision, and time elapsed between administration and follow-up assessments of the patient reported outcome measure and anchor (for longitudinal designs). The reviewers applied the newly developed instrument to evaluate the credibility of the MID estimates.

Sampling method

For a random sample of 200 MID estimates from our inventory, we retrieved the credibility assessments performed by each pair of reviewers with the newly developed instrument. We sampled in excess (see sample size below) to account for potential discrepancies in the MIDs extracted between reviewers and incomplete data (eg, where one reviewer might have missed an MID reported in the study, we would only have one credibility assessment). Because the questions in the extension of the credibility instrument apply only to MIDs estimated with transition rating anchors, and only 50% of the initial sample of 200 MIDs used transition anchors, we randomly sampled an additional 50 MID estimates to meet the required sample size for the relevant reliability analyses. To ensure observations in our sample were independent of each other, when one study reported multiple MIDs, we included only the first MID estimate extracted for that study.

Sample size

We tested the reliability of our credibility instrument with classical test theory.30 Because assessments of credibility involve subjective judgments and different individuals collecting data might experience and interpret phenomena of interest differently, we measured inter-rater reliability. According to Shoukri,31 considering two raters per MID estimate, an expected reliability of 0.7, with a desired 95% confidence interval width of 0.2, and an α of 0.05, would require a minimum of 101 MIDs assessed per rater.

Analysis

For each item of the instrument, we calculated inter-rater reliability and the associated 95% confidence interval, measured by a weighted kappa, κ, with quadratic weights assigned by the formula: wi=1−(i2/(k−1)2), where i is the difference between categories (response options) and k is the total number of categories. The use of quadratic weights implies that response options for the credibility criteria are ordinal and equidistant. In the absence of information in the primary study to make an informed judgment, the “impossible to tell” response option can be used (see credibility instrument in the results section below). We consider that this rating reflects low certainty in terms of credibility and thus we combined these responses with ratings of “definitely no” in the reliability analysis. We considered a reliability coefficient of at least 0.7 to represent good inter-rater reliability.323334

Patient and public involvement

Patients and the public were not involved in the design, conduct, or reporting in this methodological research, as our instrument is a critical appraisal tool that is intended for researchers and decision makers who require MIDs for interpretation of patient reported outcome measures, including clinical trial investigators, authors of systematic reviews, guideline developers, clinicians, funders, and policy makers.

Results

We identified 41 relevant articles on MID methods4561422272835363738394041424344454647484950515253545556575859606162636465666768 that informed the item generation stage of the development of the instrument (fig 1). There were two major modifications from the first draft69 to the final instrument. In the first, we removed three items (items 2, 4, and 6 of the first draft) because of issues of redundancy and relevance; rephrased one item dealing with to what extent the anchor and the patient reported outcome measure are measuring the same construct (item 5 of the first draft); and added one new item looking at the precision around the point estimate of the MID. In the second modification, we added a new item evaluating whether the anchor threshold selected for estimation of the MID reflected a small but important difference, and developed more criteria for assessing the credibility of a transition rating anchor (described below).

Fig 1
Fig 1

Selection of studies for the development of the minimal important difference (MID) inventory and the credibility instrument

Credibility instrument

The instrument has five criteria essential for determining the credibility of any anchor based MID (table 1). In our inventory of anchor based MIDs8 and a separate systematic review to identify MIDs for knee specific patient reported outcome measures,29 we found that MIDs were most often derived with transition rating anchors. Transition rating anchors require patients to recall a previous health state and compare it with how they are feeling now. This retrospection required criteria to ensure that transition ratings accurately reflect the change in health status and are not unduly influenced by the baseline or endpoint status; thus, for this situation, we developed a four item extension of the core credibility instrument (table 1). Below, we describe each question in the instrument with an explanation of the relevance of the item for evaluating credibility (the full version of the instrument is in appendix 2 with three worked examples in appendix 3, where we have applied our instrument to assess the credibility of three MID estimates, each from a published study).

Table 1

Credibility instrument for judging the trustworthiness of minimal important differences

View this table:

Except for the first item, which has a yes or no response, each item has a five point adjectival scale. The response options for items in the instrument are: definitely yes; to a great extent; not so much; definitely no; and impossible to tell. A response of definitely yes indicates no concern about the credibility of the MID estimate. Responses of definitely yes and definitely no imply that information provided in the MID report under evaluation allows an unequivocal judgment in relation to the item; the responses “to a great extent” and “not so much” indicate less certainty. In the absence of information or sufficient detail to make an informed judgment about credibility, the response option “impossible to tell” can be used.

Explanation of the core credibility items

Item 1: Is the patient or necessary proxy responding directly to both the patient reported outcome measure and the anchor?

An anchor based method for estimating an MID involves linking a specific patient reported outcome measure (eg, short form 36, Beck depression inventory, chronic respiratory questionnaire) to an external criterion, such as a patient or physician transition rating, another patient reported outcome measure, or a clinical endpoint (eg, concentration of haemoglobin, Eastern Cooperative Oncology Group performance status). Patient reported anchors are more desirable than clinical measures or those that are assessed by a clinician. Situations where the patient cannot directly provide information to inform the outcome (eg, elderly individuals with dementia, infants, and pre-verbal toddlers) require a proxy respondent. We suggest using the same standards recommended for a patient directly responding to the outcome measure when evaluating the credibility of MIDs for a necessary proxy reported outcome measure on behalf on the patient.

Item 2: Is the anchor easily understandable and relevant for patients or necessary proxy?

A suitable anchor is one that is easily understandable and is highly relevant to patients. Typical appropriate anchors are global ratings of change in health status,19707172 status on an important and easily understood measure of function,73 the presence of symptoms,74 disease severity,75 response to treatment,7576 or the prognosis for future events, such as death,747778 use of healthcare facilities,79 or job loss.748081

Item 3: Has the anchor shown good correlation with the patient reported outcome measure?

The usefulness of anchor based approaches is critically dependent on the relation between the patient reported outcome measure and the anchor. When determining the credibility of the MID, we consider how closely the anchor is related to the target patient reported outcome measure and give greater importance to MIDs generated from closely linked concepts; the anchor and patient reported outcome measure should be measuring the same or similar underlying constructs, and therefore should be appreciably correlated. A moderate to high correlation (at least 0.5) suggests the validity of the anchor.148283 An anchor that has low or no correlation with the patient reported outcome measure will likely give inaccurate MID estimates. The instrument has a guide for judging the correlation coefficient.

Item 4: Is the MID precise?

To judge precision, we focus on the 95% confidence interval around the point estimate of the MID. We provide a guide for judging precision when the investigators report the 95% confidence interval around the MID estimate based on the likelihood that inferences about the magnitude of a treatment effect would differ at the extremes of the confidence interval. When authors do not provide a measure of precision, the number of patients included in the estimation of the MID gives an alternative criterion for judging precision. We provide guidance on appropriate sample size based on the relation between sample size and precision in studies in the inventory that reported 95% confidence intervals.

Item 5: Does the threshold or difference between groups on the anchor used to estimate the MID reflect a small but important difference?

To respond to this credibility question, a judgment is needed on whether the selected threshold or groups compared on the anchor reflect a small (rather than moderate or large) but important difference. Even after the threshold is set, many analytical methods can be used to compute the MID, and whether the chosen method of analysis calculates an MID needs to be determined. Box 1 provides a framework for making these judgments, and box 2 has examples of high and low credibility MIDs estimated with different types of anchors.

Box 1

Considerations for judging whether the minimal important difference represents a small but important difference

  1. What is the original scale of the anchor and is it transformed in any way?

  2. Does the scale (or transformed scale) of the anchor capture variability in the underlying construct?

  3. What is the threshold used or comparison being made on the anchor? Does this threshold or comparison represent a difference that is minimally important?

  4. Does the analytical method ensure that the minimal important difference represents a small but important difference? The last example in box 2 shows how a poorly chosen analytical method could give misguided minimal important difference estimates.

RETURN TO TEXT
Box 2

Examples of high and low credibility ratings for item 5 of the credibility instrument

High credibility

  • Investigators calculated the minimal important difference (MID) for the pain domain of the Western Ontario and McMaster University Osteoarthritis Index (WOMAC) as the mean change in the WOMAC pain score in patients who reported themselves as “a little better” to the question “how was the pain in your operated hip during the past week, compared with before the operation,” offering response options of extremely better, very much better, much better, better, a little better, a very little better, almost the same or hardly any better, or no change (with parallel responses for worsening).62

  • To estimate the MID for the National Comprehensive Cancer Network-Functional Assessment of Cancer Therapy (NCCN-FACT) Colorectal Cancer Symptom Index (FCSI-9), investigators compared Eastern Cooperative Oncology Group (ECOG) performance status (score 0-4, higher scores signify worse performance status) at follow-up with baseline performance status. The MID for the FCSI-9 was calculated with the beta coefficients from an analysis of variance model where the dependent variable was the FCSI-9 change score from baseline to week 8 and the independent variable was the ECOG performance status.84 The investigators decided on a half unit change in the ECOG performance status as a small but important difference—which is assumed to be reasonable—and this threshold was used to derive the MID for the FCSI-9.

Low credibility

  • Patients responded to: “Compared to before treatment my back problem is a) much better, b) better, c) unchanged, d) worse.” Investigators defined the MID for deterioration for the Oswestry Disability Index by calculating the difference in score between patients who rated themselves as worse and patients who rated themselves as unchanged.85 This rating has low credibility because worse could mean a little worse or much worse (box 1, framework steps 2 and 3).

  • Investigators estimated the MID for the Ability to Perform Physical Activities of Daily Living Questionnaire (APPADL) by taking the difference in mean APPADL change scores for those who achieved 5% or more weight loss from baseline to six months and those who achieved less than 5% weight loss.86 This rating is problematic because how patients whose weight fell by 6% reacted is not clear—we do not know whether the patients were pleased they had made a substantial weight reduction, had considered the change small but important, or had regarded it as trivial. Also, the researchers used a misguided analytical method. In their group of patients who they classified as having a small but important improvement, they included patients who had a 5%, and also a 10%, 30%, or 50% reduction in weight loss together. Subtracting the APPADL mean change score for the group of patients achieving a less than 5% change in weight loss from those who experienced a change greater than 5% could give an estimate for the MID that constitutes a small, moderate, or large difference, depending on the proportion of patients who achieved large percentage weight losses (box 1, framework step 4). Rather, the use of receiver operating characteristic curve analysis would have been a more appropriate choice.

RETURN TO TEXT

Explanation of additional items for transition rating anchors

Item 1: Is the amount of elapsed time between baseline and follow-up measurement for MID estimation optimal?

Despite the intuitive appeal of transition questions, patients have considerable difficulty recalling previous health states,144987 and the longer the time patients have to remember, the greater the difficulty.1449 Patients can often remember previous states for up to four weeks14; as time extends into months, patients are more likely to confuse change over time with current status.49

Judgments for items 2-4 of the extension for transition rating anchors requires knowledge of the directional characteristics of the patient reported outcome measure and transition scale. In the instrument, we provide guidance to deal with situations where higher scores on both the patient reported outcome measure and anchor represent the same direction (that is, both represent a worse or better condition) and when they represent different directions.

Item 2: Does the transition item have a satisfactory correlation with the score for the patient reported outcome measure at follow-up?

Ideally, the correlation between the transition rating with the score at baseline and the transition rating with the score at follow-up would be equal and opposite, an ideal that seldom occurs. To the extent that the score at follow-up shows at least some correlation with the transition, the MID estimate is more credible than if there were no correlation.1468

Item 3: Does the transition item correlate with the patient reported outcome measure score at baseline?

If the score at baseline correlates with the transition rating, we are more confident that patients are taking their baseline status into account when scoring the transition rating.1468

Item 4: Is the correlation of the transition item with the patient reported outcome measure change score appreciably greater than the correlation of the transition item with the patient reported outcome measure score at follow-up?

A correlation of at least 0.5 between the transition rating and the change in patient reported outcome measure is necessary but insufficient to confirm that the transition rating is measuring change, as opposed to current health status. A correlation of the score at follow-up with the transition that is similar or greater than the correlation of the change with the transition indicates that the rating likely reflects only current status, and thus confidence in the MID estimate decreases.1468 The instrument provides a guide for judging the correlation coefficients described in items 2-4.

Overall judgment of credibility

Responses to individual items provide the basis for determining an overall judgment of credibility for the MID estimate. We have deliberately avoided a prescriptive approach for reaching an overall judgment and have not scored items, because the relative weights of individual items within the instrument are uncertain and depend on context. Thus the overall credibility judgment for a given MID estimate requires consideration of the severity of the credibility issue for a particular item and the consequence of this issue.

Reliability analyses

The analysis for the assessment of inter-rater reliability included 135 MIDs assessed by two raters for the core credibility criteria and 137 for the first item in the extension criteria. For the remaining items in the extension for transition rating anchors, only 12 studies reported the correlation between the score at follow-up and transition rating described in items 2 and 4, and 10 studies provided the correlation between the score at baseline and transition rating required for item 3. Because of the limited sample sizes, we could not conduct an evaluation of the inter-rater reliability for these items.

Overall, the inter-rater reliability for all items ranged from good (Cohen’s ĸ ≥0.7) to very good (≥0.8) agreement (table 2). The item from the extension criteria looking at duration of follow-up had the highest value for Cohen’s ĸ, and the item on whether the anchor is understandable and relevant, the lowest.

Table 2

Inter-rater reliability coefficients

View this table:

Discussion

Principal findings

We have developed a credibility instrument to evaluate the design, conduct, and analysis of studies measuring anchor based MIDs. Our instrument is a critical appraisal tool that provides a systematic step-by-step approach to deciding whether a study claiming to establish an MID has trustworthy results. The five criteria in the core credibility instrument proved reliable, with good to excellent agreement between reviewers. The items on whether the anchor is understandable and relevant, and whether the threshold on the anchor represents a small but important difference, had lower, but still satisfactory, inter-rater reliability estimates.

Strengths and limitations of the study

Strengths of the study include the use of the literature and the expertise of the study team in the development of our criteria, and modifications based on expert feedback and extensive experience in applying the instrument. Similar methods have proved successful for developing methodological quality appraisal standards across a wide range of topics.8889909192 We undertook a rigorous assessment that showed the high reliability of the instrument.

Our study has limitations. Although a multidisciplinary team with a broad range of content and methodological expertise led the development of the credibility instrument, these individuals represent only a fraction of the experts in patient reported outcome and MID methodology worldwide. Researchers have not reached consensus on optimal anchor based approaches, types of anchors, and analytical methods, and methodological issues might subsequently emerge that require modification of the instrument.

Reviewers who participated in our reliability study had graduate level methodology training, received extensive instruction on MID methodology, extracted data from at least 30 studies reporting MID estimates, and participated in pilot testing with different iterations of the instrument. Thus reliability might be lower in less well trained and instructed individuals. We have, however, developed detailed instructions and examples (included here and in the appendix) that are likely to enhance reliability in those with less experience than the raters in this study.

We did not conduct a formal evaluation to collect feedback on the usability of our instrument or satisfaction with its use. The instrument did, however, undergo numerous iterations of pretesting, which resolved several issues related to the understanding, comprehensiveness, and overall structure of the instrument.

We could not assess inter-rater reliability for three items in the extension for transition rating anchors, as only 3% of the studies in our inventory of MID estimation studies evaluated the correlations necessary to judge the validity of transition rating anchors. In the future, we anticipate that the availability of this credibility instrument will spur improvements in methodology of the conduct of MID studies. If so, correlations will be regularly reported, and the investigators can look at the reliability of these items.

We have not established the validity of our instrument by formal testing. In other work, however, we have shown that the current criteria for credibility succeed in partially explaining the variability in the magnitude of the MID.2993 Our instrument does not deal with the underlying measurement properties of the patient reported outcome measures (that is, validity and responsiveness) and assumes that users will only move forward in evaluating the credibility of MIDs if the instrument has met at least minimal standards of validity and responsiveness.

Implications and future research

Knowing the MID facilitates the interpretation of treatment effects in clinical research, allowing decision makers to determine if patients have had important benefits,469495 and informing the balance between desirable and undesirable outcomes of interventions. The recent CONSORT PRO Extension (Consolidated Standards of Reporting Trials patient reported outcomes) encourages authors to include discussion of an MID or a responder definition in reports of clinical trials.14 The demand for increased use of MIDs in trials requires the availability of trustworthy estimates. Since the MID was first introduced over three decades ago,312 methods for calculating the MID have evolved. In our linked inventory of published anchor based MIDs, we identified many statistical methods, each with its own merits and limitations. We also found varying qualities of the anchor, and the threshold selected for defining the MID might not always be optimal. Different methodological and statistical approaches to calculate MIDs will give different estimates for the same patient reported outcome measure.6296 Given the multiplicity of MID estimates often available for a given patient reported outcome measure and non-standardised methodology, researchers and decision makers in search of MIDs need to critically evaluate the quality of the available estimates.

Flaws in the design and conduct (aspects of credibility) of the studies empirically estimating MIDs can lead to overestimates or underestimates of the true MID. Lack of trustworthy MIDs to guide interpretation of estimates of treatment effects measured by patient reported outcome measures—or worse, availability of misleading MIDs—might result in serious misinterpretations of the results of otherwise well designed clinical trials and meta-analyses. Our credibility instrument provides a comprehensive approach to assessing the credibility of anchor based MIDs. Widespread adoption and implementation of our credibility instrument will facilitate improved appraisal of MIDs by users such as those conducting clinical trials, authors of systematic reviews, guideline developers, clinicians, funders, and policy makers, and also guide the development of trustworthy MIDs.

In developing our inventory of anchor based MIDs, and in other related work,2993 we found that the literature often includes a number of candidate MIDs for the same patient reported outcome measure. Moreover, the magnitude of these estimates sometimes varies widely. Several other researcher groups have made similar observations, stressing the importance of improved understanding of factors influencing the magnitude of MIDs.4662979899 Future research should, therefore, focus on understanding how different methodological and statistical approaches contribute to the variability in MIDs.

Our instrument focuses on the methodological issues that could potentially lead to flawed and thus misleading MIDs, which might in part explain why different methods can give variable estimates. Variability in MIDs, however, can also be related to many other factors, including the clinical setting, patient characteristics (eg, age, sex, disease severity, diagnosis), intervention, and duration of follow-up. Findings from subsequent investigations might provide insights into the appropriate use, in terms of context and trustworthiness, of MIDs for interpretation of patient reported outcome measures in clinical research and practice. For updates to the instrument and associated instructions that may arise from these insights, see www.promid.org.

Conclusions

To better inform management choices, patients, clinicians, and researchers need to know about MIDs to be able to interpret the effects of treatment on patient reported outcome measures. Consideration of the credibility of an MID involves complex judgments. We have developed a reliable instrument that will allow users to distinguish between unreliable and credible MID estimates. This work provides guidance for dealing with the credibility of MIDs to optimise the presentation and interpretation of results from patient reported outcome measures in clinical trials, systematic reviews, health technology assessments, and clinical practice guidelines, and also has important implications for how investigators should conduct future studies on estimating anchor based MIDs.

What is already known on this topic

  • Interpreting results from patient reported outcome measures is critical for healthcare decision making

  • The minimal important difference, a measure of the smallest change in a measure that patients consider important, can greatly facilitate judgments of the magnitude of effect on patient reported outcomes

  • The credibility of minimal important difference estimates varies, and guidance on determining credibility is limited

What this study adds

  • An instrument to evaluate the design, conduct, and analysis of studies measuring minimal important differences has been developed

  • This instrument aims to allow users to distinguish between unreliable and credible minimal important differences to optimise the presentation and interpretation of results from patient reported outcome measures in clinical trials, systematic reviews, health technology assessments, and clinical practice guidelines

  • This instrument will also aim to promote higher standards in methodology for robust anchor based estimation of minimal important differences

Acknowledgments

The Credibility instrument for judging the trustworthiness of minimal important difference estimates, authored by Devji et al, is the copyright of McMaster University (copyright 2018, McMaster University). The Credibility instrument for judging the trustworthiness of minimal important difference estimates has been provided under license from McMaster University and must not be copied, distributed, or used in any way without the prior written consent of McMaster University. Contact the McMaster Industry Liaison Office at McMaster University (milo@mcmaster.ca) for licensing details.

Footnotes

  • Contributors: TD and AC-L are joint first authors. TD, AC-L, GHG, BCJ, GN, and SE conceived the study idea; TD, AC-L, AQ, MP, and GHG led the development of the credibility instrument; TD, AC-L, AQ, MP, ND, DZ, MB, XJ, RB-P, OU, FF, SS, HP-H, RWMV, HH, YR, RS, and LL extracted data and assessed the credibility of MIDs in our inventory for the reliability analyses; TD and AC-L wrote the first draft of the manuscript; TD, AC-L, GHG, AQ, MP, ND, DZ, RB-P, OU, SS, HP-H, RWMV, LL, BCJ, DLP, SE, TF, GN, HJS, MB, and LT interpreted the data analysis and critically revised the manuscript. TD and AC-L are the guarantors. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding: This project was funded by the Canadian Institutes of Health Research (CIHR), Knowledge Synthesis (grant No KRS138214). The views expressed in this work are those of the authors and not necessarily represent the views of the CIHR or the Canadian government.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: support from the Canadian Institutes of Health Research (CIHR) for the submitted work; TD, AC-L, and GHG have a patent issued for the Credibility instrument for judging the trustworthiness of minimal important difference estimates, and a patent pending for the Patient Reported Outcome Minimal Important Difference (PROMID) Database; GHG has received other grants outside the submitted work; MB reports personal fees from AgNovos Healthcare, Sanofi Aventis, Stryker, and Pendopharm, and grants from DJ Orthopaedics and Acumed outside the submitted work; no other relationships or activities that could appear to have influenced the submitted work.

  • Ethical approval: Not required.

  • Data sharing: No additional data available.

  • TD, AC-L, and GHG affirm that the manuscript is an honest, accurate, and transparent account of the recommendation being reported; that no important aspects of the recommendation have been omitted; and that any discrepancies from the recommendation as planned (and, if relevant, registered) have been explained.

  • Dissemination to participants and related patient and public communities: We have planned dissemination of the existence of the instrument and its use to relevant patient communities through health and consumer advocacy organisations, such as the Cochrane Task Exchange, Cochrane Consumer Network, the National Patient-Centred Clinical Research Network, the Society for Participatory Medicine, and Consumers United for Evidence-based Health Care.

http://creativecommons.org/licenses/by-nc/4.0/

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

References