- Penny Whiting, research fellow ()1,
- Roger Harbord, research associate1,
- Caroline Main, research fellow2,
- Jonathan J Deeks, senior medical statistician3,
- Graziella Filippini, head4,
- Matthias Egger, professor of epidemiology and public health5,
- Jonathan A C Sterne, reader in medical statistics and epidemiology1
- 1 MRC Health Services Research Collaboration, Department of Social Medicine, Bristol BS8 2PR
- 2 Centre for Reviews and Dissemination, University of York
- 3 Centre for Statistics in Medicine, Wolfson College, Oxford
- 4 Unit of Neuroepidemiology, Istituto Nazionale Neurologico “Carlo Besta,” Milan, Italy
- 5 Department of Social and Preventive Medicine, University of Bern, Switzerland
- Correspondence to: P Whiting
- Accepted 2 February 2006
Objective To determine the accuracy of magnetic resonance imaging criteria for the early diagnosis of multiple sclerosis in patients with suspected disease.
Design Systematic review.
Data sources 12 electronic databases, citation searches, and reference lists of included studies.
Review methods Studies on accuracy of diagnosis that compared magnetic resonance imaging, or diagnostic criteria incorporating such imaging, to a reference standard for the diagnosis of multiple sclerosis.
Results 29 studies (18 cohort studies, 11 other designs) were included. On average, studies of other designs (mainly diagnostic case-control studies) produced higher estimated diagnostic odds ratios than did cohort studies. Among 15 studies of higher methodological quality (cohort design, clinical follow-up as reference standard), those with longer follow-up produced higher estimates of specificity and lower estimates of sensitivity. Only two such studies followed patients for more than10 years. Even in the presence of many lesions (> 10 or > 8), magnetic resonance imaging could not accurately rule multiple sclerosis in (likelihood ratio of a positive test result 3.0 and 2.0, respectively). Similarly, the absence of lesions was of limited utility in ruling out a diagnosis of multiple sclerosis (likelihood ratio of a negative test result 0.1 and 0.5).
Conclusions Many evaluations of the accuracy of magnetic resonance imaging for the early detection of multiple sclerosis have produced inflated estimates of test performance owing to methodological weaknesses. Use of magnetic resonance imaging to confirm multiple sclerosis on the basis of a single attack of neurological dysfunction may lead to over-diagnosis and over-treatment.
Diagnosis of multiple sclerosis is based on the principle of dissemination in both time and space. Recent criteria state that patients should experience two attacks of neurological dysfunction, such as optic neuritis, transverse myelitis, double vision, or numbness and tingling of the leg, occurring at different points in time and affecting different parts of the central nervous system—that is, signs or symptoms that cannot be attributable to a single lesion.1 Many years may elapse between first and second attacks, and not all patients who experience a first attack develop multiple sclerosis. In a study of patients with optic neuritis, a common presenting symptom of multiple sclerosis, 38% developed the disease by 10 years; of these, 50% received their diagnosis more than three years after presentation and 28% more than five years after presentation.2 In a study of patients presenting with clinically isolated syndromes (optic, spinal cord, or brain symptoms) 68% of patients had developed multiple sclerosis by 14 years, the proportions being similar for the different presenting symptoms.3
Magnetic resonance imaging may assist in earlier diagnosis of multiple sclerosis by enabling visualisation of lesions in the brain that are clinically silent. The McDonald 2001 criteria for the diagnosis of multiple sclerosis4 allow an early diagnosis of multiple sclerosis to be made after one clinical attack if the patient also meets criteria for a positive result on a magnetic resonance imaging scan. The McDonald criteria have been adopted in England and Wales by the National Institute for Health and Clinical Excellence (NICE),5 but they are not universally accepted.1 Evidence shows that patients' wellbeing is affected by early diagnosis,6–9 usually in a beneficial way but also occasionally in a negative way—for example, through increased insurance premiums and discrimination in the workplace.10 Earlier diagnosis of multiple sclerosis could mean the availability of earlier treatment, such as the disease modifying therapies interferon beta and glatiramer acetate, provided under the “risk sharing scheme” in the United Kingdom (www.dh.gov.uk/assetRoot/04/01/22/14/04012214.pdf).
We carried out a systematic review to estimate the accuracy of different magnetic resonance imaging criteria for the early diagnosis of multiple sclerosis in patients presenting with suspected disease, to investigate whether magnetic resonance imaging has the potential to alter diagnoses and patient management.
We identified studies, published and unpublished, by searching 12 databases from inception until September or November 2004. Search terms were “multiple sclerosis” combined with “magnetic resonance imaging” or “MRI”. No language restrictions were applied. We undertook a citation search on the article reporting the McDonald 2001 criteria,4 screened reference lists of included studies, and assessed studies included in the NICE multiple sclerosis guidelines.5
Studies were eligible that compared magnetic resonance imaging (or diagnostic criteria incorporating such imaging) to a reference standard for the diagnosis of multiple sclerosis and reported sufficient data to enable a 2 × 2 table of test performance to be constructed. If studies were reported more than once, we included the publication that provided data for the longest follow-up. We also included separate publications that reported on different criteria for magnetic resonance imaging or separate results for relevant patient subgroups.
Two reviewers independently screened titles and abstracts for relevance. Screening for inclusion, data extraction, and quality assessment were carried out by one reviewer and checked by a second. Studies were assessed for methodological quality against the QUADAS (quality assessment of diagnostic accuracy studies) criteria.11 (See bmj.com for a summary of how items were scored.) One item, the avoidance of disease progression bias, was omitted as it was not relevant to this topic. We grouped studies according to patient spectrum: prospective cohort studies that enrolled patients with suspected multiple sclerosis, and studies of other designs.
From each 2×2 table we computed sensitivity, specificity, and likelihood ratios, which combine data on sensitivity and specificity to give an indication of a test's ability to rule in or rule out a condition.12
We plotted all results from all included studies on a receiver operating characteristic plot of sensitivity against specificity, with the specificity axis reversed. To compare accuracy of cohort and other studies we selected the result with the median diagnostic odds ratio (defined as the odds of positivity among people with the disease, divided by the odds of positivity among people without the disease) for each study. We used random effects meta-analysis to obtain summary diagnostic odds ratios in each group, and we carried out a permutation test13 to obtainaPvalue for their comparison. We restricted all further analyses to cohort studies that used a reference standard diagnosis of clinically definite multiple sclerosis, arrived at solely by clinical data.
As a final diagnosis of multiple sclerosis may be reached many years after a patient first presents with possible disease, we investigated the effect of duration of follow-up on estimates of diagnostic accuracy. We used the hierarchical summary receiver operating characteristic method proposed by Rutter and Gatsonis14 to assess the effect of duration of follow-up on overall accuracy and threshold. An association with threshold would indicate that sensitivity increased as specificity decreased, or vice versa. We drew separate receiver operating characteristic plots for studies that evaluated commonly reported magnetic resonance imaging criteria, the Barkhof, Paty, and Fazekas criteria, and the McDonald 2001 criteria, which combine clinical information with findings on magnetic resonance imaging.
Further analysis was restricted to cohort studies with at least 10 years' clinical follow-up. We produced separate receiver operating characteristic plots for each of these studies and compared areas under the curves. The statistical software package Stata release 9 was used for all analyses, except the hierarchical summary receiver operating characteristic model, which was fitted in SAS.15
Figure 1 shows the flow of studies through the review. Sixty one publications met the inclusion criteria, 21 of which were earlier reports of included studies and were not extracted.w1-w43 Forty publications reporting the results of 29 studies (some reported results for different magnetic resonance imaging criteria, for imaging of the spine rather than the brain, or for patient subgroups) were included. Sample sizes were generally small (median 70), ranging from 15 to 1500 patients. The proportions of dropouts ranged from 0 to 58% (median 4%), increasing with length of follow-up. Table 1 provides details of the 29 publications reporting the results of 18 cohort studies. Most of these studies used clinical follow-up as the reference standard. Most used the Poser criteria,16 although some used the McDonald 1977 criteria.17 The McDonald 1977 criteria, based on clinical information alone, are not the same as the McDonald 2001 criteria, which incorporate magnetic resonance imaging.4 Table 2 provides details of the 11 studies of other designs. The studies differed according to population, quality, magnetic resonance imaging protocol, and criteria used to define a positive test result. Cohort studies varied in their inclusion criteria; some included only patients presenting with a particular clinically isolated syndrome (for example, optic neuritis or a spinal cord syndrome), whereas others included all patients being evaluated for possible multiple sclerosis. Publication dates ranged from 1986 to 2003. Over this time improvements occurred in magnetic resonance imaging technology; this is reflected in differences in scanning protocols (see table A on bmj.com).
Figure 2 summarises the results of the quality assessment (see table B on bmj.com for results of individual studies). Study quality was generally poor: only four QUADAS items were met by over 70% of studies (avoidance of partial and differential verification bias and reporting of uninterpretable results and withdrawals). Studies scored badly on three items: blinding, the use of an appropriate reference standard, and the availability of clinical information. Four publications, reporting results from three cohort studies, were susceptible to incorporation bias as magnetic resonance imaging contributed to the final diagnosis.18–21 Three of these used a combination of clinical follow-up and paraclinical tests as the reference standard,19–21 the other relied on paraclinical tests alone.18 All other cohort studies used clinical follow-up alone as the reference standard.
Figure 3 shows that cohort studies produced lower estimated sensitivity and specificity than studies of other designs. The pooled diagnostic odds ratio was 9 (95% confidence interval 5 to 16) for cohort studies and 213 (85 to 535) for studies of other designs (P < 0.001, permutation test). Further analysis was restricted to the 15 cohort studies that used a diagnosis of clinically definite multiple sclerosis, arrived at by clinical information alone, as the reference standard.
The average duration of follow-up ranged from seven months to 14 years. The only criteria for which sufficient data were available to investigate the effects of duration of follow-up were presence of one or more lesions and presence of one or more non-clinical lesions. Figure 4 is a receiver operating characteristic plot for these criteria, with numbers showing the duration of follow-up in years. Evidence shows (P = 0.074 from hierarchical summary receiver operating characteristic analysis) that studies with longer follow-up produced higher estimated specificity and lower estimated sensitivity.
The longest average duration of follow-up was three years in studies that assessed the Barkhof, Fazekas, and McDonald 2001 criteria, and six years for studies that assessed the Paty criteria. It is therefore possible to draw conclusions regarding the ability of these criteria to predict the development of multiple sclerosis only over these relatively short periods. Figure 5 shows the receiver operating characteristic plots for these criteria. The study that developed the Barkhof criteria22 showed higher estimated sensitivity and specificity than did the other studies of this criterion. The negative likelihood ratios for the Barkhof, Fazekas, and Paty criteria ranged from 0.2 to 0.5, suggesting that a negative result on magnetic resonance imaging on the basis of these criteria is of limited utility for ruling out the development of multiple sclerosis within three to six years. Positive likelihood ratios were < 5: thus these criteria are also of limited utility in predicting the development of multiple sclerosis within three to six years. Positive likelihood ratios for the McDonald 2001 criteria ranged from 2.7 to 8.7, suggesting that they have more potential for predicting the development of multiple sclerosis within three years than any of the criteria based on magnetic resonance imaging alone.23–26 Negative likelihood ratios were 0.1 in one study and 0.2 to 0.5 in three studies, suggesting that the McDonald 2001 criteria are of limited utility for ruling out the development of multiple sclerosis within three years.
Only two studies, one from the United States2 and one from England,3 followed patients for more than 10 years, long enough to be reasonably confident that almost all patients had been diagnosed as having multiple sclerosis who ever would be. Both studies fulfilled all but one QUADAS criterion (the availability of clinical information), and in the US study it was unclear whether review bias had been avoided (see bmj.com). The US study included 351 patients with optic neuritis; follow-up of more than 10 years was available for 302 (86%) of these. The study used survival analysis to estimate the cumulative proportions of patients diagnosed, with patients who did not receive a diagnosis of multiple sclerosis censored at the time of their last clinical follow-up. The English study included 135 patients with a range of presenting symptoms, of whom 71 (53%) were included in the final evaluation. Both studies evaluated thresholds based on the number of non-clinical T2 lesions present on magnetic resonance imaging of the brain.Figure 6 shows the estimates of sensitivity and specificity, with confidence intervals, for each of the thresholds evaluated in these two studies. Sensitivity and specificity varied according to the number of lesions used to define a positive result on magnetic resonance imaging: sensitivity was higher with fewer lesions but specificity was lower. Estimates of specificity were similar for the two studies, but the English study tended to produce higher estimates of sensitivity. Comparison of areas under the curves suggested better accuracy in the English study than in the US study (P = 0.045). Estimates of the positive likelihood ratios for the presence of various numbers of lesions ranged from 2.0 to 3.4. Assuming a pretest probability of multiple sclerosis of 60% this is equivalent to a post-test probability of 75%-84%, suggesting that magnetic resonance imaging is of limited utility for ruling in multiple sclerosis at any threshold. Estimates of the negative likelihood ratio ranged from 0.1 to 0.9 but were greater than 0.5 for all but one of the thresholds in the English study. This is equivalent to modifying a pretest probability of 60% to give a post-test probability of multiple sclerosis of 43%-57%, suggesting that magnetic resonance imaging is also of limited utility in ruling out a diagnosis of multiple sclerosis.
Use of magnetic resonance imaging to confirm multiple sclerosis on the basis of a single attack of neurological dysfunction may lead to over-diagnosis and over-treatment. Many studies in our systematic review produced inflated estimates of test performance owing to methodological weaknesses.
Only two cohort studies on the accuracy of magnetic resonance imaging for the diagnosis of multiple sclerosis included at least 10 years' follow-up. These suggested that the role of magnetic resonance imaging either in ruling in or ruling out a diagnosis of multiple sclerosis is limited. Studies that did not include an appropriate patient spectrum tended to overestimate both sensitivity and specificity. Studies that included shorter clinical follow-up tended to overestimate sensitivity and underestimate specificity. Specific criteria developed for the interpretation of magnetic resonance imaging scans as indicating multiple sclerosis, the Fazekas, Barkhof, and Paty criteria, have poor accuracy for predicting the development of multiple sclerosis within three to six years. The limited data on the McDonald 2001 criteria suggest that these have some potential to rule in the development of multiple sclerosis within three years. Neither the specific magnetic resonance imaging criteria nor McDonald 2001 were evaluated in studies with long term follow-up. It is therefore not possible to determine their accuracy for the diagnosis of multiple sclerosis.
Strengths and weaknesses of the study
We carried out extensive literature searches, assessed study quality, and used recently developed statistical methods. Considerable weaknesses existed in the primary studies included in the review. The only reference standard for the diagnosis of multiple sclerosis is long term clinical follow-up. Most studies followed patients for relatively short periods and so will have classified some patients as not having multiple sclerosis who had a second clinical attack after follow-up ended. Most studies included an inappropriate patient spectrum, which we found to be associated with considerably higher estimated diagnostic accuracy. Most of such studies used a case-control design—they selected people with clinically definite multiple sclerosis and a control group of people known not to have the disease, either healthy controls or patients with conditions that may present with similar symptoms to multiple sclerosis. That such studies tend to exaggerate the accuracy27 of magnetic resonance imaging in the diagnosis of multiple sclerosis is to be expected; people with more advanced multiple sclerosis are more likely to have lesions on their magnetic resonance imaging scans than those presenting in the early stages of multiple sclerosis.
Strengths and weaknesses in relation to other studies
Although several reviews have assessed the accuracy of magnetic resonance imaging in the diagnosis of multiple sclerosis,4 28 29 we are unaware of any systematic reviews. The McDonald 2001 criteria incorporate the Barkhof criteria to define a positive MRI scan.4 The article reporting the McDonald 2001 criteria4 refers to a small number of studies to justify its selection of the Barkhof criteria for this purpose. All these had methodological weaknesses: they either used a case-control design or had an average of less than three years clinical follow-up. This paper was published before the two long term cohort studies from England and the United States.2 3
A recently published detailed (but not systematic) review is the report of the Therapeutics and Technology Assessment Subcommittee of the American Academy of Neurology.28 This was limited to cohort studies and discussed in detail the problems associated with the lack of a true reference standard, an accurate method of determining whether or not a patient has multiple sclerosis that can be applied at the same time as the index test, for the diagnosis of the disease. It did not carry out any statistical synthesis and instead presents a narrative overview of the results of the English study and several other studies, also included in our review, which had relatively short clinical follow-up, and was published before the US study. It concluded, in contrast with our findings, that the presence of at least three lesions on a magnetic resonance imaging scan is a sensitive predictor of the development of multiple sclerosis in the next 7-10 years, and that normal results suggest that future development of multiple sclerosis is less likely. A more recent review article focused on the McDonald 2001 criteria but also draws on the results of the report of the Therapeutics and Technology Assessment Subcommittee of the American Academy of Neurology.29 It highlights the limitations of the evidence base for the McDonald 2001 criteria and draws on the results of the US2 and English studies3 to conclude, consistent with the results presented here, that presence of brain lesions does not guarantee development of multiple sclerosis over 10-14 years.
Unanswered questions and future research
The main clinical question is whether magnetic resonance imaging should be included in the work-up of patients with multiple sclerosis. Several factors need to be considered, in particular the reasons why magnetic resonance imaging is ordered. This is not simply to increase the certainty of the diagnosis: other possible reasons include ruling out differential diagnoses such as brain tumours, providing a baseline for monitoring disease progression, patient request, and patient reassurance. If magnetic resonance imaging scans are ordered to inform the diagnosis of multiple sclerosis, and if the McDonald 2001 criteria that incorporate such imaging are to be used in practice, then further research, based on long term cohort studies, is required to evaluate these criteria. A limitation consequent on the need for long term clinical follow-up in studies that evaluate the accuracy of magnetic resonance imaging is that such studies inevitably use older technology. Studies with more advanced, and hence recent, technology inevitably had much shorter periods of follow-up. Differences in estimates of sensitivity and specificity according to magnetic resonance imaging technology were therefore confounded by differences in duration of follow-up.
The two studies that included follow-up of longer than 10 years produced differing results, with the US study reporting lower estimates of sensitivity than the English study for similar thresholds for magnetic resonance imaging. It is possible that these differences reflect the smaller sample size of the English study or that the large proportion of dropouts from this study biased results. An alternative explanation is that magnetic resonance imaging may be more accurate in patients presenting with brainstem or spinal cord symptoms than in patients with optic neuritis. Future studies should assess whether the accuracy of magnetic resonance imaging varies according to presenting symptoms.
Rather than the accuracy of magnetic resonance imaging alone in diagnosing multiple sclerosis, the issue of clinical relevance is, arguably, the added value of such imaging in diagnosing the disease compared with the patient's history and clinical examination alone.30 None of the identified studies addressed this issue. A further limitation of published studies is that they tend to dichotomise the results of magnetic resonance imaging into positive or negative scans. The use of a scale based on features present on a scan, ranging from no lesions (in which case the probability of disease is low), to specific lesions (which may imply a greatly increased probability of disease), should be considered as an alternative to dichotomisation. This is probably consistent with how the results of magnetic resonance imaging are interpreted in practice.
In patients with clinically suspected multiple sclerosis, magnetic resonance imaging currently allows a diagnosis of the disease according to the McDonald 2001 criteria. Our results suggest that magnetic resonance imaging is a relatively poor test for both ruling in and ruling out multiple sclerosis. In clinical practice a false positive diagnosis of multiple sclerosis is potentially more dangerous than a false negative one because it implies unnecessary successive tests and treatments, or needless anxiety and psychological distress for the patient. Wrongly ruling out a diagnosis of multiple sclerosis after a first attack seems less dangerous: not all patients who experience a first attack will develop the disease and currently no treatment has been shown to delay conversion to clinically definite multiple sclerosis or impacts on long term disability. Neurologists should discuss with their patients the potential diagnosis, treatment, and ultimate effect of potential errors of false positive and false negative magnetic resonance imaging results. High quality clinical research based on improved magnetic resonance imaging techniques and measures in combination with a complete description of participants and long term clinical follow-up are needed for quantitative assessment of the clinical efficacy of magnetic resonance imaging in the diagnosis of multiple sclerosis. The disease remains a predominantly clinical diagnosis.
What is already known on this topic
Magnetic resonance imaging has been recommended in the diagnosis of multiple sclerosis
The diagnostic accuracy of such imaging has been assessed but a systematic review has not previously been carried out
What this study adds
Magnetic resonance imaging is of limited utility for both ruling in and ruling out multiple sclerosis
Studies with shorter follow-up tended to produce higher estimates of sensitivity and lower estimates of specificity compared with longer term studies
Additional information and references w1-w43 are on bmj.com
Contributors PW, JACS, JJD, CM, and RH designed the study. PW carried out the literature searches. PW and CM screened the results of the searches for relevance, assessed inclusion, and carried out the data extraction and quality assessment. GF provided expert advice on magnetic resonance imaging and multiple sclerosis. RH, JJD, JACS, and PW developed the plan of analysis and RH carried out the analysis. PW, JACS, RH, JJD, and GF drafted the paper. All authors commented on drafts of the paper and approved the final manuscript. PW is guarantor for the paper.
Funding This work was supported by the Medical Research Council Health Services Research Collaboration. The authors' work was independent of the funders. JJD is funded by a senior research fellowship in evidence synthesis from the UK Department of Health.
Competing interests None declared
Ethical approval Not required.