Intended for healthcare professionals


Optimal search strategies for retrieving scientifically strong studies of diagnosis from Medline: analytical survey

BMJ 2004; 328 doi: (Published 29 April 2004) Cite this as: BMJ 2004;328:1040
  1. R Brian Haynes, professor (bhaynes{at},
  2. Nancy L Wilczynski, research associate

    Hedges Team

  1. 1Health Information Research Unit, Department of Clinical Epidemiology and Biostatistics, McMasterUniversity Faculty of Health Sciences, 1200 Main Street West, Hamilton, ON, L8N 3Z5, Canada
  1. Correspondence to: R B Haynes
  • Accepted 18 March 2004


Objective To develop optimal search strategies in Medline for retrieving sound clinical studies on the diagnosis of health disorders.

Design Analytical survey.

Setting Medline, 2000.

Participants 170 journals for 2000 of which 161 were indexed in Medline.

Main outcome measures The sensitivity, specificity, precision (“positive predictive value”), and accuracy of 4862 unique terms in 17 287 combinations were determined by comparison with a hand search of all articles (the “gold standard”) in 161 journals published during 2000 (49 028 articles).

Results Only 147 (18.9%) of 778 articles about diagnostic tests met basic criteria for scientific merit. Combinations of search terms reached peak sensitivities of 98.6% at a specificity of 74.3%. Compared with best single terms, best multiple terms increased sensitivity forsound studies by 6.8% (absolute increase), while also increasing specificity (absolute increase 6.0%) when sensitivity was maximised. When terms were combined to maximise specificity, the singleterm, (98.4%), outperformed combinations of terms. The strategies newly reported in this paper outperformed other validated search strategies except for one strategy that had slightly higher sensitivity (99.3% v 98.6%) but lower specificity (54.7% v 74.3%).

Conclusion New empirical search strategies in Medline can optimise retrieval of articles reporting high quality clinical studies of diagnosis.


Accurate diagnosis is the cornerstone of decision making for clinical intervention and is increasingly important as the number of validated treatments for specific conditions increases Clinical research, usually widely accessible first in the biomedical journal literature, provides quantitative information about the sensitivity, specificity, and predictive value of many diagnostic tests. This information, however, is buried in a much larger biomedical literature.A recent surveyshowed that clinicians are highly interested in using evidence based information and frequently use Medline.1 Information pertaining to diagnosis is second most commonly sought by clinicians after treatment.23

Finding the current best evidence in Medline for a diagnostic process is daunting, given that Medline has over 11 million articles from over 4500 journals, covering all aspects of biomedical and health research.4 A recent qualitative study found that two of the six obstacles to answering clinical questions with evidence were the time required to find information and the difficulty in selecting an optimal search strategy.5Even clinicians who in principle support the use of evidence for patient careoften do not have time to find and apply it in practice.6 When they do try, searches are not performed effectively.7

Search filters (“hedges”) can improve the retrieval of clinically relevant and scientifically sound studies from Medline and similar databases.812 For instance, when we searched Medline for studies on the diagnosis of arthritis from 1996 to the present using the term “arthritis”, 7083 articles alone were retrieved; using “arthritis and diagnosis” yielded 3451 articles. Although this filtered out over half the articles, there were still many articles to sort through, with no guarantee that the mostrigorous studies would be retrieved. More sophisticated search filters can be created by combining disease content terms with medical subject headings, explosions, publication types, subheadings, and textwords (see box). These detect design features indicating methodological rigour for applied healthcare research using such terms as “gold standard” as a filter, seeking studies in whicha test of uncertain value is compared with one of known high accuracy.

In the early 1990s our group at McMaster University developed search filters on a small subsetof 10 journals and for four types of article (therapy, diagnosis, prognosis, and causation (aetiology)).1314 These strategies have been adapted for use in the Clinical Queries interface of Medline ( This research is being updated and expanded with data from 161 journals indexed in Medline from 2000. The robustness of empirical search strategies developed in 1991 for detecting clinical content in Medline in 2000 has already been reported.15 We report on the information retrieval properties of single terms and combinations of terms in Medline for identifying methodologically sound studies on the diagnosis of health disorders.


We developed search strategies by using methodological search terms and phrases in a subset of Medline records matched with a handsearch of the contents of 161 journal titles for 2000. The search strategies were treated as diagnostic tests for sound studies, and the manual review of the literature was treated as the gold standard. It is potentially confusing to use the terminology of diagnostic testing for assessing strategies for retrieving articles about diagnostic tests, especially when some of the search terms are the same. Nevertheless, the principles for retrieval are the same as those for diagnosis. Thus we determined the sensitivity, specificity, accuracy, and precision (a library science term equivalent to the diagnostic test term “positive predictive value”) of single term and multiple term Medline search strategies (table 1 and box). Sensitivity and specificity are not affected by the proportion of high quality articles in the database; precision depends on this proportion, and so does accuracy, but to a lesser extent.

Table 1

Formula for calculating sensitivity, specificity, precision, and accuracy of Medline searches for detecting sound studies of diagnosis by manual review

View this table:

After extensive attempts only 2% (n = 968) of the handsearch items did not match citations in Medline. Unmatched citations that were detected by a search strategy were included in cell b of the analysis table (table 1), leading to slight underestimates of the precision, specificity, and accuracy of the search strategy. Similarly, unmatched citations that were not detected by a search strategy were included in cell d of the table, leadingto slight overestimates of specificity and accuracy.

Manual review

Six research assistants reviewed all issues of 170 journals for 2000 of which 161 were indexedin Medline. The journal titles were regularly reviewed for content for four evidence based journals prepared by our group, Evidence-Based Medicine, Evidence-Based Nursing, Evidence-Based Mental Health, and ACP Journal Club, according to an explicit process that assesses the scientific meritand clinical relevance of original and review articles for health care ( journal list has been chosen over several years in an iterative process based on handsearch review of over 400 journals recommended by clinicians and librarians, science citation index impact factors, recommendations by editors and publishers, and ongoing assessment of their yield of studies and reviews of scientific merit and clinical relevance. These journals (examples bracketed) include content for the disciplines of internal medicine (Annals of Internal Medicine), general medical practice (BMJ, JAMA, and Lancet), mental health (Archives of General Psychiatry, British Journal of Psychiatry), and general nursing practice (Nursing Research) (also see

Terms and definitions for search strategies

  • Sensitivity—proportion of high quality articles retrieved

  • Specificity—proportion of low quality diagnosis studies or non-diagnosis studies not retrieved

  • Precision—proportion of retrieved articles of high quality

  • Accuracy—proportion of all articles correctly categorised

  • “ANDed”—combined with

  • di—diagnosis subheading

  • du—diagnostic use subheading

  • exp—explosion

  • fs—floating subheading

  • MeSH—medical subject heading

  • mp—multiple posting (term in title, abstract, or MeSH heading)

  • pt—publication type

  • sh—MeSH subject heading

  • tw—textword

  • xs—exploded subheading

  • :—truncation

Methodological criteria for evaluating studies of diagnosis were: inclusion of a range of participants; use of an objective diagnostic (“gold”) standard or current clinical standard for diagnosis; participants receiving the new test and some form of the diagnostic standard; interpretation of diagnostic standard without knowledge of test result, and vice versa; and analysis consistent with study design. These criteria were developed for critical appraisal of the healthcare literature, and the second to fourth criteria have been empirically validated.1617 The research assistants were rigorously calibrated and periodically checked for application of criteria to determine if each article was methodologically sound for any of six categories of purpose (diagnosis and screening, treatment and prevention, prognosis, aetiology and harm, clinical prediction guides, and economics).18 Inter-rater agreement for identifying the purpose of articles was 81% beyond chance (κ 0.81, 95% confidence interval 0.79 to 0.84). Inter-rater agreement for which articles met all scientific criteria was 89% beyond chance (κ 0.89, 0.78 to 0.99).18 Articles that seemed to pass the criteria were reviewed by at least the lead author (RBH).

Collecting search terms

To construct a comprehensive set of possible search terms, we listed MeSH terms and textwords related to study criteria and then sought input from clinicians and librarians through interviews, requests by email and at meetings and conferences, review of published and unpublished searching strategies from other groups, and requests to Medline experts. Individuals were asked what terms or phrases they used when searching for each category. Terms could be subject headings, publication types, check tags, and subheadings, or could be single words or phrases as textwords, denoting their presence in titles and abstracts of articles. Various truncations were also applied to the textwords, phrases, and MeSH terms. We compiled a list of 5395 terms of which 4862 were unique.All terms were tested in all purpose categories using the Ovid Technologies searching system. Optimised strategies for aetiology and studies of clinical prediction guides have been published elsewhere.1920

Data collection

Data collection forms were used to record handsearched data for each article found in each issue of the 161 journal titles. These data were scanned using Teleform software (Cardiff Software; Vista, CA). After verification of the data online, the handsearch data were written to an Access database (Microsoft). Each journal title was searched in Medline for 2000, and the full Medline records were captured for all articles in the journals. Medline data were then linked with the handsearch data.

Testing strategies

We calculated the sensitivity, specificity, precision, and accuracy for each term for each category of article. For some categories of articles, such as therapy, we were able to split the database into 60% and 40% components to provide a development and validation database. For diagnosis,however, this was not possible as there were an insufficient number of diagnosis articles that were considered methodologically rigorous. Individual search terms with a sensitivity of more than25% and a specificity of more than 75% for the diagnosis category were incorporated into the development of search strategies that included a combination of two or more terms. All combinations of terms used the Boolean OR—for example, “sensitivity OR specificity”.

For the development of multiple term search strategies to optimise either sensitivity or specificity, we tested the combination of individual terms with all two term search strategies with sensitivity at least 75% and specificity at least 50%. For optimising accuracy, two term search strategies with accuracy of more than 75% were considered for multiple term development. Overall, wetested 17 287 multiple term search strategies. Search strategies were also developed that optimised combined sensitivity and specificity (equivalent to the optimal point on a receiver operating characteristic curve, minimising the total number of errors).


Overall, 49 028 articles were included in the analysis. Of these, 778 (1.6% of original studies and review articles, case reports, or general interest papers) were classified as original studies evaluating a diagnosis question, of which 147 (18.9%) met the methodological criteria.

Table table 2shows the operating characteristics for the single terms with the highest sensitivity and specificity. The best accuracy when keeping sensitivity to 50% or more was seen with the term “” (.tw. is Ovid search system's syntax for searching all words in the title and abstract of an article).

Table 2

Best single terms for high sensitivity searches, high specificity searches, and searches that optimise the balance between sensitivity and specificity for retrieving studies of diagnosis. Values are percentages (95% confidence intervals)

View this table:

Tables table 3 and table 4show the strategies yielding the highest sensitivity and specificity based on testing of all strategies for combinations up to three terms. Some one term and two term strategies outperformed multiple term strategies (table table 4). Because of the low prevalence of diagnosis articles, the accuracy of search terms is driven by their specificity, and thus the three search strategies yielding the highest accuracy are the same as those yielding the highest specificity (table table 4). Table table 5 shows the three search strategies best optimising the trade off between sensitivity and specificity.

Table 3

Top three search strategies yielding highest sensitivity (keeping specificity ≥50%) with combinations of terms. Values are percentages (95% confidence intervals)

View this table:
Table 4

Top three search strategies yielding highest specificity (and highest accuracy) (keeping sensitivity 50%) with combinations of terms. Values are percentages (95% confidence intervals)

View this table:
Table 5

Top three search strategies for optimising sensitivity and specificity (based on minimising absolute difference between sensitivity and specificity).

View this table:

Logistic regression modelling did not lead to the development of search strategies that outperformed those already developed using the Boolean approach.

We used our data to test 10 published strategies and one previously unpublished strategy for retrieving diagnostic test studies from Medline.911 Two strategies were modified slightly to eliminate the content words in the search strategies. When we used our handsearch data, the publishedand unpublished strategies containing only methodological terms had a sensitivity range of 85.0% to 99.3%. One strategy had slightly higher sensitivity (99.3%) than our most sensitive strategy (98.6%), but it came with a large trade off for specificity (54.7%, compared with our strategy's specificity of 74.3%; see table table 3). The specificities for these strategies in our database ranged from 54.7% to 94.5%, all lower than our best specificity of 98.4% (seetable table 4).


Our study documents search terms with best sensitivity, specificity, accuracy, and balance of sensitivity and specificity for retrieving high quality studies of diagnostic tests from Medline.This research updates our previous one published in 1994, calibrated using 10 internal and general medicine journals.19 When the 1991 strategies for diagnosis articles were tested in the 2000 database, the performance of the 2000 strategies was consistently better (table table 6). We did not haveenough datato do an independent validation of our diagnostic test strategies and thus risked overestimating their performance. We did independent validations for studies of therapy, however, with the greatest statistically significant difference being 1.1% for one set of specificities (data not shown). Furthermore, by double checking only articles that initially seemed to pass criteria, we may have underestimated performance: a few articles that met our criteria may have been missed in the handsearch.

Table 6

Comparison of performance of strategies from 1991 and 2000, compiled using 2000 dataset. Values are percentages

View this table:

Searchers who want retrieval with little non-relevant material can choose strategies with high specificity. For those interested in comprehensive retrievals or in searching for clinical topics with few citations, strategies with higher sensitivity may be more appropriate. The strategies that optimised the balance of sensitivity and specificity provided the best separation of eligible studies from others but did so without regard for whether sensitivity or specificity was affected. Regardless of the strategy used, we foresee that the most effective way to harness these strategies is to have them embedded within searching systems, either as clinical queries in PubMed or as stored searches that can be invoked at the user's request. The US National Library of Medicine has updated their Clinical Queries site for searching Medline for studies of diagnostic tests and other clinical topics, and they are available free ( Further, the new strategies have been incorporated into Ovid's main search enginefor Medline (, with the high specificity strategies being incorporated into Skolar (

What is already known on this topic

Information on the accuracy of diagnostic tests abounds in the medical literature but is oftenunknown to, or forgotten by, clinicians

The medical literature is accessible through large internet databases such as Medline, but fewclinicians know how to search them well

What this study adds

Special Medline search strategies were developed and tested that retrieved up to 99% of scientifically strong studies of diagnostic tests

These strategies have been automated for use in PubMed Medline at a special screen, Clinical Queries, and Ovid Technology's Medline and Skolar services

Our search strategies were designed to retrieve diagnostic test studies that meet criteria forvalidity, just 18.9% of all diagnosis studies in our database. We did not test the performance ofthese strategies for all diagnosis studies, but in a similar project for studies of health services research, we found that the highest sensitivity strategies for the better designed studies had5-10% lower sensitivity for all articles on the same topic, with no important differences in specificity (unpublished data).

Other investigators have attempted to find strategies that outperform those we previously published, with some success.91214 Our new strategies have set the bar higher, but there is still considerable room for improvement, particularly for the precision of searches.

Embedded Image Full titles of journals indexed in Medline and a specific diagnostic search strategy are on

The Hedges Team includes Angela Eady, Brian Haynes, Susan Marks, Ann McKibbon, Doug Morgan, Cindy Walker-Dilks, Stephen Walter, Stephen Werre, Nancy Wilczynski, and Sharon Wong, all at McMaster University Faculty of Health Sciences.


  • Contributors RBH planned the study, designed the protocol, and interpreted the data; he will act as guarantor. NLW supervised the research staff, and collected, analysed, and interpreted thedata. The Hedges Team conducted the study: AE, SM, AM, CW-D, S Werre, and S Wong collected the data. DM programmed the data set and analysed the data. S Walter and S Werre provided statistical advice, and S Werre did supplementary analyses. The manuscript was prepared by NLW and RBH.

  • Funding This study was funded by the US National Institutes of Health (grant No 1 RO1 LM06866).

  • Competing interests None declared.

  • Ethical approval: Not required.


View Abstract