Jump to: Page Content, Site Navigation, Site Search,
You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.
BMJ 2005;330:1179 (21 May), doi:10.1136/bmj.38446.498542.8F (published 13 May 2005)
R Brian Haynes, chief1, K Ann McKibbon, doctoral candidate4, Nancy L Wilczynski, doctoral candidate2, Stephen D Walter, professor3, Stephen R Werre, research associate1, for the Hedges Team
1 Health Information Research Unit, McMaster University, Hamilton, ON, Canada L8N 3Z5, 2 School of Graduate Studies, McMaster University, 3 Department of Clinical Epidemiology and Biostatistics, McMaster University, 4 Center for Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
Correspondence to: R B Haynes bhaynes{at}mcmaster.ca
Design Analytical survey.
Data sources 161 clinical journals indexed in Medline for the year 2000.
Main outcome measures Sensitivity, specificity, precision, and accuracy of 4862 unique terms in 18 404 combinations.
Results Only 1587 (24.2%) of 6568 articles on treatment met criteria for testing clinical interventions. Combinations of search terms reached peak sensitivities of 99.3% (95% confidence interval 98.7% to 99.8%) at a specificity of 70.4% (69.8% to 70.9%). Compared with best single terms, best multiple terms increased sensitivity for sound studies by 4.1% (absolute increase), but with substantial loss of specificity (absolute difference 23.7%) when sensitivity was maximised. When terms were combined to maximise specificity, 97.4% (97.3% to 97.6%) was achieved, about the same as that achieved by the best single term (97.6%, 97.4% to 97.7%). The strategies newly reported in this paper outperformed other validated search strategies except for two strategies that had slightly higher specificity (98.1% and 97.6% v 97.4%) but lower sensitivity (42.0% and 92.8% v 93.1%).
Conclusion New empirical search strategies have been validated to optimise retrieval from Medline of articles reporting high quality clinical studies on prevention or treatment of health disorders.
If large electronic bibliographic databases such as Medline are to be helpful to clinical users, clinicians must be able to retrieve articles that are scientifically sound and directly relevant to the health problem they are trying to solve, without missing key studies or retrieving excessive numbers of preliminary, irrelevant, outdated, or misleading reports. Few clinicians, however, are trained in search techniques. One approach to enhance the effectiveness of searches by clinical users is to develop search filters ("hedges") to improve the retrieval of clinically relevant and scientifically sound reports of studies from Medline and similar bibliographic databases.2-7 Hedges can be created with appropriate disease content terms combined ("ANDed") with medical subject headings (MeSH), explosions (px), publication types (pt), subheadings (sh), and textwords (tw) that detect research design features indicating methodological rigour for applied healthcare research. For instance, combining clinical trial (pt) AND myocardial infarction in PubMed brings the retrieval for myocardial infarction down by a factor of 13 (from 116 199 to 8956 articles) and effectively removes case reports, laboratory and animal studies, and other less rigorous and extraneous reports.
In the early 1990s, our group developed Medline search filters for studies of the cause, course, diagnosis, or treatment of health problems, based on a small subset of 10 clinical journals.8 These strategies were adapted for use in the Clinical Queries feature in PubMed and other services. In this paper we report improved hedges for retrieving studies on prevention and treatment, developed on a larger number of journals (n = 161) in a more current era (2000) than previously reported.9
Table 1 shows the sensitivity, specificity, precision, and accuracy of single term and multiple term Medline search strategies that we determined. The sensitivity for a given strategy is defined as the proportion of articles retrieved that are scientifically sound and clinically relevant (high quality articles); specificity is the proportion of lower quality articles (did not meet criteria) that are not retrieved; precision is the proportion of retrieved articles that meet criteria (equivalent to positive predictive value in diagnostic test terminology); and accuracy is the proportion of all articles that are correctly dealt with by the strategy (articles that met criteria and were retrieved plus articles that did not meet criteria and were not retrieved divided by all articles in the database).
|
After extensive attempts, a small fraction (n = 968, 2%) of citations downloaded from Medline could not be matched to the handsearched data. As a conservative approach, unmatched citations that were detected by a given search strategy were included in cell b of the analysis in table 1 (leading to slight underestimates of the precision, specificity, and accuracy of the search strategy). Similarly, unmatched citations that were not detected by a search strategy were included in cell d of the table (leading to slight overestimates of the specificity and accuracy of the strategy).
|
Collecting search terms
To construct a comprehensive set of possible search terms, we listed MeSH terms and textwords related to study criteria and then sought input from clinicians and librarians through interviews and requests at meetings and conferences and through electronic mail, review of published and unpublished search strategies from other groups, and requests to the National Library of Medicine. We compiled a list of 4862 unique terms (data not shown). All terms were tested using the Ovid Technologies searching system. Search strategies developed using Ovid were subsequently translated by the National Library of Medicine for use in the Clinical Queries interface of PubMed and reviewed by RBH.
Data collection
Manual ratings of articles were recorded on data collection forms along with bibliographic information and database specific unique identifiers. Each journal title was searched in Medline for 2000, and the full Medline records (including citation, abstract, MeSH terms, and publication types) were captured for all articles. Medline data were then linked with the manual review data.
Testing strategies
We randomly divided treatment and prevention articles that met criteria in the manual review database into development and validation datasets (60% and 40%). Sensitivity, specificity, precision, and accuracy were calculated for each term in the development subset and then validated in the rest of the database. For a given purpose category, we incorporated individual search terms with sensitivity greater than 25% and specificity greater than 75% into the development of search strategies that included a combination of two or more terms. All combinations of terms used the boolean ORfor example, "random OR controlled". (The boolean AND was not used because this strategy invariably compromised sensitivity.)
For the development of multiple term search strategies to optimise either sensitivity or specificity, we tested all two term search strategies with sensitivity at least 75% and specificity at least 50%. For optimising accuracy, two term search strategies with accuracy greater than 75% were considered for multiple term development. Overall, we tested 18 404 multiple term search strategies. Search strategies were also developed that optimised combined sensitivity and specificity (by keeping the absolute difference between sensitivity and specificity less than 1%, if possible).
To attempt to increase specificity without compromising sensitivity, we used terms with low sensitivity but appreciable specificity to NOT out citations (for example, randomised controlled trial.pt. OR randomized.mp. OR placebo.mp. NOT retrospective studies.mp. (where pt = publication type; mp = multiple postingterm appears in title, abstract, or MeSH heading)). We also used logistic regression analysis models that included terms in a stepwise manner and also NOTed out terms with a regression coefficient less than -2.0.
We compared strategies that maximised each of sensitivity, specificity, precision, and accuracy for both development and validation datasets with 19 previously published strategies. We chose strategies that had been tested against an ideal method such as a hand search of the published literature and for which most Medline records were from 1990 forward, to reflect major changes in the classification of clinical trials by the National Library of Medicine. These changes included new MeSH definitions (for example, "cohort studies" was introduced in 1989 and "single-blind method" in 1990) and publication types (for example, "clinical trial (pt)" and "randomized controlled trial (pt)", which were instituted in 1991). Six papers2-7 and one library website13 provided a total of 19 strategies to test, including the strategy advocated by the Cochrane Collaboration in their handbook (www.cochrane.dk/cochrane/handbook/hbookAPPENDIX_5C_OPTIMAL_SEARCH_STRAT.htm).2
Table 2 shows the operating characteristics for the single terms with the highest sensitivity and the highest specificity. The accuracy is driven by the specificity and thus the term with the best accuracy when keeping sensitivity more than 50% was "randomized controlled trial.pt.". The single term that yielded the best precision while keeping sensitivity more than 50% was also "randomized controlled trial.pt.", and this strategy also gave the optimal balance of sensitivity and specificity.
|
For strategies combining up to three terms, those yielding the highest sensitivity, specificity, and accuracy are shown in tables 3, 4, 5. Some two term strategies outperformed one term and multiple term strategies (table 5). Table 6 shows the top three search strategies optimising the trade-off between sensitivity and specificity.
|
|
|
|
Table 7 shows the best combination of terms for optimising the trade-off between sensitivity and specificity when using the boolean NOT to eliminate terms with the lowest sensitivity. Nonsignificant differences were shown when citations retrieved by the three terms "review tutorial.pt.", "review academic.pt.", and "selection criteri:.tw." were removed from the strategy that optimised sensitivity and specificity.
|
After the two term and three term computations, search strategies with sensitivity more than 50% and specificity more than 95% were further evaluated by adding search terms selected using logistic regression modelling. Initially, candidate terms for addition to the base strategy were ordered with the most significant first, using stepwise logistic regression, and then added to the model sequentially. The resulting logistic function (data not shown) determined the association between the predicted probabilities and observed responses. We selected the best one term, two term, three term, and four term strategies. Two were already evaluated ("randomized controlled trial.mp." OR "randomized controlled trial.pt." in table 4 and "randomized controlled trial.mp." OR "randomized controlled trial.pt." OR "double-blind:.tw." in table 5). The other two strategies are listed in table 8: both had high performance. We next took the 13 terms that had regression coefficients less than -2.0 ("predict.tw.", "predict.mp.", "economic.tw.", "economic.mp.", "survey.tw.", "survey.mp.", "hospital mortality.mp,tw.", "hospital mortalit:.mp.", "accuracy:.tw.", "accuracy.tw.", "accuracy.mp.", "explode bias (epidemiology)", and "longitudinal.tw.") and NOTed these terms out of the four term search strategy to determine if these terms would improve the operating characteristic values (table 8, last row). We found a small but insignificant decrease in sensitivity and increases in specificity, precision, and accuracy.
|
We compared our best strategies for maximising sensitivity (sensitivity > 99% and specificity > 70%) and for maximising specificity while maintaining a high sensitivity (sensitivity > 94% and specificity > 97%). To ascertain if the less sensitive strategy (which had a much greater specificity) would miss important articles, we assessed the methodologically sound articles that had not been retrieved by the less sensitive strategy, using studies from the four major medical journals (BMJ, JAMA, Lancet, and New England Journal of Medicine). In total, 32 articles were missed by the less sensitive search, of which four were from these four journals. A practising clinician with training in methods for health research found only one of the four articles to be of substantial clinical importance.14 The indexing terms for this randomised controlled trial did not include "randomized controlled trial(pt)". When we contacted the National Library of Medicine about indexing for this article, the article was reindexed and now the "missing" article would be retrieved.
We used our data to test 19 published strategies2-7 13 and we compared these with the best strategies for optimising sensitivity and specificity. The published strategies had a sensitivity range of 1.3% to 98.8% on the basis of our handsearched data. All of these were lower than our best sensitivity of 99.3%. The specificities for the published strategies ranged from 63.3% to 96.6%. Two strategies from Dumbrique6 outperformed our most specific strategy (specificity of 98.1% and 97.6% versus our 97.4%). Both of these strategies had a lower sensitivity than did our search strategy with the best specificity (42.0% and 92.8% v 93.1%).
|
No one search strategy will perform perfectly, for several reasons. Indexing inconsistencies affect retrievals, as shown by the need to reindex in the study by Julien et al.14 Indexing terms and methods are modified over time and few changes are implemented retrospectively. Indexers also choose only a small number of terms for each item they index, and many of these terms have similar meaningsfor example, "randomized controlled trials" and "clinical trials" as MeSH and "randomized controlled trial" and "clinical trial" as publication types. Methods and their naming also change over time, and authors may also be imprecise in their description of methods and results, affecting retrievals that are based on textwords in the titles and abstracts. The model we used for testing search strategies defines the constant features of these strategies (their sensitivity and specificity), and these strategies can be expected to perform the same way in the entire Medline database, as shown by their performance in the validation database in our study and by the robustness of the 1991 strategies when retested in our much larger 2000 database.9 The precision of searches, however, depends on the concentration of relevant articles in the database. We selected clinical journals to calibrate the search strategies, but Medline contains many non-clinical journals. Thus, the concentration of high quality treatment studies will be less in the full Medline database, and the precision of searches will be less accordingly. This is a problem that warrants further attention.
Searchers who want retrieval with little non-relevant materialfor instance, practising clinicians with little time to sort through many irrelevant articles can choose strategies with high specificity. For those interested in comprehensive retrievals, for instance researchers conducting systematic reviews or those searching for clinical topics with few citations, strategies with higher sensitivity will be more appropriate. Regardless of the strategy used, the most effective way to harness these strategies is to have them embedded within searching systems. The most sensitive and most specific search strategies reported here have been implemented in the Clinical Queries search screen (www.ncbi.nlm.nih.gov:80/entrez/query/static/clinical.html) and by Ovid Technologies (www.ovid.com), and the optimal strategy has been added to Skolar (www.skolar.com).
|
Contributors: RBH and NLW prepared grant submissions for this project. All authors drafted, commented on, and approved the final manuscript, and supplied intellectual content to the collection and analysis of the data. KAM and NLW participated in the data collection and, with the addition of RBH, were involved in data analysis and staff supervision. RBH is guarantor for the paper.
Funding: National Institutes of Health (grant RO1 LM06866-01).
Competing interests: None declared.
Ethical approval: Not required.
![]()
CiteULike
Complore
Connotea
Del.icio.us
Digg
Reddit
StumbleUpon
Technorati What's this?