Evaluation of symptom checkers for self diagnosis and triage: audit study
BMJ 2015; 351 doi: https://doi.org/10.1136/bmj.h3480 (Published 08 July 2015) Cite this as: BMJ 2015;351:h3480- Hannah L Semigran, research assistant1,
- Jeffrey A Linder, associate professor 2,
- Courtney Gidengil, instructor3, natural scientist4,
- Ateev Mehrotra, associate professor15
- 1Department of Health Care Policy, Harvard Medical School, Boston, MA 02115, USA
- 2Division of General Medicine and Primary Care, Brigham and Women’s Hospital & Harvard Medical School, Boston, MA, USA
- 3 Division of Infectious Diseases, Boston Children’s Hospital, Boston, MA, USA
- 4RAND Corporation, Boston, MA, USA
- 5Division of General Internal Medicine and Primary Care, Beth Israel Deaconess Medical Center, Boston, MA, USA
- Correspondence to: A Mehrotra mehrotra@hcp.med.harvard.edu
- Accepted 15 June 2015
Abstract
Objective To determine the diagnostic and triage accuracy of online symptom checkers (tools that use computer algorithms to help patients with self diagnosis or self triage).
Design Audit study.
Setting Publicly available, free symptom checkers.
Participants 23 symptom checkers that were in English and provided advice across a range of conditions. 45 standardized patient vignettes were compiled and equally divided into three categories of triage urgency: emergent care required (for example, pulmonary embolism), non-emergent care reasonable (for example, otitis media), and self care reasonable (for example, viral upper respiratory tract infection).
Main outcome measures For symptom checkers that provided a diagnosis, our main outcomes were whether the symptom checker listed the correct diagnosis first or within the first 20 potential diagnoses (n=770 standardized patient evaluations). For symptom checkers that provided a triage recommendation, our main outcomes were whether the symptom checker correctly recommended emergent care, non-emergent care, or self care (n=532 standardized patient evaluations).
Results The 23 symptom checkers provided the correct diagnosis first in 34% (95% confidence interval 31% to 37%) of standardized patient evaluations, listed the correct diagnosis within the top 20 diagnoses given in 58% (55% to 62%) of standardized patient evaluations, and provided the appropriate triage advice in 57% (52% to 61%) of standardized patient evaluations. Triage performance varied by urgency of condition, with appropriate triage advice provided in 80% (95% confidence interval 75% to 86%) of emergent cases, 55% (47% to 63%) of non-emergent cases, and 33% (26% to 40%) of self care cases (P<0.001). Performance on appropriate triage advice across the 23 individual symptom checkers ranged from 33% (95% confidence interval 19% to 48%) to 78% (64% to 91%) of standardized patient evaluations.
Conclusions Symptom checkers had deficits in both triage and diagnosis. Triage advice from symptom checkers is generally risk averse, encouraging users to seek care for conditions where self care is reasonable.
Introduction
Members of the public are increasingly using the internet to research their health concerns. For example, the United Kingdom’s online patient portal for national health information, NHS Choices, reports over 15 million visits per month.1 More than a third of adults in the United States regularly use the internet to self diagnose their ailments, using it both for non-urgent symptoms and for urgent symptoms such as chest pain.2 3 While there is a wealth of online resources to learn about specific conditions, self diagnosis usually starts with search engines like Google, Bing, or Yahoo.2 However, internet search engines can lead users to confusing and sometimes unsubstantiated information, and people with urgent symptoms may not be directed to seek emergent care.3 4 5 6 Recently there has been a proliferation of more sophisticated programs called symptom checkers that attempt to more effectively provide a potential diagnosis for patients and direct them to the appropriate care setting.3 6 7 8 9 10 11 12 13
Using computerized algorithms, symptom checkers ask users a series of questions about their symptoms or require users to input details about their symptoms themselves. The algorithms vary and may use branching logic, bayesian inference, or other methods. Private companies and other organizations, including the National Health Service, the American Academy of Pediatrics, and the Mayo Clinic, have launched their own symptom checkers. One symptom checker, iTriage, reports 50 million uses each year.14 Typically, symptom checkers are accessed through websites, but some are also available as apps for smart phones or tablets.
Symptom checkers serve two main functions: to facilitate self diagnosis and to assist with triage. The self diagnosis function provides a list of diagnoses, usually rank ordered by likelihood. The diagnosis function is typically framed as helping educate patients on the range of diagnoses that might fit their symptoms. The triage function informs patients whether they should seek care at all and, if so, where (that is, emergency department, general practitioner’s clinic) and with what urgency (that is, emergently or within a few days). Symptom checkers may supplement or replace telephone triage lines, which are common in primary care.15 16 17 18 To ensure the safety of medical mobile apps, the US Congress is considering the regulation of apps that “provide a list of possible medical conditions and advice on when to consult a health care provider.”19 20
Symptom checkers have several potential benefits. They can encourage patients with a life threatening problem such as stroke or heart attack to seek emergency care.21 For patients with a non-emergent problem that does not require a medical visit, these programs can reassure people and recommend they stay home. For approximately a quarter of visits for acute respiratory illness such as viral upper respiratory tract infection, patients do not receive any intervention beyond over the counter treatment,22 and over half of patients receive unnecessary antibiotics.23 24 25 Reducing the number of visits saves patients’ time and money, deters overprescribing of antibiotics, and may decrease demand on primary care providers—a critical problem given that the workload for general practitioners in the United Kingdom increased by 62% from 1995 to 2008.17 However, there are several key concerns. If patients with a life threatening problem are misdiagnosed and not told to seek care, their health could worsen, increasing morbidity and mortality. Alternatively, if patients with minor illnesses are told to seek care, in particular in an emergency department, such programs could increase unnecessary visits and therefore result in increased time and costs for patients and society.
The impact of symptom checkers will depend to a large degree on their clinical performance. To measure the accuracy of diagnosis and triage advice provided by symptom checkers, we used 45 standardized patient vignettes to audit 23 symptom checkers. The vignettes reflected a range of conditions from common to less common and low acuity to life threatening.
Methods
Search strategy for symptom checkers
Between June 2014 and November 2014 we searched for symptoms checkers that were in English, were free, were publicly available, were for humans (compared with veterinary use), and did not focus on a single type of condition (for example, only orthopedic problems). To find symptom checkers that were available as apps in the Apple app store and Google Play, we used two search phrases (“symptom checker”, “medical diagnosis”) used in a recent study on symptom checkers and examined the first 240 search results by hand.12 We chose 240 because this cut-off has been used in previous studies that have searched smartphone app stores.26 To find online symptom checkers, we entered the same two search phrases in Google and Google Scholar and examined the first 300 results. In previous research, the probability of relevant search results identified using Google declines substantially after the first 300 results.27 We supplemented our searches by asking the developers of two symptom checkers if they knew of other competing products.
In total we identified 143 symptom checkers. We excluded 102 that used the same medical content and logic as another tool (and therefore would have identical performance) (see list in supplementary appendix). We excluded a further 25 that only focused on a single class of illness (for example, orthopedic problems), 14 that only provided medical advice (for example, what symptoms are typically associated with a certain condition) and did not provide diagnosis or triage advice, and two that were not working. After these exclusions, we evaluated 23 symptom checkers.
Symptom checkers’ characteristics
We categorized symptom checkers by whether they facilitated self diagnosis, self triage, or both; type of organization that operated the symptom checker; and the maximum number of diagnoses provided and whether they were based on Schmitt or Thompson nurse triage guidelines, which are decision support protocols commonly used in telephone triage for pediatric and adult consultations, respectively.28 29 We grouped government and health plans together because both may have a financial incentive to deter unnecessary visits. In the supplementary appendix we provide data when available about estimated total visitors to select symptom checkers.
Clinical vignettes
To evaluate the diagnosis and triage performance of the symptom checkers, we used 45 standardized patient vignettes. We used clinical vignettes to assess performance because they are a common method to test physicians and other clinicians on their diagnostic ability and management decisions. We purposefully selected standardized patient vignettes from three categories of triage urgency: 15 vignettes for which emergent care is required, 15 vignettes for which non-emergent care is reasonable, and 15 vignettes for which a medical visit is generally unnecessary and self care is sufficient. We chose vignettes across the severity spectrum because patients use symptom checkers for symptoms that require both urgent and non-urgent care.3 We included vignettes for both common and uncommon conditions because we believe that the clinical community would be particularly interested in performance for less common but potentially life threatening problems.
The standardized patient vignettes were identified from various clinical sources, including materials used to educate health professionals and a medical resource website with content provided by a panel of physicians.30 The source for each vignette also provided the associated correct diagnosis. Symptom checkers generally require users to enter a list of symptoms or ask a series of questions about their symptoms. Each vignette was simplified into a core set of symptoms for easy entry, and in some situations we supplemented the data provided by the vignette because a symptom checker asked about a symptom not addressed in the vignette (see the supplementary appendix for details on source, core symptoms, and supplemental symptoms for each vignette).
We categorized the 45 vignettes as either “common” or “uncommon” diagnoses based on the prevalence of the diagnosis among ambulatory visits in the United States (for full details see the supplementary appendix).31
Assessing diagnosis and triage results
Each standardized patient vignette was entered into each website or app, and we recorded the resulting diagnoses and triage advice. An author (HS) with no clinical training entered all the vignettes. A random sample of 25 vignettes was entered into symptom checkers by another person without clinical training and the inter-rater reliability between the two in capturing the symptom checker’s recommendations for diagnosis and triage was high (Cohen’s κ 0.90). In some cases we could not evaluate a vignette because some symptom checkers focus only on children or on adults or the symptom checker did not list or ask for the key symptom in the vignette. To avoid penalizing these symptom checkers, we referred to standardized patient vignettes that successfully yielded an output as “standardized patient evaluations.”
To assess diagnostic accuracy, we noted whether the correct diagnosis was listed first or listed at all. For several vignettes, two symptom checkers presented a large number of diagnoses (as much as 99). Because such a long list of potential diagnoses is unlikely to be useful for patients, we considered a diagnosis to be listed at all only if it was within the first 20 diagnoses provided by a symptom checker. It is possible that many patients only focus on the top diagnoses listed. Therefore we also looked at whether the correct diagnosis was listed in the first three diagnoses given. We judged the diagnosis incorrect if the symptom checker indicated that the condition could not be identified.
We categorized the triage advice into three groups: emergent, which included advice to call an ambulance, go to the emergency department, or see a general practitioner immediately; non-emergent, which included advice to call a general practitioner or primary care provider, see a general practitioner or primary care provider, go to an urgent care facility, go to a specialist, go to a retail clinic, or have an e-visit; and self care, which included advice to stay at home or go to a pharmacy. If multiple triage locations were suggested (for example, emergency department or specialist), we used the most urgent suggestion. We chose to do so because in almost all of the cases the most urgent triage suggestion was listed first. If a symptom checker was unable to reach a decision on diagnosis for a given standardized patient vignette but provided triage advice, we still assessed the appropriateness of this triage advice. Symptom checkers that required users to select the correct diagnosis before giving triage advice were not included in assessing the accuracy of triage with the exception of iTriage, which always suggested emergent triage advice.
Patient involvement
There was no patient involvement in this study.
Analysis
We calculated summary statistics for diagnostic accuracy and triage advice with 95% confidence intervals based on binomial distribution using Stata/MP 13.0. Given our focus on symptom checkers as a whole, we did not make statistical comparisons of accuracy between individual symptom checkers. We used χ2 tests to compare the diagnosis and triage accuracy by level and urgency and by type of symptom checker. We conducted a sensitivity analysis of triage advice, excluding several symptom checkers that always or usually recommended emergent care.
Results
Study sample
The 23 identified symptom checkers were based in the United Kingdom, United States, the Netherlands, and Poland (table 1⇓): 11 symptom checkers provided both diagnoses and triage advice, eight only provided diagnoses, and four only provided triage advice. The 45 standardized patient vignettes included 26 common and 19 uncommon diagnoses. Performance was assessed on a total of 770 standardized patient evaluations for diagnosis and 532 standardized patient evaluations for triage. Across the symptom checkers, 10 did not ask for demographics (age and sex).
Accuracy of diagnosis
Overall, the correct diagnosis was listed first in 34% (95% confidence interval 31% to 37%; table 2⇓) of standardized patient evaluations. Performance varied by urgency of condition. The correct diagnosis was listed first for 24% (19% to 30%) of emergent standardized patient evaluations, 38% (32% to 34%) of non-emergent standardized patient evaluations, and 40% (34% to 47%) of self care standardized patient evaluations (P<0.001 for comparison, table 2). There was no difference between symptom checkers that asked for and did not ask for demographic information (34%, 95% confidence interval 30% to 39% and 34%, 28% to 39%, P=0.88; table 3⇓). The correct diagnosis was, however, listed first more often in standardized patient evaluations for common diagnoses than for uncommon diagnoses (38%, 34% to 43% and 28%, 23% to 33%, P=0.004; table 2⇓).
Performance varied across symptom checkers. Listing the correct diagnosis first in standardized patient evaluations ranged from 5% for MEDoctor (95% confidence interval 0% to 13%) to 50% for DocResponse (33% to 67%; table 4⇓). Few differences were observed by the symptom checkers’ characteristics (table 3⇑).
Across all symptom checkers the correct diagnosis was listed in the first three diagnoses in 51% (95% confidence interval 47% to 54%) of standardized patient evaluations and in the first 20 diagnoses in 58% (55% to 62%) of standardized patient evaluations (table 2). Diagnostic accuracy for listing the correct diagnosis in the top three and top 20 was higher for self care conditions than for emergent conditions and was also higher for common conditions than for uncommon conditions. There was no significant difference in listing the correct diagnosis in the top 20 between symptom checkers that listed more than 11 diagnoses compared with those that only listed 1-3 diagnoses (59%, 53% to 65% v 53%, 46% to 59%, P=0.12; table 3). The accuracy of listing the correct diagnosis in the top 20 across the 23 individual symptom checkers ranged from 34% (95% confidence interval 17% to 52%) to 84% (73% to 95%, table 4⇑).
Accuracy of triage advice
Appropriate triage advice was given in 57% (95% confidence interval 52% to 61%) of standardized patient evaluations (table 2⇑). Performance on triage advice was higher for emergent care standardized patient evaluations than for non-emergent and self-care standardized patient evaluations: 80% (75% to 86%) v 55% (47% to 63%) v 33% (26% to 40%), P<0.001). Appropriate triage advice was higher for uncommon diagnoses than for common diagnoses: 63% (57% to 70%) v 52% (46% to 57%), P=0.01).
iTriage, Symcat, Symptomate, and Isabel always suggested users seek care and therefore never advised self care (table 4⇑). After excluding these four symptom checkers, appropriate triage advice was given in 61% (95% confidence interval 56% to 66%) of standardized patient evaluations (see supplementary table 5).
Symptom checkers that used the Schmitt or Thompson nurse triage protocols were more likely to provide appropriate triage decisions than those that did not: 72% (95% confidence interval 60% to 84%) v 55% (50% to 59%), P=0.01; table 3⇑. Accurate triage advice varied by operator of symptom checker (provider groups and physician associations 68% (58% to 77%), private companies 59% (53% to 65%), health plans or governments 43% (34% to 51%), P<0.001).
Discussion
Using standardized patient vignettes we measured the diagnostic and triage accuracy of symptom checkers. Although there was a range of performance across symptom checkers, overall they had deficits in both diagnosis and triage accuracy. On average, symptom checkers provided the correct diagnosis within the first 20 listed in 58% of standardized patient evaluations, with the best performing symptom checker listing the correct diagnosis in 84% of standardized patient evaluations. Symptom checkers advised the appropriate level of care about half the time, but this varied by clinical severity. The correct triage decision was much higher for standardized patient evaluations requiring emergent care (80%) than for those for which self care was appropriate (34%).
Comparisons with other studies
Our results on diagnostic accuracy and appropriate triage are roughly similar to previous work on the performance of single symptom checkers for a limited set of diagnoses.6 7 8 32 An orthopedic symptom checker listed the correct diagnosis for knee pain 89% of the time, and Boots WebMD listed the correct diagnosis 70% of the time for ear, nose, and throat symptoms.7 8 One study that also used two common acute standardized patient vignettes to evaluate WebMD reported a diagnostic accuracy rate of 50%.6
Whether this level of performance for diagnosis and triage we observed is acceptable depends on the standard for comparison. If symptom checkers are seen as a replacement for seeing a physician, they are likely an inferior alternative. It is believed that physicians have a diagnostic accuracy rate of 85-90%, though in some studies using clinical vignettes, performance was lower.33 34 However, in-person physician visits might be the wrong comparison because patients are likely not using symptom checkers to obtain a definitive diagnosis but for quick and accessible guidance. Also, instead of diagnostic accuracy the key assessment of symptom checkers may be appropriate triage. Distinguishing between Rocky Mountain spotted fever and meningitis may be less important than ensuring patients seek emergent care.
If symptom checkers are seen as an alternative for simply entering symptoms into an online search engine such as Google, then symptom checkers are likely a superior alternative. A recent study found that when typing acute symptoms that would require urgent medical attention into search engines to identify symptom-related web sites, advice to seek emergent care was present only 64% of the time.3
Perhaps the most appropriate comparison to symptom checkers is telephone triage lines, which are widely used in developed nations.15 16 17 18 In general patients use symptom checkers and telephone triage for similar complaints.13 Also, many nurse phone triage lines use the same underlying clinical logic as the symptom checkers evaluated in this study. For example, some health plan nurse triage lines use the Healthwise symptom checker, and the Schmitt and Thompson protocols were originally developed for phone triage and now provide the underlying logic for several symptom checkers that we evaluated. The accuracy of telephone triage recommendations, as compared to in-person physician recommendations, ranged from 61% in a study of pediatric abdominal pain to 69% in a multicenter observational study.35 36 A recent study of NHS Symptom Checkers and NHS Direct’s telephone triage line found triage advice from both to be comparable.9 Given their similar clinical logic, triage performance, and their negligible operation costs, symptom checkers could potentially be a more cost effective way of providing triage advice than nurse-staffed phone lines.17
Implications for using symptom checkers
Both symptom checkers and telephone triage have been promoted as a means of reducing unnecessary office visits.15 16 17 18 37 The impact of symptom checkers on how people seek care depends on how patients respond to advice, and this is unknown. In one study, users expressed skepticism about the diagnosis ultimately suggested by a symptom checker.6 The risk averse nature of symptom checkers’ triage advice is a concern. In two thirds of standardized patient evaluations where medical attention was not necessary, we found symptom checkers encouraged care. Overly risk adverse advice is not limited to symptom checkers. Telephone triage advice can also encourage unnecessary care seeking.32 35 For instance, the NHS’s telephone triage line, which is not staffed by health professionals, has been implicated in increasing visits to emergency departments in the UK.38 Some patients researching health conditions online are motivated by fear, and the listing of concerning diagnoses by symptom checkers could contribute to hypochondriasis and “cyberchondria,” which describes the escalated anxiety associated with self diagnosis on the internet.39 40 41 42 43 Together, confusion, risk adverse triage advice, and cyberchondria could mean that symptom checkers encourage patients to receive care unnecessarily and thus increase healthcare spending. Understanding how patients interpret and use the advice from symptom checkers and the impact of symptom checkers on care seeking should be a key focus for future research.
The symptom checkers in this study represent the first generation of such tools, and there are several potential advances that may improve their performance in future versions. Incorporating local epidemiological data may help inform diagnoses. For instance, addition of real time information about the local incidence of illness in the community greatly improved the performance of a diagnostic tool for group A streptococcal pharyngitis.10 Diagnosis and triage rates could also be improved if symptom checkers incorporated individual clinical data from medical claims or the electronic medical record. Demographic information is critical for both diagnostic and triage decisions for programs such as symptom checkers.11 One surprising finding in our study was that symptom checkers that asked for demographic background information did not perform better. However, it is possible that this demographic information was not effectively incorporated into the symptom checkers’ algorithms.
Strengths and limitations of this study
Despite the growing use of symptom checkers, we believe our study is the first to assess the clinical performance across a large number of symptom checkers and wide range of conditions.
There were key limitations to this study. We cannot be sure we identified all publicly available symptom checkers, despite a thorough search of relevant databases and consultation with experts in this discipline. We used clinical vignettes in which the symptoms and diagnoses were typically clear, and few vignettes included comorbid conditions, resulting in a possible overestimation of the true clinical accuracy of symptom checkers.33 Some standardized patient vignettes contained specific clinical language (for example, mouth ulcers, tonsils with exudate), and actual patients with the same condition might struggle with the words to use to describe their symptoms or use different terms. Therefore, our analysis represents an indirect assessment of how well symptom checkers would perform with actual patients. We do not know how well physicians or other providers would diagnose or triage when presented with these standardized patient vignettes, preventing a direct comparison between symptom checkers and physicians. When symptom checkers suggested several care sites (for example, emergency department or general practice), our triage assessment was based only on the highest acuity site of care listed, and this may contribute to our finding that triage advice is risk averse.
Symptom checkers are part of a larger trend of both patients and physicians using the internet for many healthcare tasks and therefore it seems likely that the use of symptom checkers will only increase. Patients are chatting online with physicians,44 emailing their doctors for medical advice,45 receiving care through e-visits,46 47 and downloading health apps to smartphones.48 In addition to the public, physicians and other practitioners are also using conceptually similar tools to aid in the diagnosis and triage of their patients.49 50
Physicians should be aware that an increasing number of their patients are using new internet based tools such as symptom checkers and that the diagnosis and triage advice patients receive may often be inaccurate. For patients, our results imply that in many cases symptom checkers can give the user a sense of possible diagnoses but also provide a note of caution, as the tools are frequently wrong and the triage advice overly cautious. Symptom checkers may, however, be of value if the alternative is not seeking any advice or simply using an internet search engine. Further evaluations and monitoring of symptom checkers will be important to assess whether they help people learn more and make better decisions about their health.
What is already known on this topic
The public is increasingly using the internet for self diagnosis and triage advice, and there has been a proliferation of computerized algorithms called symptom checkers that attempt to streamline this process
Despite the growth in use of these tools, their clinical performance has not been thoroughly assessed
What this study adds
Our study suggests that symptom checkers have deficits in both diagnosis and triage, and their triage advice is generally risk averse
Notes
Cite this as: BMJ 2015;351:h3480
Footnotes
Contributors: All authors conceived and designed the study. HLS acquired the data and drafted the manuscript. HLS and AM analysed and interpreted the data. CG, JAL, and AM critically revised the manuscript for important intellectual content. HLS and AM carried out the statistical analysis. AM provided administrative, technical, and material support and supervised the study. AM acts as guarantor.
Funding: This study was funded by the US National Institute of Health (National Institute of Allergy and Infectious Disease grant No R21 AI097759-01).
Competing interests: All authors have completed the ICMJE uniform disclosure form at http://www.icmje.org/coi_disclosure.pdf and declare: all authors are affiliated with Harvard Medical School. Harvard Medical School’s Family Health Guide is used as the basis for one of the symptom checkers evaluated. This symptom checker is available both in print and online (www.health.harvard.edu/family_health_guide/symptoms). None of the authors have been or plan to be involved in the development, evaluation, promotion, or any other facet of a Harvard Medical School related symptom checker; the authors have no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: Not required
Data sharing: No additional data available.
Transparency: The guarantor (AM) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.