Development of phenotype algorithms using electronic medical records and incorporating natural language processing
BMJ 2015; 350 doi: https://doi.org/10.1136/bmj.h1885 (Published 24 April 2015) Cite this as: BMJ 2015;350:h1885- Katherine P Liao, assistant professor12,
- Tianxi Cai, professor3,
- Guergana K Savova, associate professor4,
- Shawn N Murphy, associate professor5,
- Elizabeth W Karlson, associate professor12,
- Ashwin N Ananthakrishnan, assistant professor6,
- Vivian S Gainer, senior analyst7,
- Stanley Y Shaw, assistant professor28,
- Zongqi Xia, assistant professor29,
- Peter Szolovits, professor10,
- Susanne Churchill, executive director2,
- Isaac Kohane, professor25
- 1Division of Rheumatology, Immunology and Allergy, Brigham and Women’s Hospital, Boston, MA 02115, USA
- 2Harvard Medical School, Boston
- 3Department of Biostatistics, Harvard School of Public Health, Boston
- 4Department of Pediatrics, Children’s Hospital of Boston, Boston
- 5Department of Neurology, Massachusetts General Hospital, Boston
- 6Department of Gastroenterology, Massachusetts General Hospital, MGH Crohn’s and Colitis Center, Boston
- 7Partners Research Computing, Partners HealthCare System, Boston
- 8Center for Systems Biology, Massachusetts General Hospital, Boston
- 9Department of Neurology, Harvard Medical School, Boston
- 10Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA
- Correspondence to: K P Liao kliao{at}partners.org
- Accepted 2 February 2015
The increasing use of electronic medical records (EMR), driven mainly by efforts to improve the quality of patient care, have also launched a discipline of research using EMR data. In the past decade, methods and tools specifically used to conduct EMR research have allowed for sophisticated analyses including pharmacovigilance,1 genetic association,2 and pharmacogenetic studies.3 Phenotype algorithms using EMR data to classify patients with specific diseases and outcomes is a foundation of EMR research. Diagnoses or billing codes are typically used in these algorithms, and are examples of structured EMR data. These data are readily available and searchable (fig 1⇓), but vary in accuracy. Recent work has focused on incorporating other informative EMR data to develop robust phenotype algorithms.
Beyond billing and diagnoses codes, advanced EMRs contain a variety of structured data such as electronic prescriptions and laboratory values. A substantial portion of clinical data is also embedded in unstructured data in the form of narrative text notes, either typed or dictated by physicians (fig 1). Extracting accurate information from narrative notes is a well known challenge to clinical researchers and is typically obtained through laborious medical record review. Natural language processing (NLP),4 …