Intended for healthcare professionals

Research Methods & Reporting

Development of phenotype algorithms using electronic medical records and incorporating natural language processing

BMJ 2015; 350 doi: https://doi.org/10.1136/bmj.h1885 (Published 24 April 2015) Cite this as: BMJ 2015;350:h1885
  1. Katherine P Liao, assistant professor12,
  2. Tianxi Cai, professor3,
  3. Guergana K Savova, associate professor4,
  4. Shawn N Murphy, associate professor5,
  5. Elizabeth W Karlson, associate professor12,
  6. Ashwin N Ananthakrishnan, assistant professor6,
  7. Vivian S Gainer, senior analyst7,
  8. Stanley Y Shaw, assistant professor28,
  9. Zongqi Xia, assistant professor29,
  10. Peter Szolovits, professor10,
  11. Susanne Churchill, executive director2,
  12. Isaac Kohane, professor25
  1. 1Division of Rheumatology, Immunology and Allergy, Brigham and Women’s Hospital, Boston, MA 02115, USA
  2. 2Harvard Medical School, Boston
  3. 3Department of Biostatistics, Harvard School of Public Health, Boston
  4. 4Department of Pediatrics, Children’s Hospital of Boston, Boston
  5. 5Department of Neurology, Massachusetts General Hospital, Boston
  6. 6Department of Gastroenterology, Massachusetts General Hospital, MGH Crohn’s and Colitis Center, Boston
  7. 7Partners Research Computing, Partners HealthCare System, Boston
  8. 8Center for Systems Biology, Massachusetts General Hospital, Boston
  9. 9Department of Neurology, Harvard Medical School, Boston
  10. 10Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA
  1. Correspondence to: K P Liao kliao{at}partners.org
  • Accepted 2 February 2015

Electronic medical records are emerging as a major source of data for clinical and translational research studies, although phenotypes of interest need to be accurately defined first. This article provides an overview of how to develop a phenotype algorithm from electronic medical records, incorporating modern informatics and biostatistics methods.

The increasing use of electronic medical records (EMR), driven mainly by efforts to improve the quality of patient care, have also launched a discipline of research using EMR data. In the past decade, methods and tools specifically used to conduct EMR research have allowed for sophisticated analyses including pharmacovigilance,1 genetic association,2 and pharmacogenetic studies.3 Phenotype algorithms using EMR data to classify patients with specific diseases and outcomes is a foundation of EMR research. Diagnoses or billing codes are typically used in these algorithms, and are examples of structured EMR data. These data are readily available and searchable (fig 1), but vary in accuracy. Recent work has focused on incorporating other informative EMR data to develop robust phenotype algorithms.

Fig 1 Overview of the two main types of EMR data, structured and unstructured, and how these data can be integrated for research studies. In this instance, the figure illustrates the development of a phenotype algorithm for rheumatoid arthritis. *Including ICD-9 (international classification of diseases, 9th revision) codes and CPT (current procedural terminology) codes

Beyond billing and diagnoses codes, advanced EMRs contain a variety of structured data such as electronic prescriptions and laboratory values. A substantial portion of clinical data is also embedded in unstructured data in the form of narrative text notes, either typed or dictated by physicians (fig 1). Extracting accurate information from narrative notes is a well known challenge to clinical researchers and is typically obtained through laborious medical record review. Natural language processing (NLP),4 …

View Full Text