A joint project of the Graduate School, Peabody College, and the Jean & Alexander Heard Library

Title page for ETD etd-08262015-232710

Type of Document Dissertation
Author Teixeira, Pedro Luis, Jr.
URN etd-08262015-232710
Title Computational Phenotyping and Phenome-wide Association Studies: Leveraging Machine Learning and Natural Language Processing to Understand Electronic Health Record Data
Degree PhD
Department Biomedical Informatics
Advisory Committee
Advisor Name Title
Joshua C. Denny, M.D., M.S. Committee Chair
Dan M. Roden, M.D. Committee Member
S. Trent Rosenbloom, M.D., MPH Committee Member
Thomas A. Lasko, M.D., Ph.D. Committee Member
Todd L. Edwards, M.S., Ph.D. Committee Member
  • biomedical informatics
  • phenome-wide association studies
  • hypertension
  • random forests
  • machine learning
  • natural language processing
Date of Defense 2015-07-24
Availability unrestricted
The aims of this project are 1) to evaluate various data sources and algorithms for identifying hypertensive individuals within the electronic health record, and 2) to develop and evaluate a novel method for identifying associations between genotypes and natural language processing-based phenotypes extracted from the electronic health record.

The author evaluated data sources and hypertension phenotyping algorithms using a set of 631 individuals manually reviewed for hypertension status based on their electronic health record data. Combinations of data sources outperformed methods that leveraged any category individually. Random forest models trained with billing codes, medications, vital signs, and hypertension concept counts achieved a median AUC of 0.976. The best algorithms performed similarly at a second site.

The author also developed a novel method for phenome-wide association studies using natural language processing-based phenotypes (NLP-PheWAS). Using 29,722 individuals with Exome data, the author extracted 11,553 unique concepts from narrative text after negation, note section, and semantic type filtering. The method replicated 43.7% of known, statistically powered associations from the National Human Genome Research Institute’s genome-wide association catalog. NLP-PheWAS also identified two potentially novel associations among the SNPs studied. They included an association between optic disc neovascularization and rs1497546 and between Langerhans-Cell Histiocytosis and rs7193343. NLP-PheWAS is a promising method for enabling rapid discovery, interpretation of novel associations, and increased understanding of genetic influences within the rapidly expanding narrative text of electronic health records.

  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Teixeira.pdf 4.52 Mb 00:20:55 00:10:45 00:09:24 00:04:42 00:00:24

Browse All Available ETDs by ( Author | Department )

If you have more questions or technical problems, please Contact LITS.