A joint project of the Graduate School, Peabody College, and the Jean & Alexander Heard Library

Title page for ETD etd-07142018-114155

Type of Document Master's Thesis
Author Parr, Sharidan Kristen
URN etd-07142018-114155
Title Automated Mapping of Laboratory Tests to LOINC Codes using Noisy Labels in a National Electronic Health Record System Database
Degree Master of Science
Department Biomedical Informatics
Advisory Committee
Advisor Name Title
Michael Matheny Committee Chair
Matthew Shotwell Committee Member
Thomas Lasko Committee Member
  • Laboratory
  • Data Quality
  • Machine Learning
Date of Defense 2018-06-01
Availability restrictone
Standards, such as the Logical Observation Identifiers Names and Codes (LOINC®) are critical for interoperability and integrating data into common data models, but are inconsistently used. Without consistent mapping to standards, clinical data cannot be harmonized, shared, or interpreted in a meaningful context. We sought to develop an automated machine learning pipeline that leverages noisy labels to map laboratory data to LOINC codes. Across 130 sites in the Department of Veterans Affairs Corporate Data Warehouse, we selected the 150 most commonly-used laboratory tests with numeric results per site from 2000 through 2016. Using source data text and numeric fields, we developed a machine learning model and manually validated random samples from both labeled and unlabeled datasets. The raw laboratory data consisted of >6.5 billion test results, with 2,215 distinct LOINC codes. The model predicted the correct LOINC code in 85% of the unlabeled data and 96% of the labeled data by test frequency. In the subset of labeled data where the original and model-predicted LOINC codes disagreed, the model-predicted LOINC code was correct in 83% of the data by test frequency. Using a completely automated process, we are able to assign LOINC codes to unlabeled data with high accuracy. When the model-predicted LOINC code differed from the original LOINC code, the model prediction was correct in the vast majority of cases. This scalable, automated algorithm may improve data quality and interoperability, while substantially reducing the manual effort currently needed to accurately map laboratory data.
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
[campus] Parr.pdf 3.35 Mb 00:15:29 00:07:58 00:06:58 00:03:29 00:00:17
[campus] indicates that a file or directory is accessible from the campus network only.

Browse All Available ETDs by ( Author | Department )

If you have more questions or technical problems, please Contact LITS.