A joint project of the Graduate School, Peabody College, and the Jean & Alexander Heard Library

Title page for ETD etd-06122015-162419

Type of Document Dissertation
Author Chen, Yukun
Author's Email Address yukun.chen@vanderbilt.edu
URN etd-06122015-162419
Title Active Learning for Named Entity Recognition in Clinical Text
Degree PhD
Department Biomedical Informatics
Advisory Committee
Advisor Name Title
Joshua C. Denny Committee Chair
Hua Xu Committee Co-Chair
Qiaozhu Mei Committee Member
Qingxia Chen Committee Member
Thomas A. Lasko Committee Member
  • natural language processing
  • named entity recognition
  • machine learning
  • Active learning
  • clinical NLP
Date of Defense 2015-05-27
Availability unrestricted
Named entity recognition (NER) is one of the fundamental tasks for building clinical natural language processing (NLP) systems. Machine learning (ML) based approaches can achieve good performance. However, they often require large numbers of annotated samples, which are expensive to build with the use of domain experts in annotation. Active learning (AL), a sample selection approach that can be integrated with supervised ML, has shown the promising potential to minimize the annotation cost while maximizing the performance of ML-based models in various NLP tasks. However, very few studies have investigated AL for clinical NER in a real-life setting.

In this dissertation research, I systematically studied AL in a clinical NER task to identify medical problems, treatments, and lab tests in clinical notes. Novel AL algorithms were developed to query the most informative and least costly sentences based on three properties: uncertainty, representativeness, and annotation time. I also developed the first AL-enabled annotation system for clinical NER. Using this system, I further conducted user studies to assess the performance of AL in real world annotation processes for building clinical NER systems.

The initial user study shows that conventional AL methods with no consideration of annotation time did not always perform better than random sampling for different users. However, our newly developed AL algorithms with cost models for estimating annotation time were more promising in practice. To achieve an NER model with 0.70 in F-measure, simulated results show that the new AL method saved ~33.3% in estimated annotation time, compared to random sampling. In the user study, the new AL algorithm achieved better performance than random sampling and saved up to ~26.5% real annotation time for one of the users.

To the best of our knowledge, this is the first study examining the practical AL systems for clinical NER. Our study demonstrates that AL has the potential to save annotation time and improve model quality for building ML-based NER systems, when novel querying algorithms are implemented. Our future work includes developing better querying algorithms and evaluating the system with larger number of users.

  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Chen.pdf 3.57 Mb 00:16:32 00:08:30 00:07:26 00:03:43 00:00:19

Browse All Available ETDs by ( Author | Department )

If you have more questions or technical problems, please Contact LITS.