Type of Document Master's Thesis Author Chen, Yukun Author's Email Address email@example.com URN etd-07122013-162658 Title Applying Active Learning to Biomedical Text Processing Degree Master of Science Department Biomedical Informatics Advisory Committee
Advisor Name Title Hua Xu Committee Co-Chair Joshua C. Denny Committee Co-Chair Qiaozhu Mei Committee Member Thomas Lasko Committee Member Keywords
- Active Learning
- Natural Laugnage Processing
- Biomedical Text Processing
- Machine Learning
Date of Defense 2013-05-23 Availability unrestricted AbstractObjective: Supervised machine learning methods have shown good performance in text classification tasks in the biomedical domain, but they often require large annotated corpora, which are costly to develop. Our goal is to assess whether active learning strategies can be integrated with supervised machine learning methods, thus reducing the annotation cost while keeping or improving the quality of classification models for biomedical text.
Methods: We have applied active learning to two biomedical natural language processing (NLP) tasks: 1) the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge, which was to determine the assertion status of clinical concepts; and 2) a supervised word sense disambiguation (WSD) task that was to disambiguate 197 ambiguous words and abbreviations in MEDLINE abstracts. We developed Support Vector Machines (SVMs) based classifiers for both tasks. We then implemented several existing and newly developed active learning algorithms to integrate with SVM classifiers and evaluated their performance on both tasks.
Results: In assertion classification task, our results showed that to achieve the same classification performance, active learning strategies required much fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort. In the WSD task, our results also demonstrated that active learners significantly outperformed the passive learner, showing better performance for 177 out of 197 (89.8%) ambiguous terms. Further analysis showed that to achieve an average accuracy of 90%, the passive learner needed 38 samples, while the active learners needed only 24 annotated samples, a 37% reduction of annotation effort. Moreover, we also analyzed cases where active learning algorithms did not achieve superior performance and summarized three causes: (1) poor model in early learning stage; (2) easy WSD cases; and (3) difficult WSD cases, which provide useful insight for future improvements.
Conclusion: Both studies demonstrated that integrating active learning strategies with supervised learning methods could effectively reduce annotation cost and improve the classification models in biomedical text processing.
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access Chen.pdf 1.36 Mb 00:06:18 00:03:14 00:02:50 00:01:25 00:00:07