A joint project of the Graduate School, Peabody College, and the Jean & Alexander Heard Library

Title page for ETD etd-12072007-141136

Type of Document Dissertation
Author Aphinyanaphongs, Yindalon
Author's Email Address ping.pong@vanderbilt.edu
URN etd-12072007-141136
Title Identifying high quality MEDLINE articles and web sites using machine learning
Degree PhD
Department Biomedical Informatics
Advisory Committee
Advisor Name Title
Constantin Aliferis Committee Chair
Dan Masys Committee Member
Douglas Hardin Committee Member
Ioannis Tsamardinos Committee Member
Steven Brown Committee Member
  • information retrieval
Date of Defense 2007-07-31
Availability unrestricted
In this dissertation, I explore the applicability of text categorization machine learning methods to identify clinically pertinent and evidence-based articles in the literature and web pages on the internet. In the first series of experiments, I found that text categorization techniques identify high quality articles in internal medicine in the content categories of prognosis, diagnosis, etiology, and treatment better than the Clinical Query Filters of Pubmed. In a second set of experiments, I established that the text categorization models generalized both to time periods outside the training set and to areas outside of internal medicine including pediatrics, oncology, and surgery. My third set of experiments revealed that text categorization models built for a specific purpose identified articles better than both bibliometric (number of citations and impact factor) and web-based measures (Google PageRank, Yahoo WebRanks, and total web page hit count). In the fourth set of experiments, I built models for purpose, format, and additional content categories from a labeled gold standard that have high discriminatory power. Furthermore, we built a system called EBMSearch that implements these models to all of MEDLINE. Finally I extended these methods to the web and built the first validated models that identify websites that make false cancer treatment claims outperforming previous unvalidated models and PageRank by 30% area under the receiver operating curve. In conclusion, machine learning-based text categorization methods provide a powerful framework for identifying clinically applicable articles in the medical literature and the Internet.
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  yaphinyanaphongs_phd_dissertation.pdf 1.33 Mb 00:06:10 00:03:10 00:02:46 00:01:23 00:00:07

Browse All Available ETDs by ( Author | Department )

If you have more questions or technical problems, please Contact LITS.