Traditional methods for collecting data in support of clinical research include prospectively collected surveys, retrospective analyses of existing medical records, and a combination of the two. Yet these approaches tend to focus on a medical-centric worldview and, as a result, provide only a partial view of a patient's life. As distributed systems, cloud services and mobile devices grow in sophistication and market penetration, large amounts of personal data are generated every day, particularly in online environments, where a range of aspects of their life are disclosed, including information related to one's health. This situation provides an opportunity for healthcare providers and biomedical researchers to learn about patients from their own voice and beyond traditional data sources. However, collecting, processing, and acting upon self-authored natural language text imposes challenges on automatically extracting health-related information, including, but not limited to, ambiguity in communication, noisy data, long exposition that contains many different types of health information, and high-dimensionality in predictive model interoperability.
This dissertation applies a data-driven approach to investigate how self-authored information in three different online environments can be relied upon to learn about health-related behaviors. Specifically, this dissertation investigates three foundational questions. First, how do individuals disclose health status through a general social media platform (e.g., Twitter)? Second, can patients' long-term treatment adherence be inferred through online health communities (e.g., forums in breastcancer.org)? Third, how can we learn patients' needs based on the messages they send to healthcare providers over a patient portal that is connected to an electronic medical record (EMR) system that is ingrained in the everyday functions of a large academic medical center? To process consumer-authored natural language text, this dissertation illustrates how to combine text mining, machine learning, and statistical inference to 1) extract health related events (e.g., adherence status), 2) create interpretable factors (e.g., semantic groups), 3) build efficient predicting models (e.g., predicting medication interruption events), and 4) learn meaningful health-related associations (e.g., semantics and health status disclosure, emotions and portray of adherence status, topics and medication adherence). It is shown that many factors communicated through self-authored text (e.g., emotions, personalities, and other factors that are not captured in structured EMRs) can be applied to explain an individual's health-related behavior. This research provides evidence that self-generated information can be applied to supplement traditional data sources to facilitate healthcare research.