Dr. Kathleen DurantMy career centers on data: storing data, efficiently accessing data, analyzing data and generating knowledge from data. Early in my career, I focused on ways to efficiently access and store data on secondary devices. I have built many different types of database systems: hash, B+ trees, hierarchical, a relational database system as well as a SIMD relational database system. For all of these projects I was responsible for design, development, testing, documenting, supporting, and ensuring the quality of the database system. The data domains associated with these databases varied from the results of automatic testing equipment measuring the quality of printed circuit boards to point of sale data for several large credit card companies. Being exposed to different data domains helped piqued my interest in the knowledge that can be gleaned from data.
Between 1990 and 2000, the disk capacity of computer systems, on average, increased 60% per year; bandwidth increased 40% per year, while microprocessor improvement was 35% per year. These increases allowed computer users to harvest information from their data that was not possible in previous computer architectures. This technological change allowed me to change my research area from data storage to data analysis techniques.
My initial data analysis techniques were simple statistical methods on tabular data found within databases. My techniques have evolved to well known machine learning methods such as Naïve Bayes, Support Vector Machines, Decision Tables as well as my own ensemble techniques that use a time metric as a weight. The data representation has also evolved from table representation to vector space models for text.
My current data domain is online medical forums (ODGs). To study the nature of ODGs, we consider the topics discussed in ODGs. We seek to identify the prominent topics discussed within the ODGs as well as specific topics associated with user specified classifications. There may be topics in common among the user classifications that are common themes within the ODG as well as themes specific to each user classification. Typical user classifications found at ODGs are: patient, caregiver, doctor or member.
We also consider the effect time has on the nature of ODGs. Time can affect: the active users within an ODG, the amount of activity within an ODG, the topics being discussed within an ODG and the sentiment toward specific topics being discussed on the ODG. One question we seek to answer is: are there prevalent questions that continue to be asked at an ODG? Can we identify these patterns? If so, can we automatically identify these topics and provide a method to merge the threads associated with the same topic? For many of the ODGs, there are periods of dormancy followed by periods of intense activity. What are the factors defining these different periods? Are these factors represented in the text or some metric that can be derived from the text? Can we automatically determine the cycles of an ODG?
We also study and identify the social network among the ODG members. The initial research is to visualize the communication patterns among the members in the community. These communication patterns can be considered a network where individuals are nodes of the network and communication between people are the arcs or edges of the network connecting nodes. Understanding the interconnections among the nodes (the topology of the network) may provide insight into the nature of these ODGs. More specific social network questions can also be investigated.