|
Type of Document Dissertation Author Bell, Lindsey Renee URN etd-07142011-121540 Title A statistical approach for information extraction of biological relationships Degree Doctor of Philosophy Department Statistics, Department of Advisory Committee
Advisor Name Title Jinfeng Zhang Committee Co-Chair Xufeng Niu Committee Co-Chair Fred Huffer Committee Member Gary Tyson University Representative Keywords
- protein protein interaction
- information extraction
Date of Defense 2011-06-09 Availability unrestricted Abstract Vast amounts of biomedical information are stored in scientic literature, easily accessedthrough publicly available databases. Relationships among biomedical terms constitute
a major part of our biological knowledge. Acquiring such structured information from
unstructured literature can be done through human annotation, but is time and resource
consuming. As this content continues to rapidly grow, the popularity and importance of text
mining for obtaining information from unstructured text becomes increasingly evident. Text
mining has four major components. First relevant articles are identied through information
retrieval (IR), next important concepts and terms are flagged using entity recognition (ER),
and then relationships between these entities are extracted from the literature in a process
called information extraction(IE). Finally, text mining takes these elements and seeks to
synthesize new information from the literature.
Our goal is information extraction from unstructured literature concerning biological
entities. To do this, we use the structure of triplets where each triplet contains two biological entities and one interaction word. The biological entities may include terms such as
protein names, disease names, genes, and small-molecules. Interaction words describe the
relationship between the biological terms. Under this framework we aim to combine the
strengths of three classiers in an ensemble approach. The three classiers we consider are
Bayesian Networks, Support Vector Machines, and a mixture of logistic models dened by
interaction word.
The three classiers and ensemble approach are evaluated on three benchmark corpora
and one corpus that is introduced in this study. The evaluation includes cross validation
and cross-corpus validation to replicate an application scenario. The three classiers are
unique and we nd that performance of individual classiers varies depending on the corpus.
Therefore, an ensemble of classiers removes the need to choose one classier and provides
optimal performance.
Files
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access Bell_L_Dissertation_2011.pdf 5.51 Mb 00:25:29 00:13:06 00:11:28 00:05:44 00:00:29