FSU ETD Logo

Title page for ETD etd-07142011-121540


Type of Document Dissertation
Author Bell, Lindsey Renee
URN etd-07142011-121540
Title A statistical approach for information extraction of biological relationships
Degree Doctor of Philosophy
Department Statistics, Department of
Advisory Committee
Advisor Name Title
Jinfeng Zhang Committee Co-Chair
Xufeng Niu Committee Co-Chair
Fred Huffer Committee Member
Gary Tyson University Representative
Keywords
  • protein protein interaction
  • information extraction
Date of Defense 2011-06-09
Availability unrestricted
Abstract
Vast amounts of biomedical information are stored in scienti c literature, easily accessed

through publicly available databases. Relationships among biomedical terms constitute

a major part of our biological knowledge. Acquiring such structured information from

unstructured literature can be done through human annotation, but is time and resource

consuming. As this content continues to rapidly grow, the popularity and importance of text

mining for obtaining information from unstructured text becomes increasingly evident. Text

mining has four major components. First relevant articles are identi ed through information

retrieval (IR), next important concepts and terms are flagged using entity recognition (ER),

and then relationships between these entities are extracted from the literature in a process

called information extraction(IE). Finally, text mining takes these elements and seeks to

synthesize new information from the literature.

Our goal is information extraction from unstructured literature concerning biological

entities. To do this, we use the structure of triplets where each triplet contains two biological entities and one interaction word. The biological entities may include terms such as

protein names, disease names, genes, and small-molecules. Interaction words describe the

relationship between the biological terms. Under this framework we aim to combine the

strengths of three classi ers in an ensemble approach. The three classi ers we consider are

Bayesian Networks, Support Vector Machines, and a mixture of logistic models de ned by

interaction word.

The three classi ers and ensemble approach are evaluated on three benchmark corpora

and one corpus that is introduced in this study. The evaluation includes cross validation

and cross-corpus validation to replicate an application scenario. The three classi ers are

unique and we nd that performance of individual classi ers varies depending on the corpus.

Therefore, an ensemble of classi ers removes the need to choose one classi er and provides

optimal performance.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Bell_L_Dissertation_2011.pdf 5.51 Mb 00:25:29 00:13:06 00:11:28 00:05:44 00:00:29

Browse All Available ETDs by ( Author | Department )

If you have more questions or technical problems, please Contact the FSU Digital Library Center.