Queen's University - Utility Bar

QSpace at Queen's University >
Graduate Theses, Dissertations and Projects >
Queen's Graduate Theses and Dissertations >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1974/5415

Title: Automatic Identification of Protein Characterization Articles in support of Database Curation
Authors: Denroche, Robert

Files in This Item:

File Description SizeFormat
Denroche_Robert_E_201001_MSc.pdf1.12 MBAdobe PDFView/Open
Keywords: machine learning
database curation
biomedical text mining
Issue Date: 2010
Series/Report no.: Canadian theses
Abstract: Experimentally determining the biological function of a protein is a process known as protein characterization. Establishing the role a specific protein plays is a vital step toward fully understanding the biochemical processes that drive life in all its forms. In order for researchers to efficiently locate and benefit from the results of protein characterization experiments, the relevant information is compiled into public databases. To populate such databases, curators, who are experts in the biomedical domain, must search the literature to obtain the relevant information, as the experiment results are typically published in scientific journals. The database curators identify relevant journal articles, read them, and then extract the required information into the database. In recent years the rate of biomedical research has greatly increased, and database curators are unable to keep pace with the number of articles being published. Consequently, maintaining an up-to-date database of characterized proteins, let alone populating a new database, has become a daunting task. In this thesis, we report our work to reduce the effort required from database curators in order to create and maintain a database of characterized proteins. We describe a system we have designed for automatically identifying relevant articles that discuss the results of protein characterization experiments. Classifiers are trained and tested using a large dataset of abstracts, which we collected from articles referenced in public databases, as well as small datasets of manually labeled abstracts. We evaluate both a standard and a modified na├»ve Bayes classifier and examine several different feature sets for representing articles. Our findings indicate that the resulting classifier performs well enough to be considered useful by the curators of a characterized protein database.
Description: Thesis (Master, Computing) -- Queen's University, 2010-01-28 18:45:17.249
URI: http://hdl.handle.net/1974/5415
Appears in Collections:Queen's Graduate Theses and Dissertations
School of Computing Graduate Theses

Items in QSpace are protected by copyright, with all rights reserved, unless otherwise indicated.


  DSpace Software Copyright © 2002-2008  The DSpace Foundation - TOP