Automatic Identification of Protein Characterization Articles in support of Database Curation

dc.contributor.authorDenroche, Roberten
dc.contributor.supervisorShatkay, Hagiten
dc.date2010-01-28 18:45:17.249's University at Kingstonen
dc.descriptionThesis (Master, Computing) -- Queen's University, 2010-01-28 18:45:17.249en
dc.description.abstractExperimentally determining the biological function of a protein is a process known as protein characterization. Establishing the role a specific protein plays is a vital step toward fully understanding the biochemical processes that drive life in all its forms. In order for researchers to efficiently locate and benefit from the results of protein characterization experiments, the relevant information is compiled into public databases. To populate such databases, curators, who are experts in the biomedical domain, must search the literature to obtain the relevant information, as the experiment results are typically published in scientific journals. The database curators identify relevant journal articles, read them, and then extract the required information into the database. In recent years the rate of biomedical research has greatly increased, and database curators are unable to keep pace with the number of articles being published. Consequently, maintaining an up-to-date database of characterized proteins, let alone populating a new database, has become a daunting task. In this thesis, we report our work to reduce the effort required from database curators in order to create and maintain a database of characterized proteins. We describe a system we have designed for automatically identifying relevant articles that discuss the results of protein characterization experiments. Classifiers are trained and tested using a large dataset of abstracts, which we collected from articles referenced in public databases, as well as small datasets of manually labeled abstracts. We evaluate both a standard and a modified naïve Bayes classifier and examine several different feature sets for representing articles. Our findings indicate that the resulting classifier performs well enough to be considered useful by the curators of a characterized protein database.en
dc.format.extent1143268 bytes
dc.relation.ispartofseriesCanadian thesesen
dc.rightsThis publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.en
dc.subjectmachine learningen
dc.subjectdatabase curationen
dc.subjectbiomedical text miningen
dc.titleAutomatic Identification of Protein Characterization Articles in support of Database Curationen
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
1.09 MB
Adobe Portable Document Format