Automatic Identification of Protein Characterization Articles in support of Database Curation
machine learning , database curation , bioinformatics , classification , biomedical text mining
Experimentally determining the biological function of a protein is a process known as protein characterization. Establishing the role a specific protein plays is a vital step toward fully understanding the biochemical processes that drive life in all its forms. In order for researchers to efficiently locate and benefit from the results of protein characterization experiments, the relevant information is compiled into public databases. To populate such databases, curators, who are experts in the biomedical domain, must search the literature to obtain the relevant information, as the experiment results are typically published in scientific journals. The database curators identify relevant journal articles, read them, and then extract the required information into the database. In recent years the rate of biomedical research has greatly increased, and database curators are unable to keep pace with the number of articles being published. Consequently, maintaining an up-to-date database of characterized proteins, let alone populating a new database, has become a daunting task. In this thesis, we report our work to reduce the effort required from database curators in order to create and maintain a database of characterized proteins. We describe a system we have designed for automatically identifying relevant articles that discuss the results of protein characterization experiments. Classifiers are trained and tested using a large dataset of abstracts, which we collected from articles referenced in public databases, as well as small datasets of manually labeled abstracts. We evaluate both a standard and a modified naïve Bayes classifier and examine several different feature sets for representing articles. Our findings indicate that the resulting classifier performs well enough to be considered useful by the curators of a characterized protein database.