Abstract:
The number of newly discovered proteins has increased drastically during the last two
decades. Curators are no longer capable of manually annotating them. Therefore there is
a great need to automate this process. Rule generation for protein annotation in databases
such as Uniprot, Pro site, Interpro has been tackled by many scientists and researchers and
has proven to be a reliable and successful method for correctly and accurately annotating
proteins regarding certain fields (for example the keywords field). Our study of the
organism "Newcastle Virus Disease" showed that data coming from Swiss-Prot was
accurate (checked by human experts) while data coming from TrEMBL is not reliable
and incomplete. We propose to automate the process of annotating proteins related to the
Newcastle virus disease regarding their keywords field in both the Swiss-Prot and TrEMBL database. The rules generated have been applied to most of the proteins from
SwissProt database and the results were promising. As a matter of fact 95% of the
proteins were accurately annotated with the exact keyword(s). As for TrEMBL database
our rules have annotated the proteins which were originally unannotated and improved or
completed the annotation of proteins for which annotation was incomplete. These
obtained results were again tested against the data in SwissProt and were found to be
between 90% and 100% valid and correct.