Skip Header

 

What are Complete Proteome Sets?

Last modified April 23, 2007

Not all organisms whose genomes have been sequenced are included in the complete proteome sets of UniProtKB. We consider as "complete" genomes that have been fully closed and for which there are good gene prediction models.

For bacterial and archaeal genomes, whole-genome shotguns (WGS) and draft sequences are not included in the UniProtKB complete proteome sets and are not considered for manual annotation.

Initially, some completely sequenced strains were merged into one proteome set. This was done especially at a time when not many strains of the same species were being completely sequenced; when similarity analyses revealed that the sequenced strains' proteins shared more than 90% similarity, these proteomes were merged under one organism code. This was also done for identical strains.

At present, the tendency is to assign a separate organism code (also known as a taxonomic identifier) for each completely sequenced strain to facilitate the download of specific sets and to improve similarity searches.

For eukaryotic genomes, several criteria apply to consider a proteome "complete". Some sequenced genomes have submission/annotation problems that prevent the production of a non-redundant protein set; others have problems regarding the gene model predictions. We provide a link to download Integr8 proteome sets for higher eukaryotes that are not yet considered complete in UniProtKB. Information on how those sets are produced is available on the Integr8 help pages.