Abstract

The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at http://www.proteininformationresource.org/rps, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.

Keywords

ProteomeUniProtComputational biologySequence (biology)AnnotationSet (abstract data type)Sequence databaseProtein sequencingScalabilityComputer scienceBiologySimilarity (geometry)Human proteome projectBioinformaticsSequence alignmentSequence analysisData miningPeptide sequenceGeneticsProteomicsArtificial intelligenceDatabase

MeSH Terms

ProteomeSequence AnalysisProtein

Affiliated Institutions

Related Publications

Publication Info

Year
2011
Type
article
Volume
6
Issue
4
Pages
e18910-e18910
Citations
115
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

115
OpenAlex
5
Influential
95
CrossRef

Cite This

Chuming Chen, Darren A. Natale, ROBERT FINN et al. (2011). Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation. PLoS ONE , 6 (4) , e18910-e18910. https://doi.org/10.1371/journal.pone.0018910

Identifiers

DOI
10.1371/journal.pone.0018910
PMID
21556138
PMCID
PMC3083393

Data Quality

Data completeness: 86%