Abstract

Abstract Motivation: The study of sequence space, and the deciphering of the structure of protein families and subfamilies, has up to now been required for work in comparative genomics and for the prediction of protein function. With the emergence of structural proteomics projects, it is becoming increasingly important to be able to select protein targets for structural studies that will appropriately cover the space of protein sequences, functions and genomic distribution. These problems are the motivation for the development of methods for clustering protein sequences and building families of potentially orthologous sequences, such as those proposed here. Results: First we developed a clustering strategy (Ncut algorithm) capable of forming groups of related sequences by assessing their pairwise relationships. The results presented for the ras super-family of proteins are similar to those produced by other clustering methods, but without the need for clustering the full sequence space. The Ncut clusters are then used as the input to a process of reconstruction of groups with equilibrated genomic composition formed by closely-related sequences. The results of applying this technique to the data set used in the construction of the COG database are very similar to those derived by the human experts responsible for this database. Availability: The analysis of different systems, including the COG equivalent 21 genomes are available at http://www.pdg.cnb.uam.es/GenoClustering.html Contact: valencia@cnb.uam.es * To whom correspondence should be addressed.

Keywords

Cluster analysisSequence (biology)Structural genomicsComputational biologyProtein familyPairwise comparisonIdentification (biology)Computer scienceCogBiologyGenomicsSequence alignmentProtein sequencingGenomeGeneticsPeptide sequenceProtein structureArtificial intelligenceGene

Affiliated Institutions

Related Publications

Publication Info

Year
2002
Type
article
Volume
18
Issue
7
Pages
908-921
Citations
61
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

61
OpenAlex

Cite This

Federico Abascal, Alfonso Valencia (2002). Clustering of proximal sequence space for the identification of protein families. Bioinformatics , 18 (7) , 908-921. https://doi.org/10.1093/bioinformatics/18.7.908

Identifiers

DOI
10.1093/bioinformatics/18.7.908