Abstract
Microbial community metagenomes and individual microbial genomes are becoming increasingly accessible by means of high-throughput sequencing. Assessing organismal membership within a community is typically performed using one or a few taxonomic marker genes such as the 16S rDNA, and these same genes are also employed to reconstruct molecular phylogenies. There is thus a growing need to bioinformatically catalog strongly conserved core genes that can serve as effective taxonomic markers, to assess the agreement among phylogenies generated from different core gene, and to characterize the biological functions enriched within core genes and thus conserved throughout large microbial clades. We present a method to recursively identify core genes (i.e. genes ubiquitous within a microbial clade) in high-throughput from a large number of complete input genomes. We analyzed over 1,100 genomes to produce core gene sets spanning 2,861 bacterial and archaeal clades, ranging in size from one to >2,000 genes in inverse correlation with the α-diversity (total phylogenetic branch length) spanned by each clade. These cores are enriched as expected for housekeeping functions including translation, transcription, and replication, in addition to significant representations of regulatory, chaperone, and conserved uncharacterized proteins. In agreement with previous manually curated core gene sets, phylogenies constructed from one or more of these core genes agree with those built using 16S rDNA sequence similarity, suggesting that systematic core gene selection can be used to optimize both comparative genomics and determination of microbial community structure. Finally, we examine functional phylogenies constructed by clustering genomes by the presence or absence of orthologous gene families and show that they provide an informative complement to standard sequence-based molecular phylogenies.
Keywords
MeSH Terms
Affiliated Institutions
Related Publications
KEGG for taxonomy-based analysis of pathways and genomes
Abstract KEGG (https://www.kegg.jp) is a manually curated database resource integrating various biological objects categorized into systems, genomic, chemical and health informa...
eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses
eggNOG is a public database of orthology relationships, gene evolutionary histories and functional annotations. Here, we present version 5.0, featuring a major update of the und...
NCBI GEO: archive for functional genomics data sets--10 years on
A decade ago, the Gene Expression Omnibus (GEO) database was established at the National Center for Biotechnology Information (NCBI). The original objective of GEO was to serve ...
Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB
ABSTRACT A 16S rRNA gene database ( http://greengenes.lbl.gov ) addresses limitations of public repositories by providing chimera screening, standard alignment, and taxonomic cl...
The 16s/23s ribosomal spacer region as a target for DNA probes to identify eubacteria.
Variable regions of the 16s ribosomal RNA have been frequently used as the target for DNA probes to identify microorganisms. In some situations, however, there is very little se...
Publication Info
- Year
- 2011
- Type
- article
- Volume
- 6
- Issue
- 9
- Pages
- e24704-e24704
- Citations
- 115
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1371/journal.pone.0024704
- PMID
- 21931822
- PMCID
- PMC3171473