Institut Pasteur blank vertical divider clipart blank DBC blank vertical divider clipart blank CRBIP blank vertical divider clipart blank GIPhy

DESCRIPTION    MARKER SETS    MARKER LIST    DOWNLOAD    LITTERATURE CITED

PhyloM: bacteria






Description

PhyloM: bacteria is a selection of 74 universal single-copy genes (USCG) that can be used as well-suited markers for phylogenetic tree inference of bacterial taxa. These selected markers are recommended for phylogenetic reconstruction, as they have been shown to correspond to persistent genes within bacterial phyla (close to universal distribution). This collection is derived from the meta-analysis of ten previously published sets of USCG that are also available from this webpage.

For each phylogenetic marker, a standard gene name is provided, together with three different files:
  multiple amino acid sequence alignments (MSA),
  hidden Markov models (HMM),
  position specific scoring matrices (PSSM).
These files were gathered from reference databanks (when available):
 • COG (Tatusov et al. 1997, 2003; Galperin et al. 2015, 2021),
 • Pfam (Sonnhammer et al. 1997, 1998; Finn et al. 2016),
 • TIGRFAMs (Haft et al. 2001, 2003, 2013).
For each marker, selected MSA, HMM and PSSM files (among the COG, Pfam and TIGRFAMs ones) are also provided (labelled PMB).

[last update: 24.10.31]


Marker sets

For each compiled marker sets, this section provides the list of the corresponding gene names (), together with three selected (PMB) datafiles (contained into tar.gz archives):
  multiple amino acid sequence alignments (MSA),
  hidden Markov models (HMM),
  position specific scoring matrices (PSSM).

Cic31 (Ciccarelli et al. 2006)

This small set of 31 genes was elaborated to infer a global phylogenetic tree of archaea, bacteria and eukaryotes. This dataset was also used by Sorek et al. (2007).

marker set no. loci gene names  MSA   HMM   PSSM 
Cic31 31

Cre40 (Creevey et al. 2011)

These 40 genes were originally compiled to study horizontal gene transfers within bacteria. The same market set was also used in the works of Sunagawa et al. (2013), and by Mende et al. (2013) in their tool specI.

marker set no. loci gene names  MSA   HMM   PSSM 
Cre40 40

Dup107 (Dupont et al. 2012)

Originally designed to assess the completeness of different metagenomes, these 107 phylogenetic markers were also used by Ankenbrand and Keller (2016) in their phylogenetic tree reconstruction tool bcgTree.

marker set no. loci gene names  MSA   HMM   PSSM 
Dup107 107


PhyEco (Wu et al. 2013)

These 114 phylogenetic markers comprised 40 genes spanning the archaea+bacteria domains, together with 74 bacterial-specific ones. Before the development of the PhyEco set, the same research group had previously described a limited set of 31 USCG for the phylogenetic reconstruction tool AMPHORA (Wu and Eisen 2008; see also Wu and Scott 2012).

marker set no. loci gene names  MSA   HMM   PSSM 
PhyEco 114


Aln36 (Alneberg et al. 2014)

This limited set of 36 genes was designed for the metagenomic data binning tool CONCOCT. The same USCG set was used by Quince et al. (2017) for the alternative tool DESMAN.

marker set no. loci gene names  MSA   HMM   PSSM 
Aln36 36


Lan73 (Lan et al. 2016)

These 73 phylogenetic markers were gathered to quantify the level of overall relatedness between prokaryote genomes.

marker set no. loci gene names  MSA   HMM   PSSM 
Lan73 73


bac120 (Parks et al. 2017)

First compiled to classify genome assemblies derived from metagenomic data, this set of 120 markers, named bac120 by Parks et al. (2017), was next used to infer phylogenetic trees in order to assess and revise bacterial species classifications (Parks et al. 2018, 2020, 2022).

marker set no. loci gene names  MSA   HMM   PSSM 
bac120 120


Col61 (Coleman et al. 2021)

This USCG set was developped to infer a bacterial phylogenetic tree and discuss its putative rooting. Of note, Coleman et al. (2021) described 62 loci, but the published list contains a duplicated one (i.e. radA: K04485; see Supplementary Table S1); in consequence, only 61 loci are compiled in the present marker set.

marker set no. loci gene names  MSA   HMM   PSSM 
Col61 61


UBCG2 (Kim et al. 2021)

This revision of a first USCG set (UBCG: 92 genes; Na et al. 2018) was computed using two other marker sets (i.e. Dup107 and bac120), leading to an updated list of 81 genes.

marker set no. loci gene names  MSA   HMM   PSSM 
UBCG2 81


Tia85 (Tian and Imanian 2023)

These 85 genes are derived from a reanalysis of the three marker sets Dup107, bac120 and UBCG2. Tian and Imanian (2023) also described a smaller subset of 20 genes (i.e. VBCG) that was not considered here.

marker set no. loci gene names  MSA   HMM   PSSM 
Tia85 85


PhyloM

The ten above marker sets correspond to a total of 194 putative USCG, each occurring diversely within these different sets (see the zoomable upset plot below). This discrepancy is caused by the number and the diversity of the genomes considered by each analysis, as well as the disparate criteria to define a USCG among the different works. The PhyloM set contains the 74 genes that were assessed as a USCG in at least 50% of the previous studies.

marker set no. loci gene names  MSA   HMM   PSSM 
PhyloM 74
PhyloM: bacteria

Marker list

The following table itemizes the overall 194 phylogenetic markers.
Each marker is labelled by a common gene name. The corresponding coding sequences (CDS) from the Escherichia coli strain K12 substr. MG1655 genome (Genbank accn: NC_000913) are also indicated (if any).
For each gene, the corresponding COG, Pfam and TIGR identifiers are given (if any), together with the associated MSA (), HMM () and PSSM () files. The column PMB lists a selection of recommended MSA (), HMM () and PSSM () files for each gene.
Presence/absence of each phylogenetic marker is ticked in columns Cic31, Cre40, Dup107, PhyEco, Aln36, Lan73, bac120, Col61, UBCG2, Tia85 and PhyloM.

name E. coli CDS COG Pfam TIGRFAMs PMB Cic31 Cre40 Dup107 PhyEco Aln36 Lan73 bac120 Col61 UBCG2 Tia85 PhyloM
alaS NP_417177  COG0013 pfam01411 TIGR00344 PMB051
arfB NP_414733  COG1186 pfam00472 TIGR00020 PMB086
argS NP_416390  COG0018 pfam00750 TIGR00456 PMB075
aspS NP_416380  COG0173 pfam00152 TIGR00459 PMB106
atpD NP_418188  COG0055 pfam00006 TIGR01039 PMB107
atpG NP_418189  COG0224 pfam00231 TIGR01146 PMB135
cdsA NP_414717  COG0575 pfam01148 PMB136
cgtA NP_417650  COG0536 pfam01018 TIGR02729 PMB052
clpP NP_414971  COG0740 pfam00574 TIGR00493 PMB137
clpX NP_414972  COG1219 pfam07724 TIGR00382 PMB108
coaD NP_418091  COG0669 pfam01467 TIGR01510 PMB138
coaE NP_414645  COG0237 pfam01121 TIGR00152 PMB076
cysS NP_415059  COG0215 pfam01406 TIGR00435 PMB087
der NP_417006  COG1160 TIGR03594 PMB053
dnaA NP_418157  COG0593 pfam00308 TIGR00362 PMB077
dnaB NP_418476  COG0305 pfam03796 TIGR00665 PMB139
dnaE NP_414726  COG0587 pfam07733 TIGR00594 PMB140
dnaG NP_417538  COG0358 pfam08275 TIGR01391 PMB054
dnaK NP_414555  COG0443 pfam00012 TIGR02350 PMB109
dnaN NP_418156  COG0592 pfam02767 TIGR00663 PMB088
dnaX NP_415003  COG2812 pfam13177 TIGR02397 PMB089
dxr NP_414715  COG0743 pfam02670 TIGR00243 PMB141
efp NP_418571  COG0231 pfam09285 TIGR00038 PMB142
era NP_417061  COG1159 TIGR00436 PMB090
exoIX -  COG0258 pfam02739 PMB143
ffh NP_417101  COG0541 pfam00448 TIGR00959 PMB021
fmt NP_417746  COG0223 pfam00551 TIGR00460 PMB078
frr NP_414714  COG0233 pfam01765 TIGR00496 PMB041
ftsA NP_414636  COG0849 pfam14450 TIGR01174 PMB110
ftsY NP_417921  COG0552 pfam00448 TIGR00064 PMB034
ftsZ NP_414637  COG0206 pfam12327 TIGR00065 PMB144
fusA NP_417799  COG0480 TIGR00484 PMB111
gatA -  COG0154 pfam01425 TIGR00132 PMB145
gidA NP_418197  COG0445 pfam01134 TIGR00136 PMB146
glnS NP_416899  COG0008 pfam00749 TIGR00464 PMB147
glyA NP_417046  COG0112 pfam00464 PMB148
glyS NP_418016  COG0751 pfam02092 TIGR00211 PMB112
gmk NP_418105  COG0194 pfam00625 TIGR03263 PMB079
groEL NP_418567  COG0459 pfam00118 TIGR02348 PMB113
grpE NP_417104  COG0576 pfam01025 PMB114
guaB NP_417003  COG0516 pfam00478 TIGR01302 PMB149
gyrA NP_416734  COG0188 pfam00521 TIGR01063 PMB091
gyrB YP_026241  COG0187 pfam00204 TIGR01059 PMB092
hemC YP_026260  COG0181 pfam01379 TIGR00212 PMB150
hemE NP_418425  COG0407 pfam01208 TIGR01464 PMB151
hemN NP_418303  COG0635 pfam04055 TIGR00538 PMB152
hisS NP_417009  COG0124 pfam13393 TIGR00442 PMB042
holA NP_415173  COG1466 pfam06144 TIGR01128 PMB153
ileS NP_414567  COG0060 pfam00133 TIGR00392 PMB043
infA NP_415404  COG0361 pfam01176 TIGR00008 PMB115
infB NP_417637  COG0532 pfam11987 TIGR00487 PMB035
infC NP_416233  COG0290 pfam00707 TIGR00168 PMB055
ispF NP_417226  COG0245 pfam02542 TIGR00151 PMB154
lepA NP_417064  COG0481 pfam06421 TIGR01393 PMB056
leuS NP_415175  COG0495 pfam13603 TIGR00396 PMB022
ligA NP_416906  COG0272 pfam01653 TIGR00575 PMB093
manB -  COG1109 pfam02878 PMB155
map NP_414710  COG0024 pfam00557 TIGR00500 PMB156
metG NP_416617  COG0143 pfam09334 TIGR00398 PMB116
mfd NP_415632  COG1197 TIGR00580 PMB117
miaA NP_418592  COG0324 pfam01715 TIGR00174 PMB157
mraY NP_414629  COG0472 pfam00953 TIGR00445 PMB158
mreC NP_417716  COG1792 pfam04085 TIGR00219 PMB159
murB NP_418403  COG0812 pfam02873 TIGR00179 PMB160
murC NP_414633  COG0773 TIGR01082 PMB161
murD NP_414630  COG0771 TIGR01087 PMB118
murE NP_414627  COG0769 TIGR01085 PMB162
murG NP_414632  COG0707 pfam03033 TIGR01133 PMB163
mutL NP_418591  COG0323 pfam08676 TIGR00585 PMB164
mutS NP_417213  COG0249 pfam00488 TIGR01070 PMB165
nrdA NP_416737  COG0209 pfam02867 TIGR02506 PMB166
nrdR NP_414947  COG1327 pfam03477 TIGR00244 PMB167
nusA NP_417638  COG0195 pfam08529 TIGR01953 PMB057
nusB NP_414950  COG0781 pfam01029 TIGR01951 PMB168
nusG NP_418409  COG0250 pfam02357 TIGR00922 PMB044
pepP -  COG0006 pfam01321 PMB169
pgk NP_417401  COG0126 pfam00162 PMB094
pheS NP_416229  COG0016 pfam01409 TIGR00468 PMB001
pheT NP_416228  COG0072 pfam17759 TIGR00472 PMB036
plsX NP_415608  COG0416 pfam02504 TIGR00182 PMB170
pnp NP_417633  COG1185 pfam01138 TIGR03591 PMB095
prfA NP_415729  COG0216 pfam03462 TIGR00019 PMB045
prfB NP_418300  COG0749 pfam00476 TIGR00593 PMB096
priA NP_418370  COG1198 pfam17764 TIGR00595 PMB171
proS NP_414736  COG0442 pfam04073 TIGR00409 PMB119
pth NP_415722  COG0193 pfam01195 TIGR00447 PMB172
purB NP_415649  COG0015 pfam00206 TIGR00928 PMB173
purM NP_416994  COG0150 pfam02769 TIGR00878 PMB174
pyrG NP_417260  COG0504 pfam06418 TIGR00337 PMB058
pyrH NP_414713  COG0528 pfam00696 TIGR02075 PMB120
radA NP_418806  COG1066 TIGR00416 PMB097
rbfA NP_417636  COG0858 pfam02033 TIGR00082 PMB098
recA NP_417179  COG0468 pfam00154 TIGR02012 PMB059
recF NP_418155  COG1195 TIGR00611 PMB175
recG NP_418109  COG1200 pfam00270 TIGR00643 PMB176
recN YP_026172  COG0497 pfam13476 TIGR00634 PMB121
recR NP_415005  COG0353 pfam13662 TIGR00615 PMB099
ribF NP_414566  COG0196 pfam06574 TIGR00083 PMB122
rimM NP_417099  COG0806 pfam01782 TIGR02273 PMB123
rimP NP_417639  COG0779 pfam02576 PMB177
rlmB NP_418601  COG0566 pfam00588 TIGR00186 PMB178
rnc NP_417062  COG0571 pfam14622 TIGR02191 PMB100
rnhB NP_414725  COG0164 pfam01351 PMB124
rpL1 NP_418411  COG0081 pfam00687 TIGR01169 PMB002
rpL2 NP_417776  COG0090 pfam03947 TIGR01171 PMB014
rpL3 NP_417779  COG0087 pfam00297 TIGR03625 PMB003
rpL4 NP_417778  COG0088 pfam00573 TIGR03953 PMB023
rpL5 NP_417767  COG0094 pfam00673 PMB015
rpL6 NP_417764  COG0097 pfam00347 TIGR03654 PMB004
rpL7L12 NP_418413  COG0222 pfam00542 TIGR00855 PMB060
rpL9 NP_418624  COG0359 pfam03948 TIGR00158 PMB061
rpL10 NP_418412  COG0244 pfam00466 PMB037
rpL11 NP_418410  COG0080 pfam00298 TIGR01632 PMB005
rpL13 NP_417698  COG0102 pfam00572 TIGR01066 PMB006
rpL14 NP_417769  COG0093 pfam00238 TIGR01067 PMB016
rpL15 NP_417760  COG0200 pfam00828 TIGR01071 PMB024
rpL16 NP_417772  COG0197 pfam00252 TIGR01164 PMB007
rpL17 NP_417753  COG0203 pfam01196 TIGR00059 PMB062
rpL18 NP_417763  COG0256 pfam00861 TIGR00060 PMB025
rpL19 NP_417097  COG0335 pfam01245 TIGR01024 PMB080
rpL20 NP_416231  COG0292 pfam00453 TIGR01032 PMB046
rpL21 NP_417653  COG0261 pfam00829 TIGR00061 PMB063
rpL22 NP_417774  COG0091 pfam00237 TIGR01044 PMB017
rpL23 NP_417777  COG0089 pfam00276 TIGR03636 PMB064
rpL24 NP_417768  COG0198 pfam17136 TIGR01079 PMB038
rpL25 NP_416690  COG1825 pfam01386 TIGR00731 PMB179
rpL27 NP_417652  COG0211 pfam01016 TIGR00062 PMB081
rpL28 NP_418094  COG0227 pfam00830 TIGR00009 PMB180
rpL29 NP_417771  COG0255 pfam00831 TIGR00012 PMB082
rpL32 NP_415607  COG0333 pfam01783 TIGR01031 PMB125
rpL34 NP_418158  COG0230 pfam00468 TIGR01030 PMB126
rpL35 NP_416232  COG0291 pfam01632 TIGR00001 PMB083
rpoA NP_417754  COG0202 pfam01193 TIGR02027 PMB026
rpoB NP_418414  COG0085 pfam00562 TIGR02013 PMB027
rpoC NP_418415  COG0086 pfam04997 TIGR02386 PMB065
rpS1 NP_415431  COG0539 pfam00575 TIGR00717 PMB181
rpS2 NP_414711  COG0052 pfam00318 TIGR01011 PMB008
rpS3 NP_417773  COG0092 pfam00189 TIGR01009 PMB009
rpS4 NP_417755  COG0522 pfam00163 TIGR01017 PMB039
rpS5 NP_417762  COG0098 pfam03719 TIGR01021 PMB018
rpS6 NP_418621  COG0360 pfam01250 TIGR00166 PMB066
rpS7 NP_417800  COG0049 pfam00177 TIGR01029 PMB010
rpS8 NP_417765  COG0096 pfam00410 PMB011
rpS9 NP_417697  COG0103 pfam00380 TIGR03627 PMB012
rpS10 NP_417780  COG0051 pfam00338 TIGR01049 PMB040
rpS11 NP_417756  COG0100 pfam00411 TIGR03632 PMB019
rpS12 NP_417801  COG0048 pfam00164 TIGR00981 PMB028
rpS13 NP_417757  COG0099 pfam00416 TIGR03631 PMB029
rpS14 NP_417766  COG0199 pfam00253 PMB182
rpS15 NP_417634  COG0184 pfam00312 TIGR00952 PMB020
rpS16 NP_417100  COG0228 pfam00886 TIGR00002 PMB084
rpS17 NP_417770  COG0186 pfam00366 TIGR03635 PMB030
rpS18 NP_418623  COG0238 pfam01084 TIGR00165 PMB101
rpS19 NP_417775  COG0185 pfam00203 TIGR01050 PMB031
rpS20 NP_414564  COG0268 pfam01649 TIGR00029 PMB067
rseP NP_414718  COG0750 pfam02163 TIGR00054 PMB183
rsfS NP_415170  COG0799 pfam02410 TIGR00090 PMB127
rsmA NP_414593  COG0030 pfam00398 TIGR00755 PMB068
rsmD NP_417922  COG0742 pfam03602 TIGR00095 PMB184
rsmG NP_418196  COG0357 pfam02527 TIGR00138 PMB185
rsmH NP_414624  COG0275 pfam01795 TIGR00006 PMB047
ruvA NP_416375  COG0632 pfam01330 TIGR00084 PMB128
ruvB NP_416374  COG2255 pfam05496 TIGR00635 PMB069
ruvC NP_416377  COG0817 pfam02075 TIGR00228 PMB186
secA NP_414640  COG0653 pfam07517 TIGR00963 PMB085
secE NP_418408  COG0690 pfam00584 TIGR00964 PMB102
secG NP_417642  COG1314 pfam03840 TIGR00810 PMB103
secY NP_417759  COG0201 pfam00344 TIGR00967 PMB013
serS NP_415413  COG0172 pfam00587 TIGR00414 PMB032
smpB NP_417110  COG0691 pfam01668 TIGR00086 PMB048
thrS NP_416234  COG0441 pfam00587 TIGR00418 PMB129
tig NP_414970  COG0544 pfam05697 TIGR00115 PMB130
tilS NP_414730  COG0037 pfam01171 TIGR02432 PMB070
tmk NP_415616  COG0125 pfam02223 TIGR00041 PMB187
topA NP_415790  COG0550 pfam01131 TIGR01051 PMB188
tpiA NP_418354  COG0149 pfam00121 TIGR00419 PMB189
trmD NP_417098  COG0336 pfam01746 TIGR00088 PMB071
trmU NP_415651  COG0482 pfam03054 TIGR00420 PMB104
trpS NP_417843  COG0180 pfam00579 TIGR00233 PMB190
truB NP_417635  COG0130 pfam01509 TIGR00431 PMB049
trxA NP_416699  COG0526 pfam08534 TIGR00385 PMB191
trxB NP_415408  COG0492 pfam07992 TIGR01292 PMB192
tsaD NP_417536  COG0533 pfam00814 TIGR03723 PMB050
tsf NP_414712  COG0264 pfam00889 TIGR00116 PMB072
tufA NP_417798  COG0050 TIGR00485 PMB131
typA YP_026274  COG1217 TIGR01394 PMB132
tyrS NP_416154  COG0162 pfam00579 TIGR00234 PMB133
uvrB NP_415300  COG0556 pfam17757 TIGR00631 PMB105
uvrC NP_416423  COG0322 pfam08459 TIGR00194 PMB193
valS NP_418679  COG0525 pfam00133 TIGR00422 PMB073
ybeY NP_415192  COG0319 pfam02130 TIGR00043 PMB074
ychF NP_415721  COG0012 pfam06071 TIGR00092 PMB033
yeaZ NP_416321  COG1214 pfam00814 TIGR03725 PMB194
yqgF NP_417424  COG0816 pfam03652 TIGR00250 PMB134


Download

Downloading files

For each gene name GENE and each accession identifier ACCN available in the above table, the reference multiple amino acid sequence alignment (MSA), hidden Markov model (HMM) and position-specific scoring matrix (PSSM) files can be accessed via the following URL models, respectively:

 https://giphy.pasteur.fr/PhyloM/bacteria/aln/GENE.ACCN.faa
 https://giphy.pasteur.fr/PhyloM/bacteria/hmm/GENE.ACCN.hmm
 https://giphy.pasteur.fr/PhyloM/bacteria/smp/GENE.ACCN.smp

For example, the reference MSA PMB034 for the gene rpL9 can be downloaded using wget of curl with the following linux command lines, respectively:

 wget -q https://giphy.pasteur.fr/PhyloM/bacteria/aln/rpL9.PMB034.faa
 curl --silent -O https://giphy.pasteur.fr/PhyloM/bacteria/aln/rpL9.PMB034.faa


Downloading archives

For each marker set MSET (i.e. Cic31, Cre40, Dup107, PhyEco, Aln36, Lan73, bac120, Col61, UBCG2, Tia85, PhyloM), three different files can be downloaded:
 • a tar.gz archive containing the recommended reference MSA files (PMB identifiers),
 • a tar.gz archive containing the associated HMM files,
 • a tar.gz archive containing the associated PSSM files.

The MSA, HMM and PSSM archives associated to a marker set MSET can be accessed via the following URL models, respectively:

 https://giphy.pasteur.fr/PhyloM/bacteria/tgz/MSET.aln.tar.gz
 https://giphy.pasteur.fr/PhyloM/bacteria/tgz/MSET.hmm.tar.gz
 https://giphy.pasteur.fr/PhyloM/bacteria/tgz/MSET.smp.tar.gz

For example, the 74 HMM files associated to the marker set PhyloM can be downloaded using wget or curl with the following linux command lines, respectively:

 wget -q https://giphy.pasteur.fr/PhyloM/bacteria/tgz/PhyloM.hmm.tar.gz
 curl --silent -O https://giphy.pasteur.fr/PhyloM/bacteria/tgz/PhyloM.hmm.tar.gz


Litterature cited

Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C (2014) Binning metagenomic contigs by coverage and composition. Nature Methods, 11:1144–1146. doi:10.1038/nmeth.3103

Ankenbrand MJ, Keller A (2016) bcgTree: automatized phylogenetic tree building from bacterial core genomes. Genome, 59(10). doi:10.1139/gen-2015-0175

Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P (2006) Toward automatic reconstruction of a highly resolved tree of life. Science, 311(5765):1283-1287. doi:10.1126/science.1123061

Coleman GA, Davín AA, Mahendrarajah TA, Szánthó LL, Spang A, Hugenholtz P, Szöllősi GJ, Williams TA (2021) A rooted phylogeny resolves early bacterial evolution. Science, 372(6542):eabe0511. doi:10.1126/science.1123061

Creevey CJ, Doerks T, Fitzpatrick DA, Raes J, Bork P (2011) Universally distributed single-copy genes indicate a constant rate of horizontal transfer. PLoS ONE, 6(8):e22099. doi:10.1371/journal.pone.0022099

Dupont CL, Rusch DB, Yooseph S, Lombardo M-J, Richter RA, Valas R, Novotny M, Yee-Greenbaum J, Selengut JD, Haft DH, Halpern AL, Lasken RS, Nealson K, Friedman R, Venter JC (2012) Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. The ISME Journal, 6(6):1186–1199. doi:10.1038/ismej.2011.189

Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44:D279-285. doi:10.1093/nar/gkv1344

Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Research, 43:D261-9. doi:10.1093/nar/gku1223

Galperin MY, Wolf YI, Makarova KS, Alvarez RV, Landsman D, Koonin EV (2021) COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Research, 49(D1):D274-D281. doi:10.1093/nar/gkaa1018

Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41-43. doi:10.1093/nar/29.1.41

Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Research, 31(1):371-373. doi:10.1093/nar/gkg128

Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E (2013) TIGRFAMs and Genome Properties in 2013. Nucleic Acids Research, 41:D387-95. doi:10.1093/nar/gks1234

Kim J, Na S-I, Kim D, Chun J (2021) UBCG2: Up-to-date bacterial core genes and pipeline for phylogenomic analysis. Journal of Microbiology, 59(6):609-615. doi:10.1007/s12275-021-1231-4

Mende DR, Sunagawa S, Zeller G, Bork P (2013) Accurate and universal delineation of prokaryotic species. Nature Methods, 10:881-884 doi:10.1038/nmeth.2575

Na, SI, Kim YO, Yoon SH, Ha SM, Baek I, Chun J (2018) UBCG: Up-to-date bacterial core gene set and pipeline for phylogenomic tree reconstruction. Journal of Microbiology, 56, 280–285. doi:10.1007/s12275-018-8014-65

Parks DH, Chuvochina M, Chaumeil P-A, Rinke C, Mussig AJ, Hugenholtz P (2020) A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, 38(9):1079-1086. doi:10.1038/nbt.4229

Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A, Hugenholtz P (2022) GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50(D1):D785-D794. doi:10.1093/nar/gkab776

Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, Hugenholtz P (2018) A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, 36:996-1004. doi:10.1038/nbt.4229

Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ, Evans PN, Hugenholtz P, Tyson GW (2017) Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology, 2:1533-1542. doi:10.1038/s41564-017-0012-7

Quince C, Delmont TO, Raguideau S, Alneberg J, Darling AE, Collins G, Eren AM (2017) DESMAN: a new tool for de novo extraction of strains from metagenomes. Genome Biology, 18:181. doi:10.1186/s13059-017-1309-9

Sonnhammer ELL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Research, 26:320-322. doi:10.1093/nar/26.1.320

Sonnhammer ELL, Eddy SR, Durbin R (1997) Pfam: a comprehensive database of protein families based on seed alignments. Proteins, 28:405-420. doi:10.1002/(sici)1097-0134(199707)28:3<405::aid-prot10>3.0.co;2-l

Sorek R, Zhu Y, Creevey CJ, Francino MP, Bork P, Rubin EN (2007) Genome-wide experimental determination of barriers to horizontal gene transfer. Science, 318(5855):1449-52. doi:10.1126/science.1147112

Sunagawa S, Mende DR, ZellerG, Izquierdo-Carrasco F, Berger SA, Kultima JR, Coelho LP, Arumugam M, Tap J, Nielsen HB, Rasmussen S, Brunak S, Pedersen O, Guarner F, de Vos WM, Wang J, Li J, Doré J, Ehrlich SD, Stamatakis A, Bork P (2013) Metagenomic species profiling using universal phylogenetic marker genes. Nature Methods, 10(12):1196-1199. doi:10.1038/nmeth.2693

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41. doi:10.1186/1471-2105-4-41

Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science, 278(5338):631-637. doi:10.1126/science.278.5338.631

Tian R, Imanian B (2023) VBCG: 20 validated bacterial core genes for phylogenomic analysis with high fidelity and resolution. Microbiome, 11:247. doi:10.1186/s40168-023-01705-9

Wu M, Eisen JA (2008) A simple, fast, and accurate method of phylogenomic inference. Genome Biology, 9(10):R151. doi:10.1186/gb-2008-9-10-r151

Wu D, Jospin G, Eisen JA (2013) Systematic identification of gene families for use as "markers" for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. PLoS One, 8(10):e77033. doi:10.1371/journal.pone.0077033

Wu M, Scott AJ (2012) Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2. Bioinformatics, 28(7):1033–1034. doi:10.1093/bioinformatics/bts079


Data sources

 COG   Pfam   TIGR 
 MSA  CDD FTP
(fasta.tar.gz)
CDD FTP
(fasta.tar.gz)
CDD FTP
(fasta.tar.gz)
 HMM  COGcollator Pfam FTP
(Pfam-A.hmm.gz)
TIGRFAMs FTP
(release 15.0)
 PSSM  CDD FTP
(cdd.tar.gz)
CDD FTP
(cdd.tar.gz)
CDD FTP
(cdd.tar.gz)


Deprecated versions

[22.07.28]  
[18.07.08]