PhyloM: bacteria is a compilation of markers that are well-suited for phylogenetic tree inference of bacterial taxa.
These selected markers are recommended for phylogenetic reconstruction because they have been shown to correspond to persistent genes within bacterial phyla (close to universal distribution).
This gene selection mainly relies to the independant and complementary works of Bratlie et al. (2010), Creevey et al. (2011), Wu et al. (2013) and Parks et al. (2017).
Bratlie et al. (2010) classified more than 200 genes with respect to operon participation (i.e. strong or weak operon genes) and duplication event involvement (i.e. singleton or duplicate ortholog sequence sets).
Creevey et al. (2011) compiled 40 universal single-copy marker genes (bacteria, archaea and eukaryotes), leading to genes that are putatively essential and likely few affected by duplication events.
Wu et al. (2013) selected 114 genes associated to informative phylogenetic signal and low variation in copy number across taxa.
Parks et al. (2017) identified 120 phylogenetic informative genes that are generally present as a single-copy sequence in a large proportion of representative genomes (see also Parks et al. 2018, and Chaumeil et al. 2020).
For each phylogenetic marker (and each marker set), this webpage provides the reference multiple amino acid sequence alignments (MSA; ) and the associated position specific scoring matrices (PSSM; ) gathered from different sources (when available): COG database (Tatusov et al. 1997, 2003; Galperin 2015), NCBI protein cluster collection (PRK), Pfam database (Finn et al. 2016), TIGRFAMs database (Haft et al. 2001), and PhyEco repository (Wu et al. 2013). Each MSA and PSSM can be used to perform accurate BLAST searches against amino acid and nucleotide sequence databanks, respectively (see Usage).
Each of the compiled marker sets (Bratlie et al. 2010; Creevey et al. 2011; Wu et al. 2013; Parks et al. 2017) are summarized in the different tables below.
Each table provides download links to:
— a text file containing the gene name list (),
— a tar.gz archive containing the reference multiple amino acid sequence alignments (MSA; ),
— a tar.gz archive containing the associated position specific scoring matrices (PSSM; ).
The 203 loci classified by Bratlie et al. (2010) are labelled as 'pgoc' (persistent gene operon category) or as the author's classification:
— sso (singleton strong operon): not (or little) involved in duplication events (singleton), and likely not involved in horizontal tranfer events (strong operon),
— swo (singleton weak operon): not (or little) involved in duplication events (singleton), but may be involved in horizontal tranfer events (weak operon),
— dso (duplicate strong operon): likely involved in duplication events (duplicate), but likely not involved in horizontal tranfer envents (strong operon),
— dwo (duplicate weak operon): likely involved in duplication events (duplicate), and may also be involved in horizontal tranfer events (weak operon).
For more details about the definition of singleton, duplicate, strong and weak operon proteins, respectively, see the Introduction subsection in Bratlie et al. (2010).
For each locus subset (sso, swo, dso, dwo), the table below provides download links to the original MSA and PSSM ('pgoc original', from the COG; see Sup. Table S2), as well as selected MSA and PSSM ('pgoc PhyloM', mainly from the NCBI protein cluster collection).
The 67 markers from the set 'sso' can be used to perform phylogenetic tree inference of bacterial taxa, and the selected MSA or PSSM from the set 'pgoc PhyloM' are recommended to perform BLAST searches.
marker set | no. loci | gene names | pgoc (original) | pgoc (PhyloM) |
sso | 67 | |||
swo | 41 | |||
dso | 40 | |||
dwo | 55 |
The 40 genes compiled by Creevey et al. (2011) are labelled as 'uscg' (universal single copy genes). The table below provides download links to the original MSA and PSSM ('uscg original', from the COG; see Table 1), as well as selected MSA and PSSM ('uscg PhyloM', mainly from the NCBI protein cluster collection). These 40 markers can be used to perform phylogenetic tree inference of bacterial taxa, and the selected MSA or PSSM from the set 'pgoc PhyloM' are recommended to perform BLAST searches.
marker set | no. loci | gene names | original | PhyloM |
uscg | 40 |
The 114 markers selected by Wu et al. (2013) are labelled as 'PhyEco'. The table below provides download links to the original MSA and PSSM ('PhyEco original'; see bacteria.tgz), as well as selected MSA and PSSM ('PhyEco PhyloM', mainly from the NCBI protein cluster collection). These 114 markers can be used to perform phylogenetic tree inference of any bacterial taxa, and the selected MSA or PSSM from the set 'PhyEco PhyloM' are recommended to perform BLAST searches.
marker set | no. loci | gene names | original | PhyloM |
PhyEco | 114 |
The 120 markers identified by Parks et al. (2017) are labelled as 'bac120'. The table below provides download links to the original MSA and PSSM ('bac120 original'; see Sup. Table S6), as well as selected MSA and PSSM ('bac120 PhyloM', mainly from the NCBI protein cluster collection). These 120 markers can be used to perform phylogenetic tree inference of any bacterial taxa, and the selected MSA or PSSM from the set 'bac120 PhyloM' are recommended to perform BLAST searches.
marker set | no. loci | gene names | original | PhyloM |
bac120 | 120 |
The overall 236 phylogenetic markers are classified into four categories (from A to D) according to their belonging to all (A) to only one of (D) the four marker sets pgoc, uscg, PhyEco and bac120 (see above). For each category A-D, the table below provides download links to selected MSA and PSSM (mainly from the NCBI protein cluster collection). The 25 markers from the category A are highly recommended to perform phylogenetic tree inference of any bacterial taxa, as these markers were identified in the four compiled studies (see above). However, to deal with larger datasets, it is also recommended to use the (25+56=)81 markers from the two categories A and B, as these markers were identified in most of the four compiled studies.
marker set | no. loci | gene names | PhyloM |
A | 25 | ||
B | 56 | ||
C | 56 | ||
D | 99 |
The overall 236 phylogenetic markers are listed in the table below.
They are sorted according to the reference genome of Escherichia coli strain K12 substr. MG1655 (Genbank accn: NC_000913).
Common gene names are available in the column 'name' with a link to the Uniprot description.
When a gene is organized within a well-documented operon, a link to the corresponding RegulonDB description is available in the column 'E. coli operon'.
The column 'E. coli CDS' lists the NCBI accession numbers of the E. coli strain K12 substr. MG1655 CDS with links to the corresponding entries in the NCBI Conserved Domain Database.
For each gene, the corresponding accession number (if available) is given for
the COG database ('COG'),
the NCBI protein cluster collection ('PRK'),
the Pfam database ('Pfam'),
the TIGRFAMs database ('TIGRFAMs'), and
the PhyEco repository ('PhyEco').
The column 'PhyloM' contains MSA and PSSM that were selected (mainly from PRK) to observe accurate BLAST results in practice.
Presence/absence of each marker in the four compiled studies is indicated in columns 'PhyEco', 'bac120', 'pgoc' and 'uscg', respectively.
PhyloM categories A-D (see above) are indicated in the column 'category'.
name | E. coli operon | E. coli CDS | COG | PRK | Pfam | TIGRFAMs | PhyloM | PhyEco | bac120 | pgoc | uscg | category |
dnaK | NP_414555 | COG0443 | PRK00290 | pfam00012 | TIGR02350 | PMB0082 | dwo | C | ||||
rpsT-S20 | NP_414564 | COG0268 | PRK00239 | pfam01649 | TIGR00029 | PMB0026 | B000077 | swo | B | |||
ribF | NP_414566 | COG0196 | PRK05627 | pfam06574 | TIGR00083 | PMB0027 | B000096 | dwo | B | |||
ileS | NP_414567 | COG0060 | PRK05743 | pfam00133 | TIGR00392 | PMB0083 | dwo | C | ||||
lspA | NP_414568 | COG0597 | PRK00376 | pfam01252 | TIGR00077 | PMB0138 | dso | D | ||||
carA | NP_414573 | COG0505 | PRK12564 | pfam00988 | TIGR01368 | PMB0139 | swo | D | ||||
carB | NP_414574 | COG0458 | PRK05294 | pfam02786 | TIGR01369 | PMB0140 | dwo | D | ||||
rsmA-ksgA | NP_414593 | COG0030 | PRK00274 | pfam00398 | TIGR00755 | PMB0084 | sso | C | ||||
rsmH-mraW | NP_414624 | COG0275 | PRK00050 | pfam01795 | TIGR00006 | PMB0028 | B000067 | sso | B | |||
ftsI | NP_414626 | COG0768 | PRK15105 | pfam00905 | TIGR02214 | PMB0141 | dso | D | ||||
murE | NP_414627 | COG0769 | PRK00139 | pfam08245 | TIGR01085 | PMB0085 | B000105 | dso | C | |||
murF | NP_414628 | COG0770 | PRK10773 | pfam08245 | TIGR01143 | PMB0142 | dso | D | ||||
mraY | NP_414629 | COG0472 | PRK00108 | pfam00953 | TIGR00445 | PMB0086 | sso | C | ||||
murD | NP_414630 | COG0771 | PRK03806 | pfam08245 | TIGR01087 | PMB0029 | B000068 | dso | B | |||
ftsW | NP_414631 | COG0772 | PRK10774 | pfam01098 | TIGR02614 | PMB0143 | dso | D | ||||
murG | NP_414632 | COG0707 | PRK00726 | pfam03033 | TIGR01133 | PMB0087 | B000066 | dso | C | |||
murC | NP_414633 | COG0773 | PRK00421 | pfam08245 | TIGR01082 | PMB0088 | dso | C | ||||
ftsA | NP_414636 | COG0849 | PRK09472 | pfam14450 | TIGR01174 | PMB0144 | B000110 | D | ||||
ftsZ | NP_414637 | COG0206 | PRK09330 | pfam12327 | TIGR00065 | PMB0089 | dwo | C | ||||
secA | NP_414640 | COG0653 | PRK12904 | pfam07517 | TIGR00963 | PMB0090 | dwo | C | ||||
coaE | NP_414645 | COG0237 | PRK00081 | pfam01121 | TIGR00152 | PMB0091 | B000085 | sso | C | |||
map | NP_414710 | COG0024 | PRK05716 | pfam00557 | TIGR00500 | PMB0145 | dwo | D | ||||
rpsB-S2 | NP_414711 | COG0052 | PRK05299 | pfam00318 | TIGR01011 | PMB0001 | B000001 | sso | A | |||
tsf | NP_414712 | COG0264 | PRK09377 | pfam00889 | TIGR00116 | PMB0030 | B000056 | sso | B | |||
pyrH | NP_414713 | COG0528 | PRK00358 | pfam00696 | TIGR02075 | PMB0092 | sso | C | ||||
frr | NP_414714 | COG0233 | PRK00083 | pfam01765 | TIGR00496 | PMB0031 | B000079 | sso | B | |||
dxr | NP_414715 | COG0743 | PRK05447 | pfam02670 | TIGR00243 | PMB0146 | B000088 | D | ||||
uppS | NP_414716 | COG0020 | PRK10240 | pfam01255 | TIGR00055 | PMB0147 | dso | D | ||||
rseP | NP_414718 | COG0750 | PRK10779 | pfam02163 | TIGR00054 | PMB0093 | dso | C | ||||
rnhB | NP_414725 | COG0164 | PRK00015 | pfam01351 | TIGR00729 | PMB0148 | B000039 | D | ||||
dnaE | NP_414726 | COG0587 | PRK05673 | pfam07733 | TIGR00594 | PMB0149 | dwo | D | ||||
tilS-mesJ | NP_414730 | COG0037 | PRK10660 | pfam01171 | TIGR02432 | PMB0032 | B000091 | sso | B | |||
tgt | NP_414940 | COG0343 | PRK00112 | pfam01702 | TIGR00430 | PMB0150 | dwo | D | ||||
nrdR | NP_414947 | COG1327 | PRK00464 | pfam03477 | TIGR00244 | PMB0151 | B000087 | D | ||||
nusB | NP_414950 | COG0781 | PRK00202 | pfam01029 | TIGR01951 | PMB0094 | sso | C | ||||
tig | NP_414970 | COG0544 | PRK01490 | pfam05697 | TIGR00115 | PMB0095 | swo | C | ||||
clpX | NP_414972 | COG1219 | PRK05342 | pfam07724 | TIGR00382 | PMB0033 | B000112 | swo | B | |||
dnaX | NP_415003 | COG2812 | PRK07994 | pfam12170 | TIGR02397 | PMB0096 | sso | C | ||||
recR | NP_415005 | COG0353 | PRK00076 | pfam13662 | TIGR00615 | PMB0034 | B000080 | sso | B | |||
purE | NP_415056 | COG0041 | PLN02948 | pfam00731 | TIGR01162 | PMB0152 | dso | D | ||||
cysS | NP_415059 | COG0215 | PRK00260 | pfam01406 | TIGR00435 | PMB0035 | swo | B | ||||
rsfS | NP_415170 | COG0799 | PRK11538 | pfam02410 | TIGR00090 | PMB0097 | B000062 | C | ||||
holA | NP_415173 | COG1466 | PRK05574 | pfam14840 | TIGR01128 | PMB0153 | D | |||||
leuS | NP_415175 | COG0495 | PRK00390 | pfam13603 | TIGR00396 | PMB0002 | B000108 | swo | A | |||
ybeY | NP_415192 | COG0319 | PRK00016 | pfam02130 | TIGR00043 | PMB0036 | B000081 | sso | B | |||
uvrB | NP_415300 | COG0556 | PRK05298 | pfam12344 | TIGR00631 | PMB0098 | swo | C | ||||
infA | NP_415404 | COG0361 | PRK00276 | pfam01176 | TIGR00008 | PMB0154 | dwo | D | ||||
ftsK | NP_415410 | COG1674 | PRK10263 | pfam01580 | TIGR03928 | PMB0155 | dwo | D | ||||
serS | NP_415413 | COG0172 | PRK05431 | pfam00587 | TIGR00414 | PMB0037 | B000073 | B | ||||
aroA | NP_415428 | COG0128 | PRK11860 | pfam00275 | TIGR01356 | PMB0156 | dso | D | ||||
rpsA-S1 | NP_415431 | COG0539 | PRK06299 | pfam00575 | TIGR00717 | PMB0099 | swo | C | ||||
rpmF-L32 | NP_415607 | COG0333 | PRK01110 | pfam01783 | TIGR01031 | PMB0157 | B000106 | D | ||||
plsX | NP_415608 | COG0416 | PRK05331 | pfam02504 | TIGR00182 | PMB0158 | B000090 | D | ||||
fabD | NP_415610 | COG0331 | PLN02752 | pfam00698 | TIGR00128 | PMB0159 | dso | D | ||||
fabG | NP_415611 | COG1028 | PRK05557 | pfam13561 | TIGR01830 | PMB0160 | dso | D | ||||
acpP | NP_415612 | COG0236 | PRK00982 | pfam00550 | TIGR00517 | PMB0161 | dwo | D | ||||
fabF | NP_415613 | COG0304 | PRK07314 | pfam00109 | TIGR03150 | PMB0162 | dso | D | ||||
tmk | NP_415616 | COG0125 | PRK00698 | pfam02223 | TIGR00041 | PMB0163 | sso | D | ||||
holB | NP_415617 | COG0470 | PRK07993 | pfam09115 | TIGR00678 | PMB0164 | sso | D | ||||
ycfH | NP_415618 | COG0084 | PRK10812 | pfam01026 | TIGR00010 | PMB0165 | dso | D | ||||
mfd | NP_415632 | COG1197 | PRK10689 | pfam03461 | TIGR00580 | PMB0038 | B000046 | swo | B | |||
purB | NP_415649 | COG0015 | PRK09285 | pfam00206 | TIGR00928 | PMB0166 | D | |||||
mnmA-trmU | NP_415651 | COG0482 | PRK00143 | pfam03054 | TIGR00420 | PMB0039 | B000098 | dwo | B | |||
ychF | NP_415721 | COG0012 | PRK09601 | pfam06071 | TIGR00092 | PMB0003 | B000083 | swo | A | |||
pth | NP_415722 | COG0193 | PRK05426 | pfam01195 | TIGR00447 | PMB0100 | B000103 | dso | C | |||
prsA | NP_415725 | COG0462 | PRK01259 | pfam13793 | TIGR01251 | PMB0167 | dwo | D | ||||
ispE | NP_415726 | COG1947 | PRK00343 | pfam08544 | TIGR00154 | PMB0168 | sso | D | ||||
prfA | NP_415729 | COG0216 | PRK00591 | pfam03462 | TIGR00019 | PMB0040 | B000053 | sso | B | |||
pemK-prmC | NP_415730 | COG2890 | PRK09328 | pfam05175 | TIGR00536 | PMB0169 | dso | D | ||||
nth | NP_416150 | COG0177 | PRK10702 | pfam00730 | TIGR01083 | PMB0170 | dso | D | ||||
pheT | NP_416228 | COG0072 | PRK00629 | pfam03483 | TIGR00472 | PMB0041 | B000013 | fsso | B | |||
pheS | NP_416229 | COG0016 | PRK00488 | pfam01409 | TIGR00468 | PMB0004 | B000020 | sso | A | |||
rplT-L20 | NP_416231 | COG0292 | PRK05185 | pfam00453 | TIGR01032 | PMB0042 | B000070 | swo | B | |||
rpmI-L35 | NP_416232 | COG0291 | PRK00172 | pfam01632 | TIGR00001 | PMB0101 | B000099 | swo | C | |||
infC | NP_416233 | COG0290 | PRK00028 | pfam00707 | TIGR00168 | PMB0043 | B000104 | swo | B | |||
thrS | NP_416234 | COG0441 | PRK00413 | pfam00587 | TIGR00418 | PMB0171 | dwo | D | ||||
tsaB-yeaZ | NP_416321 | COG1214 | PRK09604 | pfam00814 | TIGR03725 | PMB0102 | sso | C | ||||
ruvB | NP_416374 | COG2255 | PRK00080 | pfam05496 | TIGR00635 | PMB0044 | B000072 | sso | B | |||
ruvA | NP_416375 | COG0632 | PRK00116 | pfam01330 | TIGR00084 | PMB0045 | B000071 | sso | B | |||
ruvC | NP_416377 | COG0817 | PRK00039 | pfam02075 | TIGR00228 | PMB0172 | B000093 | D | ||||
yebC | NP_416378 | COG0217 | PRK00110 | pfam01709 | TIGR01033 | PMB0173 | dwo | D | ||||
aspS | NP_416380 | COG0173 | PRK00476 | pfam00152 | TIGR00459 | PMB0103 | dwo | C | ||||
argS | NP_416390 | COG0018 | PRK01611 | pfam00750 | TIGR00456 | PMB0046 | dwo | B | ||||
pgsA | NP_416422 | COG0558 | PRK10832 | pfam01066 | TIGR00560 | PMB0174 | swo | D | ||||
uvrC | NP_416423 | COG0322 | PRK00558 | pfam08459 | TIGR00194 | PMB0104 | swo | C | ||||
metG | NP_416617 | COG0143 | PRK00133 | pfam09334 | TIGR00398 | PMB0175 | D | |||||
rplY-L25 | NP_416690 | COG1825 | PRK05943 | pfam01386 | TIGR00731 | PMB0176 | B000059 | D | ||||
trxA | NP_416699 | COG0526 | PRK15412 | pfam08534 | TIGR00385 | PMB0177 | dwo | D | ||||
gyrA | NP_416734 | COG0188 | PRK05560 | pfam00521 | TIGR01063 | PMB0105 | dwo | C | ||||
purF | NP_416815 | COG0034 | PRK09246 | pfam13537 | TIGR01134 | PMB0178 | dwo | D | ||||
folC | NP_416818 | COG0285 | PRK10846 | pfam08245 | TIGR01499 | PMB0179 | dso | D | ||||
truA | NP_416821 | COG0101 | PRK00021 | pfam01416 | TIGR00071 | PMB0180 | dso | D | ||||
aroC | NP_416832 | COG0082 | PRK05382 | pfam01264 | TIGR00033 | PMB0181 | sso | D | ||||
gltX | NP_416899 | COG0008 | PRK01406 | pfam00749 | TIGR00464 | PMB0182 | dwo | D | ||||
ligA | NP_416906 | COG0272 | PRK07956 | pfam01653 | TIGR00575 | PMB0106 | B000109 | dwo | C | |||
purM | NP_416994 | COG0150 | PRK05385 | pfam02769 | TIGR00878 | PMB0107 | B000038 | sso | C | |||
guaB | NP_417003 | COG0516 | PRK05567 | pfam00478 | TIGR01302 | PMB0183 | D | |||||
der-engA | NP_417006 | COG1160 | PRK00093 | pfam01926 | TIGR03594 | PMB0047 | B000043 | swo | B | |||
hisS | NP_417009 | COG0124 | PRK00037 | pfam13393 | TIGR00442 | PMB0048 | dso | B | ||||
era | NP_417061 | COG1159 | PRK00089 | pfam01926 | TIGR00436 | PMB0108 | sso | C | ||||
rnc | NP_417062 | COG0571 | PRK00102 | pfam14622 | TIGR02191 | PMB0109 | dso | C | ||||
lepA | NP_417064 | COG0481 | PRK05433 | pfam00009 | TIGR01393 | PMB0049 | B000004 | dwo | B | |||
clpB | NP_417083 | COG0542 | PRK10865 | pfam07724 | TIGR03346 | PMB0184 | dwo | D | ||||
rluD-sfhB | NP_417085 | COG0564 | PRK11180 | pfam00849 | TIGR00005 | PMB0185 | swo | D | ||||
rplS-L19 | NP_417097 | COG0335 | PRK05338 | pfam01245 | TIGR01024 | PMB0110 | B000069 | swo | C | |||
trmD | NP_417098 | COG0336 | PRK00026 | pfam01746 | TIGR00088 | PMB0050 | B000058 | sso | B | |||
rimM | NP_417099 | COG0806 | PRK00122 | pfam01782 | TIGR02273 | PMB0051 | B000086 | sso | B | |||
rpsP-S16 | NP_417100 | COG0228 | PRK00040 | pfam00886 | TIGR00002 | PMB0111 | B000094 | sso | C | |||
ffh | NP_417101 | COG0541 | PRK10867 | pfam00448 | TIGR00959 | PMB0005 | B000008 | swo | A | |||
grpE | NP_417104 | COG0576 | PRK10325 | pfam01025 | PMB0112 | dwo | C | |||||
nadK-ppnK | NP_417105 | COG0061 | PRK03378 | pfam01513 | PMB0186 | dwo | D | |||||
recN | YP_026172 | COG0497 | PRK10869 | pfam13476 | TIGR00634 | PMB0113 | B000078 | C | ||||
smpB | NP_417110 | COG0691 | PRK05422 | pfam01668 | TIGR00086 | PMB0052 | B000065 | swo | B | |||
alaS | NP_417177 | COG0013 | PRK00252 | pfam01411 | TIGR00344 | PMB0114 | swo | C | ||||
recA | NP_417179 | COG0468 | PRK09354 | pfam00154 | TIGR02012 | PMB0053 | B000095 | dwo | B | |||
mutS | NP_417213 | COG0249 | PRK05399 | pfam00488 | TIGR01070 | PMB0187 | B000076 | D | ||||
ispF | NP_417226 | COG0245 | PRK00084 | pfam02542 | TIGR00151 | PMB0188 | B000089 | D | ||||
eno | NP_417259 | COG0148 | PRK00077 | pfam00113 | TIGR01060 | PMB0189 | dwo | D | ||||
pyrG | NP_417260 | COG0504 | PRK05380 | pfam06418 | TIGR00337 | PMB0054 | B000047 | swo | B | |||
prfB | NP_417367 | COG1186 | PRK00578 | pfam03462 | TIGR00020 | PMB0190 | D | |||||
metK | NP_417417 | COG0192 | PRK05250 | pfam02773 | TIGR01034 | PMB0191 | dwo | D | ||||
rsmE | NP_417421 | COG1385 | PRK11713 | pfam04452 | TIGR00046 | PMB0192 | dwo | D | ||||
yqgF | NP_417424 | COG0816 | PRK00109 | pfam03652 | TIGR00250 | PMB0115 | sso | C | ||||
rdgB | NP_417429 | COG0127 | PRK00120 | pfam01725 | TIGR00042 | PMB0193 | dso | D | ||||
hemW | NP_417430 | COG0635 | PRK05660 | pfam04055 | TIGR00539 | PMB0194 | D | |||||
tsaD | NP_417536 | COG0533 | PRK09604 | pfam00814 | TIGR03723 | PMB0006 | B000006 | swo | A | |||
dnaG | NP_417538 | COG0358 | PRK05667 | pfam08275 | TIGR01391 | PMB0055 | B000092 | swo | B | |||
rpoD | NP_417539 | COG0568 | PRK05658 | pfam04546 | TIGR02393 | PMB0195 | dwo | D | ||||
yraL | NP_417615 | COG0313 | PRK14994 | pfam00590 | TIGR00096 | PMB0196 | dwo | D | ||||
pnp | NP_417633 | COG1185 | PRK11824 | pfam03726 | TIGR03591 | PMB0056 | B000055 | swo | B | |||
rpsO-S15 | NP_417634 | COG0184 | PRK05626 | pfam00312 | TIGR00952 | PMB0057 | B000034 | swo | B | |||
truB | NP_417635 | COG0130 | PRK05033 | pfam01509 | TIGR00431 | PMB0058 | B000032 | sso | B | |||
rbfA | NP_417636 | COG0858 | PRK00521 | pfam02033 | TIGR00082 | PMB0059 | B000063 | sso | B | |||
infB | NP_417637 | COG0532 | PRK05306 | pfam11987 | TIGR00487 | PMB0060 | B000005 | sso | B | |||
nusA | NP_417638 | COG0195 | PRK09202 | pfam08529 | TIGR01953 | PMB0061 | B000041 | sso | B | |||
rimP | NP_417639 | COG0779 | PRK14640 | pfam02576 | PMB0197 | D | ||||||
secG | NP_417642 | COG1314 | PRK06870 | pfam03840 | TIGR00810 | PMB0198 | D | |||||
folP | NP_417644 | COG0294 | PRK11613 | pfam00809 | TIGR01496 | PMB0199 | dso | D | ||||
hflB-ftsH | NP_417645 | COG0465 | PRK10733 | pfam01434 | TIGR01241 | PMB0200 | dwo | D | ||||
greA | NP_417648 | COG0782 | PRK00226 | pfam03449 | TIGR01462 | PMB0201 | dwo | D | ||||
obg | NP_417650 | COG0536 | PRK12298 | pfam01018 | TIGR02729 | PMB0062 | B000049 | swo | B | |||
rpmA-L27 | NP_417652 | COG0211 | PRK05435 | pfam01016 | TIGR00062 | PMB0116 | B000102 | sso | C | |||
rplU-L21 | NP_417653 | COG0261 | PRK05573 | pfam00829 | TIGR00061 | PMB0063 | B000074 | sso | B | |||
ispB | NP_417654 | COG0142 | PRK10888 | pfam00348 | TIGR02749 | PMB0202 | dwo | D | ||||
murA | NP_417656 | COG0766 | PRK09369 | pfam00275 | TIGR01072 | PMB0203 | dwo | D | ||||
rpsI-S9 | NP_417697 | COG0103 | PRK00132 | pfam00380 | TIGR03627 | PMB0007 | B000011 | sso | A | |||
rplM-L13 | NP_417698 | COG0102 | PRK09216 | pfam00572 | TIGR01066 | PMB0008 | B000037 | sso | A | |||
smf | YP_026211 | COG0758 | PRK10736 | pfam02481 | TIGR00732 | PMB0204 | dwo | D | ||||
mreC | NP_417716 | COG1792 | PRK13922 | pfam04085 | TIGR00219 | PMB0205 | B000101 | D | ||||
def | NP_417745 | COG0242 | PRK00150 | pfam01327 | TIGR00079 | PMB0206 | dso | D | ||||
fmt | NP_417746 | COG0223 | PRK00005 | pfam00551 | TIGR00460 | PMB0117 | sso | C | ||||
rplQ-L17 | NP_417753 | COG0203 | PRK05591 | pfam01196 | TIGR00059 | PMB0064 | B000057 | sso | B | |||
rpoA | NP_417754 | COG0202 | PRK05182 | pfam01193 | TIGR02027 | PMB0009 | B000052 | dso | A | |||
rpsD-S4 | NP_417755 | COG0522 | PRK05327 | pfam00163 | TIGR01017 | PMB0065 | dwo | B | ||||
rpsK-S11 | NP_417756 | COG0100 | PRK05309 | pfam00411 | TIGR03632 | PMB0010 | B000029 | sso | A | |||
rpsM-S13 | NP_417757 | COG0099 | PRK05179 | pfam00416 | TIGR03631 | PMB0066 | B000019 | sso | B | |||
prlA-secY | NP_417759 | COG0201 | PRK09204 | pfam00344 | TIGR00967 | PMB0011 | B000048 | dso | A | |||
rplO-L15 | NP_417760 | COG0200 | PRK05592 | pfam00828 | TIGR01071 | PMB0012 | B000021 | sso | A | |||
rpsE-S5 | NP_417762 | COG0098 | PRK00550 | pfam03719 | TIGR01021 | PMB0013 | B000015 | sso | A | |||
rplR-L18 | NP_417763 | COG0256 | PRK05593 | pfam00861 | TIGR00060 | PMB0067 | B000033 | sso | B | |||
rplF-L6 | NP_417764 | COG0097 | PRK05498 | pfam00347 | TIGR03654 | PMB0014 | B000023 | sso | A | |||
rpsH-S8 | NP_417765 | COG0096 | PRK00136 | pfam00410 | PMB0015 | B000031 | sso | A | ||||
rpsN-S14 | NP_417766 | COG0199 | PRK08881 | pfam00253 | PMB0207 | dso | D | |||||
rplE-L5 | NP_417767 | COG0094 | PRK00010 | pfam00673 | PMB0068 | B000025 | sso | B | ||||
rplX-L24 | NP_417768 | COG0198 | PRK00004 | pfam17136 | TIGR01079 | PMB0069 | B000040 | sso | B | |||
rplN-L14 | NP_417769 | COG0093 | PRK05483 | pfam00238 | TIGR01067 | PMB0070 | B000014 | sso | B | |||
rpsQ-S17 | NP_417770 | COG0186 | PRK05610 | pfam00366 | TIGR03635 | PMB0071 | B000036 | dso | B | |||
rpmC-L29 | NP_417771 | COG0255 | PRK00306 | pfam00831 | TIGR00012 | PMB0208 | B000027 | D | ||||
rplP-L16 | NP_417772 | COG0197 | PRK09203 | pfam00252 | TIGR01164 | PMB0016 | B000018 | sso | A | |||
rpsC-S3 | NP_417773 | COG0092 | PRK00310 | pfam00189 | TIGR01009 | PMB0017 | B000028 | sso | A | |||
rplV-L22 | NP_417774 | COG0091 | PRK00565 | pfam00237 | TIGR01044 | PMB0018 | B000007 | sso | A | |||
rpsS-S19 | NP_417775 | COG0185 | PRK00357 | pfam00203 | TIGR01050 | PMB0072 | B000016 | sso | B | |||
rplB-L2 | NP_417776 | COG0090 | PRK09374 | pfam03947 | TIGR01171 | PMB0019 | B000010 | sso | A | |||
rplW-L23 | NP_417777 | COG0089 | PRK05738 | pfam00276 | TIGR03636 | PMB0118 | B000022 | sso | C | |||
rplD-L4 | NP_417778 | COG0088 | PRK05319 | pfam00573 | TIGR03953 | PMB0020 | B000009 | sso | A | |||
rplC-L3 | NP_417779 | COG0087 | PRK00001 | pfam00297 | TIGR03625 | PMB0021 | B000012 | sso | A | |||
rpsJ-S10 | NP_417780 | COG0051 | PRK00596 | pfam00338 | TIGR01049 | PMB0119 | B000002 | sso | C | |||
rpsG-S7 | NP_417800 | COG0049 | PRK05302 | pfam00177 | TIGR01029 | PMB0022 | B000017 | sso | A | |||
rpsL-S12 | NP_417801 | COG0048 | PRK05163 | pfam00164 | TIGR00981 | PMB0073 | B000026 | swo | B | |||
trpS | NP_417843 | COG0180 | PRK00927 | pfam00579 | TIGR00233 | PMB0209 | dwo | D | ||||
ftsY | NP_417921 | COG0552 | PRK10416 | pfam00448 | TIGR00064 | PMB0074 | swo | B | ||||
rsmD | NP_417922 | COG0742 | PRK10909 | pfam03602 | TIGR00095 | PMB0120 | sso | C | ||||
glyS | NP_418016 | COG0751 | PRK01233 | pfam02092 | TIGR00211 | PMB0210 | B000097 | D | ||||
gpsA | NP_418065 | COG0240 | PRK00094 | pfam07479 | TIGR03376 | PMB0211 | dso | D | ||||
kdtB-coaD | NP_418091 | COG0669 | PRK00168 | pfam01467 | TIGR01510 | PMB0121 | swo | C | ||||
rpmB-L28 | NP_418094 | COG0227 | PRK00359 | pfam00830 | TIGR00009 | PMB0212 | dwo | D | ||||
gmk | NP_418105 | COG0194 | PRK00300 | pfam00625 | TIGR03263 | PMB0122 | swo | C | ||||
spoT-relA | NP_418107 | COG0317 | PRK11092 | pfam13328 | TIGR00691 | PMB0213 | dwo | D | ||||
recG | NP_418109 | COG1200 | PRK10917 | pfam00270 | TIGR00643 | PMB0214 | D | |||||
gyrB | YP_026241 | COG0187 | PRK14939 | pfam00204 | TIGR01059 | PMB0123 | dwo | C | ||||
recF | NP_418155 | COG1195 | PRK00064 | pfam02463 | TIGR00611 | PMB0215 | B000113 | D | ||||
dnaN | NP_418156 | COG0592 | PRK05643 | pfam02768 | TIGR00663 | PMB0124 | dwo | C | ||||
dnaA | NP_418157 | COG0593 | PRK00149 | pfam00308 | TIGR00362 | PMB0075 | B000084 | dwo | B | |||
yidC | NP_418161 | COG0706 | PRK01318 | pfam14849 | TIGR03593 | PMB0216 | dwo | D | ||||
thdF | NP_418162 | COG0486 | PRK05291 | pfam12631 | TIGR00450 | PMB0217 | swo | D | ||||
glmS | NP_418185 | COG0449 | PRK00331 | pfam01380 | TIGR01135 | PMB0218 | dwo | D | ||||
atpD | NP_418188 | COG0055 | PRK09280 | pfam00006 | TIGR01039 | PMB0125 | dso | C | ||||
atpG | NP_418189 | COG0224 | PRK05621 | pfam00231 | TIGR01146 | PMB0126 | dso | C | ||||
atpA | NP_418190 | COG0056 | PRK09281 | pfam00006 | TIGR00962 | PMB0219 | dso | D | ||||
atpH | NP_418191 | COG0712 | PRK05758 | pfam00213 | TIGR01145 | PMB0220 | dso | D | ||||
gidB-rsmG | NP_418196 | COG0357 | PRK00107 | pfam02527 | TIGR00138 | PMB0127 | dwo | C | ||||
gidA-mnmG | NP_418197 | COG0445 | PRK05192 | pfam01134 | TIGR00136 | PMB0128 | B000061 | swo | C | |||
hemC | YP_026260 | COG0181 | PRK00072 | pfam01379 | TIGR00212 | PMB0221 | B000035 | D | ||||
uvrD | NP_418258 | COG0210 | PRK11773 | pfam00580 | TIGR01075 | PMB0222 | dwo | D | ||||
polA | NP_418300 | COG0749 | PRK05755 | pfam00476 | TIGR00593 | PMB0076 | B000050 | fdwo | B | |||
hemN | NP_418303 | COG0635 | PRK09249 | pfam04055 | TIGR00538 | PMB0223 | sso | D | ||||
typA | YP_026274 | COG1217 | PRK10218 | pfam00009 | TIGR01394 | PMB0129 | B000111 | C | ||||
tpiA | NP_418354 | COG0149 | PRK00042 | pfam00121 | TIGR00419 | PMB0224 | dwo | D | ||||
priA | NP_418370 | COG1198 | PRK05580 | pfam00270 | TIGR00595 | PMB0130 | B000045 | swo | C | |||
murB | NP_418403 | COG0812 | PRK00046 | pfam02873 | TIGR00179 | PMB0131 | B000114 | dso | C | |||
secE | NP_418408 | COG0690 | PRK05740 | pfam00584 | TIGR00964 | PMB0225 | D | |||||
nusG | NP_418409 | COG0250 | PRK05609 | pfam02357 | TIGR00922 | PMB0132 | sso | C | ||||
rplK-L11 | NP_418410 | COG0080 | PRK00140 | pfam00298 | TIGR01632 | PMB0023 | B000024 | sso | A | |||
rplA-L1 | NP_418411 | COG0081 | PRK05424 | pfam00687 | TIGR01169 | PMB0024 | B000003 | sso | A | |||
rplJ-L10 | NP_418412 | COG0244 | PRK00099 | pfam00466 | PMB0077 | B000030 | sso | B | ||||
rplL-L7L12 | NP_418413 | COG0222 | PRK00157 | pfam00542 | TIGR00855 | PMB0133 | B000107 | swo | C | |||
rpoB | NP_418414 | COG0085 | PRK00405 | pfam00562 | TIGR02013 | PMB0025 | B000042 | sso | A | |||
rpoC | NP_418415 | COG0086 | PRK00566 | pfam04998 | TIGR02386 | PMB0078 | B000044 | swo | B | |||
hemE | NP_418425 | COG0407 | PRK00115 | pfam01208 | TIGR01464 | PMB0226 | B000100 | D | ||||
purD | NP_418433 | COG0151 | PRK00885 | pfam01071 | TIGR00877 | PMB0227 | swo | D | ||||
purH | NP_418434 | COG0138 | PRK00881 | pfam01808 | TIGR00355 | PMB0228 | dso | D | ||||
dnaB | NP_418476 | COG0305 | PRK08006 | pfam03796 | TIGR00665 | PMB0229 | dwo | D | ||||
uvrA | NP_418482 | COG0178 | PRK00349 | pfam00005 | TIGR00630 | PMB0230 | dwo | D | ||||
groES | NP_418566 | COG0234 | PRK00364 | pfam00166 | PMB0231 | dwo | D | |||||
efp | NP_418571 | COG0231 | PRK00529 | pfam09285 | TIGR00038 | PMB0232 | dwo | D | ||||
tsaE-yjeE | NP_418589 | COG0802 | PRK10646 | pfam02367 | TIGR00150 | PMB0233 | dso | D | ||||
mutL | NP_418591 | COG0323 | PRK00095 | pfam01119 | TIGR00585 | PMB0234 | B000064 | D | ||||
miaA | NP_418592 | COG0324 | PRK00091 | pfam01715 | TIGR00174 | PMB0134 | B000082 | swo | C | |||
purA | NP_418598 | COG0104 | PRK01117 | pfam00709 | TIGR00184 | PMB0235 | dwo | D | ||||
rlmB | NP_418601 | COG0566 | PRK11181 | pfam00588 | TIGR00186 | PMB0135 | swo | C | ||||
rpsF-S6 | NP_418621 | COG0360 | PRK00453 | pfam01250 | TIGR00166 | PMB0079 | B000051 | sso | B | |||
rpsR-S18 | NP_418623 | COG0238 | PRK00391 | pfam01084 | TIGR00165 | PMB0136 | B000075 | dso | C | |||
rplI-L9 | NP_418624 | COG0359 | PRK00137 | pfam03948 | TIGR00158 | PMB0080 | B000054 | sso | B | |||
pyrB | NP_418666 | COG0540 | PRK00856 | pfam02729 | TIGR00670 | PMB0236 | dso | D | ||||
valS | NP_418679 | COG0525 | PRK05729 | pfam00133 | TIGR00422 | PMB0137 | swo | C | ||||
sms-radA | NP_418806 | COG1066 | PRK11823 | pfam06745 | TIGR00416 | PMB0081 | B000060 | swo | B |
For each gene name GENE
and each accession number ACCN
available in the above table, the reference multiple amino acid sequence alignment (MSA) can be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/aln/GENE.ACCN.faa
and the associated position specific scoring matrix (PSSM) via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/smp/GENE.ACCN.smp
For example, the reference MSA COG0359 for the gene rplI-L9 can be downloaded using wget with the following linux command line:
wget -q http://giphy.pasteur.fr/PhyloM/bacteria/aln/rplI-L9.COG0359.faa
The same download can be also performed using curl with the following command line:
curl --silent -O http://giphy.pasteur.fr/PhyloM/bacteria/aln/rplI-L9.COG0359.faa
For each marker set MSET
(i.e. sso
, swo
, dso
, dwo
, PhyloM.sso
, PhyloM.swo
, PhyloM.dso
, PhyloM.dwo
, uscg
, PhyloM.uscg
, phyeco
, PhyloM.phyeco
, bac120
, PhyloM.bac120
, PhyloM.A
, PhyloM.B
, PhyloM.C
, PhyloM.D
), three different files can be downloaded:
— a text file containing the gene name list,
— a tar.gz archive containing the reference MSAs,
— a tar.gz archive containing the associated PSSM.
The gene name list associated to a marker set MSET
can be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/cat/MSET.txt
The MSAs associated to a marker set MSET
can be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/cat/MSET.aln.tar.gz
The PSSMs associated to a marker set MSET
can be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/cat/MSET.smp.tar.gz
Each of the MSA files can be used as a query for performing a psiblast search using the BLAST+ tools (Camacho et al. 2009).
Let cds.faa
be a FASTA-formatted amino acid sequence file (e.g. every CDS from a bacterial genome).
This databank should be first formatted with the following linux command line:
makeblastdb -in cds.faa
Next, a MSA file msa.faa
can be directly used as a query for performing a BLAST search with the following linux command line model:
psiblast -in_msa msa.faa -db cds.faa -seg no -word_size 2 -evalue 0.05 -xdrop_gap_final 1000
Each of the PSSM files can be used as a query for performing a tblastn search using the BLAST+ tools (Camacho et al. 2009).
Let seq.fna
be a FASTA-formatted nucleotide sequence file (e.g. de novo assembly of a bacterial genome).
This databank should be first formatted with the following linux command line:
makeblastdb -in seq.fna -dbtype nucl
Next, a PSSM file pssm.smp
can be directly used as a query for performing a BLAST search with the following linux command line model:
tblastn -in_pssm pssm.smp -db seq.fna -seg no -word_size 2 -evalue 0.05 -xdrop_gap_final 1000
Of note, the corresponding full CDS can be easily extracted by using the program eFASTA along with the fields 2, 9 and 10 outputted by the tblastn option -outfmt 6
.
The tool eCDS can also be used to easily extract the full CDS associated to each tblastn hit.
Bratlie MS, Johansen J, Drablos F (2010) Relationship between operon preference and functional properties of persistent genes in bacterial genomes. BMC Genomics, 11:71. doi:10.1186/1471-2164-11-71
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421
Chaumeil P-A, Mussig AJ, Hugenholtz P, Parks DH (2020) GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, 36(6):1925-1927. doi:10.1093/bioinformatics/btz848
Creevey CJ, Doerks T, Fitzpatrick DA, Raes J, Bork P (2011) Universally distributed single-copy genes indicate a constant rate of horizontal transfer. PLoS ONE, 6(8):e22099. doi:10.1371/journal.pone.0022099
Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44:D279-285. doi:10.1093/nar/gkv1344
Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Research, 43:D261-9. doi:10.1093/nar/gku1223
Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41-43. doi:10.1093/nar/29.1.41
Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, Hugenholtz P (2018) A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, 36:996-1004. doi:10.1038/nbt.4229
Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ, Evans PN, Hugenholtz P, Tyson GW (2017) Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology, 2:1533-1542. doi:10.1038/s41564-017-0012-7
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41. doi:10.1186/1471-2105-4-41
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science, 278(5338):631-637. doi:10.1126/science.278.5338.631
Wu D, Jospin G, Eisen JA (2013) Systematic identification of gene families for use as "markers" for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. PLoS One, 8(10):e77033. doi:10.1371/journal.pone.0077033