PhyloM: bacteria is a selection of markers that are well-suited for phylogenetic tree inference of bacterial taxa. These selected markers are recommended for phylogenetic reconstruction because they have been shown to correspond to persistent genes within bacterial phyla (close to universal distribution). This gene selection mainly relies to the independant and complementary works of Bratlie et al. (2010), Creevey et al. (2011) and Wu et al. (2013). Bratlie et al. (2010) classified 213 genes with respect to operon participation (i.e. strong or weak operon genes) and duplication event involvement (i.e. singleton or duplicate ortholog sequence sets). Creevey et al. (2011) compiled 40 universal single copy marker genes (bacteria, archaea and eukaryotes), leading to genes that are putatively essential and likely few affected by duplication events. Wu et al. (2013) selected 114 genes associated to informative phylogenetic signal and low variation in copy number across taxa.
The 114 PhyEco marker candidates selected by Wu et al. (2013) are labelled as PhyEco.
The 40 genes selected by Creevey et al. (2011) are labelled as uscg (universal single copy gene).
Finally, the loci selected by Bratlie et al. (2010) are labelled as pgoc (persistent gene operon category) or as the author's classification:
— sso (singleton strong operon): few or not involved in duplication events (singleton) and likely few involved in horizontal tranfer events (strong operon),
— swo (singleton weak operon): few or not involved in duplication events (singleton), but could be involved in horizontal tranfer events (weak operon),
— dso (duplicate strong operon): likely involved in duplication events (duplicate), but likely few involved in horizontal tranfer envents (strong operon),
— dwo (duplicate weak operon): likely involved in duplication events (duplicate) and could also be involved in horizontal tranfer events (weak operon).
For more details about the definition of singleton, duplicate, strong and weak operon proteins, respectively, see the Introduction subsection in Bratlie et al. (2010).
The overall 226 phylogenetic markers are classified into 16 categories (from A to P) according to their belonging to pgoc, uscg and/or PhyEco. These categories are described below. For each of them, the associated gene name list is available (), as well as the reference multiple amino acid sequence alignments () and their associated position specific scoring matrices (PSSM; ) gathered from the COG database (Tatusov et al. 1997, 2003; Galperin 2015), the NCBI protein cluster collection (PRK), the Pfam database (Finn et al. 2016), the TIGRFAMs database (Haft et al. 2001), and the PhyEco repository (Wu et al. 2013).
Categories A and B (52 markers identified by at least two studies) are highly recommended for phylogenetic inference, as very few paralogy and xenology are expected.
category | pgoc | uscg | PhyEco | no. | ||||||
A | sso | 24 | COG | PRK | Pfam | TIGRFAMs | PhyEco | |||
B | sso | 28 | COG | PRK | Pfam | TIGRFAMs | PhyEco |
Categories C to F (29 markers identified by at least two studies) are recommended for phylogenetic inference. Very few paralogy is expected, but xenology could sometimes be observed.
category | pgoc | uscg | PhyEco | no. | ||||||
C | swo | 6 | COG | PRK | Pfam | TIGRFAMs | PhyEco | |||
D | swo | 3 | COG | PRK | Pfam | TIGRFAMs | PhyEco | |||
E | swo | 19 | COG | PRK | Pfam | TIGRFAMs | PhyEco | |||
F | 1 | COG | PRK | Pfam | TIGRFAMs | PhyEco |
Categories G to I (49 markers specific to each of two studies) could be used for phylogenetic inference. Few paralogy is expected, but xenology could sometimes be observed. Deletion could also be observed within some phyla.
category | pgoc | uscg | PhyEco | no. | ||||||
G | sso | 16 | COG | PRK | Pfam | TIGRFAMs | PhyEco | |||
H | 20 | COG | PRK | Pfam | TIGRFAMs | PhyEco | ||||
I | swo | 13 | COG | PRK | Pfam | TIGRFAMs | PhyEco |
Categories J to L (10 markers identified by at least two studies) are suggested for phylogenetic inference. Paralogy could be observed, but few xenology is expected.
category | pgoc | uscg | PhyEco | no. | ||||||
J | dso | 3 | COG | PRK | Pfam | TIGRFAMs | PhyEco | |||
K | dso | 6 | COG | PRK | Pfam | TIGRFAMs | PhyEco | |||
L | dso | 1 | COG | PRK | Pfam | TIGRFAMs | PhyEco |
Categories M and N (9 markers identified by at least two studies) are not very recommended for phylogenetic inference, as paralogy and xenology could be observed.
category | pgoc | uscg | PhyEco | no. | ||||||
M | dwo | 2 | COG | PRK | Pfam | TIGRFAMs | PhyEco | |||
N | dwo | 7 | COG | PRK | Pfam | TIGRFAMs | PhyEco |
Categories O and P (77 markers identified by only one study) are not recommended for phylogenetic inference, as paralogy, xenology or deletion could be observed.
category | pgoc | uscg | PhyEco | no. | ||||||
O | dso | 30 | COG | PRK | Pfam | TIGRFAMs | PhyEco | |||
P | dwo | 47 | COG | PRK | Pfam | TIGRFAMs | PhyEco |
The 226 genes are listed in the table below. They are sorted according to the reference genome of Escherichia coli strain K12 substrain MG1655 (Genbank accn: U00096). Common gene names are available in the column name with a link to the Uniprot description. When a gene is organized within a well-documented operon, a link to the corresponding RegulonDB description is available inside the column E. coli operon. The column E. coli CDS lists the NCBI accession numbers of the Escherichia coli strain K12 substrain MG1655 CDS with links to the corresponding entries in the NCBI Conserved Domain Database. For each gene, the corresponding accession number (if available) is given for the COG database (Tatusov et al. 1997, 2003; Galperin 2015), the NCBI protein cluster collection (PRK), the Pfam database (Finn et al. 2016), the TIGRFAMs database (Haft et al. 2001), and the PhyEco repository (Wu et al. 2013).
name | E. coli operon | E. coli CDS | COG | PRK | Pfam | TIGRFAMs | PhyEco | pgoc | uscg | category |
dnaK | NP_414555 | COG0443 | PRK00290 | pfam00012 | TIGR02350 | dwo | P | |||
rpsT-S20 | NP_414564 | COG0268 | PRK00239 | pfam01649 | TIGR00029 | B000077 | swo | E | ||
ribF | NP_414566 | COG0196 | PRK05627 | pfam06574 | TIGR00083 | B000096 | dwo | N | ||
ileS | NP_414567 | COG0060 | PRK05743 | pfam00133 | TIGR00392 | dwo | P | |||
lspA | NP_414568 | COG0597 | PRK00376 | pfam01252 | TIGR00077 | dso | O | |||
carA | NP_414573 | COG0505 | PRK12564 | pfam00988 | TIGR01368 | swo | I | |||
carB | NP_414574 | COG0458 | PRK05294 | pfam02786 | TIGR01369 | dwo | P | |||
rsmA-ksgA | NP_414593 | COG0030 | PRK00274 | pfam00398 | TIGR00755 | sso | G | |||
rsmH-mraW | NP_414624 | COG0275 | PRK00050 | pfam01795 | TIGR00006 | B000067 | sso | B | ||
ftsI | NP_414626 | COG0768 | PRK15105 | pfam00905 | TIGR02214 | dso | O | |||
murE | NP_414627 | COG0769 | PRK00139 | pfam08245 | TIGR01085 | B000105 | dso | K | ||
murF | NP_414628 | COG0770 | PRK10773 | pfam08245 | TIGR01143 | dso | O | |||
mraY | NP_414629 | COG0472 | PRK00108 | pfam00953 | TIGR00445 | sso | G | |||
murD | NP_414630 | COG0771 | PRK03806 | pfam08245 | TIGR01087 | B000068 | dso | K | ||
ftsW | NP_414631 | COG0772 | PRK10774 | pfam01098 | TIGR02614 | dso | O | |||
murG | NP_414632 | COG0707 | PRK00726 | pfam03033 | TIGR01133 | B000066 | dso | K | ||
murC | NP_414633 | COG0773 | PRK00421 | pfam08245 | TIGR01082 | dso | O | |||
ftsA | NP_414636 | COG0849 | PRK09472 | pfam14450 | TIGR01174 | B000110 | H | |||
ftsZ | NP_414637 | COG0206 | PRK09330 | pfam12327 | TIGR00065 | dwo | P | |||
secA | NP_414640 | COG0653 | PRK12904 | pfam07517 | TIGR00963 | dwo | P | |||
coaE | NP_414645 | COG0237 | PRK00081 | pfam01121 | TIGR00152 | B000085 | sso | B | ||
map | NP_414710 | COG0024 | PRK05716 | pfam00557 | TIGR00500 | dwo | P | |||
rpsB-S2 | NP_414711 | COG0052 | PRK05299 | pfam00318 | TIGR01011 | B000001 | sso | A | ||
tsf | NP_414712 | COG0264 | PRK09377 | pfam00889 | TIGR00116 | B000056 | sso | B | ||
pyrH | NP_414713 | COG0528 | PRK00358 | pfam00696 | TIGR02075 | sso | G | |||
frr | NP_414714 | COG0233 | PRK00083 | pfam01765 | TIGR00496 | B000079 | sso | B | ||
dxr | NP_414715 | COG0743 | PRK05447 | pfam02670 | TIGR00243 | B000088 | H | |||
uppS | NP_414716 | COG0020 | PRK10240 | pfam01255 | TIGR00055 | dso | O | |||
rseP | NP_414718 | COG0750 | PRK10779 | pfam02163 | TIGR00054 | dso | O | |||
rnhB | NP_414725 | COG0164 | PRK00015 | pfam01351 | TIGR00729 | B000039 | H | |||
dnaE | NP_414726 | COG0587 | PRK05673 | pfam07733 | TIGR00594 | dwo | P | |||
tilS-mesJ | NP_414730 | COG0037 | PRK10660 | pfam01171 | TIGR02432 | B000091 | sso | B | ||
tgt | NP_414940 | COG0343 | PRK00112 | pfam01702 | TIGR00430 | dwo | P | |||
nrdR | NP_414947 | COG1327 | PRK00464 | pfam03477 | TIGR00244 | B000087 | H | |||
nusB | NP_414950 | COG0781 | PRK00202 | pfam01029 | TIGR01951 | sso | G | |||
tig | NP_414970 | COG0544 | PRK01490 | pfam05697 | TIGR00115 | swo | I | |||
clpX | NP_414972 | COG1219 | PRK05342 | pfam07724 | TIGR00382 | B000112 | swo | E | ||
dnaX | NP_415003 | COG2812 | PRK07994 | pfam12170 | TIGR02397 | sso | G | |||
recR | NP_415005 | COG0353 | PRK00076 | pfam13662 | TIGR00615 | B000080 | sso | B | ||
purE | NP_415056 | COG0041 | PLN02948 | pfam00731 | TIGR01162 | dso | O | |||
cysS | NP_415059 | COG0215 | PRK00260 | pfam01406 | TIGR00435 | swo | D | |||
rsfS | NP_415170 | COG0799 | PRK11538 | pfam02410 | TIGR00090 | B000062 | H | |||
leuS | NP_415175 | COG0495 | PRK00390 | pfam13603 | TIGR00396 | B000108 | swo | C | ||
ybeY | NP_415192 | COG0319 | PRK00016 | pfam02130 | TIGR00043 | B000081 | sso | B | ||
uvrB | NP_415300 | COG0556 | PRK05298 | pfam12344 | TIGR00631 | swo | I | |||
infA | NP_415404 | COG0361 | PRK00276 | pfam01176 | TIGR00008 | dwo | P | |||
ftsK | NP_415410 | COG1674 | PRK10263 | pfam01580 | TIGR03928 | dwo | P | |||
serS | NP_415413 | COG0172 | PRK05431 | pfam00587 | TIGR00414 | B000073 | F | |||
aroA | NP_415428 | COG0128 | PRK11860 | pfam00275 | TIGR01356 | dso | O | |||
rpsA-S1 | NP_415431 | COG0539 | PRK06299 | pfam00575 | TIGR00717 | swo | I | |||
rpmF-L32 | NP_415607 | COG0333 | PRK01110 | pfam01783 | TIGR01031 | B000106 | H | |||
plsX | NP_415608 | COG0416 | PRK05331 | pfam02504 | TIGR00182 | B000090 | H | |||
fabD | NP_415610 | COG0331 | PLN02752 | pfam00698 | TIGR00128 | dso | O | |||
fabG | NP_415611 | COG1028 | PRK05557 | pfam13561 | TIGR01830 | dso | O | |||
acpP | NP_415612 | COG0236 | PRK00982 | pfam00550 | TIGR00517 | dwo | P | |||
fabF | NP_415613 | COG0304 | PRK07314 | pfam00109 | TIGR03150 | dso | O | |||
tmk | NP_415616 | COG0125 | PRK00698 | pfam02223 | TIGR00041 | sso | G | |||
holB | NP_415617 | COG0470 | PRK07993 | pfam09115 | TIGR00678 | sso | G | |||
ycfH | NP_415618 | COG0084 | PRK10812 | pfam01026 | TIGR00010 | dso | O | |||
mfd | NP_415632 | COG1197 | PRK10689 | pfam03461 | TIGR00580 | B000046 | swo | E | ||
mnmA-trmU | NP_415651 | COG0482 | PRK00143 | pfam03054 | TIGR00420 | B000098 | dwo | N | ||
ychF | NP_415721 | COG0012 | PRK09601 | pfam06071 | TIGR00092 | B000083 | swo | C | ||
pth | NP_415722 | COG0193 | PRK05426 | pfam01195 | TIGR00447 | B000103 | dso | K | ||
prsA | NP_415725 | COG0462 | PRK01259 | pfam13793 | TIGR01251 | dwo | P | |||
ispE | NP_415726 | COG1947 | PRK00343 | pfam08544 | TIGR00154 | sso | G | |||
prfA | NP_415729 | COG0216 | PRK00591 | pfam03462 | TIGR00019 | B000053 | sso | B | ||
pemK-prmC | NP_415730 | COG2890 | PRK09328 | pfam05175 | TIGR00536 | dso | O | |||
nth | NP_416150 | COG0177 | PRK10702 | pfam00730 | TIGR01083 | dso | O | |||
pheT | NP_416228 | COG0072 | PRK00629 | pfam03483 | TIGR00472 | B000013 | fsso | B | ||
pheS | NP_416229 | COG0016 | PRK00488 | pfam01409 | TIGR00468 | B000020 | sso | A | ||
rplT-L20 | NP_416231 | COG0292 | PRK05185 | pfam00453 | TIGR01032 | B000070 | swo | E | ||
rpmI-L35 | NP_416232 | COG0291 | PRK00172 | pfam01632 | TIGR00001 | B000099 | swo | E | ||
infC | NP_416233 | COG0290 | PRK00028 | pfam00707 | TIGR00168 | B000104 | swo | E | ||
thrS | NP_416234 | COG0441 | PRK00413 | pfam00587 | TIGR00418 | dwo | P | |||
tsaB-yeaZ | NP_416321 | COG1214 | PRK09604 | pfam00814 | TIGR03725 | sso | G | |||
ruvB | NP_416374 | COG2255 | PRK00080 | pfam05496 | TIGR00635 | B000072 | sso | B | ||
ruvA | NP_416375 | COG0632 | PRK00116 | pfam01330 | TIGR00084 | B000071 | sso | B | ||
ruvC | NP_416377 | COG0817 | PRK00039 | pfam02075 | TIGR00228 | B000093 | H | |||
yebC | NP_416378 | COG0217 | PRK00110 | pfam01709 | TIGR01033 | dwo | P | |||
aspS | NP_416380 | COG0173 | PRK00476 | pfam00152 | TIGR00459 | dwo | P | |||
argS | NP_416390 | COG0018 | PRK01611 | pfam00750 | TIGR00456 | dwo | M | |||
pgsA | NP_416422 | COG0558 | PRK10832 | pfam01066 | TIGR00560 | swo | I | |||
uvrC | NP_416423 | COG0322 | PRK00558 | pfam08459 | TIGR00194 | swo | I | |||
rplY-L25 | NP_416690 | COG1825 | PRK05943 | pfam01386 | TIGR00731 | B000059 | H | |||
trxA | NP_416699 | COG0526 | PRK15412 | pfam08534 | TIGR00385 | dwo | P | |||
gyrA | NP_416734 | COG0188 | PRK05560 | pfam00521 | TIGR01063 | dwo | P | |||
purF | NP_416815 | COG0034 | PRK09246 | pfam13537 | TIGR01134 | dwo | P | |||
folC | NP_416818 | COG0285 | PRK10846 | pfam08245 | TIGR01499 | dso | O | |||
truA | NP_416821 | COG0101 | PRK00021 | pfam01416 | TIGR00071 | dso | O | |||
aroC | NP_416832 | COG0082 | PRK05382 | pfam01264 | TIGR00033 | sso | G | |||
gltX | NP_416899 | COG0008 | PRK01406 | pfam00749 | TIGR00464 | dwo | P | |||
ligA | NP_416906 | COG0272 | PRK07956 | pfam01653 | TIGR00575 | B000109 | dwo | N | ||
purM | NP_416994 | COG0150 | PRK05385 | pfam02769 | TIGR00878 | B000038 | sso | B | ||
der-engA | NP_417006 | COG1160 | PRK00093 | pfam01926 | TIGR03594 | B000043 | swo | E | ||
hisS | NP_417009 | COG0124 | PRK00037 | pfam13393 | TIGR00442 | dso | L | |||
era | NP_417061 | COG1159 | PRK00089 | pfam01926 | TIGR00436 | sso | G | |||
rnc | NP_417062 | COG0571 | PRK00102 | pfam14622 | TIGR02191 | dso | O | |||
lepA | NP_417064 | COG0481 | PRK05433 | pfam00009 | TIGR01393 | B000004 | dwo | N | ||
clpB | NP_417083 | COG0542 | PRK10865 | pfam07724 | TIGR03346 | dwo | P | |||
rluD-sfhB | NP_417085 | COG0564 | PRK11180 | pfam00849 | TIGR00005 | swo | I | |||
rplS-L19 | NP_417097 | COG0335 | PRK05338 | pfam01245 | TIGR01024 | B000069 | swo | E | ||
trmD | NP_417098 | COG0336 | PRK00026 | pfam01746 | TIGR00088 | B000058 | sso | B | ||
rimM | NP_417099 | COG0806 | PRK00122 | pfam01782 | TIGR02273 | B000086 | sso | B | ||
rpsP-S16 | NP_417100 | COG0228 | PRK00040 | pfam00886 | TIGR00002 | B000094 | sso | B | ||
ffh | NP_417101 | COG0541 | PRK10867 | pfam00448 | TIGR00959 | B000008 | swo | C | ||
grpE | NP_417104 | COG0576 | PRK10325 | pfam01025 | dwo | P | ||||
nadK-ppnK | NP_417105 | COG0061 | PRK03378 | pfam01513 | dwo | P | ||||
recN | YP_026172 | COG0497 | PRK10869 | pfam13476 | TIGR00634 | B000078 | H | |||
smpB | NP_417110 | COG0691 | PRK05422 | pfam01668 | TIGR00086 | B000065 | swo | E | ||
alaS | NP_417177 | COG0013 | PRK00252 | pfam01411 | TIGR00344 | swo | I | |||
recA | NP_417179 | COG0468 | PRK09354 | pfam00154 | TIGR02012 | B000095 | dwo | N | ||
mutS | NP_417213 | COG0249 | PRK05399 | pfam00488 | TIGR01070 | B000076 | H | |||
ispF | NP_417226 | COG0245 | PRK00084 | pfam02542 | TIGR00151 | B000089 | H | |||
eno | NP_417259 | COG0148 | PRK00077 | pfam00113 | TIGR01060 | dwo | P | |||
pyrG | NP_417260 | COG0504 | PRK05380 | pfam06418 | TIGR00337 | B000047 | swo | E | ||
metK | NP_417417 | COG0192 | PRK05250 | pfam02773 | TIGR01034 | dwo | P | |||
rsmE | NP_417421 | COG1385 | PRK11713 | pfam04452 | TIGR00046 | dwo | P | |||
yqgF | NP_417424 | COG0816 | PRK00109 | pfam03652 | TIGR00250 | sso | G | |||
rdgB | NP_417429 | COG0127 | PRK00120 | pfam01725 | TIGR00042 | dso | O | |||
tsaD | NP_417536 | COG0533 | PRK09604 | pfam00814 | TIGR03723 | B000006 | swo | C | ||
dnaG | NP_417538 | COG0358 | PRK05667 | pfam08275 | TIGR01391 | B000092 | swo | E | ||
rpoD | NP_417539 | COG0568 | PRK05658 | pfam04546 | TIGR02393 | dwo | P | |||
yraL | NP_417615 | COG0313 | PRK14994 | pfam00590 | TIGR00096 | dwo | P | |||
pnp | NP_417633 | COG1185 | PRK11824 | pfam01138 | TIGR03591 | B000055 | swo | E | ||
rpsO-S15 | NP_417634 | COG0184 | PRK05626 | pfam00312 | TIGR00952 | B000034 | swo | C | ||
truB | NP_417635 | COG0130 | PRK05033 | pfam01509 | TIGR00431 | B000032 | sso | B | ||
rbfA | NP_417636 | COG0858 | PRK00521 | pfam02033 | TIGR00082 | B000063 | sso | B | ||
infB | NP_417637 | COG0532 | PRK05306 | pfam11987 | TIGR00487 | B000005 | sso | B | ||
nusA | NP_417638 | COG0195 | PRK09202 | pfam08529 | TIGR01953 | B000041 | sso | B | ||
folP | NP_417644 | COG0294 | PRK11613 | pfam00809 | TIGR01496 | dso | O | |||
hflB-ftsH | NP_417645 | COG0465 | PRK10733 | pfam01434 | TIGR01241 | dwo | P | |||
greA | NP_417648 | COG0782 | PRK00226 | pfam03449 | TIGR01462 | dwo | P | |||
obg | NP_417650 | COG0536 | PRK12298 | pfam01018 | TIGR02729 | B000049 | swo | E | ||
rpmA-L27 | NP_417652 | COG0211 | PRK05435 | pfam01016 | TIGR00062 | B000102 | sso | B | ||
rplU-L21 | NP_417653 | COG0261 | PRK05573 | pfam00829 | TIGR00061 | B000074 | sso | B | ||
ispB | NP_417654 | COG0142 | PRK10888 | pfam00348 | TIGR02749 | dwo | P | |||
murA | NP_417656 | COG0766 | PRK09369 | pfam00275 | TIGR01072 | dwo | P | |||
rpsI-S9 | NP_417697 | COG0103 | PRK00132 | pfam00380 | TIGR03627 | B000011 | sso | A | ||
rplM-L13 | NP_417698 | COG0102 | PRK09216 | pfam00572 | TIGR01066 | B000037 | sso | A | ||
smf | YP_026211 | COG0758 | PRK10736 | pfam02481 | TIGR00732 | dwo | P | |||
mreC | NP_417716 | COG1792 | PRK13922 | pfam04085 | TIGR00219 | B000101 | H | |||
def | NP_417745 | COG0242 | PRK00150 | pfam01327 | TIGR00079 | dso | O | |||
fmt | NP_417746 | COG0223 | PRK00005 | pfam00551 | TIGR00460 | sso | G | |||
rplQ-L17 | NP_417753 | COG0203 | PRK05591 | pfam01196 | TIGR00059 | B000057 | sso | B | ||
rpoA | NP_417754 | COG0202 | PRK05182 | pfam01193 | TIGR02027 | B000052 | dso | J | ||
rpsD-S4 | NP_417755 | COG0522 | PRK05327 | pfam00163 | TIGR01017 | dwo | M | |||
rpsK-S11 | NP_417756 | COG0100 | PRK05309 | pfam00411 | TIGR03632 | B000029 | sso | A | ||
rpsM-S13 | NP_417757 | COG0099 | PRK05179 | pfam00416 | TIGR03631 | B000019 | sso | A | ||
prlA-secY | NP_417759 | COG0201 | PRK09204 | pfam00344 | TIGR00967 | B000048 | dso | J | ||
rplO-L15 | NP_417760 | COG0200 | PRK05592 | pfam00828 | TIGR01071 | B000021 | sso | A | ||
rpsE-S5 | NP_417762 | COG0098 | PRK00550 | pfam03719 | TIGR01021 | B000015 | sso | A | ||
rplR-L18 | NP_417763 | COG0256 | PRK05593 | pfam00861 | TIGR00060 | B000033 | sso | A | ||
rplF-L6 | NP_417764 | COG0097 | PRK05498 | pfam00347 | TIGR03654 | B000023 | sso | A | ||
rpsH-S8 | NP_417765 | COG0096 | PRK00136 | pfam00410 | B000031 | sso | A | |||
rpsN-S14 | NP_417766 | COG0199 | PRK08881 | pfam00253 | dso | O | ||||
rplE-L5 | NP_417767 | COG0094 | PRK00010 | pfam00673 | B000025 | sso | A | |||
rplX-L24 | NP_417768 | COG0198 | PRK00004 | pfam17136 | TIGR01079 | B000040 | sso | B | ||
rplN-L14 | NP_417769 | COG0093 | PRK05483 | pfam00238 | TIGR01067 | B000014 | sso | A | ||
rpsQ-S17 | NP_417770 | COG0186 | PRK05610 | pfam00366 | TIGR03635 | B000036 | dso | J | ||
rpmC-L29 | NP_417771 | COG0255 | PRK00306 | pfam00831 | TIGR00012 | B000027 | H | |||
rplP-L16 | NP_417772 | COG0197 | PRK09203 | pfam00252 | TIGR01164 | B000018 | sso | A | ||
rpsC-S3 | NP_417773 | COG0092 | PRK00310 | pfam00189 | TIGR01009 | B000028 | sso | A | ||
rplV-L22 | NP_417774 | COG0091 | PRK00565 | pfam00237 | TIGR01044 | B000007 | sso | A | ||
rpsS-S19 | NP_417775 | COG0185 | PRK00357 | pfam00203 | TIGR01050 | B000016 | sso | A | ||
rplB-L2 | NP_417776 | COG0090 | PRK09374 | pfam03947 | TIGR01171 | B000010 | sso | A | ||
rplW-L23 | NP_417777 | COG0089 | PRK05738 | pfam00276 | TIGR03636 | B000022 | sso | B | ||
rplD-L4 | NP_417778 | COG0088 | PRK05319 | pfam00573 | TIGR03953 | B000009 | sso | A | ||
rplC-L3 | NP_417779 | COG0087 | PRK00001 | pfam00297 | TIGR03625 | B000012 | sso | A | ||
rpsJ-S10 | NP_417780 | COG0051 | PRK00596 | pfam00338 | TIGR01049 | B000002 | sso | B | ||
rpsG-S7 | NP_417800 | COG0049 | PRK05302 | pfam00177 | TIGR01029 | B000017 | sso | A | ||
rpsL-S12 | NP_417801 | COG0048 | PRK05163 | pfam00164 | TIGR00981 | B000026 | swo | C | ||
trpS | NP_417843 | COG0180 | PRK00927 | pfam00579 | TIGR00233 | dwo | P | |||
ftsY | NP_417921 | COG0552 | PRK10416 | pfam00448 | TIGR00064 | swo | D | |||
rsmD | NP_417922 | COG0742 | PRK10909 | pfam03602 | TIGR00095 | sso | G | |||
glyS | NP_418016 | COG0751 | PRK01233 | pfam02092 | TIGR00211 | B000097 | H | |||
gpsA | NP_418065 | COG0240 | PRK00094 | pfam07479 | TIGR03376 | dso | O | |||
kdtB-coaD | NP_418091 | COG0669 | PRK00168 | pfam01467 | TIGR01510 | swo | I | |||
rpmB-L28 | NP_418094 | COG0227 | PRK00359 | pfam00830 | TIGR00009 | dwo | P | |||
gmk | NP_418105 | COG0194 | PRK00300 | pfam00625 | TIGR03263 | swo | I | |||
spoT-relA | NP_418107 | COG0317 | PRK11092 | pfam13328 | TIGR00691 | dwo | P | |||
gyrB | YP_026241 | COG0187 | PRK14939 | pfam00204 | TIGR01059 | dwo | P | |||
recF | NP_418155 | COG1195 | PRK00064 | pfam02463 | TIGR00611 | B000113 | H | |||
dnaN | NP_418156 | COG0592 | PRK05643 | pfam02768 | TIGR00663 | dwo | P | |||
dnaA | NP_418157 | COG0593 | PRK00149 | pfam00308 | TIGR00362 | B000084 | dwo | N | ||
yidC | NP_418161 | COG0706 | PRK01318 | pfam14849 | TIGR03593 | dwo | P | |||
thdF | NP_418162 | COG0486 | PRK05291 | pfam12631 | TIGR00450 | swo | I | |||
glmS | NP_418185 | COG0449 | PRK00331 | pfam01380 | TIGR01135 | dwo | P | |||
atpD | NP_418188 | COG0055 | PRK09280 | pfam00006 | TIGR01039 | dso | O | |||
atpG | NP_418189 | COG0224 | PRK05621 | pfam00231 | TIGR01146 | dso | O | |||
atpA | NP_418190 | COG0056 | PRK09281 | pfam00006 | TIGR00962 | dso | O | |||
atpH | NP_418191 | COG0712 | PRK05758 | pfam00213 | TIGR01145 | dso | O | |||
gidB-rsmG | NP_418196 | COG0357 | PRK00107 | pfam02527 | TIGR00138 | dwo | P | |||
gidA-mnmG | NP_418197 | COG0445 | PRK05192 | pfam01134 | TIGR00136 | B000061 | swo | E | ||
hemC | YP_026260 | COG0181 | PRK00072 | pfam01379 | TIGR00212 | B000035 | H | |||
uvrD | NP_418258 | COG0210 | PRK11773 | pfam00580 | TIGR01075 | dwo | P | |||
polA | NP_418300 | COG0749 | PRK05755 | pfam00476 | TIGR00593 | B000050 | fdwo | N | ||
hemN | NP_418303 | COG0635 | PRK09249 | pfam04055 | TIGR00538 | sso | G | |||
typA | YP_026274 | COG1217 | PRK10218 | pfam00009 | TIGR01394 | B000111 | H | |||
tpiA | NP_418354 | COG0149 | PRK00042 | pfam00121 | TIGR00419 | dwo | P | |||
priA | NP_418370 | COG1198 | PRK05580 | pfam00270 | TIGR00595 | B000045 | swo | E | ||
murB | NP_418403 | COG0812 | PRK00046 | pfam02873 | TIGR00179 | B000114 | dso | K | ||
nusG | NP_418409 | COG0250 | PRK05609 | pfam02357 | TIGR00922 | sso | G | |||
rplK-L11 | NP_418410 | COG0080 | PRK00140 | pfam00298 | TIGR01632 | B000024 | sso | A | ||
rplA-L1 | NP_418411 | COG0081 | PRK05424 | pfam00687 | TIGR01169 | B000003 | sso | A | ||
rplJ-L10 | NP_418412 | COG0244 | PRK00099 | pfam00466 | B000030 | sso | B | |||
rplL-L7L12 | NP_418413 | COG0222 | PRK00157 | pfam00542 | TIGR00855 | B000107 | swo | E | ||
rpoB | NP_418414 | COG0085 | PRK00405 | pfam00562 | TIGR02013 | B000042 | sso | A | ||
rpoC | NP_418415 | COG0086 | PRK00566 | pfam04998 | TIGR02386 | B000044 | swo | E | ||
hemE | NP_418425 | COG0407 | PRK00115 | pfam01208 | TIGR01464 | B000100 | H | |||
purD | NP_418433 | COG0151 | PRK00885 | pfam01071 | TIGR00877 | swo | I | |||
purH | NP_418434 | COG0138 | PRK00881 | pfam01808 | TIGR00355 | dso | O | |||
dnaB | NP_418476 | COG0305 | PRK08006 | pfam03796 | TIGR00665 | dwo | P | |||
uvrA | NP_418482 | COG0178 | PRK00349 | pfam00005 | TIGR00630 | dwo | P | |||
groES | NP_418566 | COG0234 | PRK00364 | pfam00166 | dwo | P | ||||
efp | NP_418571 | COG0231 | PRK00529 | pfam09285 | TIGR00038 | dwo | P | |||
tsaE-yjeE | NP_418589 | COG0802 | PRK10646 | pfam02367 | TIGR00150 | dso | O | |||
mutL | NP_418591 | COG0323 | PRK00095 | pfam01119 | TIGR00585 | B000064 | H | |||
miaA | NP_418592 | COG0324 | PRK00091 | pfam01715 | TIGR00174 | B000082 | swo | E | ||
purA | NP_418598 | COG0104 | PRK01117 | pfam00709 | TIGR00184 | dwo | P | |||
rlmB | NP_418601 | COG0566 | PRK11181 | pfam00588 | TIGR00186 | swo | I | |||
rpsF-S6 | NP_418621 | COG0360 | PRK00453 | pfam01250 | TIGR00166 | B000051 | sso | B | ||
rpsR-S18 | NP_418623 | COG0238 | PRK00391 | pfam01084 | TIGR00165 | B000075 | dso | K | ||
rplI-L9 | NP_418624 | COG0359 | PRK00137 | pfam03948 | TIGR00158 | B000054 | sso | B | ||
pyrB | NP_418666 | COG0540 | PRK00856 | pfam02729 | TIGR00670 | dso | O | |||
valS | NP_418679 | COG0525 | PRK05729 | pfam00133 | TIGR00422 | swo | D | |||
sms-radA | NP_418806 | COG1066 | PRK11823 | pfam06745 | TIGR00416 | B000060 | swo | E |
For each gene name GENE
and each accession number ACCN
from the databank BANK
(i.e. COG, PRK, Pfam, TIGR or PhyEco) available inside the above table, the reference multiple amino acid sequence alignment (MSA) could be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/aln/BANK/GENE.ACCN.faa
and the associated position specific scoring matrix (PSSM) via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/smp/BANK/GENE.ACCN.smp
For example, the reference MSA COG0359 for the gene rplI-L9 could be downloaded by wget with the following linux command line:
wget -q http://giphy.pasteur.fr/PhyloM/bacteria/aln/COG/rplI-L9.COG0359.faa
The same download could be also performed by curl with the following command line:
curl --silent -O http://giphy.pasteur.fr/PhyloM/bacteria/aln/COG/rplI-L9.COG0359.faa
Gene names from a given category CTG
(i.e. from A to P) could be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/cat/CTG.txt
Every reference MSA from the databank BANK
(i.e. COG, PRK, Pfam, TIGR or PhyEco) associated to a given category CTG
(i.e. from A to P) could be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/aln/BANK/BANK.CTG.aln.tar.gz
Similarly, every PSSM from the databank BANK
(i.e. COG, PRK, Pfam, TIGR or PhyEco) associated to a given category CTG
(i.e. from A to P) could be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/bacteria/smp/BANK/BANK.CTG.smp.tar.gz
For example, every PhyEco PSSM file that belongs to the category A could be downloaded by curl and uncompressed by tar with the following linux command line:
curl --silent http://giphy.pasteur.fr/PhyloM/bacteria/smp/COG/COG.A.smp.tar.gz | tar -xz
Each of the PhyloM MSA files could be used as a query for performing a psiblast search with the BLAST+ tools (Camacho et al. 2009).
Let cds.faa
be a FASTA-formatted amino acid sequence file (e.g. every CDS from a bacterial genome).
This databank should be first formatted with the following linux command line:
makeblastdb -in cds.faa
Next, a PhyloM MSA file msa.faa
could be directly used as a query for performing a BLAST search with the following linux command line model:
psiblast -in_msa msa.faa -db cds.faa -seg no -word_size 2 -evalue 1E-20 -xdrop_gap_final 1000
Each of the PhyloM PSSM files could be used as a query for performing a tblastn search with the BLAST+ tools (Camacho et al. 2009).
Let seq.fna
be a FASTA-formatted nucleotide sequence file (e.g. de novo assembly of a bacterial genome).
This databank should be first formatted with the following linux command line:
makeblastdb -in seq.fna -dbtype nucl
Next, a PhyloM PSSM file pssm.smp
could be directly used as a query for performing a BLAST search with the following linux command line model:
tblastn -in_pssm pssm.smp -db seq.fna -seg no -word_size 2 -evalue 1E-20 -xdrop_gap_final 1000
Of note, the corresponding full CDS could be easily extracted by using the program eFASTA along with the fields 2, 9 and 10 outputed by the tblastn option -outfmt 6
.
Bratlie MS, Johansen J, Drablos F (2010) Relationship between operon preference and functional properties of persistent genes in bacterial genomes. BMC Genomics, 11:71. doi:10.1186/1471-2164-11-71
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421
Creevey CJ, Doerks T, Fitzpatrick DA, Raes J, Bork P (2011) Universally distributed single-copy genes indicate a constant rate of horizontal transfer. PLoS ONE, 6(8):e22099. doi:10.1371/journal.pone.0022099
Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44:D279-285. doi:10.1093/nar/gkv1344
Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Research, 43:D261-9. doi:10.1093/nar/gku1223
Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41-43. doi:10.1093/nar/29.1.41
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41. doi:10.1186/1471-2105-4-41
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science, 278(5338):631-637. doi:10.1126/science.278.5338.631
Wu D, Jospin G, Eisen JA (2013) Systematic identification of gene families for use as "markers" for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. PLoS One, 8(10):e77033. doi:10.1371/journal.pone.0077033