Institut Pasteur blankvertical divider clipartblank DBC blankvertical divider clipartblank Bioinformatics and Biostatistics Hub blankvertical divider clipartblank GIPhy

DESCRIPTION      MARKER CATEGORIES      MARKER LIST      USAGE      LITTERATURE CITED

PhyloM: bacteria


Description

PhyloM: bacteria is a selection of markers that are well-suited for phylogenetic tree inference of bacterial taxa. These selected markers are recommended for phylogenetic reconstruction because they have been shown to correspond to persistent genes within bacterial phyla (close to universal distribution). This gene selection mainly relies to the independant and complementary works of Bratlie et al. (2010), Creevey et al. (2011) and Wu et al. (2013). Bratlie et al. (2010) classified 213 genes with respect to operon participation (i.e. strong or weak operon genes) and duplication event involvement (i.e. singleton or duplicate ortholog sequence sets). Creevey et al. (2011) compiled 40 universal single copy marker genes (bacteria, archaea and eukaryotes), leading to genes that are putatively essential and likely few affected by duplication events. Wu et al. (2013) selected 114 genes associated to informative phylogenetic signal and low variation in copy number across taxa.


Marker categories

The 114 PhyEco marker candidates selected by Wu et al. (2013) are labelled as PhyEco. The 40 genes selected by Creevey et al. (2011) are labelled as uscg (universal single copy gene). Finally, the loci selected by Bratlie et al. (2010) are labelled as pgoc (persistent gene operon category) or as the author's classification:
 — sso (singleton strong operon): few or not involved in duplication events (singleton) and likely few involved in horizontal tranfer events (strong operon),
 — swo (singleton weak operon): few or not involved in duplication events (singleton), but could be involved in horizontal tranfer events (weak operon),
 — dso (duplicate strong operon): likely involved in duplication events (duplicate), but likely few involved in horizontal tranfer envents (strong operon),
 — dwo (duplicate weak operon): likely involved in duplication events (duplicate) and could also be involved in horizontal tranfer events (weak operon).
For more details about the definition of singleton, duplicate, strong and weak operon proteins, respectively, see the Introduction subsection in Bratlie et al. (2010).

The overall 226 phylogenetic markers are classified into 16 categories (from A to P) according to their belonging to pgoc, uscg and/or PhyEco. These categories are described below. For each of them, the associated gene name list is available (), as well as the reference multiple amino acid sequence alignments () and their associated position specific scoring matrices (PSSM; ) gathered from the COG database (Tatusov et al. 1997, 2003; Galperin 2015), the NCBI protein cluster collection (PRK), the Pfam database (Finn et al. 2016), the TIGRFAMs database (Haft et al. 2001), and the PhyEco repository (Wu et al. 2013).

Categories A and B (52 markers identified by at least two studies) are highly recommended for phylogenetic inference, as very few paralogy and xenology are expected.

category pgoc uscg PhyEco no.
A sso 24 COG PRK Pfam TIGRFAMs PhyEco
B sso 28 COG PRK Pfam TIGRFAMs PhyEco

Categories C to F (29 markers identified by at least two studies) are recommended for phylogenetic inference. Very few paralogy is expected, but xenology could sometimes be observed.

category pgoc uscg PhyEco no.
C swo 6 COG PRK Pfam TIGRFAMs PhyEco
D swo 3 COG PRK Pfam TIGRFAMs PhyEco
E swo 19 COG PRK Pfam TIGRFAMs PhyEco
F 1 COG PRK Pfam TIGRFAMs PhyEco

Categories G to I (49 markers specific to each of two studies) could be used for phylogenetic inference. Few paralogy is expected, but xenology could sometimes be observed. Deletion could also be observed within some phyla.

category pgoc uscg PhyEco no.
G sso 16 COG PRK Pfam TIGRFAMs PhyEco
H 20 COG PRK Pfam TIGRFAMs PhyEco
I swo 13 COG PRK Pfam TIGRFAMs PhyEco

Categories J to L (10 markers identified by at least two studies) are suggested for phylogenetic inference. Paralogy could be observed, but few xenology is expected.

category pgoc uscg PhyEco no.
J dso 3 COG PRK Pfam TIGRFAMs PhyEco
K dso 6 COG PRK Pfam TIGRFAMs PhyEco
L dso 1 COG PRK Pfam TIGRFAMs PhyEco

Categories M and N (9 markers identified by at least two studies) are not very recommended for phylogenetic inference, as paralogy and xenology could be observed.

category pgoc uscg PhyEco no.
M dwo 2 COG PRK Pfam TIGRFAMs PhyEco
N dwo 7 COG PRK Pfam TIGRFAMs PhyEco

Categories O and P (77 markers identified by only one study) are not recommended for phylogenetic inference, as paralogy, xenology or deletion could be observed.

category pgoc uscg PhyEco no.
O dso 30 COG PRK Pfam TIGRFAMs PhyEco
P dwo 47 COG PRK Pfam TIGRFAMs PhyEco


Marker list

The 226 genes are listed in the table below. They are sorted according to the reference genome of Escherichia coli strain K12 substrain MG1655 (Genbank accn: U00096). Common gene names are available in the column name with a link to the Uniprot description. When a gene is organized within a well-documented operon, a link to the corresponding RegulonDB description is available inside the column E. coli operon. The column E. coli CDS lists the NCBI accession numbers of the Escherichia coli strain K12 substrain MG1655 CDS with links to the corresponding entries in the NCBI Conserved Domain Database. For each gene, the corresponding accession number (if available) is given for the COG database (Tatusov et al. 1997, 2003; Galperin 2015), the NCBI protein cluster collection (PRK), the Pfam database (Finn et al. 2016), the TIGRFAMs database (Haft et al. 2001), and the PhyEco repository (Wu et al. 2013).

name E. coli operon E. coli CDS COG PRK Pfam TIGRFAMs PhyEco pgoc uscg category
dnaK   NP_414555 COG0443 PRK00290 pfam00012 TIGR02350   dwo   P
rpsT-S20   NP_414564 COG0268 PRK00239 pfam01649 TIGR00029 B000077 swo   E
ribF NP_414566 COG0196 PRK05627 pfam06574 TIGR00083 B000096 dwo   N
ileS NP_414567 COG0060 PRK05743 pfam00133 TIGR00392   dwo   P
lspA NP_414568 COG0597 PRK00376 pfam01252 TIGR00077   dso   O
carA NP_414573 COG0505 PRK12564 pfam00988 TIGR01368   swo   I
carB NP_414574 COG0458 PRK05294 pfam02786 TIGR01369   dwo   P
rsmA-ksgA   NP_414593 COG0030 PRK00274 pfam00398 TIGR00755   sso   G
rsmH-mraW NP_414624 COG0275 PRK00050 pfam01795 TIGR00006 B000067 sso   B
ftsI NP_414626 COG0768 PRK15105 pfam00905 TIGR02214   dso   O
murE NP_414627 COG0769 PRK00139 pfam08245 TIGR01085 B000105 dso   K
murF NP_414628 COG0770 PRK10773 pfam08245 TIGR01143   dso   O
mraY NP_414629 COG0472 PRK00108 pfam00953 TIGR00445   sso   G
murD NP_414630 COG0771 PRK03806 pfam08245 TIGR01087 B000068 dso   K
ftsW NP_414631 COG0772 PRK10774 pfam01098 TIGR02614   dso   O
murG NP_414632 COG0707 PRK00726 pfam03033 TIGR01133 B000066 dso   K
murC NP_414633 COG0773 PRK00421 pfam08245 TIGR01082   dso   O
ftsA   NP_414636 COG0849 PRK09472 pfam14450 TIGR01174 B000110     H
ftsZ   NP_414637 COG0206 PRK09330 pfam12327 TIGR00065   dwo   P
secA   NP_414640 COG0653 PRK12904 pfam07517 TIGR00963   dwo   P
coaE   NP_414645 COG0237 PRK00081 pfam01121 TIGR00152 B000085 sso   B
map   NP_414710 COG0024 PRK05716 pfam00557 TIGR00500   dwo   P
rpsB-S2 NP_414711 COG0052 PRK05299 pfam00318 TIGR01011 B000001 sso A
tsf NP_414712 COG0264 PRK09377 pfam00889 TIGR00116 B000056 sso   B
pyrH   NP_414713 COG0528 PRK00358 pfam00696 TIGR02075   sso   G
frr   NP_414714 COG0233 PRK00083 pfam01765 TIGR00496 B000079 sso   B
dxr   NP_414715 COG0743 PRK05447 pfam02670 TIGR00243 B000088     H
uppS   NP_414716 COG0020 PRK10240 pfam01255 TIGR00055   dso   O
rseP   NP_414718 COG0750 PRK10779 pfam02163 TIGR00054   dso   O
rnhB NP_414725 COG0164 PRK00015 pfam01351 TIGR00729 B000039     H
dnaE NP_414726 COG0587 PRK05673 pfam07733 TIGR00594   dwo   P
tilS-mesJ   NP_414730 COG0037 PRK10660 pfam01171 TIGR02432 B000091 sso   B
tgt NP_414940 COG0343 PRK00112 pfam01702 TIGR00430   dwo   P
nrdR NP_414947 COG1327 PRK00464 pfam03477 TIGR00244 B000087     H
nusB NP_414950 COG0781 PRK00202 pfam01029 TIGR01951   sso   G
tig   NP_414970 COG0544 PRK01490 pfam05697 TIGR00115   swo   I
clpX NP_414972 COG1219 PRK05342 pfam07724 TIGR00382 B000112 swo   E
dnaX   NP_415003 COG2812 PRK07994 pfam12170 TIGR02397   sso   G
recR NP_415005 COG0353 PRK00076 pfam13662 TIGR00615 B000080 sso   B
purE NP_415056 COG0041 PLN02948 pfam00731 TIGR01162   dso   O
cysS   NP_415059 COG0215 PRK00260 pfam01406 TIGR00435   swo D
rsfS NP_415170 COG0799 PRK11538 pfam02410 TIGR00090 B000062     H
leuS   NP_415175 COG0495 PRK00390 pfam13603 TIGR00396 B000108 swo C
ybeY NP_415192 COG0319 PRK00016 pfam02130 TIGR00043 B000081 sso   B
uvrB   NP_415300 COG0556 PRK05298 pfam12344 TIGR00631   swo   I
infA   NP_415404 COG0361 PRK00276 pfam01176 TIGR00008   dwo   P
ftsK   NP_415410 COG1674 PRK10263 pfam01580 TIGR03928   dwo   P
serS   NP_415413 COG0172 PRK05431 pfam00587 TIGR00414 B000073   F
aroA NP_415428 COG0128 PRK11860 pfam00275 TIGR01356   dso   O
rpsA-S1 NP_415431 COG0539 PRK06299 pfam00575 TIGR00717   swo   I
rpmF-L32 NP_415607 COG0333 PRK01110 pfam01783 TIGR01031 B000106     H
plsX NP_415608 COG0416 PRK05331 pfam02504 TIGR00182 B000090     H
fabD NP_415610 COG0331 PLN02752 pfam00698 TIGR00128   dso   O
fabG NP_415611 COG1028 PRK05557 pfam13561 TIGR01830   dso   O
acpP NP_415612 COG0236 PRK00982 pfam00550 TIGR00517   dwo   P
fabF NP_415613 COG0304 PRK07314 pfam00109 TIGR03150   dso   O
tmk   NP_415616 COG0125 PRK00698 pfam02223 TIGR00041   sso   G
holB   NP_415617 COG0470 PRK07993 pfam09115 TIGR00678   sso   G
ycfH   NP_415618 COG0084 PRK10812 pfam01026 TIGR00010   dso   O
mfd   NP_415632 COG1197 PRK10689 pfam03461 TIGR00580 B000046 swo   E
mnmA-trmU   NP_415651 COG0482 PRK00143 pfam03054 TIGR00420 B000098 dwo   N
ychF NP_415721 COG0012 PRK09601 pfam06071 TIGR00092 B000083 swo C
pth NP_415722 COG0193 PRK05426 pfam01195 TIGR00447 B000103 dso   K
prsA NP_415725 COG0462 PRK01259 pfam13793 TIGR01251   dwo   P
ispE NP_415726 COG1947 PRK00343 pfam08544 TIGR00154   sso   G
prfA NP_415729 COG0216 PRK00591 pfam03462 TIGR00019 B000053 sso   B
pemK-prmC NP_415730 COG2890 PRK09328 pfam05175 TIGR00536   dso   O
nth   NP_416150 COG0177 PRK10702 pfam00730 TIGR01083   dso   O
pheT NP_416228 COG0072 PRK00629 pfam03483 TIGR00472 B000013 fsso   B
pheS NP_416229 COG0016 PRK00488 pfam01409 TIGR00468 B000020 sso A
rplT-L20 NP_416231 COG0292 PRK05185 pfam00453 TIGR01032 B000070 swo   E
rpmI-L35 NP_416232 COG0291 PRK00172 pfam01632 TIGR00001 B000099 swo   E
infC NP_416233 COG0290 PRK00028 pfam00707 TIGR00168 B000104 swo   E
thrS NP_416234 COG0441 PRK00413 pfam00587 TIGR00418   dwo   P
tsaB-yeaZ   NP_416321 COG1214 PRK09604 pfam00814 TIGR03725   sso   G
ruvB NP_416374 COG2255 PRK00080 pfam05496 TIGR00635 B000072 sso   B
ruvA NP_416375 COG0632 PRK00116 pfam01330 TIGR00084 B000071 sso   B
ruvC NP_416377 COG0817 PRK00039 pfam02075 TIGR00228 B000093     H
yebC NP_416378 COG0217 PRK00110 pfam01709 TIGR01033   dwo   P
aspS   NP_416380 COG0173 PRK00476 pfam00152 TIGR00459   dwo   P
argS   NP_416390 COG0018 PRK01611 pfam00750 TIGR00456   dwo M
pgsA   NP_416422 COG0558 PRK10832 pfam01066 TIGR00560   swo   I
uvrC   NP_416423 COG0322 PRK00558 pfam08459 TIGR00194   swo   I
rplY-L25   NP_416690 COG1825 PRK05943 pfam01386 TIGR00731 B000059     H
trxA   NP_416699 COG0526 PRK15412 pfam08534 TIGR00385   dwo   P
gyrA   NP_416734 COG0188 PRK05560 pfam00521 TIGR01063   dwo   P
purF NP_416815 COG0034 PRK09246 pfam13537 TIGR01134   dwo   P
folC   NP_416818 COG0285 PRK10846 pfam08245 TIGR01499   dso   O
truA NP_416821 COG0101 PRK00021 pfam01416 TIGR00071   dso   O
aroC   NP_416832 COG0082 PRK05382 pfam01264 TIGR00033   sso   G
gltX   NP_416899 COG0008 PRK01406 pfam00749 TIGR00464   dwo   P
ligA NP_416906 COG0272 PRK07956 pfam01653 TIGR00575 B000109 dwo   N
purM NP_416994 COG0150 PRK05385 pfam02769 TIGR00878 B000038 sso   B
der-engA NP_417006 COG1160 PRK00093 pfam01926 TIGR03594 B000043 swo   E
hisS   NP_417009 COG0124 PRK00037 pfam13393 TIGR00442   dso L
era NP_417061 COG1159 PRK00089 pfam01926 TIGR00436   sso   G
rnc NP_417062 COG0571 PRK00102 pfam14622 TIGR02191   dso   O
lepA NP_417064 COG0481 PRK05433 pfam00009 TIGR01393 B000004 dwo   N
clpB   NP_417083 COG0542 PRK10865 pfam07724 TIGR03346   dwo   P
rluD-sfhB   NP_417085 COG0564 PRK11180 pfam00849 TIGR00005   swo   I
rplS-L19 NP_417097 COG0335 PRK05338 pfam01245 TIGR01024 B000069 swo   E
trmD NP_417098 COG0336 PRK00026 pfam01746 TIGR00088 B000058 sso   B
rimM NP_417099 COG0806 PRK00122 pfam01782 TIGR02273 B000086 sso   B
rpsP-S16 NP_417100 COG0228 PRK00040 pfam00886 TIGR00002 B000094 sso   B
ffh NP_417101 COG0541 PRK10867 pfam00448 TIGR00959 B000008 swo C
grpE   NP_417104 COG0576 PRK10325 pfam01025     dwo   P
nadK-ppnK   NP_417105 COG0061 PRK03378 pfam01513     dwo   P
recN   YP_026172 COG0497 PRK10869 pfam13476 TIGR00634 B000078     H
smpB   NP_417110 COG0691 PRK05422 pfam01668 TIGR00086 B000065 swo   E
alaS   NP_417177 COG0013 PRK00252 pfam01411 TIGR00344   swo   I
recA NP_417179 COG0468 PRK09354 pfam00154 TIGR02012 B000095 dwo   N
mutS   NP_417213 COG0249 PRK05399 pfam00488 TIGR01070 B000076     H
ispF   NP_417226 COG0245 PRK00084 pfam02542 TIGR00151 B000089     H
eno NP_417259 COG0148 PRK00077 pfam00113 TIGR01060   dwo   P
pyrG NP_417260 COG0504 PRK05380 pfam06418 TIGR00337 B000047 swo   E
metK   NP_417417 COG0192 PRK05250 pfam02773 TIGR01034   dwo   P
rsmE   NP_417421 COG1385 PRK11713 pfam04452 TIGR00046   dwo   P
yqgF NP_417424 COG0816 PRK00109 pfam03652 TIGR00250   sso   G
rdgB NP_417429 COG0127 PRK00120 pfam01725 TIGR00042   dso   O
tsaD   NP_417536 COG0533 PRK09604 pfam00814 TIGR03723 B000006 swo C
dnaG NP_417538 COG0358 PRK05667 pfam08275 TIGR01391 B000092 swo   E
rpoD NP_417539 COG0568 PRK05658 pfam04546 TIGR02393   dwo   P
yraL   NP_417615 COG0313 PRK14994 pfam00590 TIGR00096   dwo   P
pnp NP_417633 COG1185 PRK11824 pfam01138 TIGR03591 B000055 swo   E
rpsO-S15 NP_417634 COG0184 PRK05626 pfam00312 TIGR00952 B000034 swo C
truB NP_417635 COG0130 PRK05033 pfam01509 TIGR00431 B000032 sso   B
rbfA NP_417636 COG0858 PRK00521 pfam02033 TIGR00082 B000063 sso   B
infB NP_417637 COG0532 PRK05306 pfam11987 TIGR00487 B000005 sso   B
nusA NP_417638 COG0195 PRK09202 pfam08529 TIGR01953 B000041 sso   B
folP   NP_417644 COG0294 PRK11613 pfam00809 TIGR01496   dso   O
hflB-ftsH   NP_417645 COG0465 PRK10733 pfam01434 TIGR01241   dwo   P
greA   NP_417648 COG0782 PRK00226 pfam03449 TIGR01462   dwo   P
obg NP_417650 COG0536 PRK12298 pfam01018 TIGR02729 B000049 swo   E
rpmA-L27 NP_417652 COG0211 PRK05435 pfam01016 TIGR00062 B000102 sso   B
rplU-L21 NP_417653 COG0261 PRK05573 pfam00829 TIGR00061 B000074 sso   B
ispB   NP_417654 COG0142 PRK10888 pfam00348 TIGR02749   dwo   P
murA NP_417656 COG0766 PRK09369 pfam00275 TIGR01072   dwo   P
rpsI-S9 NP_417697 COG0103 PRK00132 pfam00380 TIGR03627 B000011 sso A
rplM-L13 NP_417698 COG0102 PRK09216 pfam00572 TIGR01066 B000037 sso A
smf   YP_026211 COG0758 PRK10736 pfam02481 TIGR00732   dwo   P
mreC NP_417716 COG1792 PRK13922 pfam04085 TIGR00219 B000101     H
def NP_417745 COG0242 PRK00150 pfam01327 TIGR00079   dso   O
fmt NP_417746 COG0223 PRK00005 pfam00551 TIGR00460   sso   G
rplQ-L17 NP_417753 COG0203 PRK05591 pfam01196 TIGR00059 B000057 sso   B
rpoA NP_417754 COG0202 PRK05182 pfam01193 TIGR02027 B000052 dso J
rpsD-S4 NP_417755 COG0522 PRK05327 pfam00163 TIGR01017   dwo M
rpsK-S11 NP_417756 COG0100 PRK05309 pfam00411 TIGR03632 B000029 sso A
rpsM-S13 NP_417757 COG0099 PRK05179 pfam00416 TIGR03631 B000019 sso A
prlA-secY NP_417759 COG0201 PRK09204 pfam00344 TIGR00967 B000048 dso J
rplO-L15 NP_417760 COG0200 PRK05592 pfam00828 TIGR01071 B000021 sso A
rpsE-S5 NP_417762 COG0098 PRK00550 pfam03719 TIGR01021 B000015 sso A
rplR-L18 NP_417763 COG0256 PRK05593 pfam00861 TIGR00060 B000033 sso A
rplF-L6 NP_417764 COG0097 PRK05498 pfam00347 TIGR03654 B000023 sso A
rpsH-S8 NP_417765 COG0096 PRK00136 pfam00410   B000031 sso A
rpsN-S14 NP_417766 COG0199 PRK08881 pfam00253     dso   O
rplE-L5 NP_417767 COG0094 PRK00010 pfam00673   B000025 sso A
rplX-L24 NP_417768 COG0198 PRK00004 pfam17136 TIGR01079 B000040 sso   B
rplN-L14 NP_417769 COG0093 PRK05483 pfam00238 TIGR01067 B000014 sso A
rpsQ-S17 NP_417770 COG0186 PRK05610 pfam00366 TIGR03635 B000036 dso J
rpmC-L29 NP_417771 COG0255 PRK00306 pfam00831 TIGR00012 B000027     H
rplP-L16 NP_417772 COG0197 PRK09203 pfam00252 TIGR01164 B000018 sso A
rpsC-S3 NP_417773 COG0092 PRK00310 pfam00189 TIGR01009 B000028 sso A
rplV-L22 NP_417774 COG0091 PRK00565 pfam00237 TIGR01044 B000007 sso A
rpsS-S19 NP_417775 COG0185 PRK00357 pfam00203 TIGR01050 B000016 sso A
rplB-L2 NP_417776 COG0090 PRK09374 pfam03947 TIGR01171 B000010 sso A
rplW-L23 NP_417777 COG0089 PRK05738 pfam00276 TIGR03636 B000022 sso   B
rplD-L4 NP_417778 COG0088 PRK05319 pfam00573 TIGR03953 B000009 sso A
rplC-L3 NP_417779 COG0087 PRK00001 pfam00297 TIGR03625 B000012 sso A
rpsJ-S10 NP_417780 COG0051 PRK00596 pfam00338 TIGR01049 B000002 sso   B
rpsG-S7 NP_417800 COG0049 PRK05302 pfam00177 TIGR01029 B000017 sso A
rpsL-S12 NP_417801 COG0048 PRK05163 pfam00164 TIGR00981 B000026 swo C
trpS NP_417843 COG0180 PRK00927 pfam00579 TIGR00233   dwo   P
ftsY   NP_417921 COG0552 PRK10416 pfam00448 TIGR00064   swo D
rsmD NP_417922 COG0742 PRK10909 pfam03602 TIGR00095   sso   G
glyS NP_418016 COG0751 PRK01233 pfam02092 TIGR00211 B000097     H
gpsA NP_418065 COG0240 PRK00094 pfam07479 TIGR03376   dso   O
kdtB-coaD NP_418091 COG0669 PRK00168 pfam01467 TIGR01510   swo   I
rpmB-L28 NP_418094 COG0227 PRK00359 pfam00830 TIGR00009   dwo   P
gmk   NP_418105 COG0194 PRK00300 pfam00625 TIGR03263   swo   I
spoT-relA NP_418107 COG0317 PRK11092 pfam13328 TIGR00691   dwo   P
gyrB   YP_026241 COG0187 PRK14939 pfam00204 TIGR01059   dwo   P
recF NP_418155 COG1195 PRK00064 pfam02463 TIGR00611 B000113     H
dnaN NP_418156 COG0592 PRK05643 pfam02768 TIGR00663   dwo   P
dnaA NP_418157 COG0593 PRK00149 pfam00308 TIGR00362 B000084 dwo   N
yidC   NP_418161 COG0706 PRK01318 pfam14849 TIGR03593   dwo   P
thdF   NP_418162 COG0486 PRK05291 pfam12631 TIGR00450   swo   I
glmS NP_418185 COG0449 PRK00331 pfam01380 TIGR01135   dwo   P
atpD NP_418188 COG0055 PRK09280 pfam00006 TIGR01039   dso   O
atpG NP_418189 COG0224 PRK05621 pfam00231 TIGR01146   dso   O
atpA NP_418190 COG0056 PRK09281 pfam00006 TIGR00962   dso   O
atpH NP_418191 COG0712 PRK05758 pfam00213 TIGR01145   dso   O
gidB-rsmG NP_418196 COG0357 PRK00107 pfam02527 TIGR00138   dwo   P
gidA-mnmG NP_418197 COG0445 PRK05192 pfam01134 TIGR00136 B000061 swo   E
hemC   YP_026260 COG0181 PRK00072 pfam01379 TIGR00212 B000035     H
uvrD   NP_418258 COG0210 PRK11773 pfam00580 TIGR01075   dwo   P
polA   NP_418300 COG0749 PRK05755 pfam00476 TIGR00593 B000050 fdwo   N
hemN   NP_418303 COG0635 PRK09249 pfam04055 TIGR00538   sso   G
typA   YP_026274 COG1217 PRK10218 pfam00009 TIGR01394 B000111     H
tpiA   NP_418354 COG0149 PRK00042 pfam00121 TIGR00419   dwo   P
priA   NP_418370 COG1198 PRK05580 pfam00270 TIGR00595 B000045 swo   E
murB   NP_418403 COG0812 PRK00046 pfam02873 TIGR00179 B000114 dso   K
nusG NP_418409 COG0250 PRK05609 pfam02357 TIGR00922   sso   G
rplK-L11 NP_418410 COG0080 PRK00140 pfam00298 TIGR01632 B000024 sso A
rplA-L1 NP_418411 COG0081 PRK05424 pfam00687 TIGR01169 B000003 sso A
rplJ-L10 NP_418412 COG0244 PRK00099 pfam00466   B000030 sso   B
rplL-L7L12 NP_418413 COG0222 PRK00157 pfam00542 TIGR00855 B000107 swo   E
rpoB NP_418414 COG0085 PRK00405 pfam00562 TIGR02013 B000042 sso A
rpoC NP_418415 COG0086 PRK00566 pfam04998 TIGR02386 B000044 swo   E
hemE   NP_418425 COG0407 PRK00115 pfam01208 TIGR01464 B000100     H
purD NP_418433 COG0151 PRK00885 pfam01071 TIGR00877   swo   I
purH NP_418434 COG0138 PRK00881 pfam01808 TIGR00355   dso   O
dnaB   NP_418476 COG0305 PRK08006 pfam03796 TIGR00665   dwo   P
uvrA   NP_418482 COG0178 PRK00349 pfam00005 TIGR00630   dwo   P
groES   NP_418566 COG0234 PRK00364 pfam00166     dwo   P
efp   NP_418571 COG0231 PRK00529 pfam09285 TIGR00038   dwo   P
tsaE-yjeE NP_418589 COG0802 PRK10646 pfam02367 TIGR00150   dso   O
mutL NP_418591 COG0323 PRK00095 pfam01119 TIGR00585 B000064     H
miaA NP_418592 COG0324 PRK00091 pfam01715 TIGR00174 B000082 swo   E
purA   NP_418598 COG0104 PRK01117 pfam00709 TIGR00184   dwo   P
rlmB   NP_418601 COG0566 PRK11181 pfam00588 TIGR00186   swo   I
rpsF-S6 NP_418621 COG0360 PRK00453 pfam01250 TIGR00166 B000051 sso   B
rpsR-S18 NP_418623 COG0238 PRK00391 pfam01084 TIGR00165 B000075 dso   K
rplI-L9 NP_418624 COG0359 PRK00137 pfam03948 TIGR00158 B000054 sso   B
pyrB NP_418666 COG0540 PRK00856 pfam02729 TIGR00670   dso   O
valS NP_418679 COG0525 PRK05729 pfam00133 TIGR00422   swo D
sms-radA NP_418806 COG1066 PRK11823 pfam06745 TIGR00416 B000060 swo   E

Usage

Downloading multiple sequence alignment (MSA) or position specific scoring matrix (PSSM) files

For each gene name GENE and each accession number ACCN from the databank BANK (i.e. COG, PRK, Pfam, TIGR or PhyEco) available inside the above table, the reference multiple amino acid sequence alignment (MSA) could be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/aln/BANK/GENE.ACCN.faa

and the associated position specific scoring matrix (PSSM) via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/smp/BANK/GENE.ACCN.smp

For example, the reference MSA COG0359 for the gene rplI-L9 could be downloaded by wget with the following linux command line:

 wget -q http://giphy.pasteur.fr/PhyloM/bacteria/aln/COG/rplI-L9.COG0359.faa

The same download could be also performed by curl with the following command line:

 curl --silent -O http://giphy.pasteur.fr/PhyloM/bacteria/aln/COG/rplI-L9.COG0359.faa

Downloading file sets associated to a given category

Gene names from a given category CTG (i.e. from A to P) could be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/cat/CTG.txt

Every reference MSA from the databank BANK (i.e. COG, PRK, Pfam, TIGR or PhyEco) associated to a given category CTG (i.e. from A to P) could be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/aln/BANK/BANK.CTG.aln.tar.gz

Similarly, every PSSM from the databank BANK (i.e. COG, PRK, Pfam, TIGR or PhyEco) associated to a given category CTG (i.e. from A to P) could be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/smp/BANK/BANK.CTG.smp.tar.gz

For example, every PhyEco PSSM file that belongs to the category A could be downloaded by curl and uncompressed by tar with the following linux command line:

 curl --silent http://giphy.pasteur.fr/PhyloM/bacteria/smp/COG/COG.A.smp.tar.gz | tar -xz

Using a MSA for performing a BLAST search against an amino acid sequence databank

Each of the PhyloM MSA files could be used as a query for performing a psiblast search with the BLAST+ tools (Camacho et al. 2009). Let cds.faa be a FASTA-formatted amino acid sequence file (e.g. every CDS from a bacterial genome). This databank should be first formatted with the following linux command line:

 makeblastdb  -in cds.faa

Next, a PhyloM MSA file msa.faa could be directly used as a query for performing a BLAST search with the following linux command line model:

 psiblast  -in_msa msa.faa  -db cds.faa  -seg no  -word_size 2  -evalue 1E-20  -xdrop_gap_final 1000

Using a PSSM for performing a BLAST search against a nucleotide sequence databank

Each of the PhyloM PSSM files could be used as a query for performing a tblastn search with the BLAST+ tools (Camacho et al. 2009). Let seq.fna be a FASTA-formatted nucleotide sequence file (e.g. de novo assembly of a bacterial genome). This databank should be first formatted with the following linux command line:

 makeblastdb  -in seq.fna  -dbtype nucl

Next, a PhyloM PSSM file pssm.smp could be directly used as a query for performing a BLAST search with the following linux command line model:

 tblastn  -in_pssm pssm.smp  -db seq.fna  -seg no  -word_size 2  -evalue 1E-20  -xdrop_gap_final 1000

Of note, the corresponding full CDS could be easily extracted by using the program eFASTA along with the fields 2, 9 and 10 outputed by the tblastn option -outfmt 6.


Litterature cited

Bratlie MS, Johansen J, Drablos F (2010) Relationship between operon preference and functional properties of persistent genes in bacterial genomes. BMC Genomics, 11:71. doi:10.1186/1471-2164-11-71

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421

Creevey CJ, Doerks T, Fitzpatrick DA, Raes J, Bork P (2011) Universally distributed single-copy genes indicate a constant rate of horizontal transfer. PLoS ONE, 6(8):e22099. doi:10.1371/journal.pone.0022099

Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44:D279-285. doi:10.1093/nar/gkv1344

Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Research, 43:D261-9. doi:10.1093/nar/gku1223

Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41-43. doi:10.1093/nar/29.1.41

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41. doi:10.1186/1471-2105-4-41

Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science, 278(5338):631-637. doi:10.1126/science.278.5338.631

Wu D, Jospin G, Eisen JA (2013) Systematic identification of gene families for use as "markers" for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. PLoS One, 8(10):e77033. doi:10.1371/journal.pone.0077033