Institut Pasteur blankvertical divider clipartblank DBC blankvertical divider clipartblank Bioinformatics and Biostatistics Hub blankvertical divider clipartblank GIPhy

DESCRIPTION      MARKER SETS      MARKER LIST      USAGE      LITTERATURE CITED

PhyloM: bacteria


Description

PhyloM: bacteria is a compilation of markers that are well-suited for phylogenetic tree inference of bacterial taxa. These selected markers are recommended for phylogenetic reconstruction because they have been shown to correspond to persistent genes within bacterial phyla (close to universal distribution). This gene selection mainly relies to the independant and complementary works of Bratlie et al. (2010), Creevey et al. (2011), Wu et al. (2013) and Parks et al. (2017).
Bratlie et al. (2010) classified more than 200 genes with respect to operon participation (i.e. strong or weak operon genes) and duplication event involvement (i.e. singleton or duplicate ortholog sequence sets).
Creevey et al. (2011) compiled 40 universal single-copy marker genes (bacteria, archaea and eukaryotes), leading to genes that are putatively essential and likely few affected by duplication events.
Wu et al. (2013) selected 114 genes associated to informative phylogenetic signal and low variation in copy number across taxa.
Parks et al. (2017) identified 120 phylogenetic informative genes that are generally present as a single-copy sequence in a large proportion of representative genomes (see also Parks et al. 2018, and Chaumeil et al. 2020).

For each phylogenetic marker (and each marker set), this webpage provides the reference multiple amino acid sequence alignments (MSA; ) and the associated position specific scoring matrices (PSSM; ) gathered from different sources (when available): COG database (Tatusov et al. 1997, 2003; Galperin 2015), NCBI protein cluster collection (PRK), Pfam database (Finn et al. 2016), TIGRFAMs database (Haft et al. 2001), and PhyEco repository (Wu et al. 2013). Each MSA and PSSM can be used to perform accurate BLAST searches against amino acid and nucleotide sequence databanks, respectively (see Usage).


Marker sets

Each of the compiled marker sets (Bratlie et al. 2010; Creevey et al. 2011; Wu et al. 2013; Parks et al. 2017) are summarized in the different tables below. Each table provides download links to:
 — a text file containing the gene name list (),
 — a tar.gz archive containing the reference multiple amino acid sequence alignments (MSA; ),
 — a tar.gz archive containing the associated position specific scoring matrices (PSSM; ).

▹ Persistent Gene Operon Category ('pgoc'; Bratlie et al. 2010)

The 203 loci classified by Bratlie et al. (2010) are labelled as 'pgoc' (persistent gene operon category) or as the author's classification:
 — sso (singleton strong operon): not (or little) involved in duplication events (singleton), and likely not involved in horizontal tranfer events (strong operon),
 — swo (singleton weak operon): not (or little) involved in duplication events (singleton), but may be involved in horizontal tranfer events (weak operon),
 — dso (duplicate strong operon): likely involved in duplication events (duplicate), but likely not involved in horizontal tranfer envents (strong operon),
 — dwo (duplicate weak operon): likely involved in duplication events (duplicate), and may also be involved in horizontal tranfer events (weak operon).
For more details about the definition of singleton, duplicate, strong and weak operon proteins, respectively, see the Introduction subsection in Bratlie et al. (2010). For each locus subset (sso, swo, dso, dwo), the table below provides download links to the original MSA and PSSM ('pgoc original', from the COG; see Sup. Table S2), as well as selected MSA and PSSM ('pgoc PhyloM', mainly from the NCBI protein cluster collection). The 67 markers from the set 'sso' can be used to perform phylogenetic tree inference of bacterial taxa, and the selected MSA or PSSM from the set 'pgoc PhyloM' are recommended to perform BLAST searches.

marker set no. loci gene names pgoc (original) pgoc (PhyloM)
sso 67    
swo 41    
dso 40    
dwo 55    

▹ Universal Single Copy Genes ('uscg'; Creevey et al. 2011)

The 40 genes compiled by Creevey et al. (2011) are labelled as 'uscg' (universal single copy genes). The table below provides download links to the original MSA and PSSM ('uscg original', from the COG; see Table 1), as well as selected MSA and PSSM ('uscg PhyloM', mainly from the NCBI protein cluster collection). These 40 markers can be used to perform phylogenetic tree inference of bacterial taxa, and the selected MSA or PSSM from the set 'pgoc PhyloM' are recommended to perform BLAST searches.

marker set no. loci gene names  original   PhyloM 
uscg 40    

▹ PhyEco (Wu et al. 2013)

The 114 markers selected by Wu et al. (2013) are labelled as 'PhyEco'. The table below provides download links to the original MSA and PSSM ('PhyEco original'; see bacteria.tgz), as well as selected MSA and PSSM ('PhyEco PhyloM', mainly from the NCBI protein cluster collection). These 114 markers can be used to perform phylogenetic tree inference of any bacterial taxa, and the selected MSA or PSSM from the set 'PhyEco PhyloM' are recommended to perform BLAST searches.

marker set no. loci gene names  original   PhyloM 
PhyEco 114    

▹ bac120 (Parks et al. 2017)

The 120 markers identified by Parks et al. (2017) are labelled as 'bac120'. The table below provides download links to the original MSA and PSSM ('bac120 original'; see Sup. Table S6), as well as selected MSA and PSSM ('bac120 PhyloM', mainly from the NCBI protein cluster collection). These 120 markers can be used to perform phylogenetic tree inference of any bacterial taxa, and the selected MSA or PSSM from the set 'bac120 PhyloM' are recommended to perform BLAST searches.

marker set no. loci gene names  original   PhyloM 
bac120 120    

▹ PhyloM categories

The overall 236 phylogenetic markers are classified into four categories (from A to D) according to their belonging to all (A) to only one of (D) the four marker sets pgoc, uscg, PhyEco and bac120 (see above). For each category A-D, the table below provides download links to selected MSA and PSSM (mainly from the NCBI protein cluster collection). The 25 markers from the category A are highly recommended to perform phylogenetic tree inference of any bacterial taxa, as these markers were identified in the four compiled studies (see above). However, to deal with larger datasets, it is also recommended to use the (25+56=)81 markers from the two categories A and B, as these markers were identified in most of the four compiled studies.

marker set no. loci gene names  PhyloM 
A 25  
B 56  
C 56  
D 99  

Marker list

The overall 236 phylogenetic markers are listed in the table below. They are sorted according to the reference genome of Escherichia coli strain K12 substr. MG1655 (Genbank accn: NC_000913).
Common gene names are available in the column 'name' with a link to the Uniprot description.
When a gene is organized within a well-documented operon, a link to the corresponding RegulonDB description is available in the column 'E. coli operon'.
The column 'E. coli CDS' lists the NCBI accession numbers of the E. coli strain K12 substr. MG1655 CDS with links to the corresponding entries in the NCBI Conserved Domain Database.
For each gene, the corresponding accession number (if available) is given for the COG database ('COG'), the NCBI protein cluster collection ('PRK'), the Pfam database ('Pfam'), the TIGRFAMs database ('TIGRFAMs'), and the PhyEco repository ('PhyEco'). The column 'PhyloM' contains MSA and PSSM that were selected (mainly from PRK) to observe accurate BLAST results in practice.
Presence/absence of each marker in the four compiled studies is indicated in columns 'PhyEco', 'bac120', 'pgoc' and 'uscg', respectively. PhyloM categories A-D (see above) are indicated in the column 'category'.

name E. coli operon E. coli CDS COG PRK Pfam TIGRFAMs PhyloM PhyEco bac120 pgoc uscg category
dnaK   NP_414555 COG0443 PRK00290 pfam00012 TIGR02350 PMB0082   dwo   C
rpsT-S20   NP_414564 COG0268 PRK00239 pfam01649 TIGR00029 PMB0026 B000077 swo   B
ribF NP_414566 COG0196 PRK05627 pfam06574 TIGR00083 PMB0027 B000096 dwo   B
ileS NP_414567 COG0060 PRK05743 pfam00133 TIGR00392 PMB0083   dwo   C
lspA NP_414568 COG0597 PRK00376 pfam01252 TIGR00077 PMB0138     dso   D
carA NP_414573 COG0505 PRK12564 pfam00988 TIGR01368 PMB0139     swo   D
carB NP_414574 COG0458 PRK05294 pfam02786 TIGR01369 PMB0140     dwo   D
rsmA-ksgA   NP_414593 COG0030 PRK00274 pfam00398 TIGR00755 PMB0084   sso   C
rsmH-mraW NP_414624 COG0275 PRK00050 pfam01795 TIGR00006 PMB0028 B000067 sso   B
ftsI NP_414626 COG0768 PRK15105 pfam00905 TIGR02214 PMB0141     dso   D
murE NP_414627 COG0769 PRK00139 pfam08245 TIGR01085 PMB0085 B000105   dso   C
murF NP_414628 COG0770 PRK10773 pfam08245 TIGR01143 PMB0142     dso   D
mraY NP_414629 COG0472 PRK00108 pfam00953 TIGR00445 PMB0086   sso   C
murD NP_414630 COG0771 PRK03806 pfam08245 TIGR01087 PMB0029 B000068 dso   B
ftsW NP_414631 COG0772 PRK10774 pfam01098 TIGR02614 PMB0143     dso   D
murG NP_414632 COG0707 PRK00726 pfam03033 TIGR01133 PMB0087 B000066   dso   C
murC NP_414633 COG0773 PRK00421 pfam08245 TIGR01082 PMB0088   dso   C
ftsA NP_414636 COG0849 PRK09472 pfam14450 TIGR01174 PMB0144 B000110       D
ftsZ NP_414637 COG0206 PRK09330 pfam12327 TIGR00065 PMB0089   dwo   C
secA   NP_414640 COG0653 PRK12904 pfam07517 TIGR00963 PMB0090   dwo   C
coaE   NP_414645 COG0237 PRK00081 pfam01121 TIGR00152 PMB0091 B000085   sso   C
map   NP_414710 COG0024 PRK05716 pfam00557 TIGR00500 PMB0145     dwo   D
rpsB-S2 NP_414711 COG0052 PRK05299 pfam00318 TIGR01011 PMB0001 B000001 sso A
tsf NP_414712 COG0264 PRK09377 pfam00889 TIGR00116 PMB0030 B000056 sso   B
pyrH   NP_414713 COG0528 PRK00358 pfam00696 TIGR02075 PMB0092   sso   C
frr   NP_414714 COG0233 PRK00083 pfam01765 TIGR00496 PMB0031 B000079 sso   B
dxr   NP_414715 COG0743 PRK05447 pfam02670 TIGR00243 PMB0146 B000088       D
uppS   NP_414716 COG0020 PRK10240 pfam01255 TIGR00055 PMB0147     dso   D
rseP   NP_414718 COG0750 PRK10779 pfam02163 TIGR00054 PMB0093   dso   C
rnhB NP_414725 COG0164 PRK00015 pfam01351 TIGR00729 PMB0148 B000039       D
dnaE NP_414726 COG0587 PRK05673 pfam07733 TIGR00594 PMB0149     dwo   D
tilS-mesJ   NP_414730 COG0037 PRK10660 pfam01171 TIGR02432 PMB0032 B000091 sso   B
tgt NP_414940 COG0343 PRK00112 pfam01702 TIGR00430 PMB0150     dwo   D
nrdR NP_414947 COG1327 PRK00464 pfam03477 TIGR00244 PMB0151 B000087       D
nusB NP_414950 COG0781 PRK00202 pfam01029 TIGR01951 PMB0094   sso   C
tig   NP_414970 COG0544 PRK01490 pfam05697 TIGR00115 PMB0095   swo   C
clpX NP_414972 COG1219 PRK05342 pfam07724 TIGR00382 PMB0033 B000112 swo   B
dnaX   NP_415003 COG2812 PRK07994 pfam12170 TIGR02397 PMB0096   sso   C
recR NP_415005 COG0353 PRK00076 pfam13662 TIGR00615 PMB0034 B000080 sso   B
purE NP_415056 COG0041 PLN02948 pfam00731 TIGR01162 PMB0152     dso   D
cysS   NP_415059 COG0215 PRK00260 pfam01406 TIGR00435 PMB0035   swo B
rsfS NP_415170 COG0799 PRK11538 pfam02410 TIGR00090 PMB0097 B000062     C
holA NP_415173 COG1466 PRK05574 pfam14840 TIGR01128 PMB0153       D
leuS NP_415175 COG0495 PRK00390 pfam13603 TIGR00396 PMB0002 B000108 swo A
ybeY NP_415192 COG0319 PRK00016 pfam02130 TIGR00043 PMB0036 B000081 sso   B
uvrB   NP_415300 COG0556 PRK05298 pfam12344 TIGR00631 PMB0098   swo   C
infA NP_415404 COG0361 PRK00276 pfam01176 TIGR00008 PMB0154     dwo   D
ftsK   NP_415410 COG1674 PRK10263 pfam01580 TIGR03928 PMB0155     dwo   D
serS   NP_415413 COG0172 PRK05431 pfam00587 TIGR00414 PMB0037 B000073   B
aroA NP_415428 COG0128 PRK11860 pfam00275 TIGR01356 PMB0156     dso   D
rpsA-S1 NP_415431 COG0539 PRK06299 pfam00575 TIGR00717 PMB0099   swo   C
rpmF-L32 NP_415607 COG0333 PRK01110 pfam01783 TIGR01031 PMB0157 B000106       D
plsX NP_415608 COG0416 PRK05331 pfam02504 TIGR00182 PMB0158 B000090       D
fabD NP_415610 COG0331 PLN02752 pfam00698 TIGR00128 PMB0159     dso   D
fabG NP_415611 COG1028 PRK05557 pfam13561 TIGR01830 PMB0160     dso   D
acpP NP_415612 COG0236 PRK00982 pfam00550 TIGR00517 PMB0161     dwo   D
fabF NP_415613 COG0304 PRK07314 pfam00109 TIGR03150 PMB0162     dso   D
tmk NP_415616 COG0125 PRK00698 pfam02223 TIGR00041 PMB0163     sso   D
holB NP_415617 COG0470 PRK07993 pfam09115 TIGR00678 PMB0164     sso   D
ycfH NP_415618 COG0084 PRK10812 pfam01026 TIGR00010 PMB0165     dso   D
mfd   NP_415632 COG1197 PRK10689 pfam03461 TIGR00580 PMB0038 B000046 swo   B
purB NP_415649 COG0015 PRK09285 pfam00206 TIGR00928 PMB0166       D
mnmA-trmU   NP_415651 COG0482 PRK00143 pfam03054 TIGR00420 PMB0039 B000098 dwo   B
ychF NP_415721 COG0012 PRK09601 pfam06071 TIGR00092 PMB0003 B000083 swo A
pth NP_415722 COG0193 PRK05426 pfam01195 TIGR00447 PMB0100 B000103   dso   C
prsA NP_415725 COG0462 PRK01259 pfam13793 TIGR01251 PMB0167     dwo   D
ispE NP_415726 COG1947 PRK00343 pfam08544 TIGR00154 PMB0168     sso   D
prfA NP_415729 COG0216 PRK00591 pfam03462 TIGR00019 PMB0040 B000053 sso   B
pemK-prmC NP_415730 COG2890 PRK09328 pfam05175 TIGR00536 PMB0169     dso   D
nth   NP_416150 COG0177 PRK10702 pfam00730 TIGR01083 PMB0170     dso   D
pheT NP_416228 COG0072 PRK00629 pfam03483 TIGR00472 PMB0041 B000013 fsso   B
pheS NP_416229 COG0016 PRK00488 pfam01409 TIGR00468 PMB0004 B000020 sso A
rplT-L20 NP_416231 COG0292 PRK05185 pfam00453 TIGR01032 PMB0042 B000070 swo   B
rpmI-L35 NP_416232 COG0291 PRK00172 pfam01632 TIGR00001 PMB0101 B000099   swo   C
infC NP_416233 COG0290 PRK00028 pfam00707 TIGR00168 PMB0043 B000104 swo   B
thrS NP_416234 COG0441 PRK00413 pfam00587 TIGR00418 PMB0171     dwo   D
tsaB-yeaZ   NP_416321 COG1214 PRK09604 pfam00814 TIGR03725 PMB0102   sso   C
ruvB NP_416374 COG2255 PRK00080 pfam05496 TIGR00635 PMB0044 B000072 sso   B
ruvA NP_416375 COG0632 PRK00116 pfam01330 TIGR00084 PMB0045 B000071 sso   B
ruvC NP_416377 COG0817 PRK00039 pfam02075 TIGR00228 PMB0172 B000093       D
yebC NP_416378 COG0217 PRK00110 pfam01709 TIGR01033 PMB0173     dwo   D
aspS   NP_416380 COG0173 PRK00476 pfam00152 TIGR00459 PMB0103   dwo   C
argS   NP_416390 COG0018 PRK01611 pfam00750 TIGR00456 PMB0046   dwo B
pgsA   NP_416422 COG0558 PRK10832 pfam01066 TIGR00560 PMB0174     swo   D
uvrC NP_416423 COG0322 PRK00558 pfam08459 TIGR00194 PMB0104   swo   C
metG   NP_416617 COG0143 PRK00133 pfam09334 TIGR00398 PMB0175       D
rplY-L25   NP_416690 COG1825 PRK05943 pfam01386 TIGR00731 PMB0176 B000059       D
trxA   NP_416699 COG0526 PRK15412 pfam08534 TIGR00385 PMB0177     dwo   D
gyrA   NP_416734 COG0188 PRK05560 pfam00521 TIGR01063 PMB0105   dwo   C
purF NP_416815 COG0034 PRK09246 pfam13537 TIGR01134 PMB0178     dwo   D
folC NP_416818 COG0285 PRK10846 pfam08245 TIGR01499 PMB0179     dso   D
truA NP_416821 COG0101 PRK00021 pfam01416 TIGR00071 PMB0180     dso   D
aroC   NP_416832 COG0082 PRK05382 pfam01264 TIGR00033 PMB0181     sso   D
gltX   NP_416899 COG0008 PRK01406 pfam00749 TIGR00464 PMB0182     dwo   D
ligA NP_416906 COG0272 PRK07956 pfam01653 TIGR00575 PMB0106 B000109   dwo   C
purM NP_416994 COG0150 PRK05385 pfam02769 TIGR00878 PMB0107 B000038   sso   C
guaB   NP_417003 COG0516 PRK05567 pfam00478 TIGR01302 PMB0183       D
der-engA NP_417006 COG1160 PRK00093 pfam01926 TIGR03594 PMB0047 B000043 swo   B
hisS   NP_417009 COG0124 PRK00037 pfam13393 TIGR00442 PMB0048   dso B
era NP_417061 COG1159 PRK00089 pfam01926 TIGR00436 PMB0108   sso   C
rnc NP_417062 COG0571 PRK00102 pfam14622 TIGR02191 PMB0109   dso   C
lepA NP_417064 COG0481 PRK05433 pfam00009 TIGR01393 PMB0049 B000004 dwo   B
clpB   NP_417083 COG0542 PRK10865 pfam07724 TIGR03346 PMB0184     dwo   D
rluD-sfhB   NP_417085 COG0564 PRK11180 pfam00849 TIGR00005 PMB0185     swo   D
rplS-L19 NP_417097 COG0335 PRK05338 pfam01245 TIGR01024 PMB0110 B000069   swo   C
trmD NP_417098 COG0336 PRK00026 pfam01746 TIGR00088 PMB0050 B000058 sso   B
rimM NP_417099 COG0806 PRK00122 pfam01782 TIGR02273 PMB0051 B000086 sso   B
rpsP-S16 NP_417100 COG0228 PRK00040 pfam00886 TIGR00002 PMB0111 B000094   sso   C
ffh   NP_417101 COG0541 PRK10867 pfam00448 TIGR00959 PMB0005 B000008 swo A
grpE   NP_417104 COG0576 PRK10325 pfam01025   PMB0112   dwo   C
nadK-ppnK   NP_417105 COG0061 PRK03378 pfam01513   PMB0186     dwo   D
recN   YP_026172 COG0497 PRK10869 pfam13476 TIGR00634 PMB0113 B000078     C
smpB   NP_417110 COG0691 PRK05422 pfam01668 TIGR00086 PMB0052 B000065 swo   B
alaS   NP_417177 COG0013 PRK00252 pfam01411 TIGR00344 PMB0114   swo   C
recA NP_417179 COG0468 PRK09354 pfam00154 TIGR02012 PMB0053 B000095 dwo   B
mutS   NP_417213 COG0249 PRK05399 pfam00488 TIGR01070 PMB0187 B000076       D
ispF   NP_417226 COG0245 PRK00084 pfam02542 TIGR00151 PMB0188 B000089       D
eno NP_417259 COG0148 PRK00077 pfam00113 TIGR01060 PMB0189     dwo   D
pyrG NP_417260 COG0504 PRK05380 pfam06418 TIGR00337 PMB0054 B000047 swo   B
prfB   NP_417367 COG1186 PRK00578 pfam03462 TIGR00020 PMB0190       D
metK   NP_417417 COG0192 PRK05250 pfam02773 TIGR01034 PMB0191     dwo   D
rsmE   NP_417421 COG1385 PRK11713 pfam04452 TIGR00046 PMB0192     dwo   D
yqgF NP_417424 COG0816 PRK00109 pfam03652 TIGR00250 PMB0115   sso   C
rdgB NP_417429 COG0127 PRK00120 pfam01725 TIGR00042 PMB0193     dso   D
hemW NP_417430 COG0635 PRK05660 pfam04055 TIGR00539 PMB0194       D
tsaD   NP_417536 COG0533 PRK09604 pfam00814 TIGR03723 PMB0006 B000006 swo A
dnaG NP_417538 COG0358 PRK05667 pfam08275 TIGR01391 PMB0055 B000092 swo   B
rpoD NP_417539 COG0568 PRK05658 pfam04546 TIGR02393 PMB0195     dwo   D
yraL   NP_417615 COG0313 PRK14994 pfam00590 TIGR00096 PMB0196     dwo   D
pnp NP_417633 COG1185 PRK11824 pfam03726 TIGR03591 PMB0056 B000055 swo   B
rpsO-S15 NP_417634 COG0184 PRK05626 pfam00312 TIGR00952 PMB0057 B000034   swo B
truB NP_417635 COG0130 PRK05033 pfam01509 TIGR00431 PMB0058 B000032 sso   B
rbfA NP_417636 COG0858 PRK00521 pfam02033 TIGR00082 PMB0059 B000063 sso   B
infB NP_417637 COG0532 PRK05306 pfam11987 TIGR00487 PMB0060 B000005 sso   B
nusA NP_417638 COG0195 PRK09202 pfam08529 TIGR01953 PMB0061 B000041 sso   B
rimP NP_417639 COG0779 PRK14640 pfam02576   PMB0197       D
secG NP_417642 COG1314 PRK06870 pfam03840 TIGR00810 PMB0198       D
folP NP_417644 COG0294 PRK11613 pfam00809 TIGR01496 PMB0199     dso   D
hflB-ftsH   NP_417645 COG0465 PRK10733 pfam01434 TIGR01241 PMB0200     dwo   D
greA   NP_417648 COG0782 PRK00226 pfam03449 TIGR01462 PMB0201     dwo   D
obg NP_417650 COG0536 PRK12298 pfam01018 TIGR02729 PMB0062 B000049 swo   B
rpmA-L27 NP_417652 COG0211 PRK05435 pfam01016 TIGR00062 PMB0116 B000102   sso   C
rplU-L21 NP_417653 COG0261 PRK05573 pfam00829 TIGR00061 PMB0063 B000074 sso   B
ispB   NP_417654 COG0142 PRK10888 pfam00348 TIGR02749 PMB0202     dwo   D
murA NP_417656 COG0766 PRK09369 pfam00275 TIGR01072 PMB0203     dwo   D
rpsI-S9 NP_417697 COG0103 PRK00132 pfam00380 TIGR03627 PMB0007 B000011 sso A
rplM-L13 NP_417698 COG0102 PRK09216 pfam00572 TIGR01066 PMB0008 B000037 sso A
smf   YP_026211 COG0758 PRK10736 pfam02481 TIGR00732 PMB0204     dwo   D
mreC NP_417716 COG1792 PRK13922 pfam04085 TIGR00219 PMB0205 B000101       D
def NP_417745 COG0242 PRK00150 pfam01327 TIGR00079 PMB0206     dso   D
fmt NP_417746 COG0223 PRK00005 pfam00551 TIGR00460 PMB0117   sso   C
rplQ-L17 NP_417753 COG0203 PRK05591 pfam01196 TIGR00059 PMB0064 B000057 sso   B
rpoA NP_417754 COG0202 PRK05182 pfam01193 TIGR02027 PMB0009 B000052 dso A
rpsD-S4 NP_417755 COG0522 PRK05327 pfam00163 TIGR01017 PMB0065   dwo B
rpsK-S11 NP_417756 COG0100 PRK05309 pfam00411 TIGR03632 PMB0010 B000029 sso A
rpsM-S13 NP_417757 COG0099 PRK05179 pfam00416 TIGR03631 PMB0066 B000019   sso B
prlA-secY NP_417759 COG0201 PRK09204 pfam00344 TIGR00967 PMB0011 B000048 dso A
rplO-L15 NP_417760 COG0200 PRK05592 pfam00828 TIGR01071 PMB0012 B000021 sso A
rpsE-S5 NP_417762 COG0098 PRK00550 pfam03719 TIGR01021 PMB0013 B000015 sso A
rplR-L18 NP_417763 COG0256 PRK05593 pfam00861 TIGR00060 PMB0067 B000033   sso B
rplF-L6 NP_417764 COG0097 PRK05498 pfam00347 TIGR03654 PMB0014 B000023 sso A
rpsH-S8 NP_417765 COG0096 PRK00136 pfam00410   PMB0015 B000031 sso A
rpsN-S14 NP_417766 COG0199 PRK08881 pfam00253   PMB0207     dso   D
rplE-L5 NP_417767 COG0094 PRK00010 pfam00673   PMB0068 B000025   sso B
rplX-L24 NP_417768 COG0198 PRK00004 pfam17136 TIGR01079 PMB0069 B000040 sso   B
rplN-L14 NP_417769 COG0093 PRK05483 pfam00238 TIGR01067 PMB0070 B000014   sso B
rpsQ-S17 NP_417770 COG0186 PRK05610 pfam00366 TIGR03635 PMB0071 B000036   dso B
rpmC-L29 NP_417771 COG0255 PRK00306 pfam00831 TIGR00012 PMB0208 B000027       D
rplP-L16 NP_417772 COG0197 PRK09203 pfam00252 TIGR01164 PMB0016 B000018 sso A
rpsC-S3 NP_417773 COG0092 PRK00310 pfam00189 TIGR01009 PMB0017 B000028 sso A
rplV-L22 NP_417774 COG0091 PRK00565 pfam00237 TIGR01044 PMB0018 B000007 sso A
rpsS-S19 NP_417775 COG0185 PRK00357 pfam00203 TIGR01050 PMB0072 B000016   sso B
rplB-L2 NP_417776 COG0090 PRK09374 pfam03947 TIGR01171 PMB0019 B000010 sso A
rplW-L23 NP_417777 COG0089 PRK05738 pfam00276 TIGR03636 PMB0118 B000022   sso   C
rplD-L4 NP_417778 COG0088 PRK05319 pfam00573 TIGR03953 PMB0020 B000009 sso A
rplC-L3 NP_417779 COG0087 PRK00001 pfam00297 TIGR03625 PMB0021 B000012 sso A
rpsJ-S10 NP_417780 COG0051 PRK00596 pfam00338 TIGR01049 PMB0119 B000002   sso   C
rpsG-S7 NP_417800 COG0049 PRK05302 pfam00177 TIGR01029 PMB0022 B000017 sso A
rpsL-S12 NP_417801 COG0048 PRK05163 pfam00164 TIGR00981 PMB0073 B000026   swo B
trpS NP_417843 COG0180 PRK00927 pfam00579 TIGR00233 PMB0209     dwo   D
ftsY   NP_417921 COG0552 PRK10416 pfam00448 TIGR00064 PMB0074   swo B
rsmD NP_417922 COG0742 PRK10909 pfam03602 TIGR00095 PMB0120   sso   C
glyS NP_418016 COG0751 PRK01233 pfam02092 TIGR00211 PMB0210 B000097       D
gpsA NP_418065 COG0240 PRK00094 pfam07479 TIGR03376 PMB0211     dso   D
kdtB-coaD NP_418091 COG0669 PRK00168 pfam01467 TIGR01510 PMB0121   swo   C
rpmB-L28 NP_418094 COG0227 PRK00359 pfam00830 TIGR00009 PMB0212     dwo   D
gmk   NP_418105 COG0194 PRK00300 pfam00625 TIGR03263 PMB0122   swo   C
spoT-relA NP_418107 COG0317 PRK11092 pfam13328 TIGR00691 PMB0213     dwo   D
recG NP_418109 COG1200 PRK10917 pfam00270 TIGR00643 PMB0214       D
gyrB   YP_026241 COG0187 PRK14939 pfam00204 TIGR01059 PMB0123   dwo   C
recF NP_418155 COG1195 PRK00064 pfam02463 TIGR00611 PMB0215 B000113       D
dnaN NP_418156 COG0592 PRK05643 pfam02768 TIGR00663 PMB0124   dwo   C
dnaA NP_418157 COG0593 PRK00149 pfam00308 TIGR00362 PMB0075 B000084 dwo   B
yidC   NP_418161 COG0706 PRK01318 pfam14849 TIGR03593 PMB0216     dwo   D
thdF   NP_418162 COG0486 PRK05291 pfam12631 TIGR00450 PMB0217     swo   D
glmS NP_418185 COG0449 PRK00331 pfam01380 TIGR01135 PMB0218     dwo   D
atpD NP_418188 COG0055 PRK09280 pfam00006 TIGR01039 PMB0125   dso   C
atpG NP_418189 COG0224 PRK05621 pfam00231 TIGR01146 PMB0126   dso   C
atpA NP_418190 COG0056 PRK09281 pfam00006 TIGR00962 PMB0219     dso   D
atpH NP_418191 COG0712 PRK05758 pfam00213 TIGR01145 PMB0220     dso   D
gidB-rsmG NP_418196 COG0357 PRK00107 pfam02527 TIGR00138 PMB0127   dwo   C
gidA-mnmG NP_418197 COG0445 PRK05192 pfam01134 TIGR00136 PMB0128 B000061   swo   C
hemC YP_026260 COG0181 PRK00072 pfam01379 TIGR00212 PMB0221 B000035       D
uvrD   NP_418258 COG0210 PRK11773 pfam00580 TIGR01075 PMB0222     dwo   D
polA   NP_418300 COG0749 PRK05755 pfam00476 TIGR00593 PMB0076 B000050 fdwo   B
hemN   NP_418303 COG0635 PRK09249 pfam04055 TIGR00538 PMB0223     sso   D
typA   YP_026274 COG1217 PRK10218 pfam00009 TIGR01394 PMB0129 B000111     C
tpiA   NP_418354 COG0149 PRK00042 pfam00121 TIGR00419 PMB0224     dwo   D
priA   NP_418370 COG1198 PRK05580 pfam00270 TIGR00595 PMB0130 B000045   swo   C
murB NP_418403 COG0812 PRK00046 pfam02873 TIGR00179 PMB0131 B000114   dso   C
secE NP_418408 COG0690 PRK05740 pfam00584 TIGR00964 PMB0225       D
nusG NP_418409 COG0250 PRK05609 pfam02357 TIGR00922 PMB0132   sso   C
rplK-L11 NP_418410 COG0080 PRK00140 pfam00298 TIGR01632 PMB0023 B000024 sso A
rplA-L1 NP_418411 COG0081 PRK05424 pfam00687 TIGR01169 PMB0024 B000003 sso A
rplJ-L10 NP_418412 COG0244 PRK00099 pfam00466   PMB0077 B000030 sso   B
rplL-L7L12 NP_418413 COG0222 PRK00157 pfam00542 TIGR00855 PMB0133 B000107   swo   C
rpoB NP_418414 COG0085 PRK00405 pfam00562 TIGR02013 PMB0025 B000042 sso A
rpoC NP_418415 COG0086 PRK00566 pfam04998 TIGR02386 PMB0078 B000044 swo   B
hemE   NP_418425 COG0407 PRK00115 pfam01208 TIGR01464 PMB0226 B000100       D
purD NP_418433 COG0151 PRK00885 pfam01071 TIGR00877 PMB0227     swo   D
purH NP_418434 COG0138 PRK00881 pfam01808 TIGR00355 PMB0228     dso   D
dnaB   NP_418476 COG0305 PRK08006 pfam03796 TIGR00665 PMB0229     dwo   D
uvrA   NP_418482 COG0178 PRK00349 pfam00005 TIGR00630 PMB0230     dwo   D
groES NP_418566 COG0234 PRK00364 pfam00166   PMB0231     dwo   D
efp   NP_418571 COG0231 PRK00529 pfam09285 TIGR00038 PMB0232     dwo   D
tsaE-yjeE NP_418589 COG0802 PRK10646 pfam02367 TIGR00150 PMB0233     dso   D
mutL NP_418591 COG0323 PRK00095 pfam01119 TIGR00585 PMB0234 B000064       D
miaA NP_418592 COG0324 PRK00091 pfam01715 TIGR00174 PMB0134 B000082   swo   C
purA   NP_418598 COG0104 PRK01117 pfam00709 TIGR00184 PMB0235     dwo   D
rlmB NP_418601 COG0566 PRK11181 pfam00588 TIGR00186 PMB0135   swo   C
rpsF-S6 NP_418621 COG0360 PRK00453 pfam01250 TIGR00166 PMB0079 B000051 sso   B
rpsR-S18 NP_418623 COG0238 PRK00391 pfam01084 TIGR00165 PMB0136 B000075   dso   C
rplI-L9 NP_418624 COG0359 PRK00137 pfam03948 TIGR00158 PMB0080 B000054 sso   B
pyrB NP_418666 COG0540 PRK00856 pfam02729 TIGR00670 PMB0236     dso   D
valS NP_418679 COG0525 PRK05729 pfam00133 TIGR00422 PMB0137     swo C
sms-radA NP_418806 COG1066 PRK11823 pfam06745 TIGR00416 PMB0081 B000060 swo   B

Usage

Downloading multiple sequence alignment (MSA) or position specific scoring matrix (PSSM) files

For each gene name GENE and each accession number ACCN available in the above table, the reference multiple amino acid sequence alignment (MSA) can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/aln/GENE.ACCN.faa

and the associated position specific scoring matrix (PSSM) via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/smp/GENE.ACCN.smp

For example, the reference MSA COG0359 for the gene rplI-L9 can be downloaded using wget with the following linux command line:

 wget -q http://giphy.pasteur.fr/PhyloM/bacteria/aln/rplI-L9.COG0359.faa

The same download can be also performed using curl with the following command line:

 curl --silent -O http://giphy.pasteur.fr/PhyloM/bacteria/aln/rplI-L9.COG0359.faa

Downloading files associated to a marker set

For each marker set MSET (i.e. sso, swo, dso, dwo, PhyloM.sso, PhyloM.swo, PhyloM.dso, PhyloM.dwo, uscg, PhyloM.uscg, phyeco, PhyloM.phyeco, bac120, PhyloM.bac120, PhyloM.A, PhyloM.B, PhyloM.C, PhyloM.D), three different files can be downloaded:
 — a text file containing the gene name list,
 — a tar.gz archive containing the reference MSAs,
 — a tar.gz archive containing the associated PSSM.

The gene name list associated to a marker set MSET can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/cat/MSET.txt

The MSAs associated to a marker set MSET can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/cat/MSET.aln.tar.gz

The PSSMs associated to a marker set MSET can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/bacteria/cat/MSET.smp.tar.gz

Using a MSA for performing a BLAST search against an amino acid sequence databank

Each of the MSA files can be used as a query for performing a psiblast search using the BLAST+ tools (Camacho et al. 2009). Let cds.faa be a FASTA-formatted amino acid sequence file (e.g. every CDS from a bacterial genome). This databank should be first formatted with the following linux command line:

 makeblastdb  -in cds.faa

Next, a MSA file msa.faa can be directly used as a query for performing a BLAST search with the following linux command line model:

 psiblast  -in_msa msa.faa  -db cds.faa  -seg no  -word_size 2  -evalue 0.05  -xdrop_gap_final 1000

Using a PSSM for performing a BLAST search against a nucleotide sequence databank

Each of the PSSM files can be used as a query for performing a tblastn search using the BLAST+ tools (Camacho et al. 2009). Let seq.fna be a FASTA-formatted nucleotide sequence file (e.g. de novo assembly of a bacterial genome). This databank should be first formatted with the following linux command line:

 makeblastdb  -in seq.fna  -dbtype nucl

Next, a PSSM file pssm.smp can be directly used as a query for performing a BLAST search with the following linux command line model:

 tblastn  -in_pssm pssm.smp  -db seq.fna  -seg no  -word_size 2  -evalue 0.05  -xdrop_gap_final 1000

Of note, the corresponding full CDS can be easily extracted by using the program eFASTA along with the fields 2, 9 and 10 outputted by the tblastn option -outfmt 6. The tool eCDS can also be used to easily extract the full CDS associated to each tblastn hit.


Litterature cited

Bratlie MS, Johansen J, Drablos F (2010) Relationship between operon preference and functional properties of persistent genes in bacterial genomes. BMC Genomics, 11:71. doi:10.1186/1471-2164-11-71

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421

Chaumeil P-A, Mussig AJ, Hugenholtz P, Parks DH (2020) GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, 36(6):1925-1927. doi:10.1093/bioinformatics/btz848

Creevey CJ, Doerks T, Fitzpatrick DA, Raes J, Bork P (2011) Universally distributed single-copy genes indicate a constant rate of horizontal transfer. PLoS ONE, 6(8):e22099. doi:10.1371/journal.pone.0022099

Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44:D279-285. doi:10.1093/nar/gkv1344

Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Research, 43:D261-9. doi:10.1093/nar/gku1223

Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41-43. doi:10.1093/nar/29.1.41

Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, Hugenholtz P (2018) A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, 36:996-1004. doi:10.1038/nbt.4229

Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ, Evans PN, Hugenholtz P, Tyson GW (2017) Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology, 2:1533-1542. doi:10.1038/s41564-017-0012-7

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41. doi:10.1186/1471-2105-4-41

Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science, 278(5338):631-637. doi:10.1126/science.278.5338.631

Wu D, Jospin G, Eisen JA (2013) Systematic identification of gene families for use as "markers" for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. PLoS One, 8(10):e77033. doi:10.1371/journal.pone.0077033