PhyloM: Bacillaceae is a selection of markers that are well-suited to infer phylogenetic trees of members of the Bacillaceae family. This bacteria family (phylum: Firmicutes; class: Bacilli; order: Bacillales) contains more than one hundred genera, such as the well-known Bacillus genus. This gene selection relies on the 87 Firmicutes markers (dataset BM1) of Wu et al. (2013), sometimes named phyloeco dataset (e.g. Patel and Gupta 2020, Gupta et al. 2020). The dataset BM1 is completed with two other reduced gene sets (datasets BM2 and BM3) described by Patel and Gupta (2020).
Three sets of Bacillaceae markers (91 total loci) are summarized below: BM1, BM2 and BM3 (Wu et al. 2013, Patel and Gupta 2020, Gupta et al. 2020). For each of them, the associated gene name list is available (). Next, tar.gz archives can be downloaded, each containing reference multiple amino acid sequence alignments (MSA; ) or position specific scoring matrices (PSSM; ) gathered from the COG database (Tatusov et al. 1997, 2003; Galperin et al. 2015), the NCBI protein cluster collection (PRK), the Pfam database (Finn et al. 2016), the TIGRFAMs database (Haft et al. 2001), and the PhyEco repository (Wu et al. 2013). In supplement, curated MSA and PSSM inferred specifically from 513 Bacillaceae type strain genomes are also downloadable (PhyloM). For each dataset and source, the total number of available MSA/PSSM is indicated in parentheses.
dataset | genes | COG | PRK | Pfam | TIGR | PhyEco | PhyloM | ||||||||
BM1 | (87) | (87) | (86) | (87) | (81) | (81) | (87) | ||||||||
BM2 | (4) | (4) | (4) | (4) | (4) | (2) | (4) | ||||||||
BM3 | (2) | (2) | (2) | (2) | (2) | (0) | (2) |
Of note, the 91 initial (non-curated) MSA built from 513 Bacillaceae type strain genomes to create the databank PhyloM
(last column) are available here:
All the 91 selected genes are listed in the table below.
They are sorted according to the reference genome of Bacillus subtilis strain 168 (Genbank accn: NC_000964).
Common gene names are available in the column name with a link to the Uniprot description.
The column B. subtilis CDS lists the NCBI accession numbers of the B. subtilis str. 168 CDS with links to the corresponding entries in the NCBI Conserved Domain Database.
Datasets BM1, BM2 or BM3 are indicated in the last column BM.
For each gene, the corresponding accession number (if available) is given for
the COG database (Tatusov et al. 1997, 2003; Galperin et al. 2015),
the NCBI protein cluster collection (PRK),
the Pfam database (Finn et al. 2016),
the TIGRFAMs database (Haft et al. 2001),
the PhyEco repository (Wu et al. 2013), and
the current PhyloM: Bacillaceae repository.
For each non-empty entry, the corresponding MSA is available (with consensus as first sequence; ), as well as the associated PSSM ().
name | B. subtilis CDS | COG | PRK | Pfam | TIGRFAMs | PhyEco | PhyloM | BM |
recF | NP_387885 | COG1195 | PRK00064 | pfam13175 | TIGR00611 | FIRM000113 | PMBAC001 | 1 |
gyrB | NP_387887 | COG0187 | PRK05644 | pfam00204 | TIGR01059 | PMBAC088 | 2 | |
gyrA | NP_387888 | COG0188 | PRK05560 | pfam00521 | TIGR01063 | PMBAC089 | 2 | |
dnaX | NP_387900 | COG2812 | PRK05563 | pfam13177 | TIGR02397 | FIRM000123 | PMBAC002 | 1 |
recR | NP_387902 | COG0353 | PRK00076 | pfam13662 | TIGR00615 | FIRM000080 | PMBAC003 | 1 |
rsmI | NP_387917 | COG0313 | PRK14994 | pfam00590 | TIGR00096 | FIRM000119 | PMBAC004 | 1 |
metG | NP_387919 | COG0143 | PRK12267 | pfam09334 | TIGR00398 | FIRM000139 | PMBAC005 | 1 |
pth | NP_387934 | COG0193 | PRK05426 | pfam01195 | TIGR00447 | FIRM000103 | PMBAC006 | 1 |
radA | NP_387968 | COG1066 | PRK11823 | pfam18073 | TIGR00416 | FIRM000060 | PMBAC007 | 1 |
cysS | NP_387975 | COG0215 | PRK00260 | pfam01406 | TIGR00435 | FIRM000133 | PMBAC008 | 1 |
rplA-L1 | NP_387984 | COG0081 | PRK05424 | pfam00687 | TIGR01169 | FIRM000003 | PMBAC009 | 1 |
rplJ-L10 | NP_387985 | COG0244 | PRK00099 | pfam00466 | FIRM000030 | PMBAC010 | 1 | |
rplL-L12 | NP_387986 | COG0222 | PRK00157 | pfam00542 | TIGR00855 | FIRM000107 | PMBAC011 | 1 |
rpoB | NP_387988 | COG0085 | PRK00405 | pfam00562 | TIGR02013 | FIRM000042 | PMBAC012 | 1,2 |
rpoC | NP_387989 | COG0086 | PRK00566 | pfam04997 | TIGR02386 | FIRM000044 | PMBAC013 | 1,2 |
rpsL-S12 | NP_387991 | COG0048 | PRK05163 | pfam00164 | TIGR00981 | FIRM000026 | PMBAC014 | 1 |
rpsG-S7 | NP_387992 | COG0049 | PRK05302 | pfam00177 | TIGR01029 | FIRM000017 | PMBAC015 | 1 |
rplD-L4 | NP_387998 | COG0088 | PRK05319 | pfam00573 | TIGR03953 | FIRM000009 | PMBAC016 | 1 |
rplB-L2 | NP_388000 | COG0090 | PRK09374 | pfam03947 | TIGR01171 | FIRM000010 | PMBAC017 | 1 |
rplV-L22 | NP_388002 | COG0091 | PRK00565 | pfam00237 | TIGR01044 | FIRM000007 | PMBAC018 | 1 |
rpsC-S3 | NP_388003 | COG0092 | PRK00310 | pfam00189 | TIGR01009 | FIRM000028 | PMBAC019 | 1 |
rplP-L16 | NP_388004 | COG0197 | PRK09203 | pfam00252 | TIGR01164 | FIRM000018 | PMBAC020 | 1 |
rpsQ-S17 | NP_388006 | COG0186 | PRK05610 | pfam00366 | TIGR03635 | FIRM000036 | PMBAC021 | 1 |
rplN-L14 | NP_388007 | COG0093 | PRK05483 | pfam00238 | TIGR01067 | FIRM000014 | PMBAC022 | 1 |
rplX-L24 | NP_388008 | COG0198 | PRK00004 | pfam17136 | TIGR01079 | FIRM000040 | PMBAC023 | 1 |
rplE-L5 | NP_388009 | COG0094 | PRK00010 | pfam00673 | FIRM000025 | PMBAC024 | 1 | |
rpsH-S8 | NP_388011 | COG0096 | PRK00136 | pfam00410 | FIRM000031 | PMBAC025 | 1 | |
rplF-L6 | NP_388012 | COG0097 | PRK05498 | pfam00347 | TIGR03654 | FIRM000023 | PMBAC026 | 1 |
rplR-L18 | NP_388013 | COG0256 | PRK05593 | pfam00861 | TIGR00060 | FIRM000033 | PMBAC027 | 1 |
rpsE-S5 | NP_388014 | COG0098 | PRK00550 | pfam00333 | TIGR01021 | FIRM000015 | PMBAC028 | 1 |
rplO-L15 | NP_388016 | COG0200 | PRK05592 | pfam00828 | TIGR01071 | FIRM000021 | PMBAC029 | 1 |
adk | NP_388018 | COG0563 | PRK00279 | pfam00406 | TIGR01351 | FIRM000131 | PMBAC030 | 1 |
rpsM-S13 | NP_388022 | COG0099 | PRK05179 | pfam00416 | TIGR03631 | FIRM000019 | PMBAC031 | 1 |
rpsK-S11 | NP_388023 | COG0100 | PRK05309 | pfam00411 | TIGR03632 | FIRM000029 | PMBAC032 | 1 |
rpoA | NP_388024 | COG0202 | PRK05182 | pfam01193 | TIGR02027 | FIRM000052 | PMBAC033 | 1 |
rplQ-L17 | NP_388025 | COG0203 | PRK05591 | pfam01196 | TIGR00059 | FIRM000057 | PMBAC034 | 1 |
rpsI-S9 | NP_388031 | COG0103 | PRK00132 | pfam00380 | FIRM000011 | PMBAC035 | 1 | |
tsaE | NP_388472 | COG0802 | PRK10646 | pfam02367 | TIGR00150 | FIRM000121 | PMBAC036 | 1 |
tsaB | NP_388473 | COG1214 | PRK14878 | pfam00814 | TIGR03725 | FIRM000134 | PMBAC037 | 1 |
tsaD | NP_388475 | COG0533 | PRK09604 | pfam00814 | TIGR03723 | FIRM000006 | PMBAC038 | 1 |
pcrA-uvrD | NP_388543 | COG0210 | PRK11773 | pfam00580 | TIGR01073 | PMBAC090 | 3 | |
ligA | NP_388544 | COG0272 | PRK07956 | pfam01653 | TIGR00575 | FIRM000109 | PMBAC039 | 1 |
rsmD | NP_389384 | COG0742 | PRK10909 | pfam03602 | TIGR00095 | FIRM000129 | PMBAC040 | 1 |
coaD | NP_389385 | COG0669 | PRK00168 | pfam01467 | TIGR01510 | FIRM000116 | PMBAC041 | 1 |
ftsZ | NP_389412 | COG0206 | PRK09330 | pfam12327 | TIGR00065 | FIRM000117 | PMBAC042 | 1 |
priA | NP_389453 | COG1198 | PRK05580 | pfam17764 | TIGR00595 | FIRM000045 | PMBAC043 | 1 |
plsX | NP_389471 | COG0416 | PRK05331 | pfam02504 | TIGR00182 | FIRM000090 | PMBAC044 | 1 |
rnc | NP_389475 | COG0571 | PRK00102 | pfam14622 | TIGR02191 | FIRM000128 | PMBAC045 | 1 |
trmD | NP_389485 | COG0336 | PRK00026 | pfam01746 | TIGR00088 | FIRM000058 | PMBAC046 | 1 |
rplS-L19 | NP_389486 | COG0335 | PRK05338 | pfam01245 | TIGR01024 | FIRM000069 | PMBAC047 | 1 |
rpsB-S2 | NP_389531 | COG0052 | PRK05299 | pfam00318 | TIGR01011 | FIRM000001 | PMBAC048 | 1 |
tsf | NP_389532 | COG0264 | PRK09377 | pfam00889 | TIGR00116 | FIRM000056 | PMBAC049 | 1 |
frr | NP_389534 | COG0233 | PRK00083 | pfam01765 | TIGR00496 | FIRM000079 | PMBAC050 | 1 |
uppS | NP_389535 | COG0020 | PRK14830 | pfam01255 | TIGR00055 | FIRM000127 | PMBAC051 | 1 |
cdsA | NP_389536 | COG0575 | PRK11624 | pfam01148 | FIRM000126 | PMBAC052 | 1 | |
rasP | NP_389538 | COG0750 | PRK10779 | pfam02163 | TIGR00054 | FIRM000125 | PMBAC053 | 1 |
nusA | NP_389542 | COG0195 | PRK12327 | pfam08529 | TIGR01953 | FIRM000041 | PMBAC054 | 1 |
infB | NP_389545 | COG0532 | PRK05306 | pfam11987 | TIGR00487 | FIRM000005 | PMBAC055 | 1 |
rbfA | NP_389547 | COG0858 | PRK00521 | pfam02033 | TIGR00082 | FIRM000063 | PMBAC056 | 1 |
mutS | NP_389586 | COG0249 | PRK05399 | pfam00488 | TIGR01070 | FIRM000076 | PMBAC057 | 1 |
mutL | NP_389587 | COG0323 | PRK00095 | pfam08676 | TIGR00585 | FIRM000064 | PMBAC058 | 1 |
miaA | NP_389615 | COG0324 | PRK00091 | pfam01715 | TIGR00174 | FIRM000082 | PMBAC059 | 1 |
cmk | NP_390170 | COG0283 | PRK00023 | pfam02224 | TIGR00017 | FIRM000115 | PMBAC060 | 1 |
recN | NP_390304 | COG0497 | PRK10869 | pfam13476 | TIGR00634 | FIRM000078 | PMBAC061 | 1 |
xseA | NP_390310 | COG1570 | PRK00286 | pfam02601 | TIGR00237 | FIRM000135 | PMBAC062 | 1 |
dnaG | NP_390400 | COG0358 | PRK05667 | pfam08275 | TIGR01391 | FIRM000092 | PMBAC063 | 1 |
grpE | NP_390426 | COG0576 | PRK14140 | pfam01025 | FIRM000130 | PMBAC064 | 1 | |
lepA | NP_390429 | COG0481 | PRK05433 | pfam06421 | TIGR01393 | FIRM000004 | PMBAC065 | 1 |
holA | NP_390434 | COG1466 | PRK05574 | pfam06144 | TIGR01128 | FIRM000142 | PMBAC066 | 1 |
rimF | NP_390617 | COG0816 | PRK00109 | pfam03652 | TIGR00250 | FIRM000140 | PMBAC067 | 1 |
dtd | NP_390637 | COG1490 | PRK05273 | pfam02580 | TIGR00256 | FIRM000124 | PMBAC068 | 1 |
relA-spoT | NP_390638 | COG0317 | PRK11092 | pfam04607 | TIGR00691 | FIRM000132 | PMBAC069 | 1 |
ruvB | YP_054590 | COG2255 | PRK00080 | pfam05496 | TIGR00635 | FIRM000072 | PMBAC070 | 1 |
ruvA | NP_390652 | COG0632 | PRK00116 | pfam01330 | TIGR00084 | FIRM000071 | PMBAC071 | 1 |
spo0B | NP_390670 | COG0536 | PRK12297 | pfam01018 | TIGR02729 | FIRM000049 | PMBAC072 | 1 |
rpmA-L27 | NP_390672 | COG0211 | PRK05435 | pfam01016 | TIGR00062 | FIRM000102 | PMBAC073 | 1 |
rplU-L21 | NP_390674 | COG0261 | PRK05573 | pfam00829 | TIGR00061 | FIRM000074 | PMBAC074 | 1 |
valS | NP_390687 | COG0525 | PRK05729 | pfam00133 | TIGR00422 | FIRM000136 | PMBAC075 | 1 |
pheS | NP_390742 | COG0016 | PRK00488 | pfam01409 | TIGR00468 | FIRM000020 | PMBAC076 | 1 |
rplT-L20 | NP_390763 | COG0292 | PRK05185 | pfam00453 | TIGR01032 | FIRM000070 | PMBAC077 | 1 |
coaE | NP_390784 | COG0237 | PRK00081 | pfam01121 | TIGR00152 | FIRM000085 | PMBAC078 | 1 |
polA | NP_390787 | COG0749 | PRK05755 | pfam00476 | TIGR00593 | PMBAC091 | 3 | |
murC | NP_390857 | COG0773 | PRK00421 | pfam08245 | TIGR01082 | FIRM000120 | PMBAC079 | 1 |
smpB | NP_391240 | COG0691 | PRK05422 | pfam01668 | TIGR00086 | FIRM000065 | PMBAC080 | 1 |
whiA | NP_391355 | COG1481 | pfam02650 | TIGR00647 | FIRM000137 | PMBAC081 | 1 | |
uvrB | NP_391397 | COG0556 | PRK05298 | pfam17757 | TIGR00631 | FIRM000122 | PMBAC082 | 1 |
hpf-raiA | NP_391411 | COG1544 | PRK10470 | pfam02482 | TIGR00741 | FIRM000143 | PMBAC083 | 1 |
tsaC-ywlC | NP_391576 | COG0009 | PRK10634 | pfam01300 | TIGR00057 | FIRM000138 | PMBAC084 | 1 |
prmC | NP_391581 | COG2890 | PRK09328 | pfam17827 | TIGR00536 | FIRM000141 | PMBAC085 | 1 |
rplI-L9 | NP_391930 | COG0359 | PRK00137 | pfam03948 | TIGR00158 | FIRM000054 | PMBAC086 | 1 |
gidB-rsmG | NP_391980 | COG0357 | PRK00107 | pfam02527 | TIGR00138 | FIRM000118 | PMBAC087 | 1 |
For each gene name GENE
and each accession number ACCN
from the databank BANK
(i.e. COG, PRK, Pfam, TIGR, PhyEco or PhyloM) available in the above table, the reference multiple amino acid sequence alignment (MSA) can be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/BANK/GENE.ACCN.faa
and the associated position specific scoring matrix (PSSM) via the following URL model:
http://giphy.pasteur.fr/PhyloM/Bacillaceae/smp/BANK/GENE.ACCN.smp
For example, the reference MSA PMBAC086 for the gene rplI-L9 can be downloaded using wget with the following linux command line:
wget -q http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/PhyloM/rplI-L9.PMBAC086.faa
The same download can be also performed using curl with the following command line:
curl --silent -O http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/PhyloM/rplI-L9.PMBAC086.faa
Gene names from a given dataset DATASET
(i.e. BM1, BM2 or BM3) can be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/BM/DATASET.txt
Every reference MSA from the databank BANK
(i.e. COG, PRK, Pfam, TIGR, PhyEco or PhyloM) associated to a given dataset DATASET
(i.e. BM1, BM2 or BM3) can be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/BM/BANK.DATASET.aln.tar.gz
Similarly, every PSSM from the databank BANK
(i.e. COG, PRK, Pfam, TIGR, PhyEco or PhyloM) associated to a given dataset DATASET
(i.e. BM1, BM2 or BM3) can be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/Bacillaceae/smp/BM/BANK.DATASET.smp.tar.gz
For example, the COG PSSM files that belong to the dataset BM1 can be downloaded using curl and uncompressed using tar with the following linux command line:
curl --silent http://giphy.pasteur.fr/PhyloM/Bacillaceae/smp/BM/COG.BM1.smp.tar.gz | tar -xz
Each of the MSA files can be used as a query for performing a psiblast search with the BLAST+ tools (Camacho et al. 2009).
Let cds.faa
be a FASTA-formatted amino acid sequence file (e.g. every CDS from a bacterial genome).
This coding sequence file should be first formatted using the following linux command line:
makeblastdb -in cds.faa
Next, a MSA file msa.faa
can be directly used as a query for performing a BLAST search with the following linux command line model:
psiblast -in_msa msa.faa -db cds.faa -seg no -word_size 2 -evalue 1E-20 -xdrop_gap_final 1000
Each of the PSSM files can be used as a query for performing a tblastn search with the BLAST+ tools (Camacho et al. 2009).
Let seq.fna
be a FASTA-formatted nucleotide sequence file (e.g. bacterial genome contig sequences).
This nucleotide sequence file should be first formatted using the following linux command line:
makeblastdb -in seq.fna -dbtype nucl
Next, a PSSM file pssm.smp
can be directly used as a query for performing a BLAST search with the following linux command line model:
tblastn -in_pssm pssm.smp -db seq.fna -seg no -word_size 2 -evalue 1E-20 -xdrop_gap_final 1000
Of note, the corresponding full CDS can be easily extracted by using the program eFASTA along with the fields 2, 9 and 10 outputed by the tblastn option -outfmt 6
.
Note also that searching and extracting a CDS from a genome sequence using an associated PSSM file can be easily carried out with the tool eCDS.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421
Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44:D279-285. doi:10.1093/nar/gkv1344
Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41-43. doi:10.1093/nar/29.1.41
Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Research, 43:D261-9. doi:10.1093/nar/gku1223
Gupta RS, Patel S, Saini N, Chen S (2020) Robust demarcation of 17 distinct Bacillus species clades, proposed as novel Bacillaceae genera, by phylogenomics and comparative genomic analyses: description of Robertmurraya kyonggiensis sp. nov. and proposal for an emended genus Bacillus limiting it only to the members of the Subtilis and Cereus clades of species. International Journal of Systematics and Evolutionary Microbiology, 70(11):5753-5798. doi:10.1099/ijsem.0.004475
Patel S, Gupta RS (2020) A phylogenomic and comparative genomic framework for resolving the polyphyly of the genus Bacillus: Proposal for six new genera of Bacillus species, Peribacillus gen. nov., Cytobacillus gen. nov., Mesobacillus gen. nov., Neobacillus gen. nov., Metabacillus gen. nov. and Alkalihalobacillus gen. nov. International Journal of Systematics and Evolutionary Microbiology, 70:406-438. doi:10.1099/ijsem.0.003775
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41. doi:10.1186/1471-2105-4-41
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science, 278(5338):631-637. doi:10.1126/science.278.5338.631
Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE, 8(10):e77033. doi:10.1371/journal.pone.0077033