Institut Pasteur blankvertical divider clipartblank DBC blankvertical divider clipartblank Bioinformatics and Biostatistics Hub blankvertical divider clipartblank GIPhy

DESCRIPTION      DATASETS      MARKER LIST      USAGE      LITTERATURE CITED

PhyloM: Bacillaceae


Description

PhyloM: Bacillaceae is a selection of markers that are well-suited to infer phylogenetic trees of members of the Bacillaceae family. This bacteria family (phylum: Firmicutes; class: Bacilli; order: Bacillales) contains more than one hundred genera, such as the well-known Bacillus genus. This gene selection relies on the 87 Firmicutes markers (dataset BM1) of Wu et al. (2013), sometimes named phyloeco dataset (e.g. Patel and Gupta 2020, Gupta et al. 2020). The dataset BM1 is completed with two other reduced gene sets (datasets BM2 and BM3) described by Patel and Gupta (2020).

Datasets

Three sets of Bacillaceae markers (91 total loci) are summarized below: BM1, BM2 and BM3 (Wu et al. 2013, Patel and Gupta 2020, Gupta et al. 2020). For each of them, the associated gene name list is available (). Next, tar.gz archives can be downloaded, each containing reference multiple amino acid sequence alignments (MSA; ) or position specific scoring matrices (PSSM; ) gathered from the COG database (Tatusov et al. 1997, 2003; Galperin et al. 2015), the NCBI protein cluster collection (PRK), the Pfam database (Finn et al. 2016), the TIGRFAMs database (Haft et al. 2001), and the PhyEco repository (Wu et al. 2013). In supplement, curated MSA and PSSM inferred specifically from 513 Bacillaceae type strain genomes are also downloadable (PhyloM). For each dataset and source, the total number of available MSA/PSSM is indicated in parentheses.

dataset genes COG PRK Pfam TIGR PhyEco PhyloM
BM1 (87) (87) (86) (87) (81) (81) (87)
BM2 (4) (4) (4) (4) (4) (2) (4)
BM3 (2) (2) (2) (2) (2) (0) (2)

Of note, the 91 initial (non-curated) MSA built from 513 Bacillaceae type strain genomes to create the databank PhyloM (last column) are available here:

Marker list

All the 91 selected genes are listed in the table below. They are sorted according to the reference genome of Bacillus subtilis strain 168 (Genbank accn: NC_000964).
Common gene names are available in the column name with a link to the Uniprot description. The column B. subtilis CDS lists the NCBI accession numbers of the B. subtilis str. 168 CDS with links to the corresponding entries in the NCBI Conserved Domain Database. Datasets BM1, BM2 or BM3 are indicated in the last column BM.
For each gene, the corresponding accession number (if available) is given for the COG database (Tatusov et al. 1997, 2003; Galperin et al. 2015), the NCBI protein cluster collection (PRK), the Pfam database (Finn et al. 2016), the TIGRFAMs database (Haft et al. 2001), the PhyEco repository (Wu et al. 2013), and the current PhyloM: Bacillaceae repository. For each non-empty entry, the corresponding MSA is available (with consensus as first sequence; ), as well as the associated PSSM ().

name B. subtilis CDS COG PRK Pfam TIGRFAMs PhyEco PhyloM BM
recF NP_387885 COG1195 PRK00064 pfam13175 TIGR00611 FIRM000113 PMBAC001 1
gyrB NP_387887 COG0187 PRK05644 pfam00204 TIGR01059 PMBAC088 2
gyrA NP_387888 COG0188 PRK05560 pfam00521 TIGR01063 PMBAC089 2
dnaX NP_387900 COG2812 PRK05563 pfam13177 TIGR02397 FIRM000123 PMBAC002 1
recR NP_387902 COG0353 PRK00076 pfam13662 TIGR00615 FIRM000080 PMBAC003 1
rsmI NP_387917 COG0313 PRK14994 pfam00590 TIGR00096 FIRM000119 PMBAC004 1
metG NP_387919 COG0143 PRK12267 pfam09334 TIGR00398 FIRM000139 PMBAC005 1
pth NP_387934 COG0193 PRK05426 pfam01195 TIGR00447 FIRM000103 PMBAC006 1
radA NP_387968 COG1066 PRK11823 pfam18073 TIGR00416 FIRM000060 PMBAC007 1
cysS NP_387975 COG0215 PRK00260 pfam01406 TIGR00435 FIRM000133 PMBAC008 1
rplA-L1 NP_387984 COG0081 PRK05424 pfam00687 TIGR01169 FIRM000003 PMBAC009 1
rplJ-L10 NP_387985 COG0244 PRK00099 pfam00466 FIRM000030 PMBAC010 1
rplL-L12 NP_387986 COG0222 PRK00157 pfam00542 TIGR00855 FIRM000107 PMBAC011 1
rpoB NP_387988 COG0085 PRK00405 pfam00562 TIGR02013 FIRM000042 PMBAC012 1,2
rpoC NP_387989 COG0086 PRK00566 pfam04997 TIGR02386 FIRM000044 PMBAC013 1,2
rpsL-S12 NP_387991 COG0048 PRK05163 pfam00164 TIGR00981 FIRM000026 PMBAC014 1
rpsG-S7 NP_387992 COG0049 PRK05302 pfam00177 TIGR01029 FIRM000017 PMBAC015 1
rplD-L4 NP_387998 COG0088 PRK05319 pfam00573 TIGR03953 FIRM000009 PMBAC016 1
rplB-L2 NP_388000 COG0090 PRK09374 pfam03947 TIGR01171 FIRM000010 PMBAC017 1
rplV-L22 NP_388002 COG0091 PRK00565 pfam00237 TIGR01044 FIRM000007 PMBAC018 1
rpsC-S3 NP_388003 COG0092 PRK00310 pfam00189 TIGR01009 FIRM000028 PMBAC019 1
rplP-L16 NP_388004 COG0197 PRK09203 pfam00252 TIGR01164 FIRM000018 PMBAC020 1
rpsQ-S17 NP_388006 COG0186 PRK05610 pfam00366 TIGR03635 FIRM000036 PMBAC021 1
rplN-L14 NP_388007 COG0093 PRK05483 pfam00238 TIGR01067 FIRM000014 PMBAC022 1
rplX-L24 NP_388008 COG0198 PRK00004 pfam17136 TIGR01079 FIRM000040 PMBAC023 1
rplE-L5 NP_388009 COG0094 PRK00010 pfam00673 FIRM000025 PMBAC024 1
rpsH-S8 NP_388011 COG0096 PRK00136 pfam00410 FIRM000031 PMBAC025 1
rplF-L6 NP_388012 COG0097 PRK05498 pfam00347 TIGR03654 FIRM000023 PMBAC026 1
rplR-L18 NP_388013 COG0256 PRK05593 pfam00861 TIGR00060 FIRM000033 PMBAC027 1
rpsE-S5 NP_388014 COG0098 PRK00550 pfam00333 TIGR01021 FIRM000015 PMBAC028 1
rplO-L15 NP_388016 COG0200 PRK05592 pfam00828 TIGR01071 FIRM000021 PMBAC029 1
adk NP_388018 COG0563 PRK00279 pfam00406 TIGR01351 FIRM000131 PMBAC030 1
rpsM-S13 NP_388022 COG0099 PRK05179 pfam00416 TIGR03631 FIRM000019 PMBAC031 1
rpsK-S11 NP_388023 COG0100 PRK05309 pfam00411 TIGR03632 FIRM000029 PMBAC032 1
rpoA NP_388024 COG0202 PRK05182 pfam01193 TIGR02027 FIRM000052 PMBAC033 1
rplQ-L17 NP_388025 COG0203 PRK05591 pfam01196 TIGR00059 FIRM000057 PMBAC034 1
rpsI-S9 NP_388031 COG0103 PRK00132 pfam00380 FIRM000011 PMBAC035 1
tsaE NP_388472 COG0802 PRK10646 pfam02367 TIGR00150 FIRM000121 PMBAC036 1
tsaB NP_388473 COG1214 PRK14878 pfam00814 TIGR03725 FIRM000134 PMBAC037 1
tsaD NP_388475 COG0533 PRK09604 pfam00814 TIGR03723 FIRM000006 PMBAC038 1
pcrA-uvrD NP_388543 COG0210 PRK11773 pfam00580 TIGR01073 PMBAC090 3
ligA NP_388544 COG0272 PRK07956 pfam01653 TIGR00575 FIRM000109 PMBAC039 1
rsmD NP_389384 COG0742 PRK10909 pfam03602 TIGR00095 FIRM000129 PMBAC040 1
coaD NP_389385 COG0669 PRK00168 pfam01467 TIGR01510 FIRM000116 PMBAC041 1
ftsZ NP_389412 COG0206 PRK09330 pfam12327 TIGR00065 FIRM000117 PMBAC042 1
priA NP_389453 COG1198 PRK05580 pfam17764 TIGR00595 FIRM000045 PMBAC043 1
plsX NP_389471 COG0416 PRK05331 pfam02504 TIGR00182 FIRM000090 PMBAC044 1
rnc NP_389475 COG0571 PRK00102 pfam14622 TIGR02191 FIRM000128 PMBAC045 1
trmD NP_389485 COG0336 PRK00026 pfam01746 TIGR00088 FIRM000058 PMBAC046 1
rplS-L19 NP_389486 COG0335 PRK05338 pfam01245 TIGR01024 FIRM000069 PMBAC047 1
rpsB-S2 NP_389531 COG0052 PRK05299 pfam00318 TIGR01011 FIRM000001 PMBAC048 1
tsf NP_389532 COG0264 PRK09377 pfam00889 TIGR00116 FIRM000056 PMBAC049 1
frr NP_389534 COG0233 PRK00083 pfam01765 TIGR00496 FIRM000079 PMBAC050 1
uppS NP_389535 COG0020 PRK14830 pfam01255 TIGR00055 FIRM000127 PMBAC051 1
cdsA NP_389536 COG0575 PRK11624 pfam01148 FIRM000126 PMBAC052 1
rasP NP_389538 COG0750 PRK10779 pfam02163 TIGR00054 FIRM000125 PMBAC053 1
nusA NP_389542 COG0195 PRK12327 pfam08529 TIGR01953 FIRM000041 PMBAC054 1
infB NP_389545 COG0532 PRK05306 pfam11987 TIGR00487 FIRM000005 PMBAC055 1
rbfA NP_389547 COG0858 PRK00521 pfam02033 TIGR00082 FIRM000063 PMBAC056 1
mutS NP_389586 COG0249 PRK05399 pfam00488 TIGR01070 FIRM000076 PMBAC057 1
mutL NP_389587 COG0323 PRK00095 pfam08676 TIGR00585 FIRM000064 PMBAC058 1
miaA NP_389615 COG0324 PRK00091 pfam01715 TIGR00174 FIRM000082 PMBAC059 1
cmk NP_390170 COG0283 PRK00023 pfam02224 TIGR00017 FIRM000115 PMBAC060 1
recN NP_390304 COG0497 PRK10869 pfam13476 TIGR00634 FIRM000078 PMBAC061 1
xseA NP_390310 COG1570 PRK00286 pfam02601 TIGR00237 FIRM000135 PMBAC062 1
dnaG NP_390400 COG0358 PRK05667 pfam08275 TIGR01391 FIRM000092 PMBAC063 1
grpE NP_390426 COG0576 PRK14140 pfam01025 FIRM000130 PMBAC064 1
lepA NP_390429 COG0481 PRK05433 pfam06421 TIGR01393 FIRM000004 PMBAC065 1
holA NP_390434 COG1466 PRK05574 pfam06144 TIGR01128 FIRM000142 PMBAC066 1
rimF NP_390617 COG0816 PRK00109 pfam03652 TIGR00250 FIRM000140 PMBAC067 1
dtd NP_390637 COG1490 PRK05273 pfam02580 TIGR00256 FIRM000124 PMBAC068 1
relA-spoT NP_390638 COG0317 PRK11092 pfam04607 TIGR00691 FIRM000132 PMBAC069 1
ruvB YP_054590 COG2255 PRK00080 pfam05496 TIGR00635 FIRM000072 PMBAC070 1
ruvA NP_390652 COG0632 PRK00116 pfam01330 TIGR00084 FIRM000071 PMBAC071 1
spo0B NP_390670 COG0536 PRK12297 pfam01018 TIGR02729 FIRM000049 PMBAC072 1
rpmA-L27 NP_390672 COG0211 PRK05435 pfam01016 TIGR00062 FIRM000102 PMBAC073 1
rplU-L21 NP_390674 COG0261 PRK05573 pfam00829 TIGR00061 FIRM000074 PMBAC074 1
valS NP_390687 COG0525 PRK05729 pfam00133 TIGR00422 FIRM000136 PMBAC075 1
pheS NP_390742 COG0016 PRK00488 pfam01409 TIGR00468 FIRM000020 PMBAC076 1
rplT-L20 NP_390763 COG0292 PRK05185 pfam00453 TIGR01032 FIRM000070 PMBAC077 1
coaE NP_390784 COG0237 PRK00081 pfam01121 TIGR00152 FIRM000085 PMBAC078 1
polA NP_390787 COG0749 PRK05755 pfam00476 TIGR00593 PMBAC091 3
murC NP_390857 COG0773 PRK00421 pfam08245 TIGR01082 FIRM000120 PMBAC079 1
smpB NP_391240 COG0691 PRK05422 pfam01668 TIGR00086 FIRM000065 PMBAC080 1
whiA NP_391355 COG1481 pfam02650 TIGR00647 FIRM000137 PMBAC081 1
uvrB NP_391397 COG0556 PRK05298 pfam17757 TIGR00631 FIRM000122 PMBAC082 1
hpf-raiA NP_391411 COG1544 PRK10470 pfam02482 TIGR00741 FIRM000143 PMBAC083 1
tsaC-ywlC NP_391576 COG0009 PRK10634 pfam01300 TIGR00057 FIRM000138 PMBAC084 1
prmC NP_391581 COG2890 PRK09328 pfam17827 TIGR00536 FIRM000141 PMBAC085 1
rplI-L9 NP_391930 COG0359 PRK00137 pfam03948 TIGR00158 FIRM000054 PMBAC086 1
gidB-rsmG NP_391980 COG0357 PRK00107 pfam02527 TIGR00138 FIRM000118 PMBAC087 1

Usage

Downloading multiple sequence alignment (MSA) or position specific scoring matrix (PSSM) files

For each gene name GENE and each accession number ACCN from the databank BANK (i.e. COG, PRK, Pfam, TIGR, PhyEco or PhyloM) available in the above table, the reference multiple amino acid sequence alignment (MSA) can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/BANK/GENE.ACCN.faa

and the associated position specific scoring matrix (PSSM) via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/smp/BANK/GENE.ACCN.smp

For example, the reference MSA PMBAC086 for the gene rplI-L9 can be downloaded using wget with the following linux command line:

 wget -q http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/PhyloM/rplI-L9.PMBAC086.faa

The same download can be also performed using curl with the following command line:

 curl --silent -O http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/PhyloM/rplI-L9.PMBAC086.faa

Downloading all files associated to a given dataset

Gene names from a given dataset DATASET (i.e. BM1, BM2 or BM3) can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/BM/DATASET.txt

Every reference MSA from the databank BANK (i.e. COG, PRK, Pfam, TIGR, PhyEco or PhyloM) associated to a given dataset DATASET (i.e. BM1, BM2 or BM3) can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/BM/BANK.DATASET.aln.tar.gz

Similarly, every PSSM from the databank BANK (i.e. COG, PRK, Pfam, TIGR, PhyEco or PhyloM) associated to a given dataset DATASET (i.e. BM1, BM2 or BM3) can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/smp/BM/BANK.DATASET.smp.tar.gz

For example, the COG PSSM files that belong to the dataset BM1 can be downloaded using curl and uncompressed using tar with the following linux command line:

 curl --silent http://giphy.pasteur.fr/PhyloM/Bacillaceae/smp/BM/COG.BM1.smp.tar.gz | tar -xz

Using a MSA for performing a BLAST search against an amino acid sequence databank

Each of the MSA files can be used as a query for performing a psiblast search with the BLAST+ tools (Camacho et al. 2009). Let cds.faa be a FASTA-formatted amino acid sequence file (e.g. every CDS from a bacterial genome). This coding sequence file should be first formatted using the following linux command line:

 makeblastdb  -in cds.faa

Next, a MSA file msa.faa can be directly used as a query for performing a BLAST search with the following linux command line model:

 psiblast  -in_msa msa.faa  -db cds.faa  -seg no  -word_size 2  -evalue 1E-20  -xdrop_gap_final 1000

Using a PSSM for performing a BLAST search against a nucleotide sequence databank

Each of the PSSM files can be used as a query for performing a tblastn search with the BLAST+ tools (Camacho et al. 2009). Let seq.fna be a FASTA-formatted nucleotide sequence file (e.g. bacterial genome contig sequences). This nucleotide sequence file should be first formatted using the following linux command line:

 makeblastdb  -in seq.fna  -dbtype nucl

Next, a PSSM file pssm.smp can be directly used as a query for performing a BLAST search with the following linux command line model:

 tblastn  -in_pssm pssm.smp  -db seq.fna  -seg no  -word_size 2  -evalue 1E-20  -xdrop_gap_final 1000

Of note, the corresponding full CDS can be easily extracted by using the program eFASTA along with the fields 2, 9 and 10 outputed by the tblastn option -outfmt 6.
Note also that searching and extracting a CDS from a genome sequence using an associated PSSM file can be easily carried out with the tool eCDS.


Litterature cited

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421

Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44:D279-285. doi:10.1093/nar/gkv1344

Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41-43. doi:10.1093/nar/29.1.41

Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Research, 43:D261-9. doi:10.1093/nar/gku1223

Gupta RS, Patel S, Saini N, Chen S (2020) Robust demarcation of 17 distinct Bacillus species clades, proposed as novel Bacillaceae genera, by phylogenomics and comparative genomic analyses: description of Robertmurraya kyonggiensis sp. nov. and proposal for an emended genus Bacillus limiting it only to the members of the Subtilis and Cereus clades of species. International Journal of Systematics and Evolutionary Microbiology, 70(11):5753-5798. doi:10.1099/ijsem.0.004475

Patel S, Gupta RS (2020) A phylogenomic and comparative genomic framework for resolving the polyphyly of the genus Bacillus: Proposal for six new genera of Bacillus species, Peribacillus gen. nov., Cytobacillus gen. nov., Mesobacillus gen. nov., Neobacillus gen. nov., Metabacillus gen. nov. and Alkalihalobacillus gen. nov. International Journal of Systematics and Evolutionary Microbiology, 70:406-438. doi:10.1099/ijsem.0.003775

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41. doi:10.1186/1471-2105-4-41

Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science, 278(5338):631-637. doi:10.1126/science.278.5338.631

Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE, 8(10):e77033. doi:10.1371/journal.pone.0077033