PhyloM: Bacillaceae

Description

PhyloM: Bacillaceae is a selection of markers that are well-suited to infer phylogenetic trees of members of the Bacillaceae family. This bacteria family (phylum: Firmicutes; class: Bacilli; order: Bacillales) contains more than one hundred genera, such as the well-known Bacillus genus. This gene selection relies on the 87 Firmicutes markers (dataset BM1) of Wu et al. (2013), sometimes named phyloeco dataset (e.g. Patel and Gupta 2020, Gupta et al. 2020). The dataset BM1 is completed with two other reduced gene sets (datasets BM2 and BM3) described by Patel and Gupta (2020).

Datasets

Three sets of Bacillaceae markers (91 total loci) are summarized below: BM1, BM2 and BM3 (Wu et al. 2013, Patel and Gupta 2020, Gupta et al. 2020). For each of them, the associated gene name list is available (). Next, tar.gz archives can be downloaded, each containing reference multiple amino acid sequence alignments (MSA; ) or position specific scoring matrices (PSSM; ) gathered from the COG database (Tatusov et al. 1997, 2003; Galperin et al. 2015), the NCBI protein cluster collection (PRK), the Pfam database (Finn et al. 2016), the TIGRFAMs database (Haft et al. 2001), and the PhyEco repository (Wu et al. 2013). In supplement, curated MSA and PSSM inferred specifically from 513 Bacillaceae type strain genomes are also downloadable (PhyloM). For each dataset and source, the total number of available MSA/PSSM is indicated in parentheses.

dataset	genes	COG	PRK	Pfam	TIGR	PhyEco	PhyloM
BM1	(87)	(87)	(86)	(87)	(81)	(81)	(87)
BM2	(4)	(4)	(4)	(4)	(4)	(2)	(4)
BM3	(2)	(2)	(2)	(2)	(2)	(0)	(2)

Of note, the 91 initial (non-curated) MSA built from 513 Bacillaceae type strain genomes to create the databank PhyloM (last column) are available here:

Marker list

All the 91 selected genes are listed in the table below. They are sorted according to the reference genome of Bacillus subtilis strain 168 (Genbank accn: NC_000964).
Common gene names are available in the column name with a link to the Uniprot description. The column B. subtilis CDS lists the NCBI accession numbers of the B. subtilis str. 168 CDS with links to the corresponding entries in the NCBI Conserved Domain Database. Datasets BM1, BM2 or BM3 are indicated in the last column BM.
For each gene, the corresponding accession number (if available) is given for the COG database (Tatusov et al. 1997, 2003; Galperin et al. 2015), the NCBI protein cluster collection (PRK), the Pfam database (Finn et al. 2016), the TIGRFAMs database (Haft et al. 2001), the PhyEco repository (Wu et al. 2013), and the current PhyloM: Bacillaceae repository. For each non-empty entry, the corresponding MSA is available (with consensus as first sequence; ), as well as the associated PSSM ().

name	B. subtilis CDS	COG	PRK	Pfam	TIGRFAMs	PhyEco	PhyloM	BM
recF	NP_387885	COG1195	PRK00064	pfam13175	TIGR00611	FIRM000113	PMBAC001	1
gyrB	NP_387887	COG0187	PRK05644	pfam00204	TIGR01059		PMBAC088	2
gyrA	NP_387888	COG0188	PRK05560	pfam00521	TIGR01063		PMBAC089	2
dnaX	NP_387900	COG2812	PRK05563	pfam13177	TIGR02397	FIRM000123	PMBAC002	1
recR	NP_387902	COG0353	PRK00076	pfam13662	TIGR00615	FIRM000080	PMBAC003	1
rsmI	NP_387917	COG0313	PRK14994	pfam00590	TIGR00096	FIRM000119	PMBAC004	1
metG	NP_387919	COG0143	PRK12267	pfam09334	TIGR00398	FIRM000139	PMBAC005	1
pth	NP_387934	COG0193	PRK05426	pfam01195	TIGR00447	FIRM000103	PMBAC006	1
radA	NP_387968	COG1066	PRK11823	pfam18073	TIGR00416	FIRM000060	PMBAC007	1
cysS	NP_387975	COG0215	PRK00260	pfam01406	TIGR00435	FIRM000133	PMBAC008	1
rplA-L1	NP_387984	COG0081	PRK05424	pfam00687	TIGR01169	FIRM000003	PMBAC009	1
rplJ-L10	NP_387985	COG0244	PRK00099	pfam00466		FIRM000030	PMBAC010	1
rplL-L12	NP_387986	COG0222	PRK00157	pfam00542	TIGR00855	FIRM000107	PMBAC011	1
rpoB	NP_387988	COG0085	PRK00405	pfam00562	TIGR02013	FIRM000042	PMBAC012	1,2
rpoC	NP_387989	COG0086	PRK00566	pfam04997	TIGR02386	FIRM000044	PMBAC013	1,2
rpsL-S12	NP_387991	COG0048	PRK05163	pfam00164	TIGR00981	FIRM000026	PMBAC014	1
rpsG-S7	NP_387992	COG0049	PRK05302	pfam00177	TIGR01029	FIRM000017	PMBAC015	1
rplD-L4	NP_387998	COG0088	PRK05319	pfam00573	TIGR03953	FIRM000009	PMBAC016	1
rplB-L2	NP_388000	COG0090	PRK09374	pfam03947	TIGR01171	FIRM000010	PMBAC017	1
rplV-L22	NP_388002	COG0091	PRK00565	pfam00237	TIGR01044	FIRM000007	PMBAC018	1
rpsC-S3	NP_388003	COG0092	PRK00310	pfam00189	TIGR01009	FIRM000028	PMBAC019	1
rplP-L16	NP_388004	COG0197	PRK09203	pfam00252	TIGR01164	FIRM000018	PMBAC020	1
rpsQ-S17	NP_388006	COG0186	PRK05610	pfam00366	TIGR03635	FIRM000036	PMBAC021	1
rplN-L14	NP_388007	COG0093	PRK05483	pfam00238	TIGR01067	FIRM000014	PMBAC022	1
rplX-L24	NP_388008	COG0198	PRK00004	pfam17136	TIGR01079	FIRM000040	PMBAC023	1
rplE-L5	NP_388009	COG0094	PRK00010	pfam00673		FIRM000025	PMBAC024	1
rpsH-S8	NP_388011	COG0096	PRK00136	pfam00410		FIRM000031	PMBAC025	1
rplF-L6	NP_388012	COG0097	PRK05498	pfam00347	TIGR03654	FIRM000023	PMBAC026	1
rplR-L18	NP_388013	COG0256	PRK05593	pfam00861	TIGR00060	FIRM000033	PMBAC027	1
rpsE-S5	NP_388014	COG0098	PRK00550	pfam00333	TIGR01021	FIRM000015	PMBAC028	1
rplO-L15	NP_388016	COG0200	PRK05592	pfam00828	TIGR01071	FIRM000021	PMBAC029	1
adk	NP_388018	COG0563	PRK00279	pfam00406	TIGR01351	FIRM000131	PMBAC030	1
rpsM-S13	NP_388022	COG0099	PRK05179	pfam00416	TIGR03631	FIRM000019	PMBAC031	1
rpsK-S11	NP_388023	COG0100	PRK05309	pfam00411	TIGR03632	FIRM000029	PMBAC032	1
rpoA	NP_388024	COG0202	PRK05182	pfam01193	TIGR02027	FIRM000052	PMBAC033	1
rplQ-L17	NP_388025	COG0203	PRK05591	pfam01196	TIGR00059	FIRM000057	PMBAC034	1
rpsI-S9	NP_388031	COG0103	PRK00132	pfam00380		FIRM000011	PMBAC035	1
tsaE	NP_388472	COG0802	PRK10646	pfam02367	TIGR00150	FIRM000121	PMBAC036	1
tsaB	NP_388473	COG1214	PRK14878	pfam00814	TIGR03725	FIRM000134	PMBAC037	1
tsaD	NP_388475	COG0533	PRK09604	pfam00814	TIGR03723	FIRM000006	PMBAC038	1
pcrA-uvrD	NP_388543	COG0210	PRK11773	pfam00580	TIGR01073		PMBAC090	3
ligA	NP_388544	COG0272	PRK07956	pfam01653	TIGR00575	FIRM000109	PMBAC039	1
rsmD	NP_389384	COG0742	PRK10909	pfam03602	TIGR00095	FIRM000129	PMBAC040	1
coaD	NP_389385	COG0669	PRK00168	pfam01467	TIGR01510	FIRM000116	PMBAC041	1
ftsZ	NP_389412	COG0206	PRK09330	pfam12327	TIGR00065	FIRM000117	PMBAC042	1
priA	NP_389453	COG1198	PRK05580	pfam17764	TIGR00595	FIRM000045	PMBAC043	1
plsX	NP_389471	COG0416	PRK05331	pfam02504	TIGR00182	FIRM000090	PMBAC044	1
rnc	NP_389475	COG0571	PRK00102	pfam14622	TIGR02191	FIRM000128	PMBAC045	1
trmD	NP_389485	COG0336	PRK00026	pfam01746	TIGR00088	FIRM000058	PMBAC046	1
rplS-L19	NP_389486	COG0335	PRK05338	pfam01245	TIGR01024	FIRM000069	PMBAC047	1
rpsB-S2	NP_389531	COG0052	PRK05299	pfam00318	TIGR01011	FIRM000001	PMBAC048	1
tsf	NP_389532	COG0264	PRK09377	pfam00889	TIGR00116	FIRM000056	PMBAC049	1
frr	NP_389534	COG0233	PRK00083	pfam01765	TIGR00496	FIRM000079	PMBAC050	1
uppS	NP_389535	COG0020	PRK14830	pfam01255	TIGR00055	FIRM000127	PMBAC051	1
cdsA	NP_389536	COG0575	PRK11624	pfam01148		FIRM000126	PMBAC052	1
rasP	NP_389538	COG0750	PRK10779	pfam02163	TIGR00054	FIRM000125	PMBAC053	1
nusA	NP_389542	COG0195	PRK12327	pfam08529	TIGR01953	FIRM000041	PMBAC054	1
infB	NP_389545	COG0532	PRK05306	pfam11987	TIGR00487	FIRM000005	PMBAC055	1
rbfA	NP_389547	COG0858	PRK00521	pfam02033	TIGR00082	FIRM000063	PMBAC056	1
mutS	NP_389586	COG0249	PRK05399	pfam00488	TIGR01070	FIRM000076	PMBAC057	1
mutL	NP_389587	COG0323	PRK00095	pfam08676	TIGR00585	FIRM000064	PMBAC058	1
miaA	NP_389615	COG0324	PRK00091	pfam01715	TIGR00174	FIRM000082	PMBAC059	1
cmk	NP_390170	COG0283	PRK00023	pfam02224	TIGR00017	FIRM000115	PMBAC060	1
recN	NP_390304	COG0497	PRK10869	pfam13476	TIGR00634	FIRM000078	PMBAC061	1
xseA	NP_390310	COG1570	PRK00286	pfam02601	TIGR00237	FIRM000135	PMBAC062	1
dnaG	NP_390400	COG0358	PRK05667	pfam08275	TIGR01391	FIRM000092	PMBAC063	1
grpE	NP_390426	COG0576	PRK14140	pfam01025		FIRM000130	PMBAC064	1
lepA	NP_390429	COG0481	PRK05433	pfam06421	TIGR01393	FIRM000004	PMBAC065	1
holA	NP_390434	COG1466	PRK05574	pfam06144	TIGR01128	FIRM000142	PMBAC066	1
rimF	NP_390617	COG0816	PRK00109	pfam03652	TIGR00250	FIRM000140	PMBAC067	1
dtd	NP_390637	COG1490	PRK05273	pfam02580	TIGR00256	FIRM000124	PMBAC068	1
relA-spoT	NP_390638	COG0317	PRK11092	pfam04607	TIGR00691	FIRM000132	PMBAC069	1
ruvB	YP_054590	COG2255	PRK00080	pfam05496	TIGR00635	FIRM000072	PMBAC070	1
ruvA	NP_390652	COG0632	PRK00116	pfam01330	TIGR00084	FIRM000071	PMBAC071	1
spo0B	NP_390670	COG0536	PRK12297	pfam01018	TIGR02729	FIRM000049	PMBAC072	1
rpmA-L27	NP_390672	COG0211	PRK05435	pfam01016	TIGR00062	FIRM000102	PMBAC073	1
rplU-L21	NP_390674	COG0261	PRK05573	pfam00829	TIGR00061	FIRM000074	PMBAC074	1
valS	NP_390687	COG0525	PRK05729	pfam00133	TIGR00422	FIRM000136	PMBAC075	1
pheS	NP_390742	COG0016	PRK00488	pfam01409	TIGR00468	FIRM000020	PMBAC076	1
rplT-L20	NP_390763	COG0292	PRK05185	pfam00453	TIGR01032	FIRM000070	PMBAC077	1
coaE	NP_390784	COG0237	PRK00081	pfam01121	TIGR00152	FIRM000085	PMBAC078	1
polA	NP_390787	COG0749	PRK05755	pfam00476	TIGR00593		PMBAC091	3
murC	NP_390857	COG0773	PRK00421	pfam08245	TIGR01082	FIRM000120	PMBAC079	1
smpB	NP_391240	COG0691	PRK05422	pfam01668	TIGR00086	FIRM000065	PMBAC080	1
whiA	NP_391355	COG1481		pfam02650	TIGR00647	FIRM000137	PMBAC081	1
uvrB	NP_391397	COG0556	PRK05298	pfam17757	TIGR00631	FIRM000122	PMBAC082	1
hpf-raiA	NP_391411	COG1544	PRK10470	pfam02482	TIGR00741	FIRM000143	PMBAC083	1
tsaC-ywlC	NP_391576	COG0009	PRK10634	pfam01300	TIGR00057	FIRM000138	PMBAC084	1
prmC	NP_391581	COG2890	PRK09328	pfam17827	TIGR00536	FIRM000141	PMBAC085	1
rplI-L9	NP_391930	COG0359	PRK00137	pfam03948	TIGR00158	FIRM000054	PMBAC086	1
gidB-rsmG	NP_391980	COG0357	PRK00107	pfam02527	TIGR00138	FIRM000118	PMBAC087	1

Usage

Downloading multiple sequence alignment (MSA) or position specific scoring matrix (PSSM) files

For each gene name GENE and each accession number ACCN from the databank BANK (i.e. COG, PRK, Pfam, TIGR, PhyEco or PhyloM) available in the above table, the reference multiple amino acid sequence alignment (MSA) can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/BANK/GENE.ACCN.faa

and the associated position specific scoring matrix (PSSM) via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/smp/BANK/GENE.ACCN.smp

For example, the reference MSA PMBAC086 for the gene rplI-L9 can be downloaded using wget with the following linux command line:

 wget -q http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/PhyloM/rplI-L9.PMBAC086.faa

The same download can be also performed using curl with the following command line:

 curl --silent -O http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/PhyloM/rplI-L9.PMBAC086.faa

Downloading all files associated to a given dataset

Gene names from a given dataset DATASET (i.e. BM1, BM2 or BM3) can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/BM/DATASET.txt

Every reference MSA from the databank BANK (i.e. COG, PRK, Pfam, TIGR, PhyEco or PhyloM) associated to a given dataset DATASET (i.e. BM1, BM2 or BM3) can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/aln/BM/BANK.DATASET.aln.tar.gz

Similarly, every PSSM from the databank BANK (i.e. COG, PRK, Pfam, TIGR, PhyEco or PhyloM) associated to a given dataset DATASET (i.e. BM1, BM2 or BM3) can be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/Bacillaceae/smp/BM/BANK.DATASET.smp.tar.gz

For example, the COG PSSM files that belong to the dataset BM1 can be downloaded using curl and uncompressed using tar with the following linux command line:

 curl --silent http://giphy.pasteur.fr/PhyloM/Bacillaceae/smp/BM/COG.BM1.smp.tar.gz | tar -xz

Using a MSA for performing a BLAST search against an amino acid sequence databank

Each of the MSA files can be used as a query for performing a psiblast search with the BLAST+ tools (Camacho et al. 2009). Let cds.faa be a FASTA-formatted amino acid sequence file (e.g. every CDS from a bacterial genome). This coding sequence file should be first formatted using the following linux command line:

 makeblastdb  -in cds.faa

Next, a MSA file msa.faa can be directly used as a query for performing a BLAST search with the following linux command line model:

 psiblast  -in_msa msa.faa  -db cds.faa  -seg no  -word_size 2  -evalue 1E-20  -xdrop_gap_final 1000

Using a PSSM for performing a BLAST search against a nucleotide sequence databank

Each of the PSSM files can be used as a query for performing a tblastn search with the BLAST+ tools (Camacho et al. 2009). Let seq.fna be a FASTA-formatted nucleotide sequence file (e.g. bacterial genome contig sequences). This nucleotide sequence file should be first formatted using the following linux command line:

 makeblastdb  -in seq.fna  -dbtype nucl

Next, a PSSM file pssm.smp can be directly used as a query for performing a BLAST search with the following linux command line model:

 tblastn  -in_pssm pssm.smp  -db seq.fna  -seg no  -word_size 2  -evalue 1E-20  -xdrop_gap_final 1000

Of note, the corresponding full CDS can be easily extracted by using the program eFASTA along with the fields 2, 9 and 10 outputed by the tblastn option -outfmt 6.
Note also that searching and extracting a CDS from a genome sequence using an associated PSSM file can be easily carried out with the tool eCDS.

Litterature cited

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421

Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44:D279-285. doi:10.1093/nar/gkv1344

Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41-43. doi:10.1093/nar/29.1.41

Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Research, 43:D261-9. doi:10.1093/nar/gku1223

Gupta RS, Patel S, Saini N, Chen S (2020) Robust demarcation of 17 distinct Bacillus species clades, proposed as novel Bacillaceae genera, by phylogenomics and comparative genomic analyses: description of Robertmurraya kyonggiensis sp. nov. and proposal for an emended genus Bacillus limiting it only to the members of the Subtilis and Cereus clades of species. International Journal of Systematics and Evolutionary Microbiology, 70(11):5753-5798. doi:10.1099/ijsem.0.004475

Patel S, Gupta RS (2020) A phylogenomic and comparative genomic framework for resolving the polyphyly of the genus Bacillus: Proposal for six new genera of Bacillus species, Peribacillus gen. nov., Cytobacillus gen. nov., Mesobacillus gen. nov., Neobacillus gen. nov., Metabacillus gen. nov. and Alkalihalobacillus gen. nov. International Journal of Systematics and Evolutionary Microbiology, 70:406-438. doi:10.1099/ijsem.0.003775

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41. doi:10.1186/1471-2105-4-41

Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science, 278(5338):631-637. doi:10.1126/science.278.5338.631

Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE, 8(10):e77033. doi:10.1371/journal.pone.0077033