PhyloM: NCLDV is a selection of markers that are well-suited for phylogenetic tree inference of members of the Nucleocytoplasmic large DNA virus (NCLDV) group. NCLDV represent a group of large and giant eukaryotic dsDNA viruses that includes Poxviridae, Iridoviridae, Ascoviridae, Asfarviridae, Marseilleviridae, Mimiviridae and Phycodnaviridae as well as several lineages of unclassified viruses such as pithoviruses, pandoraviruses, molliviruses and faustoviruses (Koonin and Yutin 2012). This gene selection relies on the NCLDV core genome as defined in Guglielmini et al. (2018). From this core genome, 8 markers have been selected because (i) they are present in more than 90% of the NCLDV used in the study and (ii) they showed signals of co-evolution which makes possible the concatenation of their alignments into one supermatrix.
The 8 genes are listed in the table below. They are sorted according to the reference genome of Acanthamoeba polyphaga mimivirus (APMV) (Genbank accn: NC_014649). Common gene names are available in the column name with a link to the Uniprot description. The column APMV CDS lists the NCBI accession numbers of the Acanthamoeba polyphaga mimivirus CDS with links to the corresponding entries in the NCBI Conserved Domain Database. The reference multiple amino acid sequence alignments (MSA; ) and their associated position specific scoring matrices (PSSM; ) are available in the two last columns.
full name | name | APMV CDS | MSA | PSSM | ||||
DNA primase | primase | YP_003986703 | ||||||
DNA-directed RNA polymerase subunit 2 | rpo2 | YP_003986740 | ||||||
DNA polymerase | polB | YP_003986825 | ||||||
Transcription factor SII | tfIIS | YP_003986841 | ||||||
Major capsid protein | mcp | YP_003986929 | ||||||
Late transcription factor | vltF3 | YP_003986933 | ||||||
Packaging ATPase | pATPase | YP_003986942 | ||||||
DNA-directed RNA polymerase subunit 1 | rpo1 | YP_003987013 |
For each gene name NAME
available inside the above table, the reference multiple amino acid sequence alignment (MSA) could be accessed via the following URL model:
http://giphy.pasteur.fr/PhyloM/NCLDV/aln/NAME.faa
and the associated position specific scoring matrix (PSSM) via the following URL model:
http://giphy.pasteur.fr/PhyloM/NCLDV/smp/NAME.smp
For example, the reference MSA for the gene rpo2 could be downloaded by wget with the following linux command line:
wget -q http://giphy.pasteur.fr/PhyloM/NCLDV/aln/rpo2.faa
The same download could be also performed by curl with the following command line:
curl --silent -O http://giphy.pasteur.fr/PhyloM/NCLDV/aln/rpo2.faa
All MSA and PSSM are also available as a tar.gz archive here:
Each of the PhyloM MSA files could be used as a query for performing a psiblast search with the BLAST+ tools (Camacho et al. 2009).
Let cds.faa
be a FASTA-formatted amino acid sequence file (e.g. every CDS from a bacterial genome).
This databank should be first formatted with the following linux command line:
makeblastdb -in cds.faa
Next, a PhyloM MSA file msa.faa
could be directly used as a query for performing a BLAST search with the following linux command line model:
psiblast -in_msa msa.faa -db cds.faa -seg no -word_size 2 -evalue 1E-20 -xdrop_gap_final 1000
Each of the PhyloM PSSM files could be used as a query for performing a tblastn search with the BLAST+ tools (Camacho et al. 2009).
Let seq.fna
be a FASTA-formatted nucleotide sequence file (e.g. de novo assembly of a bacterial genome).
This databank should be first formatted with the following linux command line:
makeblastdb -in seq.fna -dbtype nucl
Next, a PhyloM PSSM file pssm.smp
could be directly used as a query for performing a BLAST search with the following linux command line model:
tblastn -in_pssm pssm.smp -db seq.fna -seg no -word_size 2 -evalue 1E-20 -xdrop_gap_final 1000
Of note, the corresponding full CDS could be easily extracted by using the program eFASTA along with the fields 2, 9 and 10 outputed by the tblastn option -outfmt 6
.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421
Guglielmini J, Woo A, Krupovic M, Forterre P, Gaia M (2019) Diversification of giant and large eukaryotic dsDNA viruses predated the origin of modern eukaryotes. Proceedings of the National Academy of Sciences, 116(39):19585-19592. doi:10.1073/pnas.1912006116
Koonin EV, Yutin N (2012) Nucleo‐cytoplasmic Large DNA Viruses (NCLDV) of Eukaryotes. In: eLS. John Wiley & Sons, Ltd: Chichester. doi:10.1002/9780470015902.a0023268