Institut Pasteur blankvertical divider clipartblank DBC blankvertical divider clipartblank Bioinformatics and Biostatistics Hub blankvertical divider clipartblank GIPhy

DESCRIPTION      MARKER LIST      USAGE      LITTERATURE CITED

PhyloM: NCLDV


Description

PhyloM: NCLDV is a selection of markers that are well-suited for phylogenetic tree inference of members of the Nucleocytoplasmic large DNA virus (NCLDV) group. NCLDV represent a group of large and giant eukaryotic dsDNA viruses that includes Poxviridae, Iridoviridae, Ascoviridae, Asfarviridae, Marseilleviridae, Mimiviridae and Phycodnaviridae as well as several lineages of unclassified viruses such as pithoviruses, pandoraviruses, molliviruses and faustoviruses (Koonin and Yutin 2012). This gene selection relies on the NCLDV core genome as defined in Guglielmini et al. (2018). From this core genome, 8 markers have been selected because (i) they are present in more than 90% of the NCLDV used in the study and (ii) they showed signals of co-evolution which makes possible the concatenation of their alignments into one supermatrix.


Marker list

The 8 genes are listed in the table below. They are sorted according to the reference genome of Acanthamoeba polyphaga mimivirus (APMV) (Genbank accn: NC_014649). Common gene names are available in the column name with a link to the Uniprot description. The column APMV CDS lists the NCBI accession numbers of the Acanthamoeba polyphaga mimivirus CDS with links to the corresponding entries in the NCBI Conserved Domain Database. The reference multiple amino acid sequence alignments (MSA; ) and their associated position specific scoring matrices (PSSM; ) are available in the two last columns.

full name name APMV CDS MSA PSSM
DNA primase primase YP_003986703
DNA-directed RNA polymerase subunit 2 rpo2 YP_003986740
DNA polymerase polB YP_003986825
Transcription factor SII tfIIS YP_003986841
Major capsid protein mcp YP_003986929
Late transcription factor vltF3 YP_003986933
Packaging ATPase pATPase YP_003986942
DNA-directed RNA polymerase subunit 1 rpo1 YP_003987013

Usage

Downloading multiple sequence alignment (MSA) or position specific scoring matrix (PSSM) files

For each gene name NAME available inside the above table, the reference multiple amino acid sequence alignment (MSA) could be accessed via the following URL model:

 http://giphy.pasteur.fr/PhyloM/NCLDV/aln/NAME.faa

and the associated position specific scoring matrix (PSSM) via the following URL model:

 http://giphy.pasteur.fr/PhyloM/NCLDV/smp/NAME.smp

For example, the reference MSA for the gene rpo2 could be downloaded by wget with the following linux command line:

 wget -q http://giphy.pasteur.fr/PhyloM/NCLDV/aln/rpo2.faa

The same download could be also performed by curl with the following command line:

 curl --silent -O http://giphy.pasteur.fr/PhyloM/NCLDV/aln/rpo2.faa

All MSA and PSSM are also available as a tar.gz archive here:  

Using a MSA for performing a BLAST search against an amino acid sequence databank

Each of the PhyloM MSA files could be used as a query for performing a psiblast search with the BLAST+ tools (Camacho et al. 2009). Let cds.faa be a FASTA-formatted amino acid sequence file (e.g. every CDS from a bacterial genome). This databank should be first formatted with the following linux command line:

 makeblastdb  -in cds.faa

Next, a PhyloM MSA file msa.faa could be directly used as a query for performing a BLAST search with the following linux command line model:

 psiblast  -in_msa msa.faa  -db cds.faa  -seg no  -word_size 2  -evalue 1E-20  -xdrop_gap_final 1000

Using a PSSM for performing a BLAST search against a nucleotide sequence databank

Each of the PhyloM PSSM files could be used as a query for performing a tblastn search with the BLAST+ tools (Camacho et al. 2009). Let seq.fna be a FASTA-formatted nucleotide sequence file (e.g. de novo assembly of a bacterial genome). This databank should be first formatted with the following linux command line:

 makeblastdb  -in seq.fna  -dbtype nucl

Next, a PhyloM PSSM file pssm.smp could be directly used as a query for performing a BLAST search with the following linux command line model:

 tblastn  -in_pssm pssm.smp  -db seq.fna  -seg no  -word_size 2  -evalue 1E-20  -xdrop_gap_final 1000

Of note, the corresponding full CDS could be easily extracted by using the program eFASTA along with the fields 2, 9 and 10 outputed by the tblastn option -outfmt 6.


Litterature cited

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421

Guglielmini J, Woo A, Krupovic M, Forterre P, Gaia M (2019) Diversification of giant and large eukaryotic dsDNA viruses predated the origin of modern eukaryotes. Proceedings of the National Academy of Sciences, 116(39):19585-19592. doi:10.1073/pnas.1912006116

Koonin EV, Yutin N (2012) Nucleo‐cytoplasmic Large DNA Viruses (NCLDV) of Eukaryotes. In: eLS. John Wiley & Sons, Ltd: Chichester. doi:10.1002/9780470015902.a0023268