Evolutionary distance estimates

How to estimate a matrix of p-distances from a FASTA-formatted multiple sequence alignment?
How to estimate a matrix of evolutionary distances from a FASTA-formatted multiple nucleotide sequence alignment?
How to estimate a matrix of evolutionary distances from a FASTA-formatted multiple amino-acid sequence alignment?

How to estimate a matrix of p-distances from a FASTA-formatted multiple sequence alignment?

The p-distance is the proportion of characters at which the two sequences to be compared are different.

goalign

The program goalign could be used to compute every pairwise p-distance from the file $infile containing a multiple sequence alignment. The resulting p-distance matrix is written into the file $outfile in PHYLIP square format.

goalign compute distance -m pdist -i $infile -o $outfile

Of note, the overall speed could be improved by launching goalign on multiple threads (i.e. option -t).

[170315fl]

awk

The following awk program allows creating the p-distance matrix from the file $infile containing a multiple nucleotide sequence alignment. The resulting p-distance matrix is written into the file $outfile in PHYLIP lower-triangular format ($prec is the number of decimal places).

Importantly, the Bash variable $unk is the list of every character state that should be considered as unknown (separated by pipe symbols |).

When dealing with nucleotide sequences, use: unk="R|Y|S|W|K|M|B|D|H|V|N|X|?|-".

When dealing with amino acid sequences, use: unk="X|?|-"

awk -v p=$prec -v u="$unk" '/^>/{seq[n]=s;lbl[++n]=substr($0,2);gsub(" ","_",lbl[n]);(m<(x=length(lbl[n])))&&m=x;s="";next}  {s=s""toupper($0)}
                            END{seq[n]=s; print(b=" ")n; x=0.5;while((x*=2)<m)b=b""b;
                                while(++i<=n){split(seq[i],si,"");printf substr(lbl[i]b,1,m);j=0;while(++j<i){c=k=split(seq[j],sj,"");
                                  d=0;++c;while(--c)(((ri=si[c])~u)||((rj=sj[c])~u))&&(--k)||((ri!=rj)&&(++d));printf(" %."p"f",(d/k))}print""}}' $infile > $outfile

[200209ac]

How to estimate a matrix of evolutionary distances from a FASTA-formatted multiple nucleotide sequence alignment?

Jukes and Cantor (1969) evolutionary model

The evolutionary distance based on the evolutionary model proposed by Jukes and Cantor (1969) is denoted JC69 here.

goalign

The program goalign could be used to compute every pairwise evolutionary distance from the file $infile containing a multiple nucleotide sequence alignment. The resulting JC69 distance matrix is written into the file $outfile in PHYLIP square format.

goalign compute distance -m jc -i $infile -o $outfile

Of note, the overall speed could be improved by launching goalign on multiple threads (i.e. option -t).

[170506fl]

awk

The following awk program allows creating the JC69 distance matrix from the file $infile containing a multiple nucleotide sequence alignment. The resulting JC69 distance matrix is written into the file $outfile in PHYLIP lower-triangular format ($prec is the number of decimal places).

unk="R|Y|S|W|K|M|B|D|H|V|N|X|?|-";
awk -v p=$prec -v u="$unk" '/^>/{seq[n]=s;lbl[++n]=substr($0,2);gsub(" ","_",lbl[n]);(m<(x=length(lbl[n])))&&m=x;s="";next}  {s=s""toupper($0)}
                            END{seq[n]=s; print(b=" ")n; x=0.5;while((x*=2)<m)b=b""b;
                                while(++i<=n){split(seq[i],si,"");printf substr(lbl[i]b,1,m);j=0;while(++j<i){c=k=split(seq[j],sj,"");
                                  d=0;++c;while(--c)(((ri=si[c])~u)||((rj=sj[c])~u))&&(--k)||((ri!=rj)&&(++d));printf(" %."p"f",(d==0)?0:-0.75*log(1-4*d/(3*k)))}print""}}' $infile > $outfile

[200209ac]

Kimura (1980) evolutionary model

The evolutionary distance based on the evolutionary model proposed by Kimura (1980) is denoted K80 here.

The program goalign could be used to compute every pairwise evolutionary distances from the file $infile containing a multiple nucleotide sequence alignment. The resulting K80 distance matrix is written into the file $outfile in PHYLIP square format.

goalign compute distance -m k2p -i $infile -o $outfile

Of note, the overall speed could be improved by launching goalign on multiple threads (i.e. option -t).

[170506fl]

Felsenstein (1981) evolutionary model

The evolutionary distance based on the evolutionary model proposed by Felsenstein (1981) is denoted F81 here.

The program goalign could be used to compute every pairwise evolutionary distances from the file $infile containing a multiple nucleotide sequence alignment. The resulting F81 distance matrix is written into the file $outfile in PHYLIP square format.

goalign compute distance -m f81 -i $infile -o $outfile

Of note, the overall speed could be improved by launching goalign on multiple threads (i.e. option -t).

[170506fl]

Felsenstein (1984) evolutionary model

For more details about the F84 evolutionary model and its related evolutionary distance estimate, see e.g. Kishino and Hasegawa (1989), Felsenstein and Churchill (1996), and McGuire et al. (1999).

The program goalign could be used to compute every pairwise evolutionary distances from the file $infile containing a multiple nucleotide sequence alignment. The resulting F84 distance matrix is written into the file $outfile in PHYLIP square format.

goalign compute distance -m f84 -i $infile -o $outfile

Of note, the overall speed could be improved by launching goalign on multiple threads (i.e. option -t).

[170506fl]

Tamura and Nei (1993) evolutionary model

The evolutionary distance based on the evolutionary model proposed by Tamura and Nei (1993) is denoted TN93 here.

The program goalign could be used to compute every pairwise evolutionary distances from the file $infile containing a multiple nucleotide sequence alignment. The resulting TN93 distance matrix is written into the file $outfile in PHYLIP square format.

goalign compute distance -m tn93 -i $infile -o $outfile

Of note, the overall speed could be improved by launching goalign on multiple threads (i.e. option -t).

[170506fl]

How to estimate a matrix of evolutionary distances from a FASTA-formatted multiple amino-acid sequence alignment?

Given an empirical model of amino acid substitution, an estimate of the evolutionary distance can be simply derived from the p-distance between each pair of aligned amino acid sequences; see e.g. Bigot, Guglielmini and Criscuolo (2019). Based on this property, the following awk program allows creating a distance matrix from the file $infile containing a multiple amino acid sequence alignment. The resulting distance matrix is written into the file $outfile in PHYLIP lower-triangular format ($prec is the number of decimal places).

Importantly, the two variables a and b should be explicitely specified to obtain the evolutionary distances corresponding to the selected evolutionary model.

For example, when expecting evolutionary distances derived from the LG model of amino acid substitution, use: a=3.56820; b=0.94051.

For a complete list of empirical models of amino acid substitution and their associated nucmerical constants a et b, see Bigot, Guglielmini and Criscuolo (2019).

a=3.56820; b=0.94051; # LG model
awk -v a=$a -v b=$b -v p=$prec -v u="X|?|-" '/^>/{seq[n]=s;lbl[++n]=substr($0,2);gsub(" ","_",lbl[n]);(m<(x=length(lbl[n])))&&m=x;s="";next}  {s=s""toupper($0)}
                                             END{seq[n]=s; print(bl=" ")n; x=0.5;while((x*=2)<m)bl=bl""bl;
                                                 while(++i<=n){split(seq[i],si,"");printf substr(lbl[i]bl,1,m);j=0;while(++j<i){c=k=split(seq[j],sj,"");
                                                   d=0;++c;while(--c)(((ri=si[c])~u)||((rj=sj[c])~u))&&(--k)||((ri!=rj)&&(++d));printf(" %."p"f",(d==0)?0:a*b*((1-d/(b*k))^(-1/a)-1))}print""}}' $infile > $outfile

[200209ac]