The p-distance is the proportion of characters at which the two sequences to be compared are different.
The program goalign could be used to compute every pairwise p-distance from the file $infile
containing a multiple sequence alignment. The resulting p-distance matrix is written into the file $outfile
in PHYLIP square format.
Of note, the overall speed could be improved by launching goalign on multiple threads (i.e. option -t
).
The following awk program allows creating the p-distance matrix from the file $infile
containing a multiple nucleotide sequence alignment. The resulting p-distance matrix is written into the file $outfile
in PHYLIP lower-triangular format ($prec
is the number of decimal places).
Importantly, the Bash variable $unk
is the list of every character state that should be considered as unknown (separated by pipe symbols |
).
When dealing with nucleotide sequences, use: unk="R|Y|S|W|K|M|B|D|H|V|N|X|?|-"
.
When dealing with amino acid sequences, use: unk="X|?|-"
awk -v p=$prec -v u="$unk" '/^>/{seq[n]=s;lbl[++n]=substr($0,2);gsub(" ","_",lbl[n]);(m<(x=length(lbl[n])))&&m=x;s="";next} {s=s""toupper($0)}
END{seq[n]=s; print(b=" ")n; x=0.5;while((x*=2)<m)b=b""b;
while(++i<=n){split(seq[i],si,"");printf substr(lbl[i]b,1,m);j=0;while(++j<i){c=k=split(seq[j],sj,"");
d=0;++c;while(--c)(((ri=si[c])~u)||((rj=sj[c])~u))&&(--k)||((ri!=rj)&&(++d));printf(" %."p"f",(d/k))}print""}}' $infile > $outfile
The evolutionary distance based on the evolutionary model proposed by Jukes and Cantor (1969) is denoted JC69 here.
The program goalign could be used to compute every pairwise evolutionary distance from the file $infile
containing a multiple nucleotide sequence alignment. The resulting JC69 distance matrix is written into the file $outfile
in PHYLIP square format.
-t
).
The following awk program allows creating the JC69 distance matrix from the file $infile
containing a multiple nucleotide sequence alignment. The resulting JC69 distance matrix is written into the file $outfile
in PHYLIP lower-triangular format ($prec
is the number of decimal places).
unk="R|Y|S|W|K|M|B|D|H|V|N|X|?|-";
awk -v p=$prec -v u="$unk" '/^>/{seq[n]=s;lbl[++n]=substr($0,2);gsub(" ","_",lbl[n]);(m<(x=length(lbl[n])))&&m=x;s="";next} {s=s""toupper($0)}
END{seq[n]=s; print(b=" ")n; x=0.5;while((x*=2)<m)b=b""b;
while(++i<=n){split(seq[i],si,"");printf substr(lbl[i]b,1,m);j=0;while(++j<i){c=k=split(seq[j],sj,"");
d=0;++c;while(--c)(((ri=si[c])~u)||((rj=sj[c])~u))&&(--k)||((ri!=rj)&&(++d));printf(" %."p"f",(d==0)?0:-0.75*log(1-4*d/(3*k)))}print""}}' $infile > $outfile
The evolutionary distance based on the evolutionary model proposed by Kimura (1980) is denoted K80 here.
The program goalign could be used to compute every pairwise evolutionary distances from the file $infile
containing a multiple nucleotide sequence alignment. The resulting K80 distance matrix is written into the file $outfile
in PHYLIP square format.
-t
).
The evolutionary distance based on the evolutionary model proposed by Felsenstein (1981) is denoted F81 here.
The program goalign could be used to compute every pairwise evolutionary distances from the file $infile
containing a multiple nucleotide sequence alignment. The resulting F81 distance matrix is written into the file $outfile
in PHYLIP square format.
-t
).
For more details about the F84 evolutionary model and its related evolutionary distance estimate, see e.g. Kishino and Hasegawa (1989), Felsenstein and Churchill (1996), and McGuire et al. (1999).
The program goalign could be used to compute every pairwise evolutionary distances from the file $infile
containing a multiple nucleotide sequence alignment. The resulting F84 distance matrix is written into the file $outfile
in PHYLIP square format.
-t
).
The evolutionary distance based on the evolutionary model proposed by Tamura and Nei (1993) is denoted TN93 here.
The program goalign could be used to compute every pairwise evolutionary distances from the file $infile
containing a multiple nucleotide sequence alignment. The resulting TN93 distance matrix is written into the file $outfile
in PHYLIP square format.
-t
).
Given an empirical model of amino acid substitution, an estimate of the evolutionary distance can be simply derived from the p-distance between each pair of aligned amino acid sequences; see e.g. Bigot, Guglielmini and Criscuolo (2019). Based on this property, the following awk program allows creating a distance matrix from the file $infile
containing a multiple amino acid sequence alignment. The resulting distance matrix is written into the file $outfile
in PHYLIP lower-triangular format ($prec
is the number of decimal places).
Importantly, the two variables a
and b
should be explicitely specified to obtain the evolutionary distances corresponding to the selected evolutionary model.
For example, when expecting evolutionary distances derived from the LG model of amino acid substitution, use: a=3.56820; b=0.94051
.
For a complete list of empirical models of amino acid substitution and their associated nucmerical constants a
et b
, see Bigot, Guglielmini and Criscuolo (2019).
a=3.56820; b=0.94051; # LG model
awk -v a=$a -v b=$b -v p=$prec -v u="X|?|-" '/^>/{seq[n]=s;lbl[++n]=substr($0,2);gsub(" ","_",lbl[n]);(m<(x=length(lbl[n])))&&m=x;s="";next} {s=s""toupper($0)}
END{seq[n]=s; print(bl=" ")n; x=0.5;while((x*=2)<m)bl=bl""bl;
while(++i<=n){split(seq[i],si,"");printf substr(lbl[i]bl,1,m);j=0;while(++j<i){c=k=split(seq[j],sj,"");
d=0;++c;while(--c)(((ri=si[c])~u)||((rj=sj[c])~u))&&(--k)||((ri!=rj)&&(++d));printf(" %."p"f",(d==0)?0:a*b*((1-d/(b*k))^(-1/a)-1))}print""}}' $infile > $outfile