The following gawk one-liner selects only the lower-triangular part of the square distance matrix file $infile
and write it into $outfile
:
gawk 'NR>1{(m<(l=length(lbl[++n]=$(c=j=1))))&&m=l;--j;while(++c<=n)d[n][(++j)]=$c}
END{print(b=" ")n;x=0.5;while((x*=2)<m)b=b""b;while(++i<=n){printf substr(lbl[i]b,1,m);j=0;while(++j<i)printf" "d[i][j];print""}}' $infile > $outfile
The following gawk one-liner reads the lower-triangular distance matrix file $infile
and write its equivalent square matrix into $outfile
:
gawk 'NR>1{(m<(l=length(lbl[++n]=$(c=j=1))))&&m=l;--j;while(++c<=n)d[j][n]=d[n][(++j)]=$c}
END{print(b=" ")n;x=0.5;while((x*=2)<m)b=b""b;z=substr("0.0000000000000000000",1,length(d[1][2]));
while(++i<=n){d[i][i]=z;printf substr(lbl[i]b,1,m);j=0;while(++j<=n)printf" "d[i][j];print""}}' $infile > $outfile
Given a PHYLIP-formatted square distance matrix file $infile
and a list of labels $taxfile
(one per line), the following gawk one-liner allows the corresponding submatrix to be extracted and written into $outfile
:
gawk 'NR==1{next}
NR==FNR{(m<(l=length(lbl[++n]=$(c=j=1))))&&m=l;--j;while(++c<=NF)d[n][(++j)]=$c;next}
{j=0;while(++j<=n)if(lbl[j]==$1){s[++ns]=j;break}}
END{print(b=" ")ns;x=0.5;while((x*=2)<m)b=b""b;i=0;
while(++i<=ns){printf substr(lbl[si=s[i]]b,1,m);j=0;while(++j<=ns)printf" "d[si][s[j]];print""}}' $infile $taxfile > $outfile
Of note, the above one-liner could be used to reorder the distance matrix following the order of the labels inside $taxfile
.
The OEPL format is useful for dealing with matrix files, especially when estimating each entry simultaneously (e.g. parallel computing). The OEPL format is very simple: the first line is made up by the n labels separated by blank spaces, and each remaining lines are made up by three columns: row index i, column index j, and value of the entry ij (row and column start at index 1).
Transforming a PHYLIP-formatted distance matrix file $infile
(either square or lower-triangular) into an OEPL-formatted file $outfile
could be easily performed with the following command line:
or the following one without tac:
Reciprocally, an OEPL-formatted distance matrix file $infile
could be easily transformed into a PHYLIP-formatted file $outfile
with the following gawk one-liner ($prec
is the number of decimal places):
gawk -v p=$prec 'NR==1{while(++n<=NF){(m<(l=length(lbl[n]=$n)))&&m=l;d[n][n]=0}b=" ";x=0.5;while((x*=2)<m)b=b""b;next} {d[$1][$2]=d[$2][$1]=$3}
END{print" "(n-1);while(++i<n){printf substr(lbl[i]b,1,m);j=0;while(++j<n)printf(" %."p"f",d[i][j]);print""}}' $infile > $outfile
Of note, the above gawk one-liner returns a square matrix. A lower-triangular matrix could be obtained by replacing while(++j<n)
with while(++j<i)
.