The following command lines allow the list of leaf names (one per line) to be extracted from a phylogenetic tree contained in a NEWICK-formatted tree file $treefile
. To obtain the list on one unique line, these command lines should be ended by: | xargs echo
. To sort the list according to the alphanumerical order, these command lines should be ended by: | sort
.
The program gotree could be used to return the list of leaf names with the following command line:
Alternatively, the following Bash command line returns the same list:
The following command lines return the number of leaves from a phylogenetic tree contained in a NEWICK-formatted tree file $treefile
.
The program gotree could be used with the following command line:
Alternatively, the following Bash command line could also be used:
The following command lines allow summing up the length of every branch of a phylogenetic tree contained in a NEWICK-formatted file $treefile
.
The program gotree returns the total branch length of a tree with the following command line:
Alternatively, the following Bash command line returns the same result:
The following command lines return the boolean true
if the phylogenetic trees contained in the two NEWICK-formatted files $treefile1
and $treefile2
are made up by the same leaf set.
The program gotree could be used with the following command line:
The same result could be obtained with the following Bash command line:
The patristic distance is the sum of the length of all branches connecting two leaves in a phylogenetic tree. The program gotree could be used to compute the square matrix of patristic distances from the phylogenetic tree contained in the NEWICK-formatted file $treefile
and write it into the PHYLIP-formatted file $outfile
with the following command line:
Given a NEWICK-formatted tree file $treefile
, the following command line writes into the text file $bmfile
the sorted list of its leaf names (on the first line), followed by every non-trivial split in binary format (one per line).
sed 's/:[0-9\.-]*//g' $treefile | sed 's/)[0-9\.-]*/)/g' |
awk '{while(++i<length()){if((c=substr($0,i,1))=="(")p[++j]=i;else if(c==")"){print substr($0,p[j],i-p[j]);--j}}}' |
sed 's/[(,)]\+/ /g' | tac |
awk '(NR==1){while(++n<=NF)lbl[n]=$n;asort(lbl);printf"0";while(++i<n)printf" "lbl[i];print"";next}
{delete s;i=0;while(++i<=NF)s[$i];x=(lbl[1]in s)?1:0;y=1-x;i=0;while(++i<n)printf(lbl[i]in s)?x:y;print""}' |
sort -u | sed 's/0 //' > $bmfile
Of note, a transposed binary representation $tbmfile
(i.e. each raw corresponding to one leaf) could be obtained from the file $bmfile
with the following awk one-liner:
awk 'NR==1{while(++n<=NF)m=(m>(l=length(lbl[n]=$n)))?m:(++l);b=" ";while(++x<m)b=b""b;next} {s[++l]=$0}
END{while(++i<=n){printf substr(lbl[i]""b,1,m);c=0;while(++c<=l)printf substr(s[c],i,1);print""}}' $bmfile > $tbmfile
Concerning the binary representation inside file $bmfile
, it should stressed that: 1. the leaf names in the first line are sorted according to their alphanumerical order, 2. the first leaf name is always encoded by 1, and 3. the binary splits are sorted.
In consequence, every NEWICK representation of the same tree topology always leads to the same file $bmfile
(branch lengths and support are not considered). The topology of any phylogenetic tree could then be easily hashed with e.g. the following command line:
If $treefile1
and $treefile2
are two files, each containing one NEWICK-formatted phylogenetic tree, that have been processed by the above command line to create the two binary matrix representation files $bmfile1
and $bmfile2
, respectively, assessing that they have the same topology could therefore be easily performed with the following command line:
More generally, when the two trees are binary ones, a bipartition distance between them could be easily derived with the following command line: