NEWICK-formatted tree manipulation

How to get the list of taxa from a NEWICK-formatted tree?
How to get the number of leaves from a NEWICK-formatted tree?
How to compute the total branch length of a NEWICK-formatted tree?
How to assess whether two NEWICK-formatted trees have the same set of leaf names?
How to compute the matrix of patristic distances from a NEWICK-formatted tree?
How to generate a binary matrix representation from a NEWICK-formatted tree?

How to get the list of taxa from a NEWICK-formatted tree?

The following command lines allow the list of leaf names (one per line) to be extracted from a phylogenetic tree contained in a NEWICK-formatted tree file $treefile. To obtain the list on one unique line, these command lines should be ended by: | xargs echo. To sort the list according to the alphanumerical order, these command lines should be ended by: | sort.

gotree

The program gotree could be used to return the list of leaf names with the following command line:

gotree stats tips -i $treefile | awk 'NR>1{print$NF}'

[170403fl]

Bash

Alternatively, the following Bash command line returns the same list:

tr '(,' '\n' < $treefile | grep -o '^[^:)]*'

[181208ac]

How to get the number of leaves from a NEWICK-formatted tree?

The following command lines return the number of leaves from a phylogenetic tree contained in a NEWICK-formatted tree file $treefile.

gotree

The program gotree could be used with the following command line:

gotree stats -i $treefile | tail -1 | awk '{print$3}'

[170403fl]

Bash

Alternatively, the following Bash command line could also be used:

tr '(,' '\n' < $treefile | grep -c -v "^$"

[170403ac]

How to compute the total branch length of a NEWICK-formatted tree?

The following command lines allow summing up the length of every branch of a phylogenetic tree contained in a NEWICK-formatted file $treefile.

gotree

The program gotree returns the total branch length of a tree with the following command line:

gotree stats -i $treefile | awk 'NR>1{print $6}'

[170403fl]

Bash

Alternatively, the following Bash command line returns the same result:

grep -o ":[0-9\.-]*" $treefile | tr -d : | paste -sd+ | bc | sed 's/^\./0./'

[170403ac]

How to assess whether two NEWICK-formatted trees have the same set of leaf names?

The following command lines return the boolean true if the phylogenetic trees contained in the two NEWICK-formatted files $treefile1and $treefile2 are made up by the same leaf set.

gotree

The program gotree could be used with the following command line:

test -z "$(gotree compare tips -i $treefile1 -c $treefile2 | sed '$d')"

[170403fl]

Bash

The same result could be obtained with the following Bash command line:

test -z "$(diff <(tr '(,' '\n' < $treefile1 | grep -o '^[^:)]*' | sort) <(tr '(,' '\n' < $treefile2 | grep -o '^[^:)]*' | sort))"

[181208ac]

How to compute the matrix of patristic distances from a NEWICK-formatted tree?

The patristic distance is the sum of the length of all branches connecting two leaves in a phylogenetic tree. The program gotree could be used to compute the square matrix of patristic distances from the phylogenetic tree contained in the NEWICK-formatted file $treefile and write it into the PHYLIP-formatted file $outfile with the following command line:

gotree matrix -i $treefile -o $outfile

[181225ac]

How to generate a binary matrix representation from a NEWICK-formatted tree?

Given a NEWICK-formatted tree file $treefile, the following command line writes into the text file $bmfile the sorted list of its leaf names (on the first line), followed by every non-trivial split in binary format (one per line).

sed 's/:[0-9\.-]*//g' $treefile | sed 's/)[0-9\.-]*/)/g' | 
  awk '{while(++i<length()){if((c=substr($0,i,1))=="(")p[++j]=i;else if(c==")"){print substr($0,p[j],i-p[j]);--j}}}' | 
    sed 's/[(,)]\+/ /g' | tac |
      awk '(NR==1){while(++n<=NF)lbl[n]=$n;asort(lbl);printf"0";while(++i<n)printf" "lbl[i];print"";next}
           {delete s;i=0;while(++i<=NF)s[$i];x=(lbl[1]in s)?1:0;y=1-x;i=0;while(++i<n)printf(lbl[i]in s)?x:y;print""}' | 
        sort -u | sed 's/0 //' > $bmfile

Of note, a transposed binary representation $tbmfile (i.e. each raw corresponding to one leaf) could be obtained from the file $bmfile with the following awk one-liner:

awk 'NR==1{while(++n<=NF)m=(m>(l=length(lbl[n]=$n)))?m:(++l);b=" ";while(++x<m)b=b""b;next} {s[++l]=$0}
     END{while(++i<=n){printf substr(lbl[i]""b,1,m);c=0;while(++c<=l)printf substr(s[c],i,1);print""}}' $bmfile > $tbmfile

Concerning the binary representation inside file $bmfile, it should stressed that: 1. the leaf names in the first line are sorted according to their alphanumerical order, 2. the first leaf name is always encoded by 1, and 3. the binary splits are sorted.

In consequence, every NEWICK representation of the same tree topology always leads to the same file $bmfile (branch lengths and support are not considered). The topology of any phylogenetic tree could then be easily hashed with e.g. the following command line:

md5sum $bmfile | awk '{print$1}'

If $treefile1 and $treefile2 are two files, each containing one NEWICK-formatted phylogenetic tree, that have been processed by the above command line to create the two binary matrix representation files $bmfile1 and $bmfile2, respectively, assessing that they have the same topology could therefore be easily performed with the following command line:

test -z "$(diff $bmfile1 $bmfile2)"

More generally, when the two trees are binary ones, a bipartition distance between them could be easily derived with the following command line:

diff $bmfile1 $bmfile2 | grep -c "<"

[181208ac]