Multiple sequence alignment manipulation

How to generate character bootstrap replicates from a FASTA-formatted multiple sequence alignment?
How to generate block bootstrap replicates from a FASTA-formatted multiple sequence alignment?
How to infer a consensus sequence from a FASTA-formatted multiple sequence alignment?

How to generate character bootstrap replicates from a FASTA-formatted multiple sequence alignment?

The following awk one-liner writes $nrep FASTA-formatted files, each containing one character bootstrap replicates generated from an initial multiple sequence alignment file $infile in FASTA format.

Random selection can be controlled by the integer seed value $seed. Output file names are of the form $basefile._r_.fasta, where $basefile should be specified and r varies from 1 to $nrep.

awk -v s=$seed -v r=$nrep -v f=$basefile '/^>/{seq[n]=sn;lbl[++n]=$0;sn="";next} {sn=sn$0}
                                          END{l=length(seq[n]=sn);srand(s);++r;
                                              while(--r>0){i=c=0;while(++c<=l)b[c]=1+int(l*rand());++l;o=f""r".fasta";
                                                           while(++i<=n){sb="";si=seq[i];c=l;while(--c>0)sb=sb""substr(si,b[c],1);print lbl[i]"\n"sb>>o}--l}}' $infile

[200503ac]

How to generate block bootstrap replicates from a FASTA-formatted multiple sequence alignment?

The following awk one-liner writes $nrep FASTA-formatted files, each containing one block bootstrap replicates generated from an initial multiple sequence alignment file $infile in FASTA format.

Blocks are non-overlapping and of size $bsize (e.g. = 3 to carry out codon bootstrap). Random selection can be controlled by the integer seed value $seed. Output file names are of the form $basefile._r_.fasta, where $basefile should be specified and r varies from 1 to $nrep.

awk -v s=$seed -v r=$nrep -v x=$bsize -v f=$basefile '/^>/{seq[n]=sn;lbl[++n]=$0;sn="";next} {sn=sn$0}
                                                      END{if((l=length(seq[n]=sn))%x!=0){print"alignment length ("l") is not a multiple of "x;exit1}l/=x;srand(s);++r;           
                                                          while(--r>0){i=c=0;while(++c<=l)b[c]=int(l*rand());++l;o=f""r".fasta";
                                                                       while(++i<=n){sb="";si=seq[i];c=l;while(--c>0)sb=sb""substr(si,1+x*b[c],x);print lbl[i]"\n"sb>>o}--l}}' $infile

[200503ac]

How to infer a consensus sequence from a FASTA-formatted multiple sequence alignment?

The following awk one-liner returns a consensus sequence from the FASTA-formatted multiple sequence alignment file $infile. Of note, the awk variable unk is the list of every character state that should be considered as unknown (separated by pipe symbols |).

awk -v unk="?|-|X" '/^>/{seq[n]=sn;++n;sn="";next} {sn=sn$0} 
                    END{l=length(seq[n]=sn);while(++s<=l){delete no;c="?";i=m=0;while(++i<=n)(!((r=substr(seq[i],s,1))~unk))&&(x=++no[r])>m&&m=x&&c=r;printf c}print""}' $infile

[200503ac]