Let $infile
be a FASTA-formatted multiple sequence alignment file and $gaprate
a real value (from 0 to 1). The following command lines will discard every aligned character containing a proportion of gaps that is higher than $gaprate
and write the remaining characters into the FASTA-formatted file $outfile
.
The program goalign allows gapped characters to be filtered out with the following command line:
awk -v r=$gaprate '!/^>/{s=s$0;next} {seq[n]=s;s="";lbl[++n]=$0}
END{l=length(seq[n]=s);++l;g=n*r;i=(++n);while(--i>0){split(seq[i],si,"");m=l;while(--m>0)(si[m]=="-")&&gap[m]++}
while(++i<n){print lbl[i];split(seq[i],si,"");m=0;while(++m<l)printf(gap[m]<=g)?si[m]:"";print""}}' $infile > $outfile