FASTA-formatted file manipulation"

How to convert a FASTA file with sequences written on several lines into a FASTA file with each sequence written on one line?
How to convert a FASTA file with each sequence written on one line into a FASTA file with each sequence broken into several equal length lines?
How to reverse-complement a FASTA-formatted nucleotide sequence?
How to extract a specific region from a FASTA-formatted nucleotide sequence file?
How to translate a FASTA-formatted codon sequence?

How to convert a FASTA file with sequences written on several lines into a FASTA file with each sequence written on one line?

The following awk one-liner converts the FASTA-formatted sequence file $infile into the reformatted FASTA file $outfile:

awk '!/^>/{s=s$0;next}(s!=""){print s;s=""}{print}END{print s}' $infile > $outfile

[170212ac]

How to convert a FASTA file with each sequence written on one line into a FASTA file with each sequence broken into several equal length lines?

The following one-liners allow each sequence inside the FASTA-formatted file $infile to be broken into several lines of lengths up to $cutoff characters that are written into the FASTA file $outfile.

awk

awk -v l=$cutoff '/^>/{print;next}((lgt=length())>=l){c=1;while((c+=l)<=lgt){print substr($0,c-l,l)}}{print substr($0,c-l)}' $infile > $outfile

[170212ac]

Bash

while read line ; do [[ $line == \>* ]] && echo $line && continue ; echo $line | fold -$cutoff ; done < $infile > $outfile

[170212ac]

How to reverse-complement a FASTA-formatted nucleotide sequence?

Given a FASTA-formatted file $infile containing a nucleotide sequence, the following command lines will write its reverse-complement into the FASTA-formatted file $outfile.

tee >(grep "^>" > $outfile) < $infile | grep -v "^>" | tr -d ‘\n’ | rev | tr 'ACGTacgt' 'TGCAtgca' >> $outfile

[200208ac]

How to extract a specific region from a FASTA-formatted nucleotide sequence file?

Given a string $name and two nucleotide indexes (i.e. $start and $end), the following command lines will search for the first sequence containing the pattern $name inside the FASTA-formatted file $infile, and next write into $outfile its subsequence determined by $start and $end (both inclusive).

awk

awk -v n=$name -v s=$start -v e=$end '/^>/&&index($0,n){h=$0;next}(!h){next}/^>/{exit}{q=q$0}END{print h"::"s"-"e;print substr(q,s,++e-s)}' $infile > $outfile

[170328ac]

eFASTA

The program eFASTA is dedicated to the extraction of subsequence.

eFASTA -fasta $infile -coord $name:$start-$end -outname $outfile ; 
mv $outfile.fna $outfile ;

[170328ac]

How to translate a FASTA-formatted codon sequence?

Given a codon sequence inside the FASTA-formatted file $infile, the following command lines will translate it (based on the standard genetic code) and write the resuling amino acid sequence into the FASTA-formatted file $outfile.

Bash

Of note, any STOP codon will be translated into ?.

tee >(grep "^>" > $outfile) < $infile | grep -v "^>" | tr -d ‘\n’ | sed -n -e 's/\(...\)/\1 /gp' | sed 's/GC. /A/g;s/AG[AG] /R/g;s/CG. /R/g;s/AA[CT] /N/g;s/GA[CT] /D/g;s/TG[CT] /C/g;s/CA[AG] /Q/g;s/GA[AG] /E/g;s/GG. /G/g;s/CA[CT] /H/g;s/AT[ACT] /I/g;s/CT. /L/g;s/TT[AG] /L/g;s/AA[AG] /K/g;s/ATG /M/g;s/TT[CT] /F/g;s/CC. /P/g;s/TC. /S/g;s/AG[CT] /S/g;s/AC. /T/g;s/TGG /W/g;s/TA[CT] /Y/g;s/GT. /V/g;s/TA[AG] /?/g;s/TGA /?/g;s/... /X/g' >> $outfile

[200209ac]

eFASTA

The program eFASTA could also be used for translating a codon sequence.

eFASTA -f $infile -c $(grep ">" $infile):1-$(egrep -v "^>|^$" $infile | tr -d '\n' | wc -m) -o $outfile -cds ; 
rm -f $outfile.fna ; 
mv $outfile.faa $outfile ;

[170210ac]