Sequence mining and searches

How to build a database from a FASTA-formatted nucleotide sequence databank for performing BLAST searches?
How to build a database from a FASTA-formatted amino acid sequence databank for performing BLAST searches?
How to perform a nucleotide sequence similarity search by BLAST?
How to perform an amino acid sequence similarity search by BLAST?
How to perform a blastp search from a single short query sequence? 170328ac
How to create an HMM profile from a FASTA-formatted multiple amino acid sequence alignment?
How to query an HMM profile against a FASTA-formatted amino acid sequence databank?
How to query a motif against a sequence databank?
How to extract a set of sequences containing patterns from a FASTA-formatted databank?

How to build a database from a FASTA-formatted nucleotide sequence databank for performing BLAST searches?

The program makeblastdb available from the BLAST+ suite allows building a local database $blastdb from the FASTA-formatted nucleotide sequence file $databank with the following command line:

makeblastdb -in $databank -dbtype nucl -input_type fasta -out $blastdb

[170328ac]

How to build a database from a FASTA-formatted amino acid sequence databank for performing BLAST searches?

The program makeblastdb available from the BLAST+ suite allows building a local database $blastdb from the FASTA-formatted amino acid sequence file $databank with the following command line:

makeblastdb -in $databank -dbtype prot -input_type fasta -out $blastdb

[170328ac]

How to perform a nucleotide sequence similarity search by BLAST?

The program blastn available from the BLAST+ suite could be used to perform a sequence similarity search from the FASTA-formatted nucleotide query sequence file $infile against the nucleotide sequence database $blastdb (see here) with the following basic command line:

blastn -query $infile -db $blastdb -out $outfile

Of note, there exists numerous options to control both sensitivity and specificity of the nucleotide sequence similarity search (see Table C2 in www.ncbi.nlm.nih.gov/books/NBK279675/). For example, it is important to select a reward/penalty ratio that is different to the default one (i.e. 2/-3) when searching subject sequences that are more or less divergent when compared to the query one (see e.g Table D1 here).

[170328ac]

How to perform an amino acid sequence similarity search by BLAST?

The program blastp available from the BLAST+ suite could be used to perform a sequence similarity search from the FASTA-formatted amino acid query sequence file $infile against the amino acid database $blastdb (see here) with the following basic command line:

blastp -query $infile -db $blastdb -out $outfile

Alternatively, the program tblastn could be used to perform a sequence similarity search from the FASTA-formatted amino acid query sequence file $infile against the nucleotide database $blastdb (see here) with the following basic command line:

tblastn -query $infile -db $blastdb -out $outfile

Of note, tuning both substitution matrices (option -matrix) and gap opening costs (option -gapopen) could lead to better sensitivity and specificity than expected with default options (i.e. -matrix BLOSUM62 -gapopen 11) especially when dealing with short query sequences (see here). Moreover, as for nucleotide query sequences (see here), using alternative substitution matrices could also lead to better searches when query sequences are very long and subject sequences are expected to be very dissimilar (see e.g. www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html).

[170328ac]

How to perform a blastp search from a single short query sequence? 170328ac

As the program blastp should be tuned when dealing with short amino acid query sequences, the following command lines allow accurate search options to be set according to the general recommendations (see www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html):

lq=$(grep -v ">" $infile | tr -d '\n' | wc -m); 
if [ $lq -lt 35 ]; then mat=PAM30; gop=9; elif [ $lq -lt 50 ]; then mat=PAM70; gop=10; elif [ $lq -lt 85 ]; then mat=BLOSUM80; gop=10; else mat=BLOSUM62; gop=11; fi
blastp -query $infile -out $outfile -db $blastdb -matrix $mat -gapopen $gop ;

[170328ac]

How to create an HMM profile from a FASTA-formatted multiple amino acid sequence alignment?

The program hmmbuild available from the HMMER suite allows building a HMM protein profile file $profile from the FASTA-formatted multiple amino acid sequence alignment file $infile with the following command line:

hmmbuild --informat afa $profile $infile

[170525jg]

How to query an HMM profile against a FASTA-formatted amino acid sequence databank?

The program hmmsearch from the HMMER suite can be used to perform sequence similarity search from the protein HMM profile file $profile (see here against the FASTA-formatted amino acid sequence databank $databank and write the results into the text file $outfile.

hmmsearch $profile $databank > $outfile

[170525jg]

How to query a motif against a sequence databank?

The following command line allows searching a given $motif (e.g. oligonucleotide, regular expression) inside a FASTA-formatted sequence $databank. Every sequence inside $databank that contains the given $motif is written inside the FASTA-formatted file $outfile.

awk '!/^>/{s=s$0;next}(s!=""){print s;s=""}{print}END{print s}' $databank | grep -B1 "$motif" | grep -v "^--$"

[170208ac]

However, if a more thorough search is required, the program fimo from the MEME suite can be used. It allows searching one or more motifs in the MEME-formatted file $motifile against the FASTA-formatted nucleotide or amino-acid sequences databank $databank. The results are saved in the text file $outfile. Contrary to the previous method, fimo uses a probabilistic approach allowing the detection of non-exact matches with associated p-values.

fimo $motifile $databank > $outfile

[170328jg]

How to extract a set of sequences containing patterns from a FASTA-formatted databank?

Let $pfile be a file containing a list of patterns (one per line). Every sequence from the FASTA-formatted file $databank such that its header contains at least one of the patterns inside $pfile will be written into $outfile with the following command lines.

Bash

while read l; do if [[ $l == \>* ]]; then [ -n "$(echo "$l" | grep -m 1 -f $pfile)" ] && ok=true || ok=false; fi; $ok && echo $l; done < $databank > $outfile

[181206ac]

awk

LC_ALL=C awk -v p="$(grep -v "^$" $pfile | tr '\n' '|' | sed 's/.$//')" '/^>/{w=$0~p}w' $databank > $outfile

[181227ac]