The program makeblastdb available from the BLAST+ suite allows building a local database $blastdb
from the FASTA-formatted nucleotide sequence file $databank
with the following command line:
The program makeblastdb available from the BLAST+ suite allows building a local database $blastdb
from the FASTA-formatted amino acid sequence file $databank
with the following command line:
The program blastn available from the BLAST+ suite could be used to perform a sequence similarity search from the FASTA-formatted nucleotide query sequence file $infile
against the nucleotide sequence database $blastdb
(see here) with the following basic command line:
Of note, there exists numerous options to control both sensitivity and specificity of the nucleotide sequence similarity search (see Table C2 in www.ncbi.nlm.nih.gov/books/NBK279675/). For example, it is important to select a reward/penalty ratio that is different to the default one (i.e. 2/-3) when searching subject sequences that are more or less divergent when compared to the query one (see e.g Table D1 here).
The program blastp available from the BLAST+ suite could be used to perform a sequence similarity search from the FASTA-formatted amino acid query sequence file $infile
against the amino acid database $blastdb
(see here) with the following basic command line:
Alternatively, the program tblastn could be used to perform a sequence similarity search from the FASTA-formatted amino acid query sequence file $infile
against the nucleotide database $blastdb
(see here) with the following basic command line:
Of note, tuning both substitution matrices (option -matrix
) and gap opening costs (option -gapopen
) could lead to better sensitivity and specificity than expected with default options (i.e. -matrix BLOSUM62 -gapopen 11
) especially when dealing with short query sequences (see here). Moreover, as for nucleotide query sequences (see here), using alternative substitution matrices could also lead to better searches when query sequences are very long and subject sequences are expected to be very dissimilar (see e.g. www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html).
As the program blastp should be tuned when dealing with short amino acid query sequences, the following command lines allow accurate search options to be set according to the general recommendations (see www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html):
lq=$(grep -v ">" $infile | tr -d '\n' | wc -m);
if [ $lq -lt 35 ]; then mat=PAM30; gop=9; elif [ $lq -lt 50 ]; then mat=PAM70; gop=10; elif [ $lq -lt 85 ]; then mat=BLOSUM80; gop=10; else mat=BLOSUM62; gop=11; fi
blastp -query $infile -out $outfile -db $blastdb -matrix $mat -gapopen $gop ;
The program hmmbuild available from the HMMER suite allows building a HMM protein profile file $profile
from the FASTA-formatted multiple amino acid sequence alignment file $infile
with the following command line:
The program hmmsearch from the HMMER suite can be used to perform sequence similarity search from the protein HMM profile file $profile
(see here against the FASTA-formatted amino acid sequence databank $databank
and write the results into the text file $outfile
.
The following command line allows searching a given $motif
(e.g. oligonucleotide, regular expression) inside a FASTA-formatted sequence $databank
. Every sequence inside $databank
that contains the given $motif
is written inside the FASTA-formatted file $outfile
.
However, if a more thorough search is required, the program fimo from the MEME suite can be used. It allows searching one or more motifs in the MEME-formatted file $motifile
against the FASTA-formatted nucleotide or amino-acid sequences databank $databank
. The results are saved in the text file $outfile
. Contrary to the previous method, fimo uses a probabilistic approach allowing the detection of non-exact matches with associated p-values.
Let $pfile
be a file containing a list of patterns (one per line). Every sequence from the FASTA-formatted file $databank
such that its header contains at least one of the patterns inside $pfile
will be written into $outfile
with the following command lines.