BLAST(1) BLAST(1)
29 April 1992
NAME
blastp, blastn, blastx, tblastn - rapid sequence database query
programs using the BLAST algorithm
SYNOPSIS
blastp aadb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#]
[M=subfile] [Y=#] [Z=#] [K=#] [L=#] [H=#] [V=#] [B=#]
blastn ntdb ntquery [E=#] [S=#] [W=#] [X=#] [M=#] [N=#] [Y=#] [Z=#]
[K=#] [L=#] [H=#] [V=#] [B=#] [[top][bottom]]
blastx aadb ntquery [E=#] [S=#] [W=#] [T=#] [X=#] [M=subfile]
[Y=#] [Z=#] [C=#] [K=#] [L=#] [V=#] [B=#]
[[top][bottom]]
tblastn ntdb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#]
[M=subfile] [Y=#] [Z=#] [C=#] [K=#] [L=#]
[H=#] [V=#] [B=#] [[top][bottom]]
DESCRIPTION
(Basic Local Alignment Search Tool) is the heuristic search algorithm
employed by the programs blastp, blastn, blastx, and tblastn. The
four programs are used for the following purposes:
blastp
to compare an amino acid query sequence vs. a protein sequence
database;
blastn
to compare a nucleotide query sequence vs. a nucleotide sequence
database;
blastx
to compare a nucleotide query sequence translated in all reading
frames vs. a protein sequence database;
tblastn
to compare a protein query sequence vs. a nucleotide sequence
database dynamically translated in all reading frames. Whenever
a nucleotide query sequence or nucleotide database is involved,
both strands (or all 6 reading frames) are searched by default.
The "top" and "bottom" options may be used to restrict a search
to the specified strand. (If both options are specified, both
strands will be searched). The unit of BLAST algorithm output is
the High-scoring Segment Pair (HSP), where each segment in the
pair is an equal but arbitrarily long run of contiguous residues.
In the programmatic implementations of the algorithm described
here, an HSP is a pair of segments, one from the query sequence
and one from a database sequence, where the score of their
ungapped alignment meets or exceeds a parameterized, positive-
valued cutoff. A set of zero or more HSPs is thus defined by two
- 1 - Formatted: October 29, 2025
BLAST(1) BLAST(1)
29 April 1992
sequences, an alignment scoring scheme, and a cutoff score. A
Maximal-scoring Segment Pair (MSP) is defined by two sequences
and a scoring scheme and is the highest-scoring of all segment
pairs on all diagonals. Depending on the parameters of a BLAST
sequence comparison, there may be a non-zero probability of not
finding one or more HSPs of which the MSP is a member.
PARAMETERS
Parameters are modified using a name=value syntax, e.g., E=0.05 or
S=100. E is interpreted as the expected number of MSPs that will
satisfy the cutoff score under the random sequence model. The value
of E approximates the expected number of HSPs that will be found
during the course of an entire database search. The default value for
E is 10, and the permitted range for this Real valued parameter is 0.
< E <= 1000. S is the cutoff score for reporting HSPs. Higher scores
correspond to increasing statistical significance, a lower
probability, or a reduced expected frequency of occurrence by chance.
Any positive-scoring alignments which the programs find but which
score below S go unreported. Unless S is explicitly set on the
command line, its default value is calculated from the value of E.
The values for E and S are interconvertable, a process which is
dependent on the following factors: the length and residue composition
of the query sequence; the length of the database and a fixed,
hypothetical residue composition for it; and the scoring scheme
employed. The scoring scheme used by blastp, blastx, and tblastn is a
substitution matrix; the scoring scheme used by blastn is a positive
reward score for matching residues and a negative penalty score for
mismatching residues. When both of the parameters E and S are
specified on the command line, the one resulting in the highest (most
restrictive) cutoff score will be used. When neither of these
parameters is specified on the command line, the default value for E
is used to calculate the cutoff score. For a given value of E (e.g.,
the default value of 10), a given query sequence, and a single scoring
scheme, the calculated value of the cutoff score S will be different
when searching databases of different lengths. To normalize the
statistics reported when databases of different lengths are searched,
the parameter Z (see below) may be set to a constant value for all
database searches. S takes on only integral values in the present
implementations of the BLAST algorithm. When the cutoff score is set
implicitly via E, S is rounded to the least integral value required to
satisfy E. Since the rounding procedure can decrease the effective
value of E, the calculated value for S is used to back-calculate the
effective value for E. For example, if the user specifies E = 50 on
the command line, a cutoff score that is rounded up by 0.9 units to
the smallest satisfying integer might correspond to an expected number
of HSPs of only 43. In this case, the value displayed for E at the
end of the program's report will be 43, not 50. When at least one HSP
is found involving any given database sequence, the programs blastp
and tblastn search the database sequence a second time for HSPs that
satisfy a lower cutoff score, S2. In essence, the second-pass search
gives these programs the opportunity to report any low-significance
- 2 - Formatted: October 29, 2025
BLAST(1) BLAST(1)
29 April 1992
HSPs they may have found that might be of interest within the context
of finding one or more higher-scoring (perhaps statistically
significant) HSPs. Poisson statistics may indicate that the lower-
scoring (higher-probability) HSPs are statistically significant when
their frequencies of occurrence are considered. In a relationship
similar to that between the parameters E and S, S2 can be set
explicitly on the command line or it will be calculated from the
setting of E2. Whereas S is related to E by the size of the database
and the length of the query sequence, S2 is related to E2 by the
lengths of a pair of hypothetical protein sequences of 300 residues
each. In other words, E2 approximates the number of HSPs one would
expect to find when comparing two protein sequences of length 300, one
having the composition of the query sequence and the other having the
hypothetical residue composition of the database. If a second-pass
search is not desired, setting E2 to zero (0) turns this feature off.
If S2 happens to be equal to or greater than the primary cutoff score,
a second-pass search is not performed, as well. The user should be
forewarned that, with no other knowledge about a positive-scoring
segment pair than its score, the probability that the BLAST algorithm
will detect the alignment decreases as the score of the alignment
decreases. Consequently, low-scoring HSPs looked for in the second-
pass search have a lesser chance individually of being found than the
original HSP. With a fixed scoring scheme, the probability of missing
an alignment can be decreased by: lowering the neighborhood word-score
threshold, T, while keeping the word size, W, constant; lowering both
W and T appropriately (see Altschul et al., 1990); and/or raising the
word-hit-extension drop-off score X (described below). W is the word
size for finding initial hits against the database sequences. Each
hit is extended in both directions along the corresponding diagonal of
an imaginary 2-dimensional matrix until the segment score drops off by
at least the quantity X. The default value for W is 3 amino acids for
blastp, blastx, and tblastn, and 12 nucleotides for blastn. The value
of W used by blastn should not be changed, as the logic of the program
source code has not been validated for use with values other than the
default (particularly smaller values). For the other programs, which
perform sequence comparisons at the level of individual amino acids, W
should generally be restricted to values less than 5 or else the value
for T should be specified disproportionately larger to avoid consuming
vast quantities of memory for the neighborhood word list (see below).
T is the word score threshold for generating neighborhood words of
length W from the query sequence, prior to scanning the database
(blastp, blastx, and tblastn only). Words which have an aggregate
score (through summation of the individual residue substitution
scores) of at least T when aligned with words from the query sequence
are included in the neighborhood list. Raising the value of T
increases the likelihood of completely missing HSPs, but can decrease
the search time and memory requirements of the programs by decreasing
the size of the neighborhood list. One of the key features of the
BLAST algorithm is the user-selectable trade-off in sensitivity for
speed. A generally suitable value for T is calculated at run-time,
using the residue composition and length of the query sequence and the
- 3 - Formatted: October 29, 2025
BLAST(1) BLAST(1)
29 April 1992
substitution matrix employed. The neighborhood word-score threshold
is set using an ad hoc equation that is a function of Lambda and H.
Lambda is the number of nats of information gained per unit increase
in score of an alignment (approximately 0.69315 times the number of
bits per unit score); H is the relative entropy of the target and
background residue frequencies [Karlin and Altschul, 1990], or the
expected information available per position in an alignment to
distinguish it from chance. The supplied PAM120 amino acid
substitution matrix, with a scale of ln(2)/2, yields a value for
Lambda that is close to 0.5 bit per unit score for query sequences of
typical residue compositions. Occasionally it may be necessary to
manually set the neighborhood word-score threshold via the command
line, for which 13 may be a good value to try, but this is highly
dependent on the substitution matrix and word-length, W, being
employed. X is a positive integer representing the maximum
permissible drop-off of the cumulative segment score during word-hit
extension. Raising X may decrease the chance that the BLAST algorithm
overlooks an HSP, but it may significantly increase the search time,
as well. If computation time is of little concern, X might be
increased several points from its default value, but only a very
marginal increase in sensitivity might be expected. For blastp,
blastx, and tblastn, the default value of X is calculated to be the
minimum integral score representing at least 10 bits of information,
or a reduction in the statistical significance of the alignment by a
factor of 2 to the power of 10 (or about 1,000). For blastn, the
default value of X is the minimum integral score that represents at
least 20 bits of information, or a reduction in the statistical
significance of the alignment by a factor of 2 to the power of 20 (or
about one million). The command line parameters K and L can be used
to set fixed values for the Karlin statistics' K and lambda
parameters, respectively. Users should generally avoid setting these
parameters unless the full ramifications of doing so are understood.
For example of one of the less obvious effects of manually choosing
these parameters, the value of the H statistic reported at the end of
each program's output (which is distinct from the command line
parameter of the same name) is a function of lambda; and the default
value for the neighborhood word-score threshold parameter T is in turn
a function of H.
SCORING SCHEMES
With blastp, blastx, and tblastn, the M option can be used to select
an alternate substitution matrix file. The default PAM120 matrix is
recommended for general protein similarity searches (Altschul, 1991).
While only the PAM120 and the PAM250 matrices are provided, the pam(1)
program can be used to produce PAM matrices of any desired generation
from 2 to 511. For rigorous searches where the mutational distance
between potential homologs is unknown, Altschul (1991) recommends
performing three searches, one each with the PAM-40, PAM-120, and
PAM-250 matrices. In blastn, M is the score for a single-letter
match; N is the score for a single-letter mismatch. M and N must be
positive and negative integers, respectively. Given the assumption
- 4 - Formatted: October 29, 2025
BLAST(1) BLAST(1)
29 April 1992
made by blastn that the 4 nucleotides A, C, G, and T are represented
equally in the database, the expected score for the query sequence
must be negative.
SEQUENCE LENGTH AND STATISTICAL SIGNIFICANCE
For the purpose of calculating significance levels, Y is the effective
length of the query sequence and Z is the effective length of the
database, both measured in residues. The default values for these
parameters are the actual lengths of the query sequence and database,
respectively. Larger values signify more degrees of freedom for
aligning the sequences and reduced statistical significance for an
alignment of any given score.
GENETIC CODES
C is a non-negative integer that determines the genetic code that will
be used by blastx (tblastn) to translate the query sequence (database
sequences). The default genetic code (C=0) corresponds to the so-
called Standard or Universal genetic code. To obtain a listing of the
nine available genetic codes and their associated numerical
identifiers, invoke either blastx or tblastn with the command line
parameter C=list. The current list of genetic codes and their
associated values for parameter C are: 0 Standard or Universal 1
Vertebrate Mitochondrial 2 Yeast Mitochondrial 3 Mold Mitochondrial
and Mycoplasma 4 Invertebrate Mitochondrial 5 Ciliate Macronuclear 6
Protozoan Mitochondrial 7 Plant Mitochondrial 8 Echinodermate
Mitochondrial
POISSON STATISTICS
The occurrence of two or more HSPs involving the query sequence and
the same database sequence is modeled as a Poisson process. An
important result of applying Poisson statistics is that an HSP with a
low score and high Expect value (low significance) may be discovered
to be statistically significant when appearing in the context of one
or more additional matches of equal or higher score against the same
database sequence. The Poisson P-value for any given HSP is a
function of its expected frequency of occurrence and the number of
HSPs actually observed with scores at least as high. The Poisson P-
value for a group of HSP events is the probability that at least as
many HSPs would occur by chance, each with a score at least as high as
the lowest-scoring member of the group. HSPs which appear on opposite
strands of a nucleotide query or database sequence are considered
independent and distinguishable events, and so are counted separately.
Given the score of an HSP, when the expected length for an alignment
with that score (see the description of H above) is a significant
fraction of the length of the query sequence, the Expect value used in
estimation of the Poisson P-value is reduced proportionately.
P-VALUES, ALIGNMENT SCORES, AND INFORMATION
The Expect and P-values of HSPs reported by the programs are dependent
on numerous factors including: the scoring scheme employed, the
residue composition of the query sequence, the assumed residue
- 5 - Formatted: October 29, 2025
BLAST(1) BLAST(1)
29 April 1992
composition for a typical database sequence, and the lengths of the
query sequence and database. Independent of the query and database
lengths are the HSP scores themselves, which may be readily compared
between different program runs even if the databases searched are of
different lengths, as long as all of the other relevant factors listed
here were unchanged. Further isolation from the many variables of a
search in one's assessment of an HSP may be obtained by observing the
information content reported (in bits) for the alignments. While the
information content of an HSP may change when fundamentally different
scoring schemes are used (e.g., different generations of PAM
matrices), the number of bits reported for an HSP will be independent
of the scales to which the matrices were generated. (In practice,
this statement is not quite true because the substitution scores used
by these programs are floating point or Real values which have been
rounded to nearest integers and thus lack a high degree of precision).
When communicating the statistical significance of an alignment, the
alignment score itself is generally not so important as the
combination of the substitution matrix employed and the actual
information content of the alignment.
REGULATING OUTPUT
The output is organized into three independently regulated sections: a
histogram of word-hit extension scores; one-line descriptions of the
database sequences that yielded one or more HSPs; and the high-scoring
segment pairs themselves. Each section of the output can be
selectively suppressed by setting the parameters H, V, and B to 0
(zero). Parameter H regulates the display of an histogram of the
scores of the highest-scoring hit extensions for each database
sequence. If H is assigned a non-zero value on the command line, the
histogram will be displayed (except for the blastx program, which
never displays an histogram but retains the H parameter for command-
line compatibility with the other programs). The default value for H
is 0 (no histogram). Parameter V is the maximum number of database
sequences for which one-line descriptions will be reported. The
default value for V is 500. A warning message is prominently
displayed at the end of the one-line descriptions section when HSPs
are found in more than V sequences. When V is zero, no one-line
descriptions are reported and no warning is given. Negative values
for V are undefined and disallowed. As an example of how V can be
used advantageously, if a high value for E is desired to virtually
assure in all cases that at least one HSP will be found, selecting a
small value for V will ensure that the output will not be too
voluminous; only the most statistically significant matches will be
reported. Parameter B regulates the display of the high-scoring
segment pairs. For positive values, B is the maximum number of
database sequences for which high-scoring segment pairs will be
reported. This may be much smaller than the actual number of high-
scoring segment pairs reported, since any given database sequence may
yield several HSPs. The default value for B is 250. Negative values
for B are undefined and disallowed.
- 6 - Formatted: October 29, 2025
BLAST(1) BLAST(1)
29 April 1992
SUPPORT UTILITIES
Databases to be searched by these programs must first be processed
either by the program setdb for protein sequence databases (re: blastp
and blastx) or the program pressdb for nucleotide sequence databases
(re: blastn and tblastn). The input database format is FASTA/Pearson.
Point accepted mutation (PAM) matrices of various generations can be
produced automatically with the pam program. The output can be saved
in a file whose name is then specified in the M=filename option of a
blastp, blastx, or tblastn query.
BUGS
blastn uses a large value for the wordlength, W, and does no
neighboring on these words. Consequently, the program is suitable for
finding nearly identical sequences rapidly. To identify weak amino
acid similarities encoded by nucleic acid, use blastx or tblastn. In
blastp, blastx, and tblastn, ad hoc equations have not been
implemented yet for calculating appropriate default values for T when
W has a value other than 3 or 4. When nucleotide sequence databases
are processed into searchable form by the pressdb program, IUPAC
ambiguity letters are replaced by an appropriate random selection from
the list A, C, G and T. (For example, an R would be replaced on the
average half of the time by an A and half of the time by a G).
Similarly, blastn replaces ambiguity letters in the query sequence
with appropriate random selections. Only after an HSP is found that
satisfies the cutoff score are the original sequences with their
ambiguities intact examined. With blastn, the alignment score will
decrease and may consequently fall below the cutoff score if the
random replacement letter happened to match. With blastx and tblastn,
the outcome will depend upon whether a specific amino acid can be
deduced despite the presence of ambiguity codes. tblastn uses only
one genetic code to translate the entire nucleotide sequence database,
although the particular genetic code employed is selectable via the
parameter C. blastn, blastx, and tblastn treat U and T residues in
nucleotide sequences as the same residue (i.e., they match). With two
exceptions, any letter in the query sequence which is not a member of
the relevant IUPAC amino acid or nucleotide code is stripped and does
not contribute to the sequence coordinate numbers reported by the
programs. The exceptions are asterisks (*) and hyphens (-) in amino
acid sequences, which are interpreted as translation stops and gap
characters, respectively. In protein sequence databases that are
processed into searchable form by the setdb program, non-IUPAC
letters, including any punctuation but excluding asterisks and
hyphens, are also stripped. The pressdb program does not strip non-
IUPAC codes, but treats them similarly to Ns. blastn does not
incorporate the concept of a partial- or half-match, such as when a
purine in one sequence is juxtaposed with a purine from the other.
For two residues to match at all, they both must be members of the set
A, C, G and T (or U). When calculating the Poisson statistics, some
HSPs may be incompatible with each other (not all of them may be
simultaneously alignable without reusing some portion of either
sequence) and yet they are (incorrectly) counted as independent
- 7 - Formatted: October 29, 2025
BLAST(1) BLAST(1)
29 April 1992
events. The user may note that the nucleotide composition of a blastn
query sequence is irrelevant to the resulting Karlin parameters,
Lambda and K. This is due to the residue composition assumed for a
typical database sequence being 25% for each of the four nucleotides
A, C, G, and T. The values of the Karlin parameters are still
affected by the scoring scheme employed (parameters M and N).
Furthermore, the individual who compiles these programs is certainly
not prevented from setting a non-uniform residue composition for the
database sequences, in which case the query composition does become
relevant and will impact the values of the Karlin parameters that are
calculated by blastn.
SEE ALSO
blast3(1).
REFERENCES
Karlin, Samuel and Stephen F. Altschul (1990). Methods for assessing
the statistical significance of molecular sequence features by using
general scoring schemes, Proc. Natl. Acad. Sci. USA 87:2264-2268.
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and
David J. Lipman (1990). Basic local alignment search tool, J. Mol.
Biol. 215:403-410. Altschul, Stephen F. (1991). Amino acid
substitution matrices from an information theoretic perspective. J.
Mol. Biol. 219:555-565.
- 8 - Formatted: October 29, 2025