packages icon



 BLAST(1)                                                           BLAST(1)
                                29 April 1992



 NAME
      blastp, blastn, blastx, tblastn - rapid sequence database query
      programs using the BLAST algorithm

 SYNOPSIS
      blastp aadb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#]
                          [M=subfile] [Y=#] [Z=#] [K=#] [L=#] [H=#] [V=#] [B=#]

      blastn ntdb ntquery [E=#] [S=#] [W=#] [X=#] [M=#] [N=#] [Y=#] [Z=#]
                          [K=#] [L=#] [H=#] [V=#] [B=#] [[top][bottom]]

      blastx aadb ntquery [E=#] [S=#] [W=#] [T=#] [X=#] [M=subfile]
                          [Y=#] [Z=#] [C=#] [K=#] [L=#] [V=#] [B=#]
                          [[top][bottom]]

      tblastn ntdb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#]
                          [M=subfile] [Y=#] [Z=#] [C=#] [K=#] [L=#]
                          [H=#] [V=#] [B=#] [[top][bottom]]

 DESCRIPTION
      (Basic Local Alignment Search Tool) is the heuristic search algorithm
      employed by the programs blastp, blastn, blastx, and tblastn.  The
      four programs are used for the following purposes:

      blastp
           to compare an amino acid query sequence vs. a protein sequence
           database;

      blastn
           to compare a nucleotide query sequence vs. a nucleotide sequence
           database;

      blastx
           to compare a nucleotide query sequence translated in all reading
           frames vs. a protein sequence database;

      tblastn
           to compare a protein query sequence vs. a nucleotide sequence
           database dynamically translated in all reading frames.  Whenever
           a nucleotide query sequence or nucleotide database is involved,
           both strands (or all 6 reading frames) are searched by default.
           The "top" and "bottom" options may be used to restrict a search
           to the specified strand.  (If both options are specified, both
           strands will be searched).  The unit of BLAST algorithm output is
           the High-scoring Segment Pair (HSP), where each segment in the
           pair is an equal but arbitrarily long run of contiguous residues.
           In the programmatic implementations of the algorithm described
           here, an HSP is a pair of segments, one from the query sequence
           and one from a database sequence, where the score of their
           ungapped alignment meets or exceeds a parameterized, positive-
           valued cutoff.  A set of zero or more HSPs is thus defined by two



                                    - 1 -         Formatted:  April 25, 2024






 BLAST(1)                                                           BLAST(1)
                                29 April 1992



           sequences, an alignment scoring scheme, and a cutoff score.  A
           Maximal-scoring Segment Pair (MSP) is defined by two sequences
           and a scoring scheme and is the highest-scoring of all segment
           pairs on all diagonals.  Depending on the parameters of a BLAST
           sequence comparison, there may be a non-zero probability of not
           finding one or more HSPs of which the MSP is a member.

 PARAMETERS
      Parameters are modified using a name=value syntax, e.g., E=0.05 or
      S=100.  E is interpreted as the expected number of MSPs that will
      satisfy the cutoff score under the random sequence model.  The value
      of E approximates the expected number of HSPs that will be found
      during the course of an entire database search.  The default value for
      E is 10, and the permitted range for this Real valued parameter is 0.
      < E <= 1000.  S is the cutoff score for reporting HSPs.  Higher scores
      correspond to increasing statistical significance, a lower
      probability, or a reduced expected frequency of occurrence by chance.
      Any positive-scoring alignments which the programs find but which
      score below S go unreported.  Unless S is explicitly set on the
      command line, its default value is calculated from the value of E.
      The values for E and S are interconvertable, a process which is
      dependent on the following factors: the length and residue composition
      of the query sequence; the length of the database and a fixed,
      hypothetical residue composition for it; and the scoring scheme
      employed.  The scoring scheme used by blastp, blastx, and tblastn is a
      substitution matrix; the scoring scheme used by blastn is a positive
      reward score for matching residues and a negative penalty score for
      mismatching residues.  When both of the parameters E and S are
      specified on the command line, the one resulting in the highest (most
      restrictive) cutoff score will be used.  When neither of these
      parameters is specified on the command line, the default value for E
      is used to calculate the cutoff score.  For a given value of E (e.g.,
      the default value of 10), a given query sequence, and a single scoring
      scheme, the calculated value of the cutoff score S will be different
      when searching databases of different lengths.  To normalize the
      statistics reported when databases of different lengths are searched,
      the parameter Z (see below) may be set to a constant value for all
      database searches.  S takes on only integral values in the present
      implementations of the BLAST algorithm.  When the cutoff score is set
      implicitly via E, S is rounded to the least integral value required to
      satisfy E.  Since the rounding procedure can decrease the effective
      value of E, the calculated value for S is used to back-calculate the
      effective value for E.  For example, if the user specifies E = 50 on
      the command line, a cutoff score that is rounded up by 0.9 units to
      the smallest satisfying integer might correspond to an expected number
      of HSPs of only 43.  In this case, the value displayed for E at the
      end of the program's report will be 43, not 50.  When at least one HSP
      is found involving any given database sequence, the programs blastp
      and tblastn search the database sequence a second time for HSPs that
      satisfy a lower cutoff score, S2.  In essence, the second-pass search
      gives these programs the opportunity to report any low-significance



                                    - 2 -         Formatted:  April 25, 2024






 BLAST(1)                                                           BLAST(1)
                                29 April 1992



      HSPs they may have found that might be of interest within the context
      of finding one or more higher-scoring (perhaps statistically
      significant) HSPs.  Poisson statistics may indicate that the lower-
      scoring (higher-probability) HSPs are statistically significant when
      their frequencies of occurrence are considered.  In a relationship
      similar to that between the parameters E and S, S2 can be set
      explicitly on the command line or it will be calculated from the
      setting of E2.  Whereas S is related to E by the size of the database
      and the length of the query sequence, S2 is related to E2 by the
      lengths of a pair of hypothetical protein sequences of 300 residues
      each.  In other words, E2 approximates the number of HSPs one would
      expect to find when comparing two protein sequences of length 300, one
      having the composition of the query sequence and the other having the
      hypothetical residue composition of the database.  If a second-pass
      search is not desired, setting E2 to zero (0) turns this feature off.
      If S2 happens to be equal to or greater than the primary cutoff score,
      a second-pass search is not performed, as well.  The user should be
      forewarned that, with no other knowledge about a positive-scoring
      segment pair than its score, the probability that the BLAST algorithm
      will detect the alignment decreases as the score of the alignment
      decreases.  Consequently, low-scoring HSPs looked for in the second-
      pass search have a lesser chance individually of being found than the
      original HSP.  With a fixed scoring scheme, the probability of missing
      an alignment can be decreased by: lowering the neighborhood word-score
      threshold, T, while keeping the word size, W, constant; lowering both
      W and T appropriately (see Altschul et al., 1990); and/or raising the
      word-hit-extension drop-off score X (described below).  W is the word
      size for finding initial hits against the database sequences.  Each
      hit is extended in both directions along the corresponding diagonal of
      an imaginary 2-dimensional matrix until the segment score drops off by
      at least the quantity X.  The default value for W is 3 amino acids for
      blastp, blastx, and tblastn, and 12 nucleotides for blastn.  The value
      of W used by blastn should not be changed, as the logic of the program
      source code has not been validated for use with values other than the
      default (particularly smaller values).  For the other programs, which
      perform sequence comparisons at the level of individual amino acids, W
      should generally be restricted to values less than 5 or else the value
      for T should be specified disproportionately larger to avoid consuming
      vast quantities of memory for the neighborhood word list (see below).
      T is the word score threshold for generating neighborhood words of
      length W from the query sequence, prior to scanning the database
      (blastp, blastx, and tblastn only).  Words which have an aggregate
      score (through summation of the individual residue substitution
      scores) of at least T when aligned with words from the query sequence
      are included in the neighborhood list.  Raising the value of T
      increases the likelihood of completely missing HSPs, but can decrease
      the search time and memory requirements of the programs by decreasing
      the size of the neighborhood list.  One of the key features of the
      BLAST algorithm is the user-selectable trade-off in sensitivity for
      speed.  A generally suitable value for T is calculated at run-time,
      using the residue composition and length of the query sequence and the



                                    - 3 -         Formatted:  April 25, 2024






 BLAST(1)                                                           BLAST(1)
                                29 April 1992



      substitution matrix employed.  The neighborhood word-score threshold
      is set using an ad hoc equation that is a function of Lambda and H.
      Lambda is the number of nats of information gained per unit increase
      in score of an alignment (approximately 0.69315 times the number of
      bits per unit score); H is the relative entropy of the target and
      background residue frequencies [Karlin and Altschul, 1990], or the
      expected information available per position in an alignment to
      distinguish it from chance.  The supplied PAM120 amino acid
      substitution matrix, with a scale of ln(2)/2, yields a value for
      Lambda that is close to 0.5 bit per unit score for query sequences of
      typical residue compositions.  Occasionally it may be necessary to
      manually set the neighborhood word-score threshold via the command
      line, for which 13 may be a good value to try, but this is highly
      dependent on the substitution matrix and word-length, W, being
      employed.  X is a positive integer representing the maximum
      permissible drop-off of the cumulative segment score during word-hit
      extension.  Raising X may decrease the chance that the BLAST algorithm
      overlooks an HSP, but it may significantly increase the search time,
      as well.  If computation time is of little concern, X might be
      increased several points from its default value, but only a very
      marginal increase in sensitivity might be expected.  For blastp,
      blastx, and tblastn, the default value of X is calculated to be the
      minimum integral score representing at least 10 bits of information,
      or a reduction in the statistical significance of the alignment by a
      factor of 2 to the power of 10 (or about 1,000).  For blastn, the
      default value of X is the minimum integral score that represents at
      least 20 bits of information, or a reduction in the statistical
      significance of the alignment by a factor of 2 to the power of 20 (or
      about one million).  The command line parameters K and L can be used
      to set fixed values for the Karlin statistics' K and lambda
      parameters, respectively.  Users should generally avoid setting these
      parameters unless the full ramifications of doing so are understood.
      For example of one of the less obvious effects of manually choosing
      these parameters, the value of the H statistic reported at the end of
      each program's output (which is distinct from the command line
      parameter of the same name) is a function of lambda; and the default
      value for the neighborhood word-score threshold parameter T is in turn
      a function of H.

 SCORING SCHEMES
      With blastp, blastx, and tblastn, the M option can be used to select
      an alternate substitution matrix file.  The default PAM120 matrix is
      recommended for general protein similarity searches (Altschul, 1991).
      While only the PAM120 and the PAM250 matrices are provided, the pam(1)
      program can be used to produce PAM matrices of any desired generation
      from 2 to 511.  For rigorous searches where the mutational distance
      between potential homologs is unknown, Altschul (1991) recommends
      performing three searches, one each with the PAM-40, PAM-120, and
      PAM-250 matrices.  In blastn, M is the score for a single-letter
      match; N is the score for a single-letter mismatch.  M and N must be
      positive and negative integers, respectively.  Given the assumption



                                    - 4 -         Formatted:  April 25, 2024






 BLAST(1)                                                           BLAST(1)
                                29 April 1992



      made by blastn that the 4 nucleotides A, C, G, and T are represented
      equally in the database, the expected score for the query sequence
      must be negative.

 SEQUENCE LENGTH AND STATISTICAL SIGNIFICANCE
      For the purpose of calculating significance levels, Y is the effective
      length of the query sequence and Z is the effective length of the
      database, both measured in residues.  The default values for these
      parameters are the actual lengths of the query sequence and database,
      respectively.  Larger values signify more degrees of freedom for
      aligning the sequences and reduced statistical significance for an
      alignment of any given score.

 GENETIC CODES
      C is a non-negative integer that determines the genetic code that will
      be used by blastx (tblastn) to translate the query sequence (database
      sequences).  The default genetic code (C=0) corresponds to the so-
      called Standard or Universal genetic code.  To obtain a listing of the
      nine available genetic codes and their associated numerical
      identifiers, invoke either blastx or tblastn with the command line
      parameter C=list.  The current list of genetic codes and their
      associated values for parameter C are: 0 Standard or Universal 1
      Vertebrate Mitochondrial 2 Yeast Mitochondrial 3 Mold Mitochondrial
      and Mycoplasma 4 Invertebrate Mitochondrial 5 Ciliate Macronuclear 6
      Protozoan Mitochondrial 7 Plant Mitochondrial 8 Echinodermate
      Mitochondrial

 POISSON STATISTICS
      The occurrence of two or more HSPs involving the query sequence and
      the same database sequence is modeled as a Poisson process.  An
      important result of applying Poisson statistics is that an HSP with a
      low score and high Expect value (low significance) may be discovered
      to be statistically significant when appearing in the context of one
      or more additional matches of equal or higher score against the same
      database sequence.  The Poisson P-value for any given HSP is a
      function of its expected frequency of occurrence and the number of
      HSPs actually observed with scores at least as high.  The Poisson P-
      value for a group of HSP events is the probability that at least as
      many HSPs would occur by chance, each with a score at least as high as
      the lowest-scoring member of the group.  HSPs which appear on opposite
      strands of a nucleotide query or database sequence are considered
      independent and distinguishable events, and so are counted separately.
      Given the score of an HSP, when the expected length for an alignment
      with that score (see the description of H above) is a significant
      fraction of the length of the query sequence, the Expect value used in
      estimation of the Poisson P-value is reduced proportionately.

 P-VALUES, ALIGNMENT SCORES, AND INFORMATION
      The Expect and P-values of HSPs reported by the programs are dependent
      on numerous factors including: the scoring scheme employed, the
      residue composition of the query sequence, the assumed residue



                                    - 5 -         Formatted:  April 25, 2024






 BLAST(1)                                                           BLAST(1)
                                29 April 1992



      composition for a typical database sequence, and the lengths of the
      query sequence and database.  Independent of the query and database
      lengths are the HSP scores themselves, which may be readily compared
      between different program runs even if the databases searched are of
      different lengths, as long as all of the other relevant factors listed
      here were unchanged.  Further isolation from the many variables of a
      search in one's assessment of an HSP may be obtained by observing the
      information content reported (in bits) for the alignments.  While the
      information content of an HSP may change when fundamentally different
      scoring schemes are used (e.g., different generations of PAM
      matrices), the number of bits reported for an HSP will be independent
      of the scales to which the matrices were generated.  (In practice,
      this statement is not quite true because the substitution scores used
      by these programs are floating point or Real values which have been
      rounded to nearest integers and thus lack a high degree of precision).
      When communicating the statistical significance of an alignment, the
      alignment score itself is generally not so important as the
      combination of the substitution matrix employed and the actual
      information content of the alignment.

 REGULATING OUTPUT
      The output is organized into three independently regulated sections: a
      histogram of word-hit extension scores; one-line descriptions of the
      database sequences that yielded one or more HSPs; and the high-scoring
      segment pairs themselves.  Each section of the output can be
      selectively suppressed by setting the parameters H, V, and B to 0
      (zero).  Parameter H regulates the display of an histogram of the
      scores of the highest-scoring hit extensions for each database
      sequence.  If H is assigned a non-zero value on the command line, the
      histogram will be displayed (except for the blastx program, which
      never displays an histogram but retains the H parameter for command-
      line compatibility with the other programs).  The default value for H
      is 0 (no histogram).  Parameter V is the maximum number of database
      sequences for which one-line descriptions will be reported.  The
      default value for V is 500.  A warning message is prominently
      displayed at the end of the one-line descriptions section when HSPs
      are found in more than V sequences.  When V is zero, no one-line
      descriptions are reported and no warning is given.  Negative values
      for V are undefined and disallowed.  As an example of how V can be
      used advantageously, if a high value for E is desired to virtually
      assure in all cases that at least one HSP will be found, selecting a
      small value for V will ensure that the output will not be too
      voluminous; only the most statistically significant matches will be
      reported.  Parameter B regulates the display of the high-scoring
      segment pairs.  For positive values, B is the maximum number of
      database sequences for which high-scoring segment pairs will be
      reported.  This may be much smaller than the actual number of high-
      scoring segment pairs reported, since any given database sequence may
      yield several HSPs.  The default value for B is 250.  Negative values
      for B are undefined and disallowed.




                                    - 6 -         Formatted:  April 25, 2024






 BLAST(1)                                                           BLAST(1)
                                29 April 1992



 SUPPORT UTILITIES
      Databases to be searched by these programs must first be processed
      either by the program setdb for protein sequence databases (re: blastp
      and blastx) or the program pressdb for nucleotide sequence databases
      (re: blastn and tblastn).  The input database format is FASTA/Pearson.
      Point accepted mutation (PAM) matrices of various generations can be
      produced automatically with the pam program.  The output can be saved
      in a file whose name is then specified in the M=filename option of a
      blastp, blastx, or tblastn query.

 BUGS
      blastn uses a large value for the wordlength, W, and does no
      neighboring on these words.  Consequently, the program is suitable for
      finding nearly identical sequences rapidly.  To identify weak amino
      acid similarities encoded by nucleic acid, use blastx or tblastn.  In
      blastp, blastx, and tblastn, ad hoc equations have not been
      implemented yet for calculating appropriate default values for T when
      W has a value other than 3 or 4.  When nucleotide sequence databases
      are processed into searchable form by the pressdb program, IUPAC
      ambiguity letters are replaced by an appropriate random selection from
      the list A, C, G and T.  (For example, an R would be replaced on the
      average half of the time by an A and half of the time by a G).
      Similarly, blastn replaces ambiguity letters in the query sequence
      with appropriate random selections.  Only after an HSP is found that
      satisfies the cutoff score are the original sequences with their
      ambiguities intact examined.  With blastn, the alignment score will
      decrease and may consequently fall below the cutoff score if the
      random replacement letter happened to match.  With blastx and tblastn,
      the outcome will depend upon whether a specific amino acid can be
      deduced despite the presence of ambiguity codes.  tblastn uses only
      one genetic code to translate the entire nucleotide sequence database,
      although the particular genetic code employed is selectable via the
      parameter C.  blastn, blastx, and tblastn treat U and T residues in
      nucleotide sequences as the same residue (i.e., they match).  With two
      exceptions, any letter in the query sequence which is not a member of
      the relevant IUPAC amino acid or nucleotide code is stripped and does
      not contribute to the sequence coordinate numbers reported by the
      programs.  The exceptions are asterisks (*) and hyphens (-) in amino
      acid sequences, which are interpreted as translation stops and gap
      characters, respectively.  In protein sequence databases that are
      processed into searchable form by the setdb program, non-IUPAC
      letters, including any punctuation but excluding asterisks and
      hyphens, are also stripped.  The pressdb program does not strip non-
      IUPAC codes, but treats them similarly to Ns.  blastn does not
      incorporate the concept of a partial- or half-match, such as when a
      purine in one sequence is juxtaposed with a purine from the other.
      For two residues to match at all, they both must be members of the set
      A, C, G and T (or U).  When calculating the Poisson statistics, some
      HSPs may be incompatible with each other (not all of them may be
      simultaneously alignable without reusing some portion of either
      sequence) and yet they are (incorrectly) counted as independent



                                    - 7 -         Formatted:  April 25, 2024






 BLAST(1)                                                           BLAST(1)
                                29 April 1992



      events.  The user may note that the nucleotide composition of a blastn
      query sequence is irrelevant to the resulting Karlin parameters,
      Lambda and K.  This is due to the residue composition assumed for a
      typical database sequence being 25% for each of the four nucleotides
      A, C, G, and T.  The values of the Karlin parameters are still
      affected by the scoring scheme employed (parameters M and N).
      Furthermore, the individual who compiles these programs is certainly
      not prevented from setting a non-uniform residue composition for the
      database sequences, in which case the query composition does become
      relevant and will impact the values of the Karlin parameters that are
      calculated by blastn.

 SEE ALSO
      blast3(1).

 REFERENCES
      Karlin, Samuel and Stephen F. Altschul (1990).  Methods for assessing
      the statistical significance of molecular sequence features by using
      general scoring schemes, Proc. Natl. Acad. Sci. USA 87:2264-2268.
      Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and
      David J. Lipman (1990).  Basic local alignment search tool, J. Mol.
      Biol.  215:403-410.  Altschul, Stephen F. (1991).  Amino acid
      substitution matrices from an information theoretic perspective. J.
      Mol. Biol.  219:555-565.






























                                    - 8 -         Formatted:  April 25, 2024