
BLAST(1) BLAST(1) 29 April 1992 NAME blastp, blastn, blastx, tblastn - rapid sequence database query programs using the BLAST algorithm SYNOPSIS blastp aadb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [M=subfile] [Y=#] [Z=#] [K=#] [L=#] [H=#] [V=#] [B=#] blastn ntdb ntquery [E=#] [S=#] [W=#] [X=#] [M=#] [N=#] [Y=#] [Z=#] [K=#] [L=#] [H=#] [V=#] [B=#] [[top][bottom]] blastx aadb ntquery [E=#] [S=#] [W=#] [T=#] [X=#] [M=subfile] [Y=#] [Z=#] [C=#] [K=#] [L=#] [V=#] [B=#] [[top][bottom]] tblastn ntdb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [M=subfile] [Y=#] [Z=#] [C=#] [K=#] [L=#] [H=#] [V=#] [B=#] [[top][bottom]] DESCRIPTION (Basic Local Alignment Search Tool) is the heuristic search algorithm employed by the programs blastp, blastn, blastx, and tblastn. The four programs are used for the following purposes: blastp to compare an amino acid query sequence vs. a protein sequence database; blastn to compare a nucleotide query sequence vs. a nucleotide sequence database; blastx to compare a nucleotide query sequence translated in all reading frames vs. a protein sequence database; tblastn to compare a protein query sequence vs. a nucleotide sequence database dynamically translated in all reading frames. Whenever a nucleotide query sequence or nucleotide database is involved, both strands (or all 6 reading frames) are searched by default. The "top" and "bottom" options may be used to restrict a search to the specified strand. (If both options are specified, both strands will be searched). The unit of BLAST algorithm output is the High-scoring Segment Pair (HSP), where each segment in the pair is an equal but arbitrarily long run of contiguous residues. In the programmatic implementations of the algorithm described here, an HSP is a pair of segments, one from the query sequence and one from a database sequence, where the score of their ungapped alignment meets or exceeds a parameterized, positive- valued cutoff. A set of zero or more HSPs is thus defined by two - 1 - Formatted: May 14, 2025 BLAST(1) BLAST(1) 29 April 1992 sequences, an alignment scoring scheme, and a cutoff score. A Maximal-scoring Segment Pair (MSP) is defined by two sequences and a scoring scheme and is the highest-scoring of all segment pairs on all diagonals. Depending on the parameters of a BLAST sequence comparison, there may be a non-zero probability of not finding one or more HSPs of which the MSP is a member. PARAMETERS Parameters are modified using a name=value syntax, e.g., E=0.05 or S=100. E is interpreted as the expected number of MSPs that will satisfy the cutoff score under the random sequence model. The value of E approximates the expected number of HSPs that will be found during the course of an entire database search. The default value for E is 10, and the permitted range for this Real valued parameter is 0. < E <= 1000. S is the cutoff score for reporting HSPs. Higher scores correspond to increasing statistical significance, a lower probability, or a reduced expected frequency of occurrence by chance. Any positive-scoring alignments which the programs find but which score below S go unreported. Unless S is explicitly set on the command line, its default value is calculated from the value of E. The values for E and S are interconvertable, a process which is dependent on the following factors: the length and residue composition of the query sequence; the length of the database and a fixed, hypothetical residue composition for it; and the scoring scheme employed. The scoring scheme used by blastp, blastx, and tblastn is a substitution matrix; the scoring scheme used by blastn is a positive reward score for matching residues and a negative penalty score for mismatching residues. When both of the parameters E and S are specified on the command line, the one resulting in the highest (most restrictive) cutoff score will be used. When neither of these parameters is specified on the command line, the default value for E is used to calculate the cutoff score. For a given value of E (e.g., the default value of 10), a given query sequence, and a single scoring scheme, the calculated value of the cutoff score S will be different when searching databases of different lengths. To normalize the statistics reported when databases of different lengths are searched, the parameter Z (see below) may be set to a constant value for all database searches. S takes on only integral values in the present implementations of the BLAST algorithm. When the cutoff score is set implicitly via E, S is rounded to the least integral value required to satisfy E. Since the rounding procedure can decrease the effective value of E, the calculated value for S is used to back-calculate the effective value for E. For example, if the user specifies E = 50 on the command line, a cutoff score that is rounded up by 0.9 units to the smallest satisfying integer might correspond to an expected number of HSPs of only 43. In this case, the value displayed for E at the end of the program's report will be 43, not 50. When at least one HSP is found involving any given database sequence, the programs blastp and tblastn search the database sequence a second time for HSPs that satisfy a lower cutoff score, S2. In essence, the second-pass search gives these programs the opportunity to report any low-significance - 2 - Formatted: May 14, 2025 BLAST(1) BLAST(1) 29 April 1992 HSPs they may have found that might be of interest within the context of finding one or more higher-scoring (perhaps statistically significant) HSPs. Poisson statistics may indicate that the lower- scoring (higher-probability) HSPs are statistically significant when their frequencies of occurrence are considered. In a relationship similar to that between the parameters E and S, S2 can be set explicitly on the command line or it will be calculated from the setting of E2. Whereas S is related to E by the size of the database and the length of the query sequence, S2 is related to E2 by the lengths of a pair of hypothetical protein sequences of 300 residues each. In other words, E2 approximates the number of HSPs one would expect to find when comparing two protein sequences of length 300, one having the composition of the query sequence and the other having the hypothetical residue composition of the database. If a second-pass search is not desired, setting E2 to zero (0) turns this feature off. If S2 happens to be equal to or greater than the primary cutoff score, a second-pass search is not performed, as well. The user should be forewarned that, with no other knowledge about a positive-scoring segment pair than its score, the probability that the BLAST algorithm will detect the alignment decreases as the score of the alignment decreases. Consequently, low-scoring HSPs looked for in the second- pass search have a lesser chance individually of being found than the original HSP. With a fixed scoring scheme, the probability of missing an alignment can be decreased by: lowering the neighborhood word-score threshold, T, while keeping the word size, W, constant; lowering both W and T appropriately (see Altschul et al., 1990); and/or raising the word-hit-extension drop-off score X (described below). W is the word size for finding initial hits against the database sequences. Each hit is extended in both directions along the corresponding diagonal of an imaginary 2-dimensional matrix until the segment score drops off by at least the quantity X. The default value for W is 3 amino acids for blastp, blastx, and tblastn, and 12 nucleotides for blastn. The value of W used by blastn should not be changed, as the logic of the program source code has not been validated for use with values other than the default (particularly smaller values). For the other programs, which perform sequence comparisons at the level of individual amino acids, W should generally be restricted to values less than 5 or else the value for T should be specified disproportionately larger to avoid consuming vast quantities of memory for the neighborhood word list (see below). T is the word score threshold for generating neighborhood words of length W from the query sequence, prior to scanning the database (blastp, blastx, and tblastn only). Words which have an aggregate score (through summation of the individual residue substitution scores) of at least T when aligned with words from the query sequence are included in the neighborhood list. Raising the value of T increases the likelihood of completely missing HSPs, but can decrease the search time and memory requirements of the programs by decreasing the size of the neighborhood list. One of the key features of the BLAST algorithm is the user-selectable trade-off in sensitivity for speed. A generally suitable value for T is calculated at run-time, using the residue composition and length of the query sequence and the - 3 - Formatted: May 14, 2025 BLAST(1) BLAST(1) 29 April 1992 substitution matrix employed. The neighborhood word-score threshold is set using an ad hoc equation that is a function of Lambda and H. Lambda is the number of nats of information gained per unit increase in score of an alignment (approximately 0.69315 times the number of bits per unit score); H is the relative entropy of the target and background residue frequencies [Karlin and Altschul, 1990], or the expected information available per position in an alignment to distinguish it from chance. The supplied PAM120 amino acid substitution matrix, with a scale of ln(2)/2, yields a value for Lambda that is close to 0.5 bit per unit score for query sequences of typical residue compositions. Occasionally it may be necessary to manually set the neighborhood word-score threshold via the command line, for which 13 may be a good value to try, but this is highly dependent on the substitution matrix and word-length, W, being employed. X is a positive integer representing the maximum permissible drop-off of the cumulative segment score during word-hit extension. Raising X may decrease the chance that the BLAST algorithm overlooks an HSP, but it may significantly increase the search time, as well. If computation time is of little concern, X might be increased several points from its default value, but only a very marginal increase in sensitivity might be expected. For blastp, blastx, and tblastn, the default value of X is calculated to be the minimum integral score representing at least 10 bits of information, or a reduction in the statistical significance of the alignment by a factor of 2 to the power of 10 (or about 1,000). For blastn, the default value of X is the minimum integral score that represents at least 20 bits of information, or a reduction in the statistical significance of the alignment by a factor of 2 to the power of 20 (or about one million). The command line parameters K and L can be used to set fixed values for the Karlin statistics' K and lambda parameters, respectively. Users should generally avoid setting these parameters unless the full ramifications of doing so are understood. For example of one of the less obvious effects of manually choosing these parameters, the value of the H statistic reported at the end of each program's output (which is distinct from the command line parameter of the same name) is a function of lambda; and the default value for the neighborhood word-score threshold parameter T is in turn a function of H. SCORING SCHEMES With blastp, blastx, and tblastn, the M option can be used to select an alternate substitution matrix file. The default PAM120 matrix is recommended for general protein similarity searches (Altschul, 1991). While only the PAM120 and the PAM250 matrices are provided, the pam(1) program can be used to produce PAM matrices of any desired generation from 2 to 511. For rigorous searches where the mutational distance between potential homologs is unknown, Altschul (1991) recommends performing three searches, one each with the PAM-40, PAM-120, and PAM-250 matrices. In blastn, M is the score for a single-letter match; N is the score for a single-letter mismatch. M and N must be positive and negative integers, respectively. Given the assumption - 4 - Formatted: May 14, 2025 BLAST(1) BLAST(1) 29 April 1992 made by blastn that the 4 nucleotides A, C, G, and T are represented equally in the database, the expected score for the query sequence must be negative. SEQUENCE LENGTH AND STATISTICAL SIGNIFICANCE For the purpose of calculating significance levels, Y is the effective length of the query sequence and Z is the effective length of the database, both measured in residues. The default values for these parameters are the actual lengths of the query sequence and database, respectively. Larger values signify more degrees of freedom for aligning the sequences and reduced statistical significance for an alignment of any given score. GENETIC CODES C is a non-negative integer that determines the genetic code that will be used by blastx (tblastn) to translate the query sequence (database sequences). The default genetic code (C=0) corresponds to the so- called Standard or Universal genetic code. To obtain a listing of the nine available genetic codes and their associated numerical identifiers, invoke either blastx or tblastn with the command line parameter C=list. The current list of genetic codes and their associated values for parameter C are: 0 Standard or Universal 1 Vertebrate Mitochondrial 2 Yeast Mitochondrial 3 Mold Mitochondrial and Mycoplasma 4 Invertebrate Mitochondrial 5 Ciliate Macronuclear 6 Protozoan Mitochondrial 7 Plant Mitochondrial 8 Echinodermate Mitochondrial POISSON STATISTICS The occurrence of two or more HSPs involving the query sequence and the same database sequence is modeled as a Poisson process. An important result of applying Poisson statistics is that an HSP with a low score and high Expect value (low significance) may be discovered to be statistically significant when appearing in the context of one or more additional matches of equal or higher score against the same database sequence. The Poisson P-value for any given HSP is a function of its expected frequency of occurrence and the number of HSPs actually observed with scores at least as high. The Poisson P- value for a group of HSP events is the probability that at least as many HSPs would occur by chance, each with a score at least as high as the lowest-scoring member of the group. HSPs which appear on opposite strands of a nucleotide query or database sequence are considered independent and distinguishable events, and so are counted separately. Given the score of an HSP, when the expected length for an alignment with that score (see the description of H above) is a significant fraction of the length of the query sequence, the Expect value used in estimation of the Poisson P-value is reduced proportionately. P-VALUES, ALIGNMENT SCORES, AND INFORMATION The Expect and P-values of HSPs reported by the programs are dependent on numerous factors including: the scoring scheme employed, the residue composition of the query sequence, the assumed residue - 5 - Formatted: May 14, 2025 BLAST(1) BLAST(1) 29 April 1992 composition for a typical database sequence, and the lengths of the query sequence and database. Independent of the query and database lengths are the HSP scores themselves, which may be readily compared between different program runs even if the databases searched are of different lengths, as long as all of the other relevant factors listed here were unchanged. Further isolation from the many variables of a search in one's assessment of an HSP may be obtained by observing the information content reported (in bits) for the alignments. While the information content of an HSP may change when fundamentally different scoring schemes are used (e.g., different generations of PAM matrices), the number of bits reported for an HSP will be independent of the scales to which the matrices were generated. (In practice, this statement is not quite true because the substitution scores used by these programs are floating point or Real values which have been rounded to nearest integers and thus lack a high degree of precision). When communicating the statistical significance of an alignment, the alignment score itself is generally not so important as the combination of the substitution matrix employed and the actual information content of the alignment. REGULATING OUTPUT The output is organized into three independently regulated sections: a histogram of word-hit extension scores; one-line descriptions of the database sequences that yielded one or more HSPs; and the high-scoring segment pairs themselves. Each section of the output can be selectively suppressed by setting the parameters H, V, and B to 0 (zero). Parameter H regulates the display of an histogram of the scores of the highest-scoring hit extensions for each database sequence. If H is assigned a non-zero value on the command line, the histogram will be displayed (except for the blastx program, which never displays an histogram but retains the H parameter for command- line compatibility with the other programs). The default value for H is 0 (no histogram). Parameter V is the maximum number of database sequences for which one-line descriptions will be reported. The default value for V is 500. A warning message is prominently displayed at the end of the one-line descriptions section when HSPs are found in more than V sequences. When V is zero, no one-line descriptions are reported and no warning is given. Negative values for V are undefined and disallowed. As an example of how V can be used advantageously, if a high value for E is desired to virtually assure in all cases that at least one HSP will be found, selecting a small value for V will ensure that the output will not be too voluminous; only the most statistically significant matches will be reported. Parameter B regulates the display of the high-scoring segment pairs. For positive values, B is the maximum number of database sequences for which high-scoring segment pairs will be reported. This may be much smaller than the actual number of high- scoring segment pairs reported, since any given database sequence may yield several HSPs. The default value for B is 250. Negative values for B are undefined and disallowed. - 6 - Formatted: May 14, 2025 BLAST(1) BLAST(1) 29 April 1992 SUPPORT UTILITIES Databases to be searched by these programs must first be processed either by the program setdb for protein sequence databases (re: blastp and blastx) or the program pressdb for nucleotide sequence databases (re: blastn and tblastn). The input database format is FASTA/Pearson. Point accepted mutation (PAM) matrices of various generations can be produced automatically with the pam program. The output can be saved in a file whose name is then specified in the M=filename option of a blastp, blastx, or tblastn query. BUGS blastn uses a large value for the wordlength, W, and does no neighboring on these words. Consequently, the program is suitable for finding nearly identical sequences rapidly. To identify weak amino acid similarities encoded by nucleic acid, use blastx or tblastn. In blastp, blastx, and tblastn, ad hoc equations have not been implemented yet for calculating appropriate default values for T when W has a value other than 3 or 4. When nucleotide sequence databases are processed into searchable form by the pressdb program, IUPAC ambiguity letters are replaced by an appropriate random selection from the list A, C, G and T. (For example, an R would be replaced on the average half of the time by an A and half of the time by a G). Similarly, blastn replaces ambiguity letters in the query sequence with appropriate random selections. Only after an HSP is found that satisfies the cutoff score are the original sequences with their ambiguities intact examined. With blastn, the alignment score will decrease and may consequently fall below the cutoff score if the random replacement letter happened to match. With blastx and tblastn, the outcome will depend upon whether a specific amino acid can be deduced despite the presence of ambiguity codes. tblastn uses only one genetic code to translate the entire nucleotide sequence database, although the particular genetic code employed is selectable via the parameter C. blastn, blastx, and tblastn treat U and T residues in nucleotide sequences as the same residue (i.e., they match). With two exceptions, any letter in the query sequence which is not a member of the relevant IUPAC amino acid or nucleotide code is stripped and does not contribute to the sequence coordinate numbers reported by the programs. The exceptions are asterisks (*) and hyphens (-) in amino acid sequences, which are interpreted as translation stops and gap characters, respectively. In protein sequence databases that are processed into searchable form by the setdb program, non-IUPAC letters, including any punctuation but excluding asterisks and hyphens, are also stripped. The pressdb program does not strip non- IUPAC codes, but treats them similarly to Ns. blastn does not incorporate the concept of a partial- or half-match, such as when a purine in one sequence is juxtaposed with a purine from the other. For two residues to match at all, they both must be members of the set A, C, G and T (or U). When calculating the Poisson statistics, some HSPs may be incompatible with each other (not all of them may be simultaneously alignable without reusing some portion of either sequence) and yet they are (incorrectly) counted as independent - 7 - Formatted: May 14, 2025 BLAST(1) BLAST(1) 29 April 1992 events. The user may note that the nucleotide composition of a blastn query sequence is irrelevant to the resulting Karlin parameters, Lambda and K. This is due to the residue composition assumed for a typical database sequence being 25% for each of the four nucleotides A, C, G, and T. The values of the Karlin parameters are still affected by the scoring scheme employed (parameters M and N). Furthermore, the individual who compiles these programs is certainly not prevented from setting a non-uniform residue composition for the database sequences, in which case the query composition does become relevant and will impact the values of the Karlin parameters that are calculated by blastn. SEE ALSO blast3(1). REFERENCES Karlin, Samuel and Stephen F. Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proc. Natl. Acad. Sci. USA 87:2264-2268. Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). Basic local alignment search tool, J. Mol. Biol. 215:403-410. Altschul, Stephen F. (1991). Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555-565. - 8 - Formatted: May 14, 2025