WU BLAST 2.0 Description

WU BLAST 2.0 TOPICS

Description
Key Features
Manifest
To Fly... (last updated 2001-11-14)
Comparable WU/NCBI BLAST Parameters
Examples
New Command Line Options
Filters and Masks
Bugs
Memory Requirements
Supported Platforms
Precompiled Executables (Old and Obsolete but Free)
Installation
Licensing of the Current Version 2.0 Software
Citing BLAST
Historical Notes
References

Description

Washington University BLAST (WU BLAST) version 2.0 is a powerful software package for gene and protein identification, using sensitive, selective and rapid similarity searches of protein and nucleotide sequence databases. WU BLAST 2.0 builds upon WU BLAST 1.4, which in turn was based on the public domain NCBI BLAST version 1.4 (Gish, unpublished, 1994; Altschul et al., 1990; Gish and States, 1993). While NCBI BLAST and WU BLAST 1.4 are in the public domain, WU BLAST 2.0 contains significant new features and extended capabilities, the development of which began in late 1994, at Washington University in Saint Louis. First released in May 1996, or more than a year ahead of the NCBI, WU BLAST 2.0 is the original gapped BLAST with statistics and is known for setting higher standards for sensitivity, speed, correctness and accuracy, scalability and reliability than competing programs and implementations. WU BLAST is not a re-hash of NCBI BLAST and essentially shares no code with it, except for small portions that both packages derived from ungapped NCBI BLAST 1.4.

WU BLAST has been built to be the most trusted database search tool in your software arsenal. Its unique combination of speed, accuracy, efficiency, flexibility, scalability, reliability and consistency across all supported platforms is achieved through careful software coding, the use of extensive error checks, anticipation of future needs, and superior design.

[Note: In spite of many similar or identical characteristics of the algorithms employed, WU BLAST 2.0 and NCBI Gapped BLAST are distinctly different software packages that, in ways of varying importance, carry out their work differently. Consequently, the two packages often yield different results, particularly in the areas of the default level of sensitivity, details in how the statistics are employed, and occasionally in the accuracy or completeness of the results.]

The feature list for the licensed version of WU BLAST 2.0 is large and continues to expand. Much of this is outlined below. The primary purpose of the freely available version 2.0a19 is to allow users to demonstrate for themselves the effectiveness of using gapped alignments instead of ungapped, when combined with the evaluation of the joint probability of multiple regions of similarity, using Karlin and Altschul (1993) "Sum" statistics. Not surprisingly, with version 2.0a19 one can obtain markedly improved results over version 1.4, primarily due to the introduction of gapped alignments. WU BLAST 2.0a19 executables for several UNIX platforms can be downloaded from http://blast.wustl.edu/blast/executables. The complete suite of search programs (blastp, blastn, blastx, tblastn, and tblastx) is included, as well as several support programs. Users of the freely available version 2.0a19 should keep in mind that its reliability, features, flexibility, scalability and speed are generally not comparable to the licensed version 2.0.

WU BLAST 2.0 is copyrighted and may not be sold, redistributed or modified in any form or by any means, without the express written consent of the Washington University School of Medicine in St. Louis. Other than the aforementioned restrictions, the version 2.0a19 executables posted here may be freely used for commercial, nonprofit, or academic purposes.

DISCLAIMER: THIS SOFTWARE IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND.

Key Features

Some of the key features of WU BLAST 2.0 are described below, many of which are only available in the licensed version.

Gapped alignment routines are available (and used by default) in all BLAST search modes: BLASTP, BLASTN, and TBLASTN (Altschul et al., 1990), as well as, BLASTX (Gish and States, 1993) and TBLASTX (Gish, W., 1994, unpublished). Gaps can optionally be turned off in any mode if desired.
Potentially multiple regions of similarity are identified and reported for each database sequence, thus yielding increased sensitivity and selectivity. This feature is essential for finding: all exons in a multi-exon gene sequence, not just the longest or best-matching exon; all complete or partial copies of a repetitive element in a genomic sequence, not just the best matching one; and multiple, discrete domains of similarity between sequences, not just the highest-scoring one.
Karlin and Altschul (1993) "Sum statistics" are available (and used by default) in all search modes, to evaluate the joint probability of multiple regions of similarity, as described by Altschul and Gish (1996). By this technique, sets of similar regions are often found to be statistically significant that individually would be insignificant and go unreported. The combination of well-chosen heuristics and statistics in WU BLAST is often more sensitive/selective than: the full dynamic programming approach of Smith and Waterman (1981), that finds and evaluates the significance of only the highest scoring alignment with each database sequence; and other approaches or BLAST implementations that identify multiple regions of local similarity which are then evaluated individually for statistical significance rather than jointly.
Poisson statistics are available as an option to Karlin-Altschul Sum statistics in all search modes. Simpler Karlin-Altschul (1990) statistics, that do not involve joint probability calculations, are also available as an option.
Using the postsw option, a full Smith-Waterman alignment is performed on query-subject pairs of sequences that will be reported by BLASTP. The Smith-Waterman scores and alignments are combined with the initial BLAST results and redundancy is removed. This may alter the relative ranking of database matches before output. Use of this option is recommended, although it may be supplanted in the future by other option(s) or by a redefined default behavior.
The execution of WU BLAST 2.0 has been optimized such that gapped searches typically run faster than the ungapped version 1.4 programs, when using identical ungapped BLAST parameters. An exception to this rule is BLASTN (already quite fast) which runs about 10% slower with the new gapped alignment searches and default parameters. Any of the search modes may still run more slowly, however, if non-random sequence such as poly-A tracts, repetitive elements, and low-complexity regions are not filtered or masked from the query or database prior to performing the search.
The classical ungapped BLAST algorithm has not been changed in WU BLAST 2.0, thus retaining the sensitivity and control characteristics that users became accustomed to with previous versions of BLAST. When assessed at the same sensitivity level, the optimized, classical BLAST algorithm in WU BLAST 2.0 is nearly the same speed and uses much less memory than the 2-hit algorithm described by Altschul et al. (1997). For users who wish even greater speed, a 2-hit algorithm is available in a higher-performance WU implementation (see the hitdist option). While the classical 1-hit BLAST algorithm is the default in all WU BLAST search modes, the 2-hit algorithm is available as an option in all search modes, including BLASTN.
WU BLAST 2.0 is a virtual drop-in replacement for version 1.4, utilizing the same inputs and command line arguments, while producing almost the same format of output as before. A parser of version 1.4 output is only prone to breaking on version 2 output if it checks for just a few kinds of consistency in the results, such as equal lengths for the aligned query and subject sequences (which often isn't the case when gaps are introduced) or if the parser doesn't accept the hyphens that are used to represent gaps. In any case, if gapped output simply can not be tolerated by one's parser, one can still take advantage of the bug fixes, improved speed, and other features of WU BLAST 2.0 by using the compat1.4 compatibility option.
Gapped alignments in the blastn search mode are evaluated correctly, using different statistical parameter values (lambda, K and H) than those used to evaluate the significance of ungapped alignments. This is indeed the case for all WU BLAST search modes. And if the correct parameters are not available for the particular combination of gap penalties and scoring system being used, a prominent warning to this effect is displayed.
Unique to WU blastn is support for fully-specified scoring matrices, not just simple match/mismatch scoring systems. This allows (for example) transitions to be scored differently than transversions; and permits positive G-A substitution scores for the design of siRNAs (small interfering RNAs) where G-U base pairing is allowed. Scoring matrices may also be tailored to improve the design of PCR primers. Contrary to W. Miller (2001), scoring matrices were first supported in 1994, by the NCBI's ungapped BLASTN version 1.4 (Gish, W., unpublished; see http://blast.wustl.edu/blast-1.4). Support for nucleotide scoring matrices was dropped by the NCBI's blastall 2.0 program first released in 1997, but has been maintained continuously in all WU versions of the software since the migration to Washington University in 1994.
Word lengths (re: the W parameter) as short as 1 have been supported continuously by WU blastn, as are nucleotide neighborhood words, using the neighborhood word score threshold parameter, T. Using neighborhood words, nucleotide sequence similarity can be detected even in the absence of any identical residues between two sequences. Users are cautioned, however, that careless use of the T parameter can result in vast and overwhelming amounts of memory being requested by the software; T should likely be used only in conjunction with very short word lengths.
Information describing "consistent" groups of alignments (HSPs) is provided by licensed BLAST 2.0, when the topcomboN or links options are used. This facility can help with construction of distinct gene structure(s) from a barrage of alignments.
Licensed WU BLAST 2.0 supports the eXtended Database Format (XDF), a power user's dream in many ways for working with peptide and nucleotide sequences. Both the NCBI BLAST 2.0 database format and the NCBI implementation of the BLAST search algorithm are restricted to sequences under 16 Mbp in length, whereas human genome contigs exceeded 25 Mbp in the previous century (Hattori et al., 2000) and extend to several tens of megabytes today. In contrast, XDF can accurately store individual sequences of up to 1 Gbp (billion bp) with ambiguity codes intact. Other BLAST software, such as the NCBI's, limits database files to 2 gigabytes, whereas WU BLAST's XDF supports databases (and database files) of virtually unlimited size -- provided of course that the underlying operating system supports these so-called "large files", which most modern operating systems do.
In support of XDF, a new database formatting tool named xdformat is provided in the WU BLAST 2.0 package. Among other distinct capabilities and advantages to using XDF and xdformat are:
- fast appends of new sequences to existing databases (both protein and nucleotide) -- no need to reformat a database completely just to add more sequence(s) to it.
- xdformat runs twice as fast as the NCBI's formatdb program, while offering a vast superset of features and greater reliability;
- safe roll-backs of database updates when file I/O (e.g., disk full errors) or parse errors are encountered;
- huge databases need not be broken into multiple volumes individually composed of several files but can be managed simply with just 3 files, regardless of the database size;
- flexible indexing of all sequence identifiers -- not just a subset of the NCBI identifiers -- including user-defined identifiers;
- index support for duplicate occurrences of the same identifier, even the identical "gi" identifiers that cause some indexing programs to abort;
- identifier indexing is supported not only when creating an XDF database but when appending new sequences to an existing XDF database; and if a database was originally created without an identifier index, one can be added later in one fast step;
- identifier indexes can be quickly re-built if necessary using different indexing policies, without having to reformat the entire database;
- intelligent retrieval of indexed sequences using a complementary program named xdget; xdget can retrieve sequences by identifier even if is not told what name space (e.g., gi, accession, locus, user-defined, etc.) the identifier came from; for more on identifier indexing, see this;
- xdformat and xdget accept and work intelligently with identifiers that obey the International DDBJ/EBI/NCBI collaboration's Accession.Version identifier syntax (e.g., the programs know that BAA84643.2 is a newer version of BAA84643, but will retrieve BAA84643.1 if specifically requested); And the parsing programs that come with WU BLAST for converting GenBank and EMBL database flat files into "FASTA" format not only report gi identifiers but Accessions with Versions;
- greatly reduced memory requirements and search initiation times for databases containing large numbers of entries, which is particularly important when memory is in short supply or when multiple processors are standing by waiting for a single-threaded initialization phase to be completed;
- the ability to dump (or recover) the contents of an XDF database back into FASTA format with the original annotation and ambiguity codes intact;
- both the X and N ambiguity codes are supported in nucleotide sequences, thus permitting the use of alternative substitution scores for these letters and the use of PHRED/PHRAP sequence output "as is" as input to xdformat.
Compared to the classical BLAST 1.4 database format, XDF also provides the ability to use FASTA/Pearson format input files with unjustified (i.e., ragged or blank) input lines. With nucleotide sequence databases, there is also no longer the need to retain the original FASTA input file in order to access the ambiguity codes during a database search.
Support for XDF by the BLAST search programs does not come at the expense of backward compatibility. WU BLAST can search databases in either XDF or the classical BLAST 1.4 database formats. Furthermore, by simply installing new versions of "setdb" and "pressdb", the migration to using XDF can be performed swiftly and transparently, without making any changes whatsoever to existing database maintenance scripts. While providing this drop-in upgrade path to XDF, support for legacy databases in the BLAST 1.4 database format is retained transparently, as well: the WU BLAST search programs automatically identify the database format being used and adjust their operations accordingly. This allows users to migrate incrementally to XDF, at their own pace and as they see fit, without losing the ability to study or reproduce results obtained with older databases. Even so, users are encouraged to make the migration to XDF, as there are definite benefits to the new format, including an improved nucleotide sequence data representation and the ability to index sequences by their identifiers. More information about the simplicity of migrating to XDF is provided to licensed users of WU BLAST 2.0.
When searching very large databases, virtual memory requirements are dramatically reduced in WU BLAST 2.0, eliminating program failures that occurred when system resource limits were unexpectedly reached.
Virtual databases are supported in licensed BLAST 2.0. Virtual databases can be specified on the command line as a white space-delimited list of component database names. Virtual databases can be comprised of components in either XDF or classical BLAST 1.4 format, although both formats are not supported at the same time. For example:
```
  blastn "pri rod mam vrt htg" myquery.nt
```
With licensed BLAST 2.0, virtually no file size limits exist for databases and other files, unless the underlying operating system does not support large files. Operating systems such as Linux (kernel version 2.2 and earlier) for 32-bit Intel computing platforms are often incapable of using files larger than 2 GB, although virtual database support (see above) helps avoid this limitation by allowing large databases to be segmented into files of a manageable size. Linux users in need of large file support should use a version 2.4.* kernel.
Licensed BLAST 2.0 supports segmented query sequences, such as the contigs that result from shotgun sequencing assembly or perhaps multiple short probes for a given gene. For example, all of the contigs from a given clone can be concatenated together with a single hyphen (-) character to delimit each contig. Segment boundaries are therefore clearly distinguishable from purely ambiguous regions of the sequence, while consuming little storage. BLAST 2.0 honors segment boundaries by guaranteeing that no alignment, be it ungapped or gapped, will cross a boundary. Support for segmented database sequences is in progress.
Multi-sequence query files are supported in licensed BLAST 2.0, such that every sequence in the FASTA file is searched against the specified database. Previous versions of the software only compared the first sequence in the query file against the database. Each search result is separated from the next by a single ASCII form feed character (control-L). See the new qrecmin and qrecmax options.
The format of all dates reported in BLAST output is controlled by the UN*X standard CFTIME environment variable, so that (for example) dates can be reported in ISO 8601 standard format. Date strings produced by the xdformat program are also governed by CFTIME.
Both sequence filtering and word masking of query sequences are supported. The terms "filter" and "mask" are sometimes used alone and interchangeably, however there are two distinct techniques people can use which deserve separate names. Lower case alphabetic letters in the query sequence can be used to inform the BLAST search program as to which residues it should either filter (convert to X or N) or mask (skip when generating neighborhood words but otherwise leave intact). See the lcfilter and lcmask options, respectively.
Multiple filter=<filter> specifications can be provided on the BLAST command line. Each of the filters is executed independently and their results are OR-ed at the end.
Whereas NCBI BLAST 2.0 uses the original external filtering technique of BLAST 1.3 (Gish, W., unpublished), which utilizes the UNIX popen() system call and temporary files, WU BLAST 2.0 avoids these problematic system interfaces.
One or more word masks can be specified on the command line, using the "wordmask=<mask>" option, where <mask> may be a classical filter program such as seg, xnu, or dust. Whereas sequence filters convert certain letters in the query sequence into ambiguity codes (X for amino acid and N for nucleotide), word masks do not alter the sequence. Word masks instead cause the indicated portion(s) of the query sequence to be skipped during BLAST neighborhood word generation. This leaves the query sequence intact for generating alignments that are seeded by word hits arising in flanking, unmasked regions of the sequence.
The BLAST algorithm's word length parameter, W, can be set from 1 to 1024 in all search modes (BLASTP, BLASTN, BLASTX, TBLASTN, TBLASTX).
WU BLAST 2.0 reliably supports parallel processing on a variety of SMP (symmetric multiprocessing) computing platforms. WU BLAST is the only BLAST that threads properly across multiple CPUs on dual-processor Apple PowerMacs running MacOS X and does not require a G4 processor. POSIX threads are used under Compaq Tru64 UNIX 4.0+, Linux for X86 and Alpha processors, IBM AIX, MacOS X, IRIX 6.5, and HP/UX 11. While POSIX threads are available under Solaris 2+ (SPARC and X86), Solaris threads are specifically used instead for slightly better performance. The IRIX m_fork() system call provides parallel processing under older versions of IRIX 4 through 6.4; and DCE threads are used under Digital UNIX 3.2.
To illustrate just some of the flexibility available in WU BLAST 2.0, the licensed package includes a PERL script named wu-blastall that translates an NCBI blastall command line into a rough equivalent WU BLAST command line and then invokes the appropriate WU BLAST search mode. The output remains in WU BLAST format, but the wu-blastall script may help users of the NCBI blastall program migrate to WU BLAST and start to discover its power.
NOTE: the wu-blastall script is not at all intended to provide a literal replacement for the NCBI blastall program and is not an appropriate method for assessing the relative performance (sensitivity, specificity, accuracy or speed) of the NCBI and WU packages.
The MaskerAid substitute for CrossMatch (Phil Green, unpublished) provides another example of how the unique combination of flexibility and speed of WU BLAST 2.0 can be applied to yield 30-fold faster performance of RepeatMasker in its slow mode, while maintaining sensitivity. MaskerAid is licensed separately from WU BLAST.
Many additional command line options are available, some of which are described below. Others are described in the README.html file that accompanies the licensed software. For users' convenience, the licensed package includes compiled versions of the nrdb program, as well.

A reverse chronological list of changes is available in the HISTORY file, however this file is outdated with respect to licensed versions of WU BLAST 2.0. The reader might also get the unfortunate impression that WU BLAST 2.0 is unreliable, when in fact the licensed version has shown itself to be very robust. Furthermore, any bugs that have been found have typically been fixed within 24 hours of their being reported. For current HISTORY information, licensed users should consult the HISTORY file that accompanies the licensed software distribution.

Please send bug reports, questions, or suggestions to

Manifest

The licensed BLAST 2.0 package includes the following data analysis and utility programs:

blasta - the unified database search program, which provides blastp, blastn, blastx, tblastn, and tblastx search functionality.
xdformat - the recommended program for rapidly converting sequences from FASTA format into the native XDF format read by blasta. The program can also append new sequences to an existing database; automatically rollback on errors; provides flexible indexing and verification services; and can dump data back into FASTA format.
xdget - a flexible tool for retrieving sequences (or segments thereof) from an indexed XDF database; retrieved sequences are optionally reverse-complemented and translated in the case of nucleotide sequences. xdformat and xdget are actually one-and-the-same program, to ensure their compatibility.
nrdb - a tool for rapidly removing trivial redundancy (i.e., duplicate sequences) from one or more input files in FASTA format. A simple hash table is used, combined with data compression techniques to allow larger nucleotide sequence data sets to be manipulated in memory.
patdb - a tool for rapidly removing trivial redundancy, as well as identifying perfect substrings, from one or more input files in FASTA format. A Patricia Tree is used, combined with a Finite State Automaton. This tool is perhaps most useful when applied to protein sequences, which often differ in their inclusion of the initiator methionine or other post-translational modifications. Patdb can also be more practically applied to protein sequences than to nucleotide sequences, because the data compression techniques of the nrdb program, which are so effective with nucleotide sequences, are not employed by patdb.
wu-blastall - a PERL script for converting an NCBI blastall command line into a rough equivalent blasta command line and then invoking blasta. The output is still in WU BLAST format. This is primarily intended as a technology demonstration tool but may also assist users in their migration from NCBI BLAST to the more accurate WU BLAST. For benchmarking of BLASTs, careful tweaking of parameters may be required, but even with great care, benchmarking for speed can still be confounded by inaccuracies in NCBI BLAST.
wu-formatdb - a PERL script for converting an NCBI formatdb command line into the equivalent xdformat command line and then invoking xdformat. This is primarily intended as a technology demonstration tool but may also assist users in their migration from NCBI BLAST to WU BLAST.
pam - a program to compute amino acid substitution scoring matrices having arbitrary scales, using the Dayhoff PAM model.
pressdb.real - the legacy pressdb program for users who are reliant on the NCBI BLAST 1.4 database format for nucleotide sequences.
setdb.real - the legacy setdb program for users who are reliant on the NCBI BLAST 1.4 database format for amino acid sequences.
gb2fasta - a parser to extract nucleotide sequences from GenBank flat files into FASTA format.
gt2fasta - a parser to extract amino acid sequences from CDS features in GenBank flat files and output them in FASTA format.
sp2fasta - a parser to extract protein or nucleotide sequences from EMBL, TrEMBL, or SWISS-PROT database files and output them in FASTA format.
pir2fasta - a parser to extract protein sequences from NBRF PIR database files and output them in FASTA format.
seg - a low-complexity filter for protein and nucleotide sequences (Wootton and Federhen, 1993; Wootton and Federhen, 1996). The program identifies low compositional complexity regions.
dust - a low-complexity filter for nucleotide sequences (Hancock and Armstrong, 1994; Tatusov and Lipman, unpublished).
xnu - a low-complexity filter for protein sequences (Claverie and States, 1993). The program identifies short-periodicity repeats.
sysblast - a sample configuration file that system administrators may wish to modify and install as /etc/sysblast. Parameter settings in this file can be used to: limit the number of threads employed by each BLAST process; change the default number of threads employed per process; and alter the "nice" value for BLAST processes.

To Fly...

If the gapped alignments are nice, but even more speed or less memory use are desired, read how to make the programs fly.

Examples

Here are some sample WU BLAST 2.0 results produced using generally default parameters, plus the oft-recommended low-complexity filter seg and the -postsw option of WU BLASTP 2.0. Exceptions to the defaults are noted and their corresponding results provided, as well.
Default parameters for NCBI blastall were also used, with the exception of using -G7 -E2 to make the scoring system identical to the WU default (penalty of 9 for the first residue in a gap).

Example 1 produced with and without gaps; with gaps but without Karlin-Altschul "Sum" statistics; and with gaps and Karlin-Altschul "Sum" statistics but without the -postsw option.
Example 1 NCBI blastall output.
Example 2 produced with and without gaps; and with gaps but without Karlin-Altschul "Sum" statistics.
Example 2 NCBI blastall output.
Example 3 produced with and without gaps.
Example 3 NCBI blastall output.

New Command Line Options

Command line options for WU BLAST version 1.4 often apply to version 2.0 without change. (See the version 1.4 manual page in Adobe Acrobat (PDF) format). New command line options to version 2.0 include the following. Some of these options are not available in the freely available alpha releases of WU BLAST 2.0. Terse program usage can also be obtained by entering one of the program names on the command line without any arguments.
Note: parsing of command line options is alphabetic case-independent.

Option Description

Q=<q> set the penalty for a gap of length one to q (default Q=9 for proteins; Q=10 for BLASTN)

R=<r> set the per-residue penalty for extending a gap to r (default R=2 for proteins; R=10 for BLASTN)

H=<h> Set the value for the relative entropy to be used in Karlin-Altschul statistics of ungapped alignment scores. In earlier versions of BLAST, the H option was used to invoke the display of a histogram.

postsw perform full Smith-Waterman alignment of sequences and re-rank the database matches accordingly, prior to output (currently supported in BLASTP only)

hitdist=<hitdist> invoke a 2-hit BLAST algorithm similar to that of Altschul et al. (1997), with the maximum distance between word hits of <hitdist>. Altschul et al. (1997) use the equivalent of hitdist=40 in the BLASTP, BLASTX, TBLASTN and TBLASTX search modes. In WU BLASTN, setting hitdist=W and wink=W, where W is the word length, is akin to using double-length words generated on W-mer boundaries.
NOTE: in protein-level comparisons, for best sensitivity (or the best sensitivity for the amount of memory used), 2-hit BLAST should generally be avoided.
This option is only available in the licensed 2.0 software.

wink=<wink> generate word hits at every winkth ("W increment") position along the query, where the default wink=1 produces neighborhood words at every position. For best sensitivity, this option (setting wink greater than 1) should not be used. Wink is best used to find identical or nearly identical sequences rapidly. When used in conjunction with the hitdist option to obtain the highest search speed, care should be taken that desirable alignments are not precluded by these parameters. This option is only available in the licensed 2.0 software.

wordmask=<masker> mask letters in the query sequence without altering the sequence itself, during neighborhood word generation.

lcfilter filter lower case letters in the query sequence, by replacing lower case letters with the appropriate ambiguity code (N for nucleotide sequences, X for protein sequences).

lcmask mask lower case letters in the query sequence without altering the sequence, during neighborhood word generation.

maskextra=<extra> word-mask an additional extra letters on each side flanking an already-masked region. This helps avoid the appearance of spurious alignments through low-complexity regions initiated by chance word hits immediately adjacent to masked regions.

nogaps do not create gapped alignments, in essence reverting to WU BLAST 1.4 behavior

pingpong Perform additional work to help ensure the alignments produced are locally optimal. This option typically adds 3-10% to the execution time, without affecting the results. Only rarely is an alignment and its associated score improved, for all the work involved.

gapall effectively generate a gapped alignment for every ungapped HSP found. This is the default behavior.
See also: gapE.

gapE=<gapE> generate gapped alignments for all HSPs between sequences whose expected frequency of chance occurrence is less than or equal to <gapE>. Default value is gapE=infinity, i.e., gapall is in effect.

gapW=<gapW> set the window width (or band width) within which gapped alignments are generated (default is gapW=32 for protein comparisons, gapW=16 for BLASTN).

noseqs produces greatly abbreviated output that omits sequence alignments and yet may be interpreted correctly by existing parsers.

hspmax=<hspmax> establishes <hspmax> as the maximum number of ungapped HSPs that will be saved per subject sequence or pairwise sequence comparison. Saved HSPs are then fed to the gapped alignment phase of the program or are statistically evaluated if gapped alignments are not to be performed. If more than <hspmax> HSPs are found, only the best-scoring HSPs are retained for subsequent processing.
The default value is 1000; a value of 0 implies no limit.
See also: gspmax and spoutmax.
NOTE: this usage of hspmax is subtly, but importantly, different from the parameter's classical interpretation, wherein all ungapped HSPs that satisfied the S2 score threshold were saved and <hspmax> merely limited the number of HSPs (gapped or ungapped) that would be reported. The new interpretation was instituted to provide vastly improved speed on large problems, while imparting no effect on small problems and many medium-sized problems. The new behavior can help guard against horrendously slow searches resulting from an inadvertant omission of a low-complexity filter. Adverse effects on sensitivity may be obtained, however, if every HSP is sacred. To restore classical behavior, specify hspmax=0. As a compromise between sensitivity and speed, set a higher value than the default.
NOTE: the B and V options limit the number of subject sequences for which any results whatsoever are reported, regardless of the number of HSPs or GSPs found in each case.

gspmax=<gspmax> establishes <gspmax> as the maximum number of GSPs (gapped HSPs) to report per subject sequence or pairwise sequence comparison. If more than <gspmax> GSPs are found, only the best-scoring GSPs are retained for subsequent processing and reporting. The setting of gspmax will have no effect, if the nogaps option is specified or if the setting of hspmax is more restrictive.
The default value is 1000; a value of 0 implies no limit.
See also: hspmax and spoutmax.
NOTE: the B and V options limit the number of subject sequences for which any results whatsoever are reported, regardless of the number of HSPs or GSPs found.

spoutmax=<spoutmax> establishes <spoutmax> as the maximum number of segment pairs to report in program output per subject sequence or pairwise comparison, however many HSPs or GSPs were actually found and evaluated. If more than <spoutmax> segment pairs are found, the segment pairs are sorted by the criteria in effect for the search and only the first <spoutmax> segment pairs are reported. The setting of spoutmax will have no effect if either <hspmax> or <gspmax> is more restrictive.
The default value is 1000; a value of 0 implies no limit.
See also: hspmax and gspmax.

compat1.4 produces BLAST version 1.4-style output (no gaps), but with bug fixes and performance enhancements in place.

kap use Karlin-Altschul (1990) statistics on individual alignment scores (i.e., do not evaluate the joint probability of multiple scores, such as with Poisson or Karlin-Altschul (1993) "Sum" statistics).

restest causes statistical significance estimates to depend upon the size of the database, as determined by the total number of residues it contains. Restest is the default method for determining the database size in the blastn, tblastn, and tblastx search modes.
See seqtest.

seqtest causes statistical significance estimates to depend upon the size of the database, as determined by the number of sequences it contains. Seqtest is the default method for determining the database size in the blastp and blastx search modes. For backward compatibility with legacy BLAST software -- in all search modes, including blastp and blastx -- if the Z option is specified, Z is expected to be expressed in units of residues, unless the seqtest option is also specified.
See restest.

links display consistent link information for each alignment, indicating all of the "consistent" alignments used in joint statistical significance calculations.

topcomboN=<n> report at most n "topcombo" groups of consistent (colinear) local alignments (HSPs). Each local alignment is allowed to be a member of only one group. Use of this option causes the addition of a "Group = #" indicator in the output for each HSP. Groups of HSPs tend to be assembled in decreasing order of statistical significance. Members of the most significant group thus tend to be reported with "Group = 1". See also: topcomboE.

topcomboE=<E_ratio> E_ratio is the maximum ratio of E_current/E_best for which the current "topcombo" group of consistent (colinear) local alignments will be reported for a given database sequence. The "best" group is reported in the output as "Group = 1" and tends to be the most statistically significant. The default behavior is to impose no limit on this ratio, in which case all topcombo groups satisfying E are reported (up to a maximum of topcomboN). See also: topcomboN.

olmax=<len> maximum permitted length of overlap (in residues), len, of two ungapped alignments for their joint (Sum or Poisson) probability to be computed. The default is unlimited length, with the maximum extent of overlap being governed only by the olfraction parameter.

golmax=<len> maximum permitted length of overlap (in residues), len, of two gapped alignments for their joint (Sum or Poisson) probability to be computed. The default is unlimited length, with the maximum extent of overlap being governed only by the golfraction parameter.

hspsepqmax maximum distance allowed along the query sequence between two "consistent" HSPs. (Useful when the query is genomic with relatively short intragenic regions).

hspsepsmax maximum distance allowed along the subject (database) sequence between two consistent HSPs. (Useful when the database contains genomic sequences with relatively short intragenic regions).

gapsepqmax maximum distance allowed along the query sequence between two consistent gapped alignments. (Useful when the query is genomic with relatively short intragenic regions).

gapsepsmax maximum distance allowed along the subject sequence between two consistent gapped alignments. (Useful when the database contains genomic sequences with relatively short intragenic regions).

gapK=<k> set the value of the Karlin-Altschul statistics' K parameter to use when evaluating the significance of gapped alignment scores. Useful when precomputed values are unavailable in the internal tables for the chosen scoring matrix and gap penalty combination.

gapL=<l> set the value of the Karlin-Altschul statistics' lambda parameter to use when evaluating the significance of gapped alignment scores

gapH=<h> set the value of the Karlin-Altschul statistics' H parameter to use when evaluating the significance of gapped alignment scores

dbchunks=<nchunks> establishes the granularity of the database, as it is divided into slices for assignment to individual threads, to make more efficient use of all CPUs when multiple CPUs are employed for a given search. Higher values are appropriate when the database contains relatively few sequences and/or when the sequences vary greatly in length, composition or content (e.g., genomic contigs). Lower values are appropriate when the database contains many sequences of comparable length (e.g., the EST division of GenBank). The minimum assignable value is the number of threads employed, but this setting is ill-advised; the optimal value for any given search type is likely to be a large multiple of the number of threads employed (although it need not be an exact multiple). When searching mammalian genomic contigs, a good value may be 1000. The default value is 500.

qrecmin=<m> in a multi-sequence query file, start database searches using the query sequence numbered m. (The first record is numbered 1).

qrecmax=<n> in a multi-sequence query file, end database searches with the query sequence numbered n.

putenv="NAME=VALUE" in the local environment to the BLAST search program, set the environment variable named NAME to the value VALUE.

endputenv for security in WWW server installations, where the command line may sometimes be left open to users, ignore any subsequent putenv options found on the command line during left-to-right parsing.

getenv="NAME" display the value of the environment variable named NAME. This may be useful for verifying that the settings of environment variables on a web server or in an analysis pipeline have been propagated all the way to the BLAST search program.

endgetenv ignore any subsequent getenv options found on the command line during left-to-right parsing.

cdb search nucleotide sequence databases in their uncompressed form. This option is only effective in the BLASTN search mode for word lengths > 6. See ucdb.

ucdb search nucleotide sequence databases in their uncompressed form, with any-and-all ambiguity codes in place. This option may be used to increase sensitivity in the presence of ambiguity codes, at the expense of memory and possibly speed. This is the standard behavior for word lengths < 7, and is not recommended for use with the default or longer word lengths, particularly for longer sequences, due to the increased memory requirements; when comparing long sequences, however, if sufficient memory is available, use of this option can yield a significant increase in speed. This option offers improved sensitivity when searching databases in XDF format that contain ambiguity codes. The option is accepted by the software but offers no improvement in sensitivity for databases in the earlier BLAST 1.4 database format. (BLASTN search mode only).

mmio turn off the use of memory-mapped I/O when reading database files. Use of this option will usually retard the search, particularly when multiple processors are being used, but it serves both to demonstrate the effectiveness of this form of I/O and to validate the I/O routines. Note that no special daemon or support programs (such as the old memfile program) are required to take full advantage of memory-mapped I/O.

Environment Variables

In WU BLAST 2.0, the BLASTDB environment variable can be a list of one or more directory names in which the programs are to look for database files. (In UNIX parlance, such an environment variable might be called a path for the database files). Multiple directory names should be separated from one another by a colon (":"). If the BLASTDB environment variable is not set, the programs use a default path of ".:/usr/ncbi/blast/db", such that the programs first look in the current working directory (".") for the requested database; then they will look in the "/usr/ncbi/blast/db" directory. For backward compatibility with programs that expect BLASTDB to be a single directory specification and not a path, if the user has set a value for BLASTDB but omitted the current working directory, the version 2 programs will still look for database files in the current working directory as a last resort.

The BLASTFILTER environment variable can be set to the directory containing the filter programs, such as seg and xnu. The default directory for the filter programs is /usr/ncbi/blast/filter. This usage is unchanged from version 1.4.

The BLASTMAT environment variable can be set to the parent directory for all scoring matrix files. The default directory for these files is /usr/ncbi/blast/matrix, beneath which are nt and aa subdirectories for storing scoring matrix files appropriate for nucleotide and amino acid alphabets. This usage is unchanged from version 1.4.

For more information about environment variables, see the Installation instructions.

Filters and Masks

WU BLAST provides an highly flexible means of applying both "hard" and "soft" masks to a query sequence, supporting alternative, user-defined filter programs, as well as non-standard parameters to the standard filters. The filter (for hard masking) and wordmask (for soft masking) command line options provide the basic interface. Multiple specifications of each type are acceptable on the BLAST command line; and individual filter and wordmask specifications may consist of entire pipelines of commands.

For example, three filters are used in succession by this pipeline:

      filter="myfilter1 | myfilter2 | myfilter3 -x5 -"

The first two filters in this case are expecting to read their input from UN*X standard input (also known as stdin), whereas myfilter3 apparently needs to be told (with the usual "-" or hyphen argument) to read data from stdin. The standard output (stdout) from myfilter1 will be read via stdin by myfilter2, which in turn processes the query before handing its results to myfilter3; finally, myfilter3 reports its results to stdout, which the BLAST program itself reads to obtain the fully masked sequence. The final output from the filter pipeline is expected by the BLAST program to be in FASTA format.

Instead of running all 3 filters in the above example as part of one pipeline, they could instead be specified as separate filter options like this:

    filter=myfilter1  filter=myfilter2  filter="myfilter3 -x5 -"

The same choice of running as a pipeline or running separately is available for wordmasks, too. And of course the two approaches can be combined on the same command line. An advantage to using the pipeline approach is that all 3 filters in the example above may complete a little bit faster, because much of the I/O is avoided. Furthermore, when used in the pipeline, there's no requirement that the output from myfilter1 and myfilter2 actually be in FASTA format. Those two programs could potentially pass any information between themselves and to myfilter3. The only absolute requirement is that myfilter1 must read FASTA data from stdin and myfilter3 must output FASTA data (of the same length as the query!) to stdout.

It should be noted that with some filter programs, passing the query sequence sequentially through a pipeline of filters may yield a different result than processing the query independently with each filter and OR-ing the results. The script seg+xnu included in the filter/ directory provides an example with which to test this. Specifying filter=seg+xnu on the BLAST command line invokes a seg and xnu pipeline that is built-in to the search programs; whereas specifying filter="seg+xnu -" causes the seg+xnu script to be invoked on the query, which independently executes seg and xnu, then ORs the separate results with pmerge. (The echofilter option can be used to see the results of filtering displayed in search program output). While the built-in seg+xnu pipeline is historically the way these two filters have been implemented, the latter interpretation, as illustrated by the seg+xnu script with pmerge, may be more desirable.

Bugs

The following list describes bugs that are known to exist in the WU BLAST 2.0a19 binaries posted here. These are all fixed in the licensed version 2.0 of WU BLAST, in which there are no known bugs. If you are a user of the licensed version and believe you see a bug, please send a Even users of the licensed version should read the second set of potential problem areas or pitfalls listed below this initial bug list.

Due to a 2 GB file size limit associated with many 32-bit computing platforms (e.g., Solaris 2.5 and earlier, IRIX 5 and earlier, and many versions of Linux on X86), users will be unable to search nucleotide sequence databases where the accompanying FASTA file is larger than 2 GB. "Loss-less" compression programs like nrdb can help keep the data size below the 2 GB limit (while speeding up searches proportionately), but this is merely a stop-gap solution, as the public and private data sets are growing at exponential rates. A comprehensive set of GenBank nucleotide sequences in FASTA format currently (01-Oct-2001) exceeds 16 GB. Switching to a 64-bit computing platform -- or one which provides "largefile" support, such as Solaris 2.6+ under SPARC or Intel processors -- would be sufficient to break the 2 GB barrier for these BLAST searches, but additional problems in 2.0a19 still preclude its use beyond 2 GB.
The low-complexity sequence -filter option causes fatal errors in the 2.0a19 alpha version of blastn.
Use of the -consistency or -poissonp options (or both) results in severely truncated output, as well as uncovers a major memory leak.
The search programs blastn, tblastn, and tblastx may crash due to a segmentation fault when a (nucleotide) database sequence contains one or more ambiguity codes. For blastn, a word length, W, less than 11 is also required to elicit the bug. This bug may appear sporadically and may disappear/reappear when using the identical query and command line parameters, if the database has been reprocessed with pressdb. The chance of encountering the bug increases with: shorter neighborhood word lengths; lower values for the ungapped alignment score cutoff, S2; and longer queries. As a stop-gap measure, the bug can be avoided entirely by making the FASTA-format database file inaccessible to the search programs after processing by pressdb; however, any ambiguity codes in the database sequences will not appear in the search results and their absence during the search may yield incorrect alignment scores.
Sometimes crashes occur when only the top or bottom strand of the query sequence is being compared.
Occasional floating point exceptions may arise.
If the search program is interrupted during the preliminary filtering of the query sequence, a temporary file (usually created in the /var/tmp or /tmp directory) used in the filtering process is not removed. Over time, these files can accumulate, fill up the associated disk partition, and thus make it impossible for any further searches involving filtering to proceed. Other programs which use these partitions for temporary files may have their operation impaired as well; the "vi" editor is one such program.

The above mentioned bugs are not applicable to the licensed version of WU BLAST 2.0, but the licensed version does have some characteristics worth mentioning that could trip up or confuse even the most knowledgeable of BLAST users. Any unexpected behavior might be construed as a bug, so the following information is provided to help avoid the unexpected. If you should encounter problems or confusing areas other than those described below, or if you have questions or suggestions, please send them to

The statistical significance of gapped alignment scores is computed using values for lambda, K and H looked up in precomputed tables. (The values for lambda, K and H used to assess the significance of ungapped alignment scores are still computed at run time, as is practical). Values are chosen from the tables based on the scoring matrix and gap penalties being used. Precomputed values are not available for all scoring matrix and gap penalty combinations, however; and the precomputed values may not be well-suited to an unusual residue composition of the query sequence. When precomputed values are unavailable, the programs issue a WARNING and proceed to evaluate gapped alignment scores using values of lambda, K and H that are known to be incorrect: the values computed for ungapped alignments. In such cases, the reported significance estimates may be highly inaccurate and will be biased towards being overly significant. If the user knows more accurate values for their situation, the gapK, gapL and gapH command line options should be used to set them.

Precomputed values for lambda, K and H are available for BLASTN searches with the following match,mismatch (M,N) scoring systems, using gap penalties {Q,R}:

    "+1,-3", {3,3} {3,2} {3,1}
    "+1,-2", {2,2} {2,1} {1,1}
    "+3,-5", {10,5} {6,3} {5,5}
    "+4,-5", {10,5}
    "+1,-1", {3,1} {2,1}
    "+5,-4", {20,10} {10,10}
    "+5,-11", {22,22} {22,11} {12,2} {11,11}

pupy = 
	{ 20, 10}
	{ 10, 10}

Precomputed values for lambda, K and H are available for protein-level searches with the following scoring matrix and gap penalty combinations (or gap penalty ranges for R) {Q, R}:

blosum50 =
    { 16,  1-4}
    { 15,  1-4,6,8}
    { 14,  1-5,8}
    { 13,  1-5,8}
    { 12,  2-5,7}
    { 11,  2-4,6,8}
    { 10,  2-6,8}
    { 9,  3-5,7}
    { 8,  4-8}
    { 7,  6,7}

blosum55 =
    { 16,  1-4}
    { 15,  1-4,5,6,8}
    { 14,  1-5,7}
    { 13,  2-5,8}
    { 12,  2-5,8}
    { 11,  2-6,8}
    { 10,  3-6,9}
    { 9,  3-5,7}
    { 8,  4-8}
    { 7,  7}

blosum62 =
    { 12,  1-3}
    { 11,  1-3}
    { 10,  1-4}
    { 9,  1-5}
    { 8,  2-7}
    { 7,  2-6}
    { 6,  3-5}
    { 5,  5}

blosum80 =
    { 12, 2-12}
    { 11, 2-11}
    { 10, 2-10}
    {  9,  3-9}
    {  8,  4-8}
    {  7,  5-7}

pam40 =
    { 12,  1,2,6}
    { 11,  1,2,7}
    { 10,  1-3,7}
    { 9,  1-3,6}
    { 8,  1-4}
    { 7,  1-4}
    { 6,  2-5}
    { 5,  2-5}
    { 4,  3,4}

pam120 =
    { 12,  1,2,4}
    { 11,  1-3}
    { 10,  1-3,5}
    { 9,  1-3,5}
    { 8,  1-4,6}
    { 7,  2-4,6}
    { 6,  2-5}
    { 5,  3-5}

pam250 =
    { 16,  1-4}
    { 15,  1-5}
    { 14,  1-6}
    { 13,  1-6}
    { 12,  2-7}
    { 11,  2-7}
    { 10,  3-8}
    { 9,  3-7}
    { 8,  5-7}
    { 7,  7}

Selecting an alternative scoring matrix does not alter the gap penalties (Q and R) from their default values. This can not only result in alignments with undesirable gap characteristics, but depending on the scoring matrix chosen, this can unwittingly create a situation in which the programs do not have precomputed values for lambda, K and H. As described earlier, a WARNING message will be displayed when precomputed values are not available; nevertheless, the statistics will be unreliable.
The hspsepqmax, gapsepqmax, etc. parameters are measures of distance in residues along the sequences in the specific form in which they are being compared. For instance, in a BLASTX search (conceptually translated nt. query sequence compared against a protein sequence database), hspsepqmax refers to a distance measured in amino acid residues, not the underlying nucleotides in the query.
In blastn output, if an alignment contains no gaps, the mid-line displayed between the aligned query and subject sequences will contain vertical bars ("|") to indicate identities; but if there are one or more gaps in the alignment, the midline will contain nucleotide residue codes to indicate positions of identity. This was originally intended to be a feature, enabling gapped and ungapped alignments to be readily discerned by eye, but it has caused some confusion and, therefore, was removed from the licensed version.
The gap penalty parameters Q and R of WU BLAST have similar but important differences in interpretation from the parameters G and E of NCBI Gapped BLAST. While the two extension penalties R (WU BLAST) and E (NCBI BLAST) are analogous, Q (WU BLAST) is analogous to the sum of G and E with NCBI BLAST. In other words, where Q is the total penalty for a gap of length 1, NCBI Gapped BLAST computes this penalty as G + E.
ASN.1 formatted output is broken in all releases of WU BLAST 2.0.

Supported Platforms

The computing platforms currently supported by BLAST 2.0 (licensed version only) include the following:

Apple MacOS X 10.1 and 10.2 ("Jaguar") for both PowerPC G3 and G4
Compaq Tru64 UNIX V4.0F for Alpha (upwardly compatible with Tru64 V5.X)
FreeBSD 4.5 for Intel i686 (PentiumPro/II/III)
Hewlett-Packard HP-UX 11 for HP PA-RISC and Intel IA-64 (Itanium)
IBM AIX 5.1 for Power3 and Power4
Linux kernel version 2.2 for i586 (original Pentium) and i686
Linux kernel version 2.4 for i686, i786 (Pentium4), and IA-64
SGI IRIX 6.5 for MIPS R5000, R10000 and R12000
Sun Solaris 8 for SPARC and Solaris 9 for Intel i686

The list of supported platforms is subject to change without notice.
Multiple processors (multithreading or parallel processing) are effectively and efficiently supported by WU BLAST on all of the above platforms. WU BLAST 2.0 also supports large files (files greater than 2 GB in size) when the underlying operating system and file system support large files.

Under MacOS X, WU BLAST is the only BLAST that runs faster on multiple G4 processors. Other BLAST implementations either don't use multiple G4 processors or -- as is the case with Apple's modified version of NCBI BLAST -- actually run slower when two processors are used. Unlike other BLASTs, WU BLAST won't crash or hang your system when the use of multiple CPUs is attempted and it yields the most accurate results. WU BLAST also does not utilize any G4-specific instructions for peak performance, so you can even run it on an iBook. You don't have to scrap your old G3 in order to run the fastest BLAST -- WU BLAST runs on the same hardware that MacOS X runs on -- and it has done so since even before MacOS X was publicly released -- but it will run faster on a G4 and again up to twice as fast on a dual G4.

Hewlett-Packard HP-UX and IBM AIX operating systems may need to be patched for error-free support of large files over NFS. If a large file/NFS problem does exist with your HP or IBM system, it should immediately reveal itself when an attempt is made to search a large-file database over NFS: the search will simply fail to run and the application will exit non-zero. If necessary, simply contact your vendor for the patch. Both companies promptly addressed this issue over a year ago, in the first half of 2001.

Please note: while WU BLAST version 2.0a19 binaries dated February 1998 are freely available here for some platforms, newer, full-featured binaries for the above platforms are only available upon licensing. While containing several bugs, some of the more prominent features missing from WU BLAST 2.0a19 but supported by the licensed version are:

support for databases of virtually unlimited size -- without splitting -- as well as the other unique and powerful features of the XDF database format, such as complete identifier indexing and rapid appending;
support for so-called virtual databases comprised of multiple real databases that are combined at run time;
significantly improved speed;
64-bit virtual addressing (on supporting platforms);
more efficient 32-bit virtual addressing on many 64-bit platforms;
two-hit BLAST option in all search modes;
many of the new command line options described above, which improve the speed, sensitivity and selectivity of the software.
uniform support for large files (>2 GB) by all programs in the suite, for any and all input and output files;

Installation

Users of the licensed version of BLAST 2.0 should refer to the README.html file that accompanies the software distribution for more relevant instructions. While similar in nature, the following information is specifically for users of version 2.0a19.

To install WU BLAST 2.0, the first step is to download the UNIX tar archive of executables appropriate for your computing platform from here. Scoring matrix files are included in each package, but sequence complexity filters are not. (Several common complexity filters are however included with the licensed version of WU BLAST 2.0). It is advised that the archive be unpacked in a new, empty directory.

The executable programs from the tar archive may be placed in any directory listed in users' PATH environment variable, whether this means adding the new directory to their PATH or moving the executables into an existing directory already listed in their PATH.

If you already had BLAST 1.4 installed (with blastable databases), the installation steps for WU BLAST 2.0 are now complete. If you do not have BLAST 1.4 installed, read on...

Unpacking the tar archive creates a matrix/ subdirectory containing scoring matrix files. Wherever this directory ultimately resides, the BLASTMAT environment variable should be set to point there. In the absence of this environment variable being set, the programs look for scoring matrix files in /usr/ncbi/blast/matrix.

Low-complexity sequence filters or masking programs -- e.g., seg, xnu and dust -- are now included in the licensed software packages. The filter programs are not required for running the search programs, but they can enormously reduce the amount of garbage output, memory use, and search time. Hence, it is highly recommended that these programs be made available to users. You will need to build (compile and link) the programs from source code posted off the WU BLAST Archives home page. Whatever directory you install the filter programs in, the BLASTFILTER environment variable should be set to point there. In the absence of this environment variable being set, the programs look for masking programs in /usr/ncbi/blast/filter. Note: unlike the latest NCBI BLAST search programs, the WU BLAST search programs do not employ sequence filtering by default.

Of course the databases themselves are missing from the tar archives, too! Once the databases have been downloaded from any of many sources on the Internet, the database files are typically uncompressed and processed into FASTA format. Included in the tar archives are several utility programs for converting textual database files:

gb2fasta converts the nucleotide sequences in GenBank flat files into FASTA format.
gt2fasta converts the CDS translations (peptide sequences) in GenBank flat files into FASTA format.
sp2fasta converts EMBL or SWISS-PROT flat files into FASTA format.

The NCBI software Toolbox also contains parsers, including asn2fast, a program that converts both nucleotide and peptide sequences in GenBank ASN.1 format into FASTA format files.

All of the above parsers can read from standard input (sometimes signified by a single dash, "-"), so their input files can be maintained on disk in compressed format and dynamically zcat-ed or gunzip-ed directly into the parsers, thus saving the time and storage required for the uncompressed data. Because a dash is often used to signify the start of each command line option, if a dash is needed to specify standard input for the required input filename argument, some of these programs require that a double-dash (--) be specified before the single-dash. This double-dash signifies the end of the command line options and the start of the required arguments.

Once the databases are in FASTA format, the setdb and pressdb programs are used to convert them into blastable format. Simple usage instructions for these programs can be obtained by invoking them without command line arguments. When producing a blastable database, each program creates 3 output files whose names are derived from the name of the input FASTA-format file. The 3 output files are given distinct filename extensions and together comprise the blastable database. For nucleotide sequences containing ambiguity codes (e.g., ESTs which often contain many Ns), the FASTA file will be referenced later (if still accessible) by the search programs, to obtain ambiguity codes for matching sequences that contain such codes. More information about the blastable database file formats is available here.

The blastable database files can be placed anywhere, but the BLASTDB environment variable should point to their directory location. If the BLASTDB environment variable is not set, the programs look for their databases in /usr/ncbi/blast/db and in the current working directory. If the search programs are to find them, nucleotide sequence FASTA files must be located in the same directory as the blastable databases. Sometimes it is more convenient to maintain the FASTA files in a separate directory on another disk partition, with UNIX soft links in the BLASTDB directory pointing to FASTA files stored elsewhere. In addition, on systems where NCBI BLAST will not be in use, blastable databases can be maintained in multiple directories listed in the BLASTDB environment variable, delimiting the directory names with colons just as directory names are delimited in the PATH environment variable.

On multi-processor computer systems, the search programs will by default employ as many CPUs as are installed (up to 4 CPUs in the case of BLASTN, unless more are requested), but this may make inefficient use of the computer when more than about 4 CPUs are used. Depending on how many processors are in your box, you may want to wrap the search programs in a shell script that sets a lower number of CPUs via the cpus=# (or the deprecated P=#) command line option. Another approach to changing the default number of CPUs follows below, for BLAST managers possessing "root" or "SuperUser" privileges.

With licensed distributions of WU BLAST, a sample file named sysblast is provided to help with establishing certain system-wide configuration parameters that govern the behavior of BLAST processes. When installed under the name /etc/sysblast, the default number of CPUs employed can be altered; a hard limit can be imposed on the number of CPUs employable by any single BLAST process; and the "nice" value of BLAST processes can be set. /etc/sysblast resides in a directory that is local to any given computer, so parameter values can be configured differently and as needed for each computer, even if the software itself is maintained on a shared disk partition. The sysblast file is only effective if installed in the /etc directory and the /etc directory should only be writable by "root". See the comments included in the sample sysblast file for further details. Unlike the shell script wrapper approach described above, the limits set in /etc/sysblast can not be circumvented.

For further information, the out-dated manual page for the BLAST version 1.4 (ungapped) search programs is still sometimes useful, for a description of procedures and parameters that have not changed.

Licensing

Site licenses for the full-featured BLAST 2.0 are available free for academic and nonprofit use; commercial licenses are available from Washington University for a fee. Academic and nonprofit licenses are typically arranged through the institutions' respective offices of technology transfer. Upon obtaining written permission from Washington University, licensees are welcome to install the software for public BLAST services. Washington University seeks additional licensees for commercial development and marketing and invites interested parties to submit proposals.

Please address all e-mail requests for licensing information and limited evaluation copies to Be sure to include the name and address of your company or institution and the name and e-mail address of your lab head (if not you). Washington University typically negotiates site licenses for BLAST, so only one license needs to be executed per institution. If your institution already has a license, you will be informed of this upon inquiring, and your lab head will be provided with download instructions. If a license for your institution does not exist, you may be provided with a draft of the license agreement, which will need to be signed by authorized representatives of both institutions. If you do not receive a response within the next business day, please re-send your message and indicate it is a repeat request. Please note that responses during holiday periods may be slower than usual.

Citing BLAST

Citations or acknowledgements of WU BLAST usage are greatly appreciated, as are any personal accounts of how the software is being used that you might wish to share. When URLs are acceptable, please cite with:

   Gish, W. (1996-2003) http://blast.wustl.edu

When URLs are not acceptable, please use:

   Gish, W., personal communication.

The WU BLAST search program may also be referred to by the name BLASTA. I know of no other program (BLAST-related or otherwise) going by this name.

In scientific communications, it is typically important to report the program name, as well as the specific version used. In the case of WU BLAST or BLASTA, the version is a combination of the "2.0" moniker and the release date. The release date can be found on the first line of output, and it is the first date displayed. For example, consider this introductory line of output:

  BLASTN 2.0MP-WashU [02-Apr-2002] [sol8-ultra-ILP32F64 2002-04-03T01:25:46]

The software release date is April 2, 2002, whereas the compilation or build date of the Solaris 8 binary was April 3rd at 1:25 AM.

Historical Notes

Historical notes and additional citation information for some earlier versions of NCBI and WU BLAST include:

The first description of the classical ungapped BLAST algorithm was published by Altschul et al. (1990). This paper focuses on BLASTP and BLASTN, and makes mention of TBLASTN. (It is perhaps of interest that TBLASTN hadn't been written at the time the manuscript was submitted, but it was available by the time of publication).
The NCBI Experimental BLAST Network Service was opened to the public in December 1989, providing Internet access to the latest, parallelized search programs and sequence databases updated on a daily basis. Around the same time, the "nr" (quasi-non-redundant) databases were established (W. Gish, unpublished). The experimental service was ultimately discontinued more than a decade later in March 2000. At the request of NCBI upper management, a report on the experimental service was never published and remains W. Gish (unpublished). Awareness of the service spread by word-of-mouth, much as is the case with WU BLAST.
BLASTX first appeared in BLAST 1.1 in July 1990, and was later described by Gish and States (1993). The BLAST3 program ( Altschul and Lipman, 1990) was also folded into this release and parallelized. The use of Poisson statistics, as suggested by Karlin and Altschul (1990) to evaluate the joint probability of multiple HSPs, was also first featured in BLAST 1.1.
BLASTC, a version of BLASTX that considered codon usage information in addition to sequence similarity (States and Gish, 1994), only appeared in the BLAST 1.3 distribution. The BLAST 1.3 distribution was also the last to include the BLAST3 program.
The first version of BLAST to use Karlin and Altschul (1993) "Sum" statistics to evaluate the joint probability of multiple HSPs was BLAST 1.4 (W. Gish, unpublished).
TBLASTX first appeared in BLAST version 1.4 and remains attributable to W. Gish (unpublished).
The first release of WU BLAST was version 1.4, which was virtually identical to NCBI BLAST 1.4, save for a few bug fixes. The WU BLAST Archives (http://blast.wustl.edu) first appeared on the Internet in 1995, to provide continued support for the work begun at the NCBI, as well as to provide a central location where BLAST-related software, information, and earlier software versions could be obtained.
Starting in late 1994, Stephen Altschul and I engaged in a collaboration to provide support for my conjecture that fixed lambda, K and H values, along with Sum statistics, could be practically applied to the evaluation of locally optimal gapped alignment scores. This work eventually appeared in Altschul and Gish (1996) and provides much of the foundation for today's WU BLAST 2.0 and NCBI blastall.
The first complete implementation of gapped BLAST (BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX) with statistical significance estimates (both Poisson and Sum) was publicly released as WU BLAST version 2.0d1 (W. Gish, unpublished), in time for presentation at the Cold Spring Harbor Genome Mapping and Sequencing conference in May 1996.
The NCBI published its BLAST version 2, or Gapped BLAST, including a description of the 2-hit BLAST and PSI-BLAST algorithms, in Altschul et al. (1997), in September 1997.
The NCBI published a description of PHI-BLAST in Zhang et al. 1998.

References

Altschul, SF, and W Gish (1996). Local alignment statistics. ed. R. Doolittle. Methods in Enzymology 266:460-80.

Altschul, SF, and DJ Lipman (1990). Protein database searches for multiple alignments. Proc. Natl. Acad. Sci. USA 87:5509-13.

Altschul, SF, Gish, W, Miller, W, Myers, EW, and DJ Lipman (1990). Basic local alignment search tool. J. of Mol. Biol. 215:403-10.

Altschul, SF, Madden, TL, Schaffer, AA, Zhang, J, Zhang, Z, Miller, W, and DJ Lipman (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389-402.

Claverie, JM, and DJ States (1993). Information enhancement methods for large scale sequence analysis. Computers in Chemistry 17:191-201.

Gish, W, and DJ States (1993). Identification of protein coding regions by database similarity search. Nature Genetics 3:266-72.

Hancock, JM, and JS Armstrong (1994). SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci. 10:67-70.

Karlin, S, and SF Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264-8.

Karlin, S, and SF Altschul (1993). Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. 90:5873-7.

Smith, TF, and MS Waterman (1981). Identification of common molecular subsequences. J. Mol. Biol. 147:195-7.

States, DJ, and W Gish (1994). Combined use of sequence similarity and codon bias for coding region identification. J. Comp. Biol. 1:39-50.

Wootton, JC, and S Federhen (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry 17:149-63.

Wootton, JC, and S Federhen (1996). Analysis of compositionally biased regions in sequence databases. ed. R. Doolittle. Methods in Enzymology 266:554-71.

Zhang, Z, Schaffer, AA, Miller, W, Madden, TL, Lipman, DJ, Koonin, EV, and SF Altschul (1998). Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res. 26:3986-90.

Return to the WU BLAST Archives home page

Option	Description
Q=<q>	set the penalty for a gap of length one to q (default Q=9 for proteins; Q=10 for `BLASTN`)
R=<r>	set the per-residue penalty for extending a gap to r (default R=2 for proteins; R=10 for `BLASTN`)
H=<h>	Set the value for the relative entropy to be used in Karlin-Altschul statistics of ungapped alignment scores. In earlier versions of BLAST, the H option was used to invoke the display of a histogram.
postsw	perform full Smith-Waterman alignment of sequences and re-rank the database matches accordingly, prior to output (currently supported in `BLASTP` only)
hitdist=<hitdist>	invoke a 2-hit BLAST algorithm similar to that of Altschul et al. (1997), with the maximum distance between word hits of <hitdist>. Altschul et al. (1997) use the equivalent of hitdist=40 in the `BLASTP`, `BLASTX`, `TBLASTN` and `TBLASTX` search modes. In WU BLASTN, setting hitdist=W and wink=W, where W is the word length, is akin to using double-length words generated on W-mer boundaries. NOTE: in protein-level comparisons, for best sensitivity (or the best sensitivity for the amount of memory used), 2-hit BLAST should generally be avoided. This option is only available in the licensed 2.0 software.
wink=<wink>	generate word hits at every winkth ("W increment") position along the query, where the default wink=1 produces neighborhood words at every position. For best sensitivity, this option (setting wink greater than 1) should not be used. Wink is best used to find identical or nearly identical sequences rapidly. When used in conjunction with the hitdist option to obtain the highest search speed, care should be taken that desirable alignments are not precluded by these parameters. This option is only available in the licensed 2.0 software.
wordmask=<masker>	mask letters in the query sequence without altering the sequence itself, during neighborhood word generation.
lcfilter	filter lower case letters in the query sequence, by replacing lower case letters with the appropriate ambiguity code (N for nucleotide sequences, X for protein sequences).
lcmask	mask lower case letters in the query sequence without altering the sequence, during neighborhood word generation.
maskextra=<extra>	word-mask an additional extra letters on each side flanking an already-masked region. This helps avoid the appearance of spurious alignments through low-complexity regions initiated by chance word hits immediately adjacent to masked regions.
nogaps	do not create gapped alignments, in essence reverting to WU BLAST 1.4 behavior
pingpong	Perform additional work to help ensure the alignments produced are locally optimal. This option typically adds 3-10% to the execution time, without affecting the results. Only rarely is an alignment and its associated score improved, for all the work involved.
gapall	effectively generate a gapped alignment for every ungapped HSP found. This is the default behavior. See also: gapE.
gapE=<gapE>	generate gapped alignments for all HSPs between sequences whose expected frequency of chance occurrence is less than or equal to <gapE>. Default value is gapE=infinity, i.e., gapall is in effect.
gapW=<gapW>	set the window width (or band width) within which gapped alignments are generated (default is gapW=32 for protein comparisons, gapW=16 for `BLASTN`).
noseqs	produces greatly abbreviated output that omits sequence alignments and yet may be interpreted correctly by existing parsers.
hspmax=<hspmax>	establishes <hspmax> as the maximum number of ungapped HSPs that will be saved per subject sequence or pairwise sequence comparison. Saved HSPs are then fed to the gapped alignment phase of the program or are statistically evaluated if gapped alignments are not to be performed. If more than <hspmax> HSPs are found, only the best-scoring HSPs are retained for subsequent processing. The default value is 1000; a value of 0 implies no limit. See also: gspmax and spoutmax. NOTE: this usage of hspmax is subtly, but importantly, different from the parameter's classical interpretation, wherein all ungapped HSPs that satisfied the S2 score threshold were saved and <hspmax> merely limited the number of HSPs (gapped or ungapped) that would be reported. The new interpretation was instituted to provide vastly improved speed on large problems, while imparting no effect on small problems and many medium-sized problems. The new behavior can help guard against horrendously slow searches resulting from an inadvertant omission of a low-complexity filter. Adverse effects on sensitivity may be obtained, however, if every HSP is sacred. To restore classical behavior, specify hspmax=0. As a compromise between sensitivity and speed, set a higher value than the default. NOTE: the B and V options limit the number of subject sequences for which any results whatsoever are reported, regardless of the number of HSPs or GSPs found in each case.
gspmax=<gspmax>	establishes <gspmax> as the maximum number of GSPs (gapped HSPs) to report per subject sequence or pairwise sequence comparison. If more than <gspmax> GSPs are found, only the best-scoring GSPs are retained for subsequent processing and reporting. The setting of gspmax will have no effect, if the nogaps option is specified or if the setting of hspmax is more restrictive. The default value is 1000; a value of 0 implies no limit. See also: hspmax and spoutmax. NOTE: the B and V options limit the number of subject sequences for which any results whatsoever are reported, regardless of the number of HSPs or GSPs found.
spoutmax=<spoutmax>	establishes <spoutmax> as the maximum number of segment pairs to report in program output per subject sequence or pairwise comparison, however many HSPs or GSPs were actually found and evaluated. If more than <spoutmax> segment pairs are found, the segment pairs are sorted by the criteria in effect for the search and only the first <spoutmax> segment pairs are reported. The setting of spoutmax will have no effect if either <hspmax> or <gspmax> is more restrictive. The default value is 1000; a value of 0 implies no limit. See also: hspmax and gspmax.
compat1.4	produces BLAST version 1.4-style output (no gaps), but with bug fixes and performance enhancements in place.
kap	use Karlin-Altschul (1990) statistics on individual alignment scores (i.e., do not evaluate the joint probability of multiple scores, such as with Poisson or Karlin-Altschul (1993) "Sum" statistics).
restest	causes statistical significance estimates to depend upon the size of the database, as determined by the total number of residues it contains. Restest is the default method for determining the database size in the blastn, tblastn, and tblastx search modes. See seqtest.
seqtest	causes statistical significance estimates to depend upon the size of the database, as determined by the number of sequences it contains. Seqtest is the default method for determining the database size in the blastp and blastx search modes. For backward compatibility with legacy BLAST software -- in all search modes, including blastp and blastx -- if the Z option is specified, Z is expected to be expressed in units of residues, unless the seqtest option is also specified. See restest.
links	display consistent link information for each alignment, indicating all of the "consistent" alignments used in joint statistical significance calculations.
topcomboN=<n>	report at most n "topcombo" groups of consistent (colinear) local alignments (HSPs). Each local alignment is allowed to be a member of only one group. Use of this option causes the addition of a "Group = #" indicator in the output for each HSP. Groups of HSPs tend to be assembled in decreasing order of statistical significance. Members of the most significant group thus tend to be reported with "Group = 1". See also: topcomboE.
topcomboE=<E_ratio>	E_ratio is the maximum ratio of E_current/E_best for which the current "topcombo" group of consistent (colinear) local alignments will be reported for a given database sequence. The "best" group is reported in the output as "Group = 1" and tends to be the most statistically significant. The default behavior is to impose no limit on this ratio, in which case all topcombo groups satisfying E are reported (up to a maximum of topcomboN). See also: topcomboN.
olmax=<len>	maximum permitted length of overlap (in residues), len, of two ungapped alignments for their joint (Sum or Poisson) probability to be computed. The default is unlimited length, with the maximum extent of overlap being governed only by the olfraction parameter.
golmax=<len>	maximum permitted length of overlap (in residues), len, of two gapped alignments for their joint (Sum or Poisson) probability to be computed. The default is unlimited length, with the maximum extent of overlap being governed only by the golfraction parameter.
hspsepqmax	maximum distance allowed along the query sequence between two "consistent" HSPs. (Useful when the query is genomic with relatively short intragenic regions).
hspsepsmax	maximum distance allowed along the subject (database) sequence between two consistent HSPs. (Useful when the database contains genomic sequences with relatively short intragenic regions).
gapsepqmax	maximum distance allowed along the query sequence between two consistent gapped alignments. (Useful when the query is genomic with relatively short intragenic regions).
gapsepsmax	maximum distance allowed along the subject sequence between two consistent gapped alignments. (Useful when the database contains genomic sequences with relatively short intragenic regions).
gapK=<k>	set the value of the Karlin-Altschul statistics' K parameter to use when evaluating the significance of gapped alignment scores. Useful when precomputed values are unavailable in the internal tables for the chosen scoring matrix and gap penalty combination.
gapL=<l>	set the value of the Karlin-Altschul statistics' lambda parameter to use when evaluating the significance of gapped alignment scores
gapH=<h>	set the value of the Karlin-Altschul statistics' H parameter to use when evaluating the significance of gapped alignment scores
dbchunks=<nchunks>	establishes the granularity of the database, as it is divided into slices for assignment to individual threads, to make more efficient use of all CPUs when multiple CPUs are employed for a given search. Higher values are appropriate when the database contains relatively few sequences and/or when the sequences vary greatly in length, composition or content (e.g., genomic contigs). Lower values are appropriate when the database contains many sequences of comparable length (e.g., the EST division of GenBank). The minimum assignable value is the number of threads employed, but this setting is ill-advised; the optimal value for any given search type is likely to be a large multiple of the number of threads employed (although it need not be an exact multiple). When searching mammalian genomic contigs, a good value may be 1000. The default value is 500.
qrecmin=<m>	in a multi-sequence query file, start database searches using the query sequence numbered m. (The first record is numbered 1).
qrecmax=<n>	in a multi-sequence query file, end database searches with the query sequence numbered n.
putenv="NAME=VALUE"	in the local environment to the BLAST search program, set the environment variable named NAME to the value VALUE.
endputenv	for security in WWW server installations, where the command line may sometimes be left open to users, ignore any subsequent putenv options found on the command line during left-to-right parsing.
getenv="NAME"	display the value of the environment variable named NAME. This may be useful for verifying that the settings of environment variables on a web server or in an analysis pipeline have been propagated all the way to the BLAST search program.
endgetenv	ignore any subsequent getenv options found on the command line during left-to-right parsing.
cdb	search nucleotide sequence databases in their uncompressed form. This option is only effective in the `BLASTN` search mode for word lengths > 6. See ucdb.
ucdb	search nucleotide sequence databases in their uncompressed form, with any-and-all ambiguity codes in place. This option may be used to increase sensitivity in the presence of ambiguity codes, at the expense of memory and possibly speed. This is the standard behavior for word lengths < 7, and is not recommended for use with the default or longer word lengths, particularly for longer sequences, due to the increased memory requirements; when comparing long sequences, however, if sufficient memory is available, use of this option can yield a significant increase in speed. This option offers improved sensitivity when searching databases in XDF format that contain ambiguity codes. The option is accepted by the software but offers no improvement in sensitivity for databases in the earlier BLAST 1.4 database format. (`BLASTN` search mode only).
mmio	turn off the use of memory-mapped I/O when reading database files. Use of this option will usually retard the search, particularly when multiple processors are being used, but it serves both to demonstrate the effectiveness of this form of I/O and to validate the I/O routines. Note that no special daemon or support programs (such as the old memfile program) are required to take full advantage of memory-mapped I/O.