Practical Tips/Help for MS-BLAST Users

1. Abbreviations and Terms

BLAST: Best Local Alignment Search Tool, a sequence similarity search algorithm described in: Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool, J Mol Biol 215, 403-410.

De novo sequencing software: software that is capable of automated interpretation of tandem mass spectra. De novo sequencing software is often included in the software packages shipped together with mass spectrometers: BioMultiview, BioAnalyst (MDS Sciex); BioMassLynx (Micromass), BioTools (Bruker Daltonics). Lutefisk program (J. A. Taylor and R. S. Johnson (1997) "Sequence database searches via de novo peptide sequencing by tandem mass spectrometry". Rapid Comm. Mass Spec. 11:1067-1075) can be downloaded from: http://www.immunex.com/researcher/lutefisk/

High scoring pair (HSP) – a region of high local sequence similarity between the peptide in the query and the protein in a database that was identified by database searching. Usually several HSPs are reported for every protein hit

MS-BLAST: Mass Spectrometry driven BLAST: a specialised BLAST –based protocol developed for identification of proteins by sequence similarity searches using peptide sequences produced by the interpretation of tandem mass spectra. MS-BLAST is described in detail in: Shevchenko, A., Sunyaev, S., Loboda, A., Shevchenko, A., Bork, P., Ens, W., and Standing, K. (2001). Charting the proteomes of organisms with unsequenced genomes by MALDI- Quadrupole Time-of-Flight mass spectrometry and BLAST homology searching, Anal Chem 73, 1917-1926.

MS-BLAST web interface is located at http://genetics.bwh.harvard.edu/msblast/

MS-BLAST query: a string composed of peptide sequences and edited according to MS-BLAST rules

2.Composing MS-BLAST Queries

2.1. General

MS-BLAST query is a text file that is composed from peptide sequence proposals obtained by software assisted or manual interpretation of tandem mass spectra. The candidate sequences should be edited according to MS-BLAST rules (see below) and spaced with - (minus) symbol. Once completed, the text can be pasted into the query window of the MS-BLAST web interface.

2.2. How to obtain, edit and assemble peptide sequences?

If tandem mass spectra were interpreted by de novo sequencing software, use the entire list of candidate sequences disregarding their relative score. Any sequence that is longer than 7 amino acid residues is worth including in the list. However, it is often enough to consider some 50 – 100 top scoring sequence proposals.

If MS/MS spectra are subjected to manual interpretation it is advisable to produce longest possible sequence stretches, although their accuracy may be compromised. For example, in sequencing at the femtomole level chemical noise often hampers interpretation of fragment ion series at the low m/z range. It is better to include a number of complete sequence proposals into the search rather than using a single three-four amino acid sequence stretch deduced from the high m/z region of the spectrum, albeit it can often be determined unambiguously.

2.2.1. Dealing with gaps and ambiguities in peptide sequences.

Some de novo sequencing programs (as, BioMultiview) may suggest gaps in the peptide sequence that can be filled with various isobaric combinations of amino acid residues. For example:

ASDF[...]FGTR, [...] = [L,T] or [D,V]

If one or two combinations were suggested it is better to include all of them in the searching string by considering all possible combinations of amino acid residues:

-ASDFLTFGTR-ASDFTLFGTR-ASDFDVFGTR-ASDFVDFGTR-

If two or more combinations were suggested, the symbol X may be used for filling the gap. Zero score is assigned to the X symbol in the scoring matrix and therefore it matches any amino acid residue (although the score of the match will be decreased):

-ASDFXXFGTR-

Note that MS-BLAST is sensitive to the number of amino acid residues in the gap. If the gap may be filled by a combination of two or three amino acid residues, it is worth considering both options in the query:

-ASDFXXFGTR-ASDFXXXFGTR-

2.2.2. Symbols for isobaric amino acids and putative cleavage sites.

The following symbols are allowed in MS BLAST:

L stands for Leu and Ile

Z stands for Gln and Lys, if undistinguishable in the spectrum

X stands for any amino acid

B a putative trypsin cleavage site, stands for Arg or Lys residue preceding the complete sequence

Note that the server will ignore other characters! Some de novo sequencing programs may use special characters for modified amino acid residues (as, J for cysteine S- acrylamide, U for oxidased methionine, etc.etc.) Those characters should be replaced by conventional amino acid symbols (C for any form of cysteine, M for any form of methionine, etc.) If you are not sure whether the sequence contains Phe or Met –sulfoxide simply keep both optional sequences in the query

2.2.3. How to use the cleavage site symbol?

If the proposed sequence is complete, a putative trypsin cleavage site is added prior to the peptide sequence:

-BASDFLTFGTR-

When interpreting low-energy tandem mass spectra acquired from multiply charged precursor ions of tryptic peptides, it is often impossible to determine two amino acid residues located at the N-terminus of the peptide. In those cases present the candidate peptide sequence as:

-BXXDFLTFGTR-

since BXX residues can then be included into the alignment.

2.3. Complete and paste the query.

After interpreting MS/MS spectra all candidate sequence proposals obtained from all fragmented precursors are spaced with the minus symbol and are merged into a single text file that can be pasted into the query window at the MS-BLAST web interface. The text file may contain space symbols, hard returns, numbers etc. that will be ignored by the server. It is therefore convenient to keep masses of precursor ions in the query since it makes retrospective analysis of data much easier. If ECHO option is engaged at the web interface the searching string “as read” by the server will be reported in the MS-BLAST output.

3. MS-BLAST options and settings:

3.1. Options specified via command line (the window “Other advanced options”)

–nogaP : absolutely essential, it turns off gapped alignment method so that only HSPs with no internal gaps are reported

–span1 : absolutely essential, it identifies and fetches the best matching peptide sequence among similar peptide sequences in the query. Therefore the query may contain multiple partially redundant variants of the same peptide sequence (see above) without affecting the total score of the protein hit.

–hspmax 100 limits the total number of reported HSPs to 100. You can change it to a larger number (for

example, to 200) if large query is submitted and you would like to obtain complete list of protein hits and HSPs in the output.

–sort_by_totalscore places the hits that matched multiple scoring pairs to the top of the list

(although scores of those HSPs may be rather weak). Note that the total score is not displayed but can be calculated, if necessary, by adding up scores of individual HSPs. Alternatively MS-BLAST output can be sorted by the best scoring HSP (specify –SORT_BY_HIGHSCORE via command line).

3.2. Options specified via dedicated windows at the interface

EXPECT: It is usually enough to set EXPECT value at 100. Searching with higher EXPECT values (as, 1000) will report short low scoring HSPs, which may be useful for matching more fragmented peptide precursors to the protein sequence that has been identified by higher scoring HSPs. Note that those lower scoring HSPs do not increase confidence of protein identification (see below). EXPECT setting also does not affect scores of retrieved HSPs.

Matrix: PAM30MS is a modified scoring matrix that has to be used only with MS-BLAST searching. Do not use it with conventional BLAST searching!

Program: blast2p; database: nrdb95 are default settings at the MS-BLAST interface. In principle, MS-BLAST can be used as a tblastn program for searching EST or genomic databases. However, it consumes a lot of computational resources and therefore is not currently provided at EMBL server.

FILTER: Filtering can be set to “none”. However, if sequence query contains a large number of repeating stretches (as, …EQEQEQ…), it is advisable to enable filtering by setting it to “default”

4. Statistical evaluation of MS-BLAST hits

4.1. General comments

Statistical evaluation is a very important element of MS-BLAST protocol because the query typically comprises a large number of incorrect or ambiguous peptide sequences. Note that statistics of conventional BLAST searching is not applicable for MS-BLAST! Please neglect E-values and P-values that are reported in MS-BLAST output for each listed HSP. Identities(%) and Positives(%) are listed solely for reference.

4.2. Once MS-BLAST search has been completed, do the following:

Check the number of fragmented precursor ions from which sequences for MS-BLAST searching were obtained. Accordingly, pick up appropriate number of expected unique peptides from the table (see below)
Consider the top hit protein in MS-BLAST output and check the list of HSPs matched to it. Pick up the HSP with the highest score and compare this score with the threshold value reported in the table for a single reported HSP.
If the score of the HSP is higher than the threshold, the match is statistically significant.
If not, pick up the score of the second ranked HSP, add it to the score of the first ranked HSP and compare the sum with the threshold score reported in the table for two reported HSPs. Again, if the score of the sum is higher than the threshold, positive identification can be claimed. If not, add the score of the third ranked HSP, compare the sum with the threshold expected for three reported HSP and so forth.

Always start the evaluation from the highest ranked HSP reported for the given protein hit!

If necessary, repeat the procedure for other reported protein hits in MS-BLAST output.

^a – n.o. No random hits with specified number of HSPs were observed

^b – The calculated value is statistically unreliable because just a few hits matching with the specified number of HSPs were observed. In those cases the maximal score from the ones observed is presented solely as a reference

4.3. Important notes:

In our experience MS-BLAST may rather be omitting true hits rather than producing false positives. For example, if only a single and short peptide can be matched to the protein sequence in a database (proteins sequences may be poorly conserved), such match will likely be discarded as statistically insignificant. However, even “twilight” hits may sometime provide useful clues for further inspection of mass spectra or navigate further biological experiments. Keep your eyes opened!

Be cautious, when assigning a putative function to the protein solely via MS-BLAST – based identification. The sequenced peptides might originate from a region that contains a conserved sequence motif (as, ATP binding domain), which is shared between proteins of different functionality