Software:SpectraST

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 10:45, 22 September 2014
Henrylam (Talk | contribs)
(Miscellaneous Features)
← Previous diff
Revision as of 03:40, 23 September 2014
Henrylam (Talk | contribs)
(ETD Support)
Next diff →
Line 260: Line 260:
==== ETD Support ==== ==== ETD Support ====
-As of version 4.0, SpectraST supports the import and searching of MS2 spectra by electron-transfer dissociation (ETD). A tag encoding the fragmentation method is added to each library entry to differentiate between CID (collision-induced dissociation) and ETD spectra, and that information is automatically extracted from the data (.mzML/.mzXML) files for both library building and searching.+As of version 4.0, SpectraST supports the import and searching of MS2 spectra by electron-transfer dissociation (ETD). A tag encoding the fragmentation method is added to each library entry to differentiate between CID (collision-induced dissociation) and ETD spectra, and that information is automatically extracted from the data (.mzML/.mzXML) files (if specified therein) for both library building and searching. The user can also explicitly specify the fragmentation method when building library, using the -cI option.
SpectraST annotates ETD spectra differently than CID spectra, and the spectral matching algorithm is slightly modified to deal with the charged-reduced precursor peaks common in ETD spectra. SpectraST annotates ETD spectra differently than CID spectra, and the spectral matching algorithm is slightly modified to deal with the charged-reduced precursor peaks common in ETD spectra.

Revision as of 03:40, 23 September 2014

SpectraST (short for "Spectra Search Tool" and rhymes with "contrast") is a spectral library building and searching tool designed primarily for shotgun proteomics applications. It is developed at the Institute for Systems Biology (ISB), in the research group of Professor Ruedi Aebersold. The main developer is Henry Lam.

The latest version of SpectraST is 5.0, released beta in November 2013, and officially with TPP 4.7 in March 2014. It is distributed by ISB under the LPGL license, as a component of the Trans Proteomic Pipeline (TPP) suite of software, distributed under the same license. The source code repository is at [1], and the official download site for the Windows installer is at [2].

Contents

Introduction to Shotgun Proteomics and Spectral Searching

The goal of proteomics is the systematic identification and quantification of all proteins in a biological system. In one of the most frequently practiced workflows, commonly known as shotgun proteomics, a protein sample of interest is first digested with a proteolytic enzyme (trypsin being the most common) to yield peptides that are amenable to LC-MS/MS analysis. The peptides in the resulting mixture are chromatographically resolved, ionized by techniques such as electrospray ionization (ESI) or matrix-assisted laser desorption ionization (MALDI) before being analyzed by a mass spectrometer. A fraction of the peptide ions are selectively isolated by the mass spectrometer and subjected to collision-induced dissociation (CID), in which the peptide ions are bombarded with noble gas atoms to induce fragmentation. (Other types of fragmentation techniques are also rapidly maturing.) The fragment ions are detected and reported by the mass spectrometer as tandem mass (MS/MS) spectra. Because peptide ions tend to fragment mostly along the peptide backbone in a somewhat predictable manner, the MS/MS spectra contain information that can be used to deduce the peptide sequence.

Traditionally, the inference of the peptide sequence from its characteristic tandem mass spectra is done by sequence (database) searching. In sequence searching, a target protein (or translated DNA) database is used as a reference to generate all possible putative peptide sequences by in silico digestion. The search engines then use various rules to predict the theoretical fragmentation pattern of each of these putative peptides, and compare the experimentally observed MS/MS spectra to these theoretical spectra one-by-one. Presumably, a positive identification is made if the experimental spectrum is sufficiently similar to one of the theoretical spectra. Several popular computational tools developed for this purpose have emerged over the years, each employing different algorithms and heuristics to achieve an acceptable balance of sensitivity and accuracy. Unfortunately, traditional sequence searching is a challenging, error-prone, and computationally expensive exercise. Despite the tremendous improvement in computer hardware and software over the past decade, this step often remains the bottleneck of any given proteomics experiment. The requirement of computational resources is also substantial, limiting the use of this powerful technique to only those research groups that can afford the costly computational infrastructure.

Spectral searching is an alternative approach that promises to address some of the shortcomings of sequence searching. In spectral searching, a spectral library is meticulously compiled from a large collection of previously observed and identified peptide MS/MS spectra. The unknown spectrum can then by identified by comparing it to all the candidates in the spectral library for the best match. This approach has been commonly employed for mass spectrometric analysis of small molecules with great success, but has only become possible for proteomics very recently. The main difficulty of generating enough high-quality experimental spectra for compilation into spectral libraries has been overcome by the recent explosion of proteomics data and the availability of public data repositories. Several attempts at creating and searching spectral libraries in the context of proteomics have been published within the past year, all demonstrating the tremendous improvement in search speed and the great potential of this method in complementing, if not replacing, sequence searching in many proteomics applications.

Advantages of Spectral Searching

1. Speed

Spectral searching benefits from a much reduced search space compared to sequence searching. In spectral searching, only peptide ions that are observed and identified in previous experiments will be included in spectral libraries and considered as candidates, whereas in sequence searching, all putative peptide sequences -- plus all permutations of post-translational modification sites, if specified -- in a protein database are considered. Most of these putative peptide ions considered in sequence searching are never observed in practice for a variety of reasons. With typical search parameters, the search space of spectral searching can be several orders of magnitude smaller. It is therefore not surprising that spectral searching can also be several orders of magnitude faster. SpectraST can achieve a top speed of 0.001 to 0.01 second per query spectrum (against a library of about 50,000 entries) on a modern personal computer. In contrast, SEQUEST, one of the most popular sequence search engine, needs about 5 to 20 seconds per query spectrum (against a human IPI database).

2. Preciseness

Spectral searching compares experimental spectra to experimental spectra; sequence searching compares experimental spectra to theoretical spectra. In general, the theoretical spectra considered in sequence searching are very simplistic (e.g., only including b- and y-type ions, at a fixed intensity), and do not resemble the experimental spectra that they are supposed to match. On the other hand, armed with previously observed experimental spectra compiled into spectral libraries, spectral searching can take full advantage of all spectral features, including actual peak intensities, neutral losses from fragments, and various uncommon or even unknown fragments, to determine the best match. The similarity scoring of spectral searching is therefore more precise, and will generally provide better discrimination between good and bad matches. This usually results in much superior statistics (e.g., sensitivity, false discovery rates) for the search results, compared to sequence searching.

Versions

What's new in SpectraST 5.0

  • New, rank-based similarity scoring function (Old scoring function remains as an option)
  • High mass accuracy MS2 (including HCD) support
  • Spectral archive (unidentified spectral library) building
  • Biological sample fingerprinting by spectral archives
  • Open (blind) modification search
  • Improved decoy generation, including alternative method by precursor swapping
  • Semi-empirical spectrum generation for amino acid substitutions
  • De-noising based on Bayesian classifier
  • Retention time normalization using injected landmark peptides

What's new in SpectraST 4.0

  • ETD support
  • iProphet support
  • Decoy spectrum generation
  • MRM transition list generation
  • User-defined modifications
  • Semi-empirical spectrum generation from real spectrum of closely related identification
  • Searching .mgf files
  • Clickable (HTML) search output format
  • Better book-keeping in library building
  • Various bug fixes and performance enhancements

What's new in SpectraST 3.1

  • Re-mapping peptide identifications of library entries to protein sequence database of choice
  • Rudimentary centroiding for imported spectra in profile mode
  • mzML support via TPP
  • Various bug fixes and performance enhancements

What's new in SpectraST 3.0

  • Creating libraries from sequence search results
  • Library manipulation
    • Union/Intersect/Subtract operations
    • Consensus/Best-replicate library building
    • Filtering based on criteria
    • Quality filters
  • Importing libraries from X!Hunter and BiblioSpec formats
  • File list feature
  • Logging
  • Lib2HTML utility for visualizing library
  • Monoisotopic mass support
  • Various bug fixes and performance enhancements

What’s new in SpectraST 2.0

  • Binary library format, enabling speed gain
  • Library information and statistics in preambles of .sptxt and .pepidx files
  • Searching of .dta files
  • Detecting homologs in hit list
  • Various bug fixes and performance enhancements

User's Guide

Installing SpectraST

SpectraST is an integral component of the Trans Proteomic Pipeline suite of software. Although it can be used alone without other TPP components, SpectraST users are strongly encouraged to download and install the entire TPP suite, which provides other useful functionalities such as raw data importation, automatic validation of search results, protein inference, and quantification and visualization.

Windows users: SpectraST is available as part of TPP for Windows. A one-click installer is available, in which Windows-native executables are compiled by MinGW.

UNIX/LINUX users: Visit the Sashimi project page on SourceForge.net, and download the code as a tarball directly. Compiling, installation and configuration information is available in the README file. Alternatively, follow the instructions for Ubuntu LINUX installation.

Running SpectraST

SpectraST has two modes, the Create mode and the Search mode. In the former, SpectraST creates a searchable spectral library from various formats to prepare for searching. In the latter, SpectraST takes in unknown spectra and searches each of them against the spectral library.

The simplest way of running SpectraST is from the command line of your UNIX/LINUX or Windows cmd shell. The general usage is:

spectrast <options> <list of files of appropriate formats>

Options must be separated by space, and all begin with a hyphen ('-'). Search mode options always have an 's' following the hyphen; Create mode options a 'c'. SpectraST will perform the appropriate action based on the options specified, and complain when there are problems interpreting the command statement. The usage statement, and a list of options can be viewed by issuing the command spectrast by itself.

Once TPP is installed, SpectraST can also be run from the Petunia web interface, with limited options.

SpectraST Search Mode

SpectraST can perform spectral searching from the following data formats:

  • .mzML format
  • .mzXML (all versions) format
  • .mzData format
  • .mgf (Mascot Generic) format
  • .dta (SEQUEST) format, a simple peak list preceded by precursor information
  • NIST (National Institute of Standards and Technology)’s .msp format

To search, the spectral library must be in SpectraST’s .splib format, which can be created in SpectraST Create Mode.

The results can be outputted to the following formats:

  • .pepXML format
  • .txt format, a fixed-width column text format
  • .xls format, a tab-delimited column text format
  • .html format, a HTML table with clickable links to spectrum viewer

The search mode is initiated with the option -s, or any of the search mode options. For instance, to search the MS/MS spectra in the file foo.mzXML against the spectral library bar.splib, using the parameters specified in the file spectrast.params, the command is simply:

Note: If the library is not specified in the parameter file or if the parameter file is not given, then the option -sL is mandatory; otherwise SpectraST will not know which spectral library to use.

spectrast -sFspectrast.params -sLbar.splib foo.mzXML

In the above, -sF and -sL are search mode options that the user can specify to customize the behavior of SpectraST. SpectraST will search all the MS/MS spectra in the file foo.mzXML against the spectral library bar.splib, using the parameters specified in the file spectrast.params. The result will be written to a file named foo.<ext> in the same directory where <ext> specifies the output format (.pep.xml, .txt, .xls, or .html).

For a full list of options, see SpectraST Options.

SpectraST Create Mode

Importing Existing Libraries

SpectraST can create a searchable spectral library from the following formats:

  • NIST (National Institute of Standards and Technology)'s .msp format (Download here)
  • X!Hunter's .hlf format [3]
  • BiblioSpec’s .ms2 format [4]

If files of these extensions are supplied, SpectraST simply converts those spectral libraries into a form suitable for SpectraST searches (.splib formats). (Note however that there is no study on how well SpectraST works with X!Hunter and BiblioSpec libraries.) For instance, to import the NIST yeast consensus library, and call the resulting library bar.splib and put it in the directory /dir/, the command is:

spectrast -cN/dir/bar yeast_consensus.msp

When it is done, it produces 5 files in the directory /dir/. The file bar.splib is the library itself; it’s in a binary (machine-readable) format. The file bar.sptxt is a text (human-readable) version of bar.splib. This .sptxt file is of no use to SpectraST; it can be deleted after manual inspection. The files bar.spidx and bar.pepidx are indices on the precursor m/z value and peptide, respectively. Keep the indices and the .splib file in the same directory for SpectraST to function properly. Lastly, a file spectrast.log is also created to document the command executed. Some useful information about the library is printed at the beginning of the bar.sptxt and bar.pepidx.

For a full list of SpectraST options, see SpectraST Options.

Creating Libraries from Sequence Search Results

Note: As per TPP convention, the spectrum query must be named:

<mzXML file name>.<start scan>.<end scan>.<charge>

in the .pepXML file, so that SpectraST knows where to find the corresponding experimental spectrum. (If the .pepXML file is created with TPP tools, this should not be an issue.)

SpectraST can create a spectral library from a .pepXML file, which contains peptide identifications from a previous shotgun proteomics experiment. For this purpose, it is preferable that the .pepXML has been processed with PeptideProphet and/or iProphet, such that all the search hits have probabilities assigned. (iProphet probabilities are used over PeptideProphet ones if both are present.)

When importing from a .pepXML file, SpectraST scans through the .pepXML file for confident identifications, and attempts to extract the corresponding experimental spectra from .mzXML files. For instance, the command

spectrast -cNraw -cP0.9 dataset1.xml

will import all peptide identifications with probability at or above 0.9 from the file dataset1.xml, and put them in a library called raw.splib (with the accompanying raw.sptxt, raw.spidx and raw.pepidx files). For a full list of SpectraST options, see SpectraST Options.

Manipulating SpectraST Libraries

SpectraST can convert one or more .splib libraries to another, performing various operations. For instance, to create a consensus library from all the entries in bar.splib and foo.splib, the command is:

spectrast -cNconsensus -cJU -cAC bar.splib foo.splib

SpectraST will take the union (specified by the option -cJU) of all the entries in bar.splib and foo.splib, and wherever a certain peptide ion is present as multiple entries (replicates), it will coalesce the replicates into a single consensus spectrum (specified by -cAC).

Some additional examples:

spectrast -cNphospho -cf”Mods =~ Phospho” bar.splib

This will screen the library bar.splib for all entries with a phosphorylation modification, and put the phosphopeptides in the library phospho.splib.

spectrast -cNcommon -cJI dataset1.splib dataset2.splib

This will take the intersection of the two libraries dataset1.splib and dataset2.splib, and put all entries of peptide ions that are seen in both files in the library common.splib.

spectrast -cNquality -cAQ -cL2 bar.splib

This will apply SpectraST’s quality filters to the library bar.splib; only those entries that pass the first 2 quality filters will be included in the library quality.splib.

For a full list of SpectraST options, see SpectraST Options. For a typical recipe for creating consensus libraries from sequence search results, see Creating Consensus Libraries.

Creating Consensus Libraries

A recipe for creating consensus libraries from TPP-processed sequence search results is detailed here. Consider the following example:

Dataset IdentifierpepXML FilesmzXML Files
AlphaA-SEQ.xml (SEQUEST results of A1.mzXML),
A-MAS.xml (Mascot results of A1.mzXML)
A1.mzXML
BetaB1.xml (SEQUEST results of B1.mzXML),
B2.xml (SEQUEST results of B2.mzXML)
B1.mzXML,
B2.mzXML
GammaG.xml (combined SEQUEST results of all .mzXML files)G1.mzXML,
G2.mzXML,
G3.mzXML

The following commands should be issued in succession:

Note: Alternatively, the library building recipe can be encoded in a recipe.list file (see SpectraST File List Feature):
? -cNrawA -cnAlpha
A-SEQ.xml
A-MAS.xml
? -cNrawB -cnBeta
B1.xml
B2.xml
? -cNrawG -cnGamma
G.xml
? -cJU -cAC -cNconsABC
rawA.splib
rawB.splib
rawC.splib
? -cAQ -cNconsABC_Q
consABC.splib
The command spectrast recipe.list will complete the entire library building procedure.

1. Importing the raw spectra into SpectraST
spectrast -cNrawA -cnAlpha A-SEQ.xml A.MAS.xml
spectrast -cNrawB -cnBeta B1.xml B2.xml
spectrast -cNrawG -cnGamma G.xml

These commands will create the raw libraries rawA.splib, rawB.splib and rawC.splib. Identifications from multiple .pepXML files of the same dataset are imported with the same dataset identifier. The same query with identifications from multiple search engines will be combined intelligently. The probability threshold above which identifications are imported can be specified with the option -cP<prob>, which defaults to 0.9. This will not coalesce replicates of the same peptide ion identification into a consensus spectrum yet. Remember that the .mzXML files must be in the same directories as their corresponding .pepXML files.

2. Creating a consensus spectral library
spectrast -cJU -cAC -cNconsABC raw*.splib

This will combine the three raw libraries, then replace multiple replicates of the same peptide ion identification with a consensus spectrum. Many options are available to fine-tune the algorithm; however, the default parameters are usually adequate.

3. Performing quality control of the consensus spectral library
spectrast -cAQ -cNconsABC_Q consABC.splib

This will run the consensus spectra through SpectraST's quality filters. With the default settings, spectra failing either or both of the first 2 filters will be removed, and spectra failing any of the other filters will be marked. Different quality levels can be set with the options -cL and -cl. It is recommended that a consensus spectral library is subject to some quality control before using it in spectral searching; the optimal quality level reflects the user's desired compromise between library comprehensiveness and library quality. This is to minimize mis-identified and low-quality spectra in the library. These questionable spectra can propagate errors from sequence searching, reduce the discriminating power of the spectral search engine, and induce false positive and false negative hits.

For details on the consensus and quality filter algorithms, please refer to Lam et al. (2008) Nature Methods 5, 873-875.

4. Appending artificial decoy spectra
spectrast -cAD -cc -cy1 -cNcons_ABC_Q_DECOY

This will generate an equal-size decoy spectral library and append it to the real library consABD_Q.splib. The presence of decoys enables the estimation of false discovery rate (FDR) by the well-established decoy counting method, and improves the accuracy of PeptideProphet in validating spectral search results.

The algorithm of generating artificial decoy spectra is described in Lam et al. (2010) Journal of Proteome Research 9, 605-610.

Miscellaneous Features

SpectraST Parameter Files

Note: All options set in the parameter file will be overridden by command-line options, if specified.

SpectraST allows the use of parameter files to simplify the process of spectral library building and searching. Namely, desired options can be specified in a text file, and supplied to SpectraST every time the same action is performed, saving the user from having to specify lengthy list of command-line options. To invoke the parameter files, specify the options -sF<parameter file> and -cF<parameter file> for Search Mode and Create Mode, respectively. Exemplary parameters file are provided below (these are essentially the defaults):

Search Mode: spectrast.params

Create Mode: spectrast_create.params

High mass accuracy MS2 Support

As of version 5.0, SpectraST supports the handling of high mass accuracy MS2 spectra (including higher-energy collisional dissociation, HCD, spectra). The fragmentation methods CID-QTOF and HCD are created to tag such spectra, and that information is automatically extracted from the data (.mzML/.mzXML) files for library building. When building libraries, SpectraST annotates and aligns CID-QTOF/HCD spectra differently, using a narrower tolerance and considering immonium ions and internal fragments. However, when searching, the user must still specifies the "bin size" (equivalent to the product ion tolerance) as the mass accuracy can vary from instrument to instrument.

ETD Support

As of version 4.0, SpectraST supports the import and searching of MS2 spectra by electron-transfer dissociation (ETD). A tag encoding the fragmentation method is added to each library entry to differentiate between CID (collision-induced dissociation) and ETD spectra, and that information is automatically extracted from the data (.mzML/.mzXML) files (if specified therein) for both library building and searching. The user can also explicitly specify the fragmentation method when building library, using the -cI option.

SpectraST annotates ETD spectra differently than CID spectra, and the spectral matching algorithm is slightly modified to deal with the charged-reduced precursor peaks common in ETD spectra.

Generation of Transition Lists for Selected/Multiple Reaction Monitoring (S/MRM)

Selected reaction monitoring (SRM, also known as Multiple Reaction Monitoring, MRM) is an acquisition mode available on some mass spectrometers (mostly triple quadrupole instruments) which has gained increasing popularity as a targeted quantification technique. It requires as input a list of “transitions” to be monitored over the course of the experiment. Each transition consists of two numbers, Q1 (the precursor m/z) and Q3 (the fragment ion m/z) and must be selected beforehand based on knowledge of the peptide to be quantified. One of the most effective strategies for selecting appropriate transitions is to rely on previously acquired MS2 spectra of the target peptides stored in spectral libraries.

As of version 4.0, SpectraST implements an algorithm to select the N (a user-specified number) most suitable transitions for each library spectrum, and print the list of transitions in a table format. For example, the command:

spectrast -cNfoo_MRM -cM -cQ5 foo.splib

will create a “reduced” spectral library foo_MRM.splib, in which only each spectrum only retains the 5 peaks most suited as MRM transitions. In addition, a text file foo_MRM.mrm containing the transitions in a table format will be printed.

A fully functional software tool for MRM experiment design, MaRiMba (Sherwood et al. (2009) Journal of Proteome Research 8, 4396-4405), which essentially wraps this transition selection algorithm of SpectraST as well as implements some surrounding functionalities, is freely available as part of the TPP and accessible via the Petunia web interface.

User-defined Modifications

As of version 4.0, users can specify allowable modifications in a text file. (Prior to 4.0, SpectraST internally maintains a list of common modifications and will reject identifications with unrecognized modifications.) This text file should contain space- or comma-delimited strings of the following form:

<token>|<monoisotopic mass change from unmodified amino acid>|<name of modification>

where <token> must be one of the amino acids in one-letter code (capitalized), n (the N terminus) or c (the C terminus), optionally followed by [<tag>], a user-defined short tag for that modification, and <name of modification> is a user-defined formal name of that modification which will be written to the library created.

An example of this file can be found in spectrast.usermods.

To activate the modifications specified in this file, add the option -M to the command line whether identifications containing these modifications are to be processed. By default, SpectraST expects the file to be named spectrast.usermods and reside in the current working directory. You can also specify otherwise by appending the file name (with path if necessary) after the -M.

Since a spectral library is meant to be a long-lasting and shared resource, special care should be taken in specifying new modifications. Accurate monoisotopic masses (preferably to at least 4 digits) should be used, and the name of the modification should follow HUPO-PSI convention as much as possible. Consult Unimod to see if your modification is already defined in HUPO-PSI standards.

Semi-empirical Spectrum Generation

As of version 4.0, SpectraST can generate semi-empirical spectra from real spectra of closely-related identifications, simply by shifting peaks on the m/z axis wherever appropriate. This is useful when spectra of one modification state are used to built the library, but one wishes to match spectra of the same sequence but another modification state in spectral searching.

To do so, the user needs to turn on the build action of semi-empirical spectrum generation by the option -cAM, and specify what modifications are desired by the -cx<str> option. The <str> in the latter is a string of allowable modification tokens (no space in between). A modification token starts with a one-letter amino acid (A through Z, plus n or c for the termini), followed by [<tag>] if it is not the unmodified form.

For example, the command

spectrast -cNfoo_heavy -cAM -cx’K[134]R[162]’ foo.splib

will create a library foo_heavy.splib containing the same peptide ions as foo.splib, but with all lysines and arginines being heavy (both +6 Da). Note that since K and R are not included in the string following -cx, unmodified K or R is not allowed to be used. This is commonly referred to as a static modification.

The command

spectrast -cNfoo_metox -cAM -cx’MM[147]’ foo.splib

will create a library foo_metox.splib containing the same peptide ions as foo.splib, but with all possible permutations of normal and oxidized methionine. This is commonly referred to as a variable modification. Note that the string ’MM[147]’ means that both M (normal methionine) and M[147] (oxidized methionine) are allowed where an M is present in a sequence.

The command

spectrast -cNfoo_binary -cAM -cx’{K}{K[134]}’ foo.splib

specifies a “binary” modification: lysines on the same peptide must be either all light or all heavy. The curly bracelet ({ }) specifies sets of allowable modification tokens; for the same peptide only tokens from a single set can be used.

Note that while the modification tokens, such as M[147], contain only integer values (of the modified amino acids) as a tag, in calculations SpectraST will use the corresponding accurate mass stored for that particular modification type indicated by the token. Hence, for modifications currently not recognized by SpectraST, the user must define them (along with the accurate mass) in a text file using the -M option.

SpectraST File List Feature

SpectraST allows the user to list the files to be processed in a text file with extension .list. This can be useful when the number of files to be processed is very large, possibly overwhelming the UNIX command line. It is also an easy way to queue up multiple SpectraST tasks and to keep track of them. For example, if the file job.list contains the lines:

# This is a comment line ignored by SpectraST.
? -sLfoo.splib   # '?' signals the start of a new job; options for this job follow the '?'
1.mzXML
2.mzXML

? -sLbar.splib
3.mzXML
4.mzXML
Note: One can mix Search jobs and Create jobs in the same .list file. Command-line options will be overridden by those specified in the .list file with lines preceded by ‘?’.

Then running the command:

spectrast -sFspectrast.params job.list

is equivalent to running

spectrast -sFspectrast.params -sLfoo.splib 1.mzXML 2.mzXML

followed by

spectrast -sFspectrast.params -sLbar.splib 3.mzXML 4.mzXML

SpectraST Options

Commonly used options are shown in bold. The rest are advanced options that should rarely need to be used.

Search Mode Options
Command-line TokenName in Parameter FileMeaningRemarks
GENERAL OPTIONS
-sNoneSpecify search mode.Not needed when any other search options are set.
-sF<file>NoneRead search options from <file>.If <file> is omitted, “spectrast.params” is assumed
-sL<file>libraryFileSpecify library file.Mandatory unless specified in parameter file. <file> must have .splib extension.
-sD<file>databaseFileSpecify a sequence database file. This will not affect the search in any way, but this information will be included in the output for any downstream data processing.<file> must have .fasta extension. If not set, SpectraST will try to determine this from the preamble of the library.
-sT<type>databaseTypeSpecify the type of the sequence database file.-sTAA (default) = protein database
-sTDNA = genomic database.
-sRindexCacheAllCache all entries in RAM. Requires a lot of memory (the library will usually be loaded almost in its entirety), but speeds up search for unsorted queries. Turn on with -sR, off with -sR!. Default is off.
-sS<file>filterSelectedListFileNameOnly search a subset of the query spectra in the search file. Only query spectra with names matching a line of <file> will be searched.Default is off (search all queries).
CANDIDATE SELECTION AND SCORING OPTIONS
-sM<tol>indexRetrievalMzToleranceSpecify precursor m/z tolerance in Th. Monoisotopic mass is assumed.Default is 3.0 Th.
-sAindexRetrievalUseAverageUse average mass instead of monoisotopic mass.Turn on with -sA, off with -sA!. Default is off.
-sC<type>expectedCysteineModSpecify the expected kind of cysteine modification. Those candidate library entries with a wrong kind of cysteine modification will be ignored.-sCICAT_cl = cleavable ICAT
-sCICAT_uc = uncleavable ICAT
-sCCAM = Carbamidomethyl.
Default is off (search all candidates).
-scignoreSpectraWithUnmodCysteineIgnore any candidate library entries with an unmodified cysteine. Turn on with -sc, off with -sc!. Default is off.
-s_HOM<rank>detectHomologsDetect homologous lower hits up to <rank>. Looks for lower hits homologous to the top hit and adjust delta accordingly.Default is 4.
-s_NO1ignoreChargeOneLibSpectraIgnore all library entries with +1 charge state.Turn on with -s_NO1, off with -s_NO1!. Default is off.
-s_NOSignoreAbnormalSpectraIgnore all spectra which have non-Normal status.Turn of with -s_NOS, off with -s_NOS!. Default is off.
OUTPUT AND DISPLAY OPTIONS
-sE<ext>outputExtensionOutput format. The search result will be written to a file with the same base name as the search file, with extension <ext>.-sEtxt = Fixed-width text format
-sExls = Tab-delimited text format)
-sExml (default) or -sEpepXML = .pepXML format.
-sO<path> outputDirectorySpecify a directory to hold the search output files.Default: Same directory as the corresponding search data (.mzML/.mzXML) file.
-s_FV1<thres>hitListTopHitFvalThresholdMinimum F value threshold for the top hit. Only top hits having F value greater than <thres> will be printed.Default = 0.0 (all top hits will be displayed)
-s_FV2<thres>hitListLowerHitsFvalThresholdMinimum F value threshold for the lower hits. Only lower hits having F value greater than <thres> will be printed.Default = 0.45
-s_SHHhitListShowHomologsAlways displays homologous lower hits regardless of F value.Turn on with -s_SHH (need -s_HOM on), off with -s_SHH! Default is on.
-s_SH1hitListOnlyTopHitOnly display the top hit for each query.Turn on with -s_SH1, off with -s_SH1!. Default is on.
-s_SHMhitListExcludeNoMatchDo not display queries for which there is no candidate, or the top hit is below the minimum F value threshold specified with -sV.Turn on with -s_SHM, off with -s_SHM!. Default is on.
SPECTRUM FILTERING AND PROCESSING OPTIONS
-s_XNP<thres>filterMinPeakCountRequire minimum number of peaks. All query spectra with fewer than <thres> peaks passing the intensity threshold set with -sP will be removed.Default is 10.
-s_XMZ<m/z>filterAllPeaksBelowMzRemove spectra with almost no peaks above a certain m/z value. All query spectra with 95%+ of the total intensity below <m/z> will be removed. Default is 520.
-s_XIN<inten>filterMaxIntensityBelowFilter query spectra with no peaks with intensity above <inten>.Default is 0.
-s_CNT<thres>filterCountPeakIntensityThresholdMinimum peak intensity for peaks to be counted. Only peaks with intensity above <thres> will be counted to meet the requirement for minimum number of peaks. Default is 2.01
-s_RNT<thres>filterRemovePeakIntensityThresholdNoise peak threshold. All peaks with intensities below <thres> will be zeroed. Default is 2.01
-s_R51<thres>filterRemoveHuge515ThresholdRemove dominant peak at 515.3 Th. All dominant peaks near 515.3 Th (with intensity greater than <thres> of the total intensity of the spectrum) will be zeroed.Default is off. Dominant 515.3 Th peaks are a common impurity artifact in cleavable ICAT experiments.
-s_RNP<num>filterMaxPeaksUsedRemove all but the top <num> peaks in query spectra.Default is 150.
-s_RDR<num>filterMaxDynamicRangeRemove all peaks smaller than 1/<num> of the base (highest) peak in query spectra.Default is 1000.
-s_MZS<mzpow>,
-s_INS<intpow>
peakScalingMzPower,
peakScalingIntensityPower
Intensity scaling power with respect to the m/z value and the raw intensity. The scaled intensity will be (m/z)^<mzpow> * (raw intensity)^<intpow>Default is <mzpow> = 0.0, <intpow> = 0.5.
-s_UAS<factor>peakScalingUnassignedPeaksScaling factor for unassigned peaks in library spectra. Unassigned peaks in the library spectra will be scaled by <factor>.Default is 0.1.
-s_BIN<num>peakBinningNumBinsPerMzUnitNumber of bins per Th.Default is 1.
-s_NEI<frac>peakBinningFractionToNeighborFraction of the scaled intensity assigned to neighboring bins.Default is 0.5.


Create Mode Options
Command-line TokenName in Parameter FileMeaningRemarks
GENERAL OPTIONS (Applicable with any file input)
-cNoneSpecify create mode.Not needed when any other create options are set.
-cF<file>NoneRead create options from file <file>.If <file> is omitted, "spectrast_create.params" is assumed.
-cN<name>outputFileNameSpecify output file name for .splib, .sptxt, .spidx and .pepidx files.If not set, SpectraST will try to construct a sensible name.
-cT<file>useProbTableUse probability table in <file>. Only those peptide ions included in the table will be imported, and their probability adjusted optionally.A probability table is a text file with one peptide ion in the format AC[160]DEFGHIK/2 per line. If a probability is supplied following the peptide ion separated by a tab, it will be used to replace the original probability of that library entry.
-cO<file>useProteinListUse protein list in <file>. Only those peptide ions associated with proteins in the list will be imported.A protein list is a text file with one protein identifier per line. If a number X is supplied following the protein separated by a tab, then at most X peptide ions associated with that protein will be imported. Peptides with more replicates are favored.
-cm<remark>remarkRemark. Add a Remark=<remark> comment to all library entries created.Default is off.
-c_ANNannotatePeaksAnnotate peaks.Turn on with -c_ANN, off with -c_ANN!. Default is on.
-c_BINbinaryFormatWrite library in binary format, which enables quicker search. Turn on with -c_BIN, off with -c_BIN!. Default is on.
-c_DTAwriteDtaFilesWrite all library spectra as .dta files. Turn on with -c_DTA, off with -c_DTA!. Default is off.
-c_MGFwriteMgfFilesWrite all library spectra as one .mgf file. Default is off.
-c_PLT<crit>plotSpectraPlot the library spectra as they are created.-c_PLT or -c_PLTALL = Plot every spectrum. -c_PLT<crit> = Plot spectrum when either the Status or the Spec comment value = <crit>.
PEPXML IMPORT OPTIONS (Applicable with .pepXML file input)
-cP<prob>minimumProbabilityToIncludeInclude all spectra identified with probability no less than <prob> in the library.Default is 0.9.
-cn<name>datasetNameSpecify a dataset identifier for the file to be imported.If not set, SpectraST will construct it from the path and the name of the .pepXML file.
-cI<type>setFragmentationSet the fragmentation type of all spectra, overriding existing information. -cIETD = tag all library spectra as ETD spectra.
Default is off (determined from the data files).
-cgsetDeamidatedNXSTSet all asparagines (N) in the motif NX(S/T) as deamidated (N[115]), and all asparagines not in the motif NX(S/T) as unmodified. Use for glycocaptured peptides. Turn on with -cg, off with -cg!. Default is off.
-coaddMzXMLFileToDatasetNameAdd the originating mzXML file name to the dataset identifier. Good for keeping track of the MS run in which the peptide is observed. Turn on with -co, off with -co!. Default is off.
-c_NPK<num>minimumNumPeaksToIncludeExclude spectra with fewer than <num> peaks.Default is 10.
-c_NAA<num>minimumNumAAToIncludeExclude spectra of peptide IDs shorter than <num> amino acids.Default is 6.
-c_DCN<num>minimumDeltaCnToIncludeExclude spectra with deltaCn smaller than <thres>. Useful for excluding spectra with indiscriminate modification sites. Turn on with -c_DCN, off with -c_DCN!. Default is 0.0.
-c_RNT<thres>rawSpectraNoiseThresholdAbsolute noise filter. Remove noise peaks with intensity below <thres> in imported spectra.Default is 0.0.
-c_RDR<range>rawSpectraMaxDynamicRangeRelative noise filter. Filter out noise peaks with intensity below 1/<range> of that of the highest peak.Default is 100000.0.
-c_CENcentroidPeaksCentroid peaks as raw spectra are imported.Designed mostly for Q-TOF spectra in profile mode.
-c_XANskipRawAnnotationSkip the annotation of raw spectra as they are imported.Annotation is quite slow and might be impractical if the number of imported spectra is enormous.
LIBRARY MANIPULATION OPTIONS (Applicable with .splib file input)
-cf<pred>filterCriteriaFilter library by criteria. Keep only those entries satisfying the predicate <pred>.<pred> should be in quotes in the form “<attr> <op> <value>”. <attr> can refer to any of the fields and any comment entries. <op> can be ==, !=, <, >, <=, >=, =~ and !~. Multiple predicates can be separated by either & (AND logic) or | (OR logic), but not both. Default is off.
-cJcombineActionCombine action.-cJU = Union (default). Include all the peptide ions in all the files.
-cJI = Intersection. Only include peptide ions that are present in all the files.
-cJS = Subtraction. Only include peptide ions in the first file that are not present in any of the other files.
-cJH = Subtraction of homologs. Only include peptide ions in the first file that do not have any homologs with similar m/z in any of the other files.
-cAbuildActionBuild action.-cAB = Best replicate. Pick the best replicate of each peptide ion.
-cAC = Consensus. Create the consensus spectrum of all replicate spectra of each peptide ion.
-cAQ = Quality filter. Apply quality filters to library.
-cAD = Decoy. Generate decoy spectra.
-cAM = Semi-empirical. Generate semi-empirical spectra.
Default is no build action - all spectra will be included as is.
-cD<file>refreshDatabaseRefresh protein mappings against the database <file> in FASTA format.Default is off.
-curefreshDeleteUnmappedDelete entries whose peptide sequences do not map to any protein during refreshing with -cD option.Default is off.
-cdrefreshDeleteMultimappedDelete entries whose peptide sequences map to multiple proteins during refreshing with the -cD option.Default is off.
CONSENSUS SPECTRUM CREATION OPTIONS (Applicable with -cAC option)
-cr<num>minimumNumReplicatesMinimum number of replicates required for each library entry. Peptide ions with fewer than <num> replicates will be excluded from library when creating consensus library.Default is 1.
-c_DISremoveDissimilarReplicatesRemove dissimilar replicates before creating consensus spectrum.Turn on with -c_DIS, off with -c_DIS!. Default is on.
-c_QUO<frac>peakQuorumSpecify peak quorum: the fraction of all replicates required to contain a certain peak. Peaks not present in enough replicates will be deleted.Default is 0.6.
-c_XPU<num>maximumNumPeaksUsedMaximum number of peaks in each replicate to be considered in creating consensus. Only the most intense <num> peaks by intensity will be considered.Default is 300.
-c_XNR<num>maximumNumReplicatesMaximum number of replicates used to build consensus spectrum.Default is 100.
-c_XPK<num>maximumNumPeaksKeptDe-noise single spectra by keeping only the most intense <num> peaks.Default is 150. Will not affect consensus spectra of more than one replicates.
-c_WGT<score>replicateWeightSelect the type of score to weigh and rank the replicates.-c_WGTS (default) = Use a measure of signal-to-noise ratio as the weight.
-c_WGTX = Use a function of the SEQUEST xcorr score as the weight.
-c_WGTP = Use a function of the PeptideProphet probability as the weight.
BEST REPLICATE SELECTION OPTIONS (Applicable with -cAB option)
-cr<num>minimumNumReplicatesMinimum number of replicates required for each library entry. Peptide ions with fewer than <num> replicates will be excluded from library when creating best-replicate library library.Default is 1.
-c_DISremoveDissimilarReplicatesRemove dissimilar replicates before selecting best replicate.Turn on with -c_DIS, off with -c_DIS!. Default is on.
QUALITY FILTER OPTIONS (Applicable with -cAQ option)
-cr<num>minimumNumReplicatesReplicate quorum. Its value affects behavior of quality filter (see below).Default is 1.
-cL<level>,
-cl<level>
qualityLevelRemove,
qualityLevelMark
Specify the stringency of the quality filter. -cL specifies the level for removal, -cl specifies the level for marking.<level> = 0: No filter.
<level> = 1: Remove/mark impure spectra.
<level> = 2: Also remove/mark spectra with a spectrally similar counterpart in the library that is better.
<level> = 3: Also remove/mark inquorate entries (defined with -cr) that share no peptide sub-sequences with any other entries in the library.
<level> = 4: Also remove/mark all singleton entries.
<level> = 5: Also remove/mark all inquorate entries (defined with -cr).
Default is -cL2, -cl5
-c_QP1qualityPenalizeSingletonsApply stricter thresholds to singleton spectra during quality filters.Turn on with -c_QP1, off with -c_QP1!. Default is on.
-c_QIP<thres>qualityImmuneProbThresholdSpecify a probability above which library spectra are immune to quality filters.Default is 0.999.
-c_QIEqualityImmuneMultipleEnginesMake spectra identified by multiple sequence search engines immune to quality filters.Turn on with -c_QIE, off with -c_QIE!. Default is on.
DECOY GENERATION OPTIONS (Applicable with -cAD option)
-ccdecoyConcatenateConcatenate real and decoy libraries.Default is off: library consisting of only decoy spectra is created.
-cy<num>decoySizeRatioSpecify the (decoy / real) size ratio.Default is 1. <num> must be an integer.
SEMI-EMPIRICAL SPECTRUM GENERATION OPTIONS (Applicable with -cAM option)
-cx<string>allowableModTokensSpecify the set(s) of modifications allowed in semi-empirical spectrum generation by -cAM option.Default is off: no semi-empirical spectrum generated.


Miscellaneous Options
Command-line TokenName in Parameter FileMeaningRemarks
-VNoneVerbose mode. More information displayed to console.Default is off.
-QNoneQuiet mode.Default is off.
-L<file>NoneSpecify name of log file.Default is spectrast.log.
-M<file>NoneActivate user-defined modifications listed in <file>.Default is off. If <file> is omitted spectrast.usermods is assumed.

Other SpectraST Utilities

Plotspectrast

Plotspectrast is a spectrum viewer designed for SpectraST. It comes as two programs: a CGI that can be launched from a web page (e.g., from PepXMLViewer), and a stand-alone application. They are included in the TPP and no additional installation is necessary.

The most common use of Plotspectrast is for visualization of spectral matches from PepXMLViewer. When displaying SpectraST results, PepXMLViewer provides a link to invoke plotspectrast.cgi for each spectrum query. The query spectrum will be plotted as a "mirror image" of the best-matched library spectrum, enabling the user to quickly assess the quality of the match. Below the plot there is an ion table, and tables listing information about the library spectrum. The legend of the plots and ion table is as follows:

  • Library spectrum
    • Peak color: Red = Selected annotated peaks; Blue = Unannotated peaks
    • Label color: Red = Selected annotated peaks that have matched peaks in the query spectrum; Black = Unmatched peaks
  • Query spectrum
    • Peak color: Red = Peaks that match selected annotated peaks in the library spectrum; Black = Unmatched peaks
  • Ion table
    • Cell color: Red = Ions present in both spectra; Pink = Ions present in the library spectrum only; White = Ions present in neither spectrum

Various controls are available to the left of the plot to customize how the spectra are displayed:

  • X-Range: The range of X axis (the m/z values) displayed
  • MatchTol: The m/z tolerance within which a peak is considered matched between the library and query spectra. This affects the labeling and coloring of the peaks.
  • Y-Zoom: Zooming factor in the Y axis (the peak intensity).
  • BlankPrecRegion: Blank the region around the precursor m/z. (Note: in SpectraST searching, peaks in this region are ignored.)
  • Annotation Options
    • LabelType: Toggling between displaying the ion type, the m/z value, or no label for selected annotated peaks
    • NumPeaks: The number of peaks considered for labeling, from the highest peak down
    • MinInten: The minimum intensity for a peak to be labeled
    • Ions a, b, y (+1, +2, +3): Whether or not to label that particular type of ion of that charge state
    • -H2O/-NH3/-P: Whether or not to label water/ammonia/phosphate neutral loss peaks of fragment ions
    • Prec losses: Whether or not to label neutral loss peaks of the precursor
    • All: Whether or not to label all annotated peaks
    • ColorAll: Whether to color all the annotated peaks regardless of label selection

The stand-alone plotspectrast application produces a static .png image in the same directory as the query spectrum file. It has the following usage:

plotspectrast <.splib file> <library file offset> <.mzXML file> <query scan number>

Plots the library spectrum at <library file offset> and the query spectrum of <query scan number> in the .mzXML file. The desired value of <library file offset> can be extracted from the .spidx, .pepidx or .sptxt file (BinaryFileOffset in the Comment field).

plotspectrast <.splib file> <library file offset> <.dta file>

Similar to above, except the query spectrum is in a .dta file.

plotspectrast <.splib file> <library file offset> <.none file>

Plots the library spectrum by itself. It will not actually look for the .none file, but the resulting .jpg file will be named with the same prefix as the .none file and place in the same directory.

plotspectrast <.msp file of library spectrum> <.msp, .dta, or .none file>

Similar to above, except the library spectrum is given in a .msp file.

plotspectrast <.splib file 1> <library file offset 1> <.splib file 2> <library file offset 2>

Plots two library spectra head to tail.

Lib2HTML

Lib2HTML is an application that converts a SpectraST library into an HTML file for viewing. It is included in the TPP and no additional installation is necessary. In the resulting HTML file, replicates of the same peptide ion will be listed on one row, and links are provided to each replicate to view the spectrum using Plotspectrast. The usage is:

Lib2HTML <options> <full path from webserver root to .splib file>

Options include:

  • -V : Verbose. Displaying more information for each entry.
  • -N<num> : Specify the maximum number of replicates displayed for each unique peptide ion. Default is 10.
  • -P<path> : Specify the full path from the webserver root to the plotspectrast.cgi binary.

Developer's Guide

The SpectraST source code contains detailed documentation.

sptxt file format:

The sptxt file format is very closely realted to the msp format, whose documentation can be found here.

Annotation syntax:

SpectraST's syntax to annotate a fragment follows the scheme proposed by Roepstorff and Fohlman.

An annotation tag starts with the assigned ion type (a,b,c,x,y or z) and is followed by the number of amino acid residues present in the fragment. This number is possibly followed by a signed integer value, indicating a modification. Please note that besides post-translational modifications also loss of water (-18) and loss of ammonia (-17), e.g., are taken into account. The caret symbol '^' followed by an integer value depicts the charge state of the fragment. Its absence indicates a singly charged fragment ion. An additional 'i' at the end of the annotation tag implies that the mass value does not correspond to the expected mass value of the monoisotopic peak, but can be assigned to a different isotopic peak of the fragment. Finally, the annotation pattern contains the average mass deviation (in Da) from the theoretically expected mass. A slash '/' preceds this number.

The list of possible annotations is ordered by ascending charge states, where ties are broken by ascending mass deviations.

Annotation tags can be enclosed by square brackets, indicating that several peaks could be assigned the same particular ion. Usually, SpectraST would resolve such a situation by annotating only one of the ions and leaving the other ones blank. If data is not (sufficiently) centroided, this strategy might lead to a buch of unresolved peaks, which might in turn cause quality filters to fail. To circumvent this problem, if there are additional intense peaks that look to be the same ion, a bracketed annotation will be given to them.

Besides annotations following the Roepstorff/Fohlman notation SpectraST also assigns immonium ions. The corresponding tag consists of 3 capital letters, always starting with an 'I' (for immonium), followed by the amino acid and an additional letter to designate different residue-specific ions from that amino acid.

More tips to developers who want to modify SpectraST will be available shortly.

Where to Get Help

The SPC Tools Discussion Group: spctools-discuss.googlegroups.com

The SPC Tools Announcement Group: spctools-announce.googlegroups.com

Public spectral libraries are available for download at PeptideAtlas

External Links

Reference

  • Keller, Andrew, et al. (2005) "A uniform proteomics MS/MS analysis platform utilizing open XML file formats". Molecular Systems Biology 1, 17. Full text
  • Lam, Henry, et al. (2007). "Development and validation of a spectral library searching method for peptide identification from MS/MS". Proteomics 7 (5), 655-667. Abstract
  • Craig, Robertson, et al. (2006). "Using annotated peptide mass spectrum libraries for protein identification". Journal of Proteome Research 5 (8), 1843-1849. Abstract
  • Frewen, Barbara, et al. (2006). "Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries". Analytical Chemistry 78 (16), 5678-5684. Abstract
  • Lam, Henry, et al. (2008). "Building consensus spectral libraries for peptide identifications in proteomics". Nature Methods 5, 873-875. FullText
  • Picotti, Paola, et al. (2008). "A database of validated assays for the targeted mass spectrometric analysis of the S. cerevisiae proteome". Nature Methods 5, 913-914. FullText
  • Sherwood, Carly, et al. (2009). "MaRiMba: A software application for spectral library-based MRM transition list assembly ". Journal of Proteome Research 8, 4396-4405. Abstract
  • Lam, Henry, et al. (2010). " Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics". Journal of Proteome Research 9, 605-610. Abstract
Personal tools