Software:SpectraST

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 09:50, 28 August 2008
Henrylam (Talk | contribs)

← Previous diff
Current revision
Henrylam (Talk | contribs)
(What's new in SpectraST 4.0)
Line 1: Line 1:
'''SpectraST''' (short for "'''Spectra''' '''S'''earch '''T'''ool" and rhymes with "contrast") is a spectral library building and searching tool designed primarily for shotgun proteomics applications. It is developed at the [http://www.systemsbiology.org/ Institute for Systems Biology (ISB)], in the research group of Professor Ruedi Aebersold. The main developer is Henry Lam. '''SpectraST''' (short for "'''Spectra''' '''S'''earch '''T'''ool" and rhymes with "contrast") is a spectral library building and searching tool designed primarily for shotgun proteomics applications. It is developed at the [http://www.systemsbiology.org/ Institute for Systems Biology (ISB)], in the research group of Professor Ruedi Aebersold. The main developer is Henry Lam.
-The latest version of SpectraST is 3.0, released in August 2008. It is distributed by ISB under the LPGL license, as a component of the Trans Proteomic Pipeline (TPP) suite of software, distributed under the same license. The source code repository is at [http://www.sourceforge.net/projects/sashimi], and the official download site for the Windows installer is at [http://tools.proteomecenter.org/software.php]. +The latest version of SpectraST is 5.0, released beta in November 2013, and officially with TPP 4.7 in March 2014. It is distributed by ISB under the LPGL license, as a component of the Trans Proteomic Pipeline (TPP) suite of software, distributed under the same license. The source code repository is at [http://www.sourceforge.net/projects/sashimi], and the official download site for the Windows installer is at [http://tools.proteomecenter.org/software.php].
== Introduction to Shotgun Proteomics and Spectral Searching == == Introduction to Shotgun Proteomics and Spectral Searching ==
Line 9: Line 9:
Traditionally, the inference of the peptide sequence from its characteristic tandem mass spectra is done by sequence (database) searching. In sequence searching, a target protein (or translated DNA) database is used as a reference to generate all possible putative peptide sequences by in silico digestion. The search engines then use various rules to predict the theoretical fragmentation pattern of each of these putative peptides, and compare the experimentally observed MS/MS spectra to these theoretical spectra one-by-one. Presumably, a positive identification is made if the experimental spectrum is sufficiently similar to one of the theoretical spectra. Several popular computational tools developed for this purpose have emerged over the years, each employing different algorithms and heuristics to achieve an acceptable balance of sensitivity and accuracy. Unfortunately, traditional sequence searching is a challenging, error-prone, and computationally expensive exercise. Despite the tremendous improvement in computer hardware and software over the past decade, this step often remains the bottleneck of any given proteomics experiment. The requirement of computational resources is also substantial, limiting the use of this powerful technique to only those research groups that can afford the costly computational infrastructure. Traditionally, the inference of the peptide sequence from its characteristic tandem mass spectra is done by sequence (database) searching. In sequence searching, a target protein (or translated DNA) database is used as a reference to generate all possible putative peptide sequences by in silico digestion. The search engines then use various rules to predict the theoretical fragmentation pattern of each of these putative peptides, and compare the experimentally observed MS/MS spectra to these theoretical spectra one-by-one. Presumably, a positive identification is made if the experimental spectrum is sufficiently similar to one of the theoretical spectra. Several popular computational tools developed for this purpose have emerged over the years, each employing different algorithms and heuristics to achieve an acceptable balance of sensitivity and accuracy. Unfortunately, traditional sequence searching is a challenging, error-prone, and computationally expensive exercise. Despite the tremendous improvement in computer hardware and software over the past decade, this step often remains the bottleneck of any given proteomics experiment. The requirement of computational resources is also substantial, limiting the use of this powerful technique to only those research groups that can afford the costly computational infrastructure.
-Spectral searching is an alternative approach that promises to address some of the shortcomings of sequence searching. In spectral searching, a spectral library is meticulously compiled from a large collection of previously observed and identified peptide MS/MS spectra. The unknown spectrum can then by identified by comparing it to all the candidates in the spectral library for the best match. This approach has been commonly employed for mass spectrometric analysis of small molecules with great success, but has only become possible for proteomics very recently. The chief difficulty, that of generating enough high-quality experimental spectra for compilation into spectral libraries, has been overcome by the recent explosion of proteomics data and the availability of public data repositories. Several attempts at creating and searching spectral libraries in the context of proteomics have been published within the past year, all demonstrating the tremendous improvement in search speed and the great potential of this method in complementing, if not replacing, sequence searching in many proteomics applications.+Spectral searching is an alternative approach that promises to address some of the shortcomings of sequence searching. In spectral searching, a spectral library is meticulously compiled from a large collection of previously observed and identified peptide MS/MS spectra. The unknown spectrum can then by identified by comparing it to all the candidates in the spectral library for the best match. This approach has been commonly employed for mass spectrometric analysis of small molecules with great success, but has only become possible for proteomics very recently. The main difficulty of generating enough high-quality experimental spectra for compilation into spectral libraries has been overcome by the recent explosion of proteomics data and the availability of public data repositories. Several attempts at creating and searching spectral libraries in the context of proteomics have been published within the past year, all demonstrating the tremendous improvement in search speed and the great potential of this method in complementing, if not replacing, sequence searching in many proteomics applications.
=== Advantages of Spectral Searching === === Advantages of Spectral Searching ===
Line 22: Line 22:
== Versions == == Versions ==
 +
 +=== What's new in SpectraST 5.0 ===
 +
 +* New, rank-based similarity scoring function (Old scoring function remains as an option)
 +* High mass accuracy MS2 (including HCD) support
 +* Spectral archive (unidentified spectral library) building
 +* Biological sample fingerprinting by spectral archives ''(Contributor: Dr. Wenguang Shao)''
 +* Open (blind) modification search ''(Contributor: Dr. Manson Ma)''
 +* Improved decoy generation, including alternative method by precursor swapping
 +* Semi-empirical spectrum generation for amino acid substitutions ''(Contributor: Dr. Yingwei Hu)''
 +* De-noising based on Bayesian classifier ''(Contributor: Dr. Wenguang Shao)''
 +* Retention time normalization using injected landmark peptides
 +* Support for glycopeptides ''(Contributor: Dr. Yingwei Hu)''
 +
 +=== What's new in SpectraST 4.0 ===
 +
 +* ETD support
 +* iProphet support
 +* Decoy spectrum generation
 +* MRM transition list generation
 +* User-defined modifications
 +* Semi-empirical spectrum generation from real spectrum of closely related identification ''(Contributor: Dr. Yingwei Hu)''
 +* Searching .mgf files
 +* Clickable (HTML) search output format
 +* Better book-keeping in library building
 +* Various bug fixes and performance enhancements
 +
 +=== What's new in SpectraST 3.1 ===
 +
 +* Re-mapping peptide identifications of library entries to protein sequence database of choice
 +* Rudimentary centroiding for imported spectra in profile mode
 +* mzML support via TPP
 +* Various bug fixes and performance enhancements
=== What's new in SpectraST 3.0 === === What's new in SpectraST 3.0 ===
Line 52: Line 85:
SpectraST is an integral component of the Trans Proteomic Pipeline suite of software. Although it can be used alone without other TPP components, SpectraST users are strongly encouraged to download and install the entire TPP suite, which provides other useful functionalities such as raw data importation, automatic validation of search results, protein inference, and quantification and visualization. SpectraST is an integral component of the Trans Proteomic Pipeline suite of software. Although it can be used alone without other TPP components, SpectraST users are strongly encouraged to download and install the entire TPP suite, which provides other useful functionalities such as raw data importation, automatic validation of search results, protein inference, and quantification and visualization.
-Windows users: SpectraST is available as part of TPP for Windows, which is run in the cygwin (UNIX emulator) environment. Download the [http://tools.proteomecenter.org/software/TPP_Cygwin_Setup.exe cygwin installer] and follow [[TPP:Windows_Cygwin_Installation|installation instructions]].+Windows users: SpectraST is available as part of TPP for Windows. A one-click installer is available, in which Windows-native executables are compiled by MinGW.
-UNIX/LINUX users: Visit the [http://sourceforge.net/project/showfiles.php?group_id=69281&package_id=126912 Sashimi project page] on SourceForge.net, and download the code as a tarball directly. Compiling, installation and configuration information is available in the README file. +UNIX/LINUX users: Visit the [http://sourceforge.net/project/showfiles.php?group_id=69281&package_id=126912 Sashimi project page] on SourceForge.net, and download the code as a tarball directly. Compiling, installation and configuration information is available in the README file. Alternatively, follow the [[TPP_4.2.1:_Installing_on_Ubuntu_9.04|instructions]] for Ubuntu LINUX installation.
=== Running SpectraST === === Running SpectraST ===
Line 60: Line 93:
SpectraST has two modes, the Create mode and the Search mode. In the former, SpectraST creates a searchable spectral library from various formats to prepare for searching. In the latter, SpectraST takes in unknown spectra and searches each of them against the spectral library. SpectraST has two modes, the Create mode and the Search mode. In the former, SpectraST creates a searchable spectral library from various formats to prepare for searching. In the latter, SpectraST takes in unknown spectra and searches each of them against the spectral library.
-The simplest way of running SpectraST is from the command line of your UNIX/LINUX or cygwin shell. The general usage is:+The simplest way of running SpectraST is from the command line of your UNIX/LINUX or Windows cmd shell. The general usage is:
<code>spectrast <options> <list of files of appropriate formats></code> <code>spectrast <options> <list of files of appropriate formats></code>
Line 72: Line 105:
SpectraST can perform spectral searching from the following data formats: SpectraST can perform spectral searching from the following data formats:
 +* [http://www.psidev.info/index.php?q=node/257 .mzML] format
* [[Formats:mzXML|.mzXML]] (all versions) format * [[Formats:mzXML|.mzXML]] (all versions) format
* .mzData format * .mzData format
 +* .mgf (Mascot Generic) format
* .dta (SEQUEST) format, a simple peak list preceded by precursor information * .dta (SEQUEST) format, a simple peak list preceded by precursor information
* NIST (National Institute of Standards and Technology)’s .msp format * NIST (National Institute of Standards and Technology)’s .msp format
-It requires a spectral library in SpectraST’s .splib format, which can be created in [[#SpectraST Create Mode|SpectraST Create Mode]]. +To search, the spectral library must be in SpectraST’s <code>.splib</code> format, which can be created in [[#SpectraST Create Mode|SpectraST Create Mode]].
The results can be outputted to the following formats: The results can be outputted to the following formats:
Line 84: Line 119:
* .txt format, a fixed-width column text format * .txt format, a fixed-width column text format
* .xls format, a tab-delimited column text format * .xls format, a tab-delimited column text format
 +* .html format, a HTML table with clickable links to spectrum viewer
-The search mode is initiated with the option -s, or any of the search mode options. For instance, to search the MS/MS spectra in the file foo.mzXML against the spectral library <code>bar.splib</code>, using the parameters specified in the file spectrast.params, the command is simply:+The search mode is initiated with the option <code>-s</code>, or any of the search mode options. For instance, to search the MS/MS spectra in the file foo.mzXML against the spectral library <code>bar.splib</code>, using the parameters specified in the file spectrast.params, the command is simply:
<div class="messagebox" style="float: right; width: 200px; border: thin solid #DDDDFF; padding: 10px; margin-left: 10px;"> Note: If the library is not specified in the parameter file or if the parameter file is not given, then the option -sL is mandatory; otherwise SpectraST will not know which spectral library to use.</div> <div class="messagebox" style="float: right; width: 200px; border: thin solid #DDDDFF; padding: 10px; margin-left: 10px;"> Note: If the library is not specified in the parameter file or if the parameter file is not given, then the option -sL is mandatory; otherwise SpectraST will not know which spectral library to use.</div>
<code>spectrast -sFspectrast.params -sLbar.splib foo.mzXML</code> <code>spectrast -sFspectrast.params -sLbar.splib foo.mzXML</code>
-In the above, -sF and -sL are search mode options that the user can specify to customize the behavior of SpectraST. SpectraST will search all the MS/MS spectra in the file <code>foo.mzXML</code> against the spectral library <code>bar.splib</code>, using the parameters specified in the file <code>spectrast.params</code>. The result will be written to a file named <code>foo.<ext></code> in the same directory where <code><ext></code> specifies the output format (.xml, .txt or .xls).+In the above, <code>-sF</code> and <code>-sL</code> are search mode options that the user can specify to customize the behavior of SpectraST. SpectraST will search all the MS/MS spectra in the file <code>foo.mzXML</code> against the spectral library <code>bar.splib</code>, using the parameters specified in the file <code>spectrast.params</code>. The result will be written to a file named <code>foo.<ext></code> in the same directory where <code><ext></code> specifies the output format (.pep.xml, .txt, .xls, or .html).
For a full list of options, see [[#SpectraST Options|SpectraST Options]]. For a full list of options, see [[#SpectraST Options|SpectraST Options]].
Line 117: Line 153:
<code><mzXML file name>.<start scan>.<end scan>.<charge></code> <code><mzXML file name>.<start scan>.<end scan>.<charge></code>
in the .pepXML file, so that SpectraST knows where to find the corresponding experimental spectrum. (If the .pepXML file is created with TPP tools, this should not be an issue.) </div> in the .pepXML file, so that SpectraST knows where to find the corresponding experimental spectrum. (If the .pepXML file is created with TPP tools, this should not be an issue.) </div>
-SpectraST can create a spectral library from a .pepXML file, which contains peptide identifications from a previous shotgun proteomics experiment. For this purpose, it is preferable that the .pepXML has been processed with [[Software:PeptideProphet|PeptideProphet]], such that all the search hits have probabilities assigned. When importing from a .pepXML file, SpectraST scans through the .pepXML file for confident identifications, and attempts to extract the corresponding experimental spectra from .mzXML files. For instance, the command+SpectraST can create a spectral library from a .pepXML file, which contains peptide identifications from a previous shotgun proteomics experiment. For this purpose, it is preferable that the .pepXML has been processed with [[Software:PeptideProphet|PeptideProphet]] and/or [[Software:iProphet|iProphet]], such that all the search hits have probabilities assigned. (iProphet probabilities are used over PeptideProphet ones if both are present.)
 + 
 +When importing from a .pepXML file, SpectraST scans through the .pepXML file for confident identifications, and attempts to extract the corresponding experimental spectra from .mzXML files. For instance, the command
<code>spectrast -cNraw -cP0.9 dataset1.xml</code> <code>spectrast -cNraw -cP0.9 dataset1.xml</code>
will import all peptide identifications with probability at or above 0.9 from the file <code>dataset1.xml</code>, and put them in a library called <code>raw.splib</code> (with the accompanying <code>raw.sptxt</code>, <code>raw.spidx</code> and <code>raw.pepidx</code> files). will import all peptide identifications with probability at or above 0.9 from the file <code>dataset1.xml</code>, and put them in a library called <code>raw.splib</code> (with the accompanying <code>raw.sptxt</code>, <code>raw.spidx</code> and <code>raw.pepidx</code> files).
- 
For a full list of SpectraST options, see [[#SpectraST Options|SpectraST Options]]. For a full list of SpectraST options, see [[#SpectraST Options|SpectraST Options]].
Line 198: Line 235:
This will run the consensus spectra through SpectraST's quality filters. With the default settings, spectra failing either or both of the first 2 filters will be removed, and spectra failing any of the other filters will be marked. Different quality levels can be set with the options <code>-cL</code> and <code>-cl</code>. It is recommended that a consensus spectral library is subject to some quality control before using it in spectral searching; the optimal quality level reflects the user's desired compromise between library comprehensiveness and library quality. This is to minimize mis-identified and low-quality spectra in the library. These questionable spectra can propagate errors from sequence searching, reduce the discriminating power of the spectral search engine, and induce false positive and false negative hits. This will run the consensus spectra through SpectraST's quality filters. With the default settings, spectra failing either or both of the first 2 filters will be removed, and spectra failing any of the other filters will be marked. Different quality levels can be set with the options <code>-cL</code> and <code>-cl</code>. It is recommended that a consensus spectral library is subject to some quality control before using it in spectral searching; the optimal quality level reflects the user's desired compromise between library comprehensiveness and library quality. This is to minimize mis-identified and low-quality spectra in the library. These questionable spectra can propagate errors from sequence searching, reduce the discriminating power of the spectral search engine, and induce false positive and false negative hits.
-=== SpectraST Parameter File ===+For details on the consensus and quality filter algorithms, please refer to Lam ''et al. '' (2008) ''Nature Methods'' '''5''', 873-875.
 + 
 +4. Appending artificial decoy spectra<br>
 +<code>spectrast -cAD -cc -cy1 -cNcons_ABC_Q_DECOY</code>
 + 
 +This will generate an equal-size decoy spectral library and append it to the real library <code>consABD_Q.splib</code>. The presence of decoys enables the estimation of false discovery rate (FDR) by the well-established decoy counting method, and improves the accuracy of PeptideProphet in validating spectral search results.
 + 
 +The algorithm of generating artificial decoy spectra is described in Lam ''et al. '' (2010) ''Journal of Proteome Research'' '''9''', 605-610.
 + 
 +=== Miscellaneous Features ===
 + 
 +==== SpectraST Parameter Files ====
<div class="messagebox" style="float: right; width: 200px; border: thin solid #DDDDFF; padding: 10px; margin-left: 10px;">Note: All options set in the parameter file will be overridden by command-line options, if specified.</div> <div class="messagebox" style="float: right; width: 200px; border: thin solid #DDDDFF; padding: 10px; margin-left: 10px;">Note: All options set in the parameter file will be overridden by command-line options, if specified.</div>
Line 207: Line 255:
Create Mode: [[Spectrast_create.params|spectrast_create.params]] Create Mode: [[Spectrast_create.params|spectrast_create.params]]
-=== SpectraST File List Feature ===+==== High Mass Accuracy MS2 Support ====
 + 
 +As of version 5.0, SpectraST supports the handling of high mass accuracy MS2 spectra (including higher-energy collisional dissociation, HCD, spectra). The fragmentation methods CID-QTOF and HCD are created to tag such spectra, and that information is automatically extracted from the data (.mzML/.mzXML) files (if specified therein) for library building. The user can also explicitly specify the fragmentation method, using the -cI option.
 + 
 +When building libraries, SpectraST annotates and aligns CID-QTOF/HCD spectra differently, using a narrower tolerance and considering immonium ions and internal fragments. However, when searching, the user must still specifies the "bin size" (equivalent to the product ion tolerance) as the mass accuracy can vary from instrument to instrument.
 + 
 +==== Spectral Archive (Unidentified Spectral Library) Building ====
 + 
 +As of version 5.0, SpectraST can also build and search spectral libraries without identifications. Such libraries are also referred to as spectral archives. The spectral archive building prodecure is as follows:
 + 
 +<code>spectrast -cNraw_unid foo.mzXML</code>
 + 
 +This imports all MS2 spectra in foo.mzXML into the library raw_unid.splib, with some filtering and merging. Each spectrum will be given a unique identifier starting with an underscore '_' in place of the peptide identification.
 + 
 +<code>spectrast -cNclustered_unid -cAS raw_unid.splib</code>
 + 
 +This performs spectral clustering, such that replicate spectra of similar precursor m/z and high spectral similarity are detected and merged into consensus spectra.
 + 
 +Advanced options <code>-c_UCR</code>, <code>-c_UCD</code>, <code>-c_UX1</code>, <code>-c_UNP</code> and <code>-c_USX</code> control some tunable parameters of this procedure.
 + 
 +Spectral archives can be searched by SpectraST in the same way as identified spectral libraries, such as for biological sample fingerprinting (see below). Please note that results obtained by searching spectral archives may not be recognized by downstream TPP tools.
 + 
 +==== Biological Sample Fingerprinting by Spectral Archives ====
 + 
 +Spectral archives can be used for biological sample fingerprinting. As of version 5.0, upon searching such libraries, SpectraST can output a text file reporting "dataset similarities" calculated from counting spectral matches of library spectra from different sources (with <code>-s_FIN</code> option). The details of this algorithm and the instructions can be found in [http://www.ncbi.nlm.nih.gov/pubmed/24625782 Onders ''et al. '' (2014) ''Nature Protocols'' '''9''', 842-850].
 + 
 +==== Open (Blind) Modification Search ====
 + 
 +As of version 5.0, SpectraST can perform open (blind) modification search using the algorithm published in [http://www.ncbi.nlm.nih.gov/pubmed/24661115 Ma ''et al. '' (2014) ''Journal of Proteome Research'' '''13''', 2262-71].
 + 
 +==== De-noising by Bayesian Classifier ====
 + 
 +As of version 5.0, SpectraST implements the algorithm published in [http://www.ncbi.nlm.nih.gov/pubmed/23675732 Shao ''et al. '' (2013) ''Journal of Proteome Research'' '''12''', 3223-32] for de-noising singleton library spectra.
 + 
 +==== Retention Time Normalization using Injected Landmark Peptides ====
 + 
 +As of version 5.0, SpectraST can make use of injected standard peptides to calculate a normalized retention time index (iRT) to be included in the library entry, which removes the variability due to LC settings and columns. This enables retention time information to be used in SWATH/SRM assays, among other uses. It implements two different ways of calculating iRT, linear regression and linear interpolation.
 + 
 +==== ETD Support ====
 + 
 +As of version 4.0, SpectraST supports the import and searching of MS2 spectra by electron-transfer dissociation (ETD). A tag encoding the fragmentation method is added to each library entry to differentiate between CID (collision-induced dissociation) and ETD spectra, and that information is automatically extracted from the data (.mzML/.mzXML) files (if specified therein) for library building. The user can also explicitly specify the fragmentation method, using the -cI option.
 + 
 +SpectraST annotates ETD spectra differently than CID spectra, and the spectral matching algorithm is slightly modified to deal with the charged-reduced precursor peaks common in ETD spectra.
 + 
 +==== Generation of Transition Lists for Selected/Multiple Reaction Monitoring (S/MRM) ====
 + 
 +Selected reaction monitoring (SRM, also known as Multiple Reaction Monitoring, MRM) is an acquisition mode available on some mass spectrometers (mostly triple quadrupole instruments) which has gained increasing popularity as a targeted quantification technique. It requires as input a list of “transitions” to be monitored over the course of the experiment. Each transition consists of two numbers, Q1 (the precursor m/z) and Q3 (the fragment ion m/z) and must be selected beforehand based on knowledge of the peptide to be quantified. One of the most effective strategies for selecting appropriate transitions is to rely on previously acquired MS2 spectra of the target peptides stored in spectral libraries.
 + 
 +As of version 4.0, SpectraST implements an algorithm to select the N (a user-specified number) most suitable transitions for each library spectrum, and print the list of transitions in a table format. For example, the command:
 + 
 +<code>spectrast -cNfoo_MRM -cM -cQ5 foo.splib</code>
 + 
 +will create a “reduced” spectral library <code>foo_MRM.splib</code>, in which only each spectrum only retains the 5 peaks most suited as MRM transitions. In addition, a text file <code>foo_MRM.mrm</code> containing the transitions in a table format will be printed.
 + 
 +A fully functional software tool for MRM experiment design, [[Software:TPP-MaRiMba|MaRiMba]] (Sherwood ''et al. '' (2009) ''Journal of Proteome Research'' '''8''', 4396-4405), which essentially wraps this transition selection algorithm of SpectraST as well as implements some surrounding functionalities, is freely available as part of the TPP and accessible via the [[Software:Petunia|Petunia]] web interface.
 + 
 +==== User-defined Modifications ====
 + 
 +As of version 4.0, users can specify allowable modifications in a text file. (Prior to 4.0, SpectraST internally maintains a list of common modifications and will reject identifications with unrecognized modifications.) This text file should contain space- or comma-delimited strings of the following form:
 + 
 +<code><token>|<monoisotopic mass change from unmodified amino acid>|<name of modification></code>
 + 
 +where <code><token></code> must be one of the amino acids in one-letter code (capitalized), n (the N terminus) or c (the C terminus), optionally followed by <code>[<tag>]</code>, a user-defined short tag for that modification, and <code><name of modification></code> is a user-defined formal name of that modification which will be written to the library created.
 + 
 +An example of this file can be found in [[spectrast.usermods|spectrast.usermods]].
 + 
 +To activate the modifications specified in this file, add the option <code>-M</code> to the command line whether identifications containing these modifications are to be processed. By default, SpectraST expects the file to be named <code>spectrast.usermods</code> and reside in the current working directory. You can also specify otherwise by appending the file name (with path if necessary) after the <code>-M</code>.
 + 
 +Since a spectral library is meant to be a long-lasting and shared resource, special care should be taken in specifying new modifications. Accurate monoisotopic masses (preferably to at least 4 digits) should be used, and the name of the modification should follow [http://www.psidev.info/ HUPO-PSI] convention as much as possible. Consult [http://www.unimod.org/ Unimod] to see if your modification is already defined in HUPO-PSI standards.
 + 
 +==== Semi-empirical Spectrum Generation ====
 + 
 +As of version 4.0, SpectraST can generate semi-empirical spectra from real spectra of closely-related identifications, simply by shifting peaks on the m/z axis wherever appropriate. This is useful when spectra of one modification state are used to built the library, but one wishes to match spectra of the same sequence but another modification state in spectral searching.
 + 
 +To do so, the user needs to turn on the build action of semi-empirical spectrum generation by the option <code>-cAM</code>, and specify what modifications are desired by the <code>-cx<str></code> option. The <code><str></code> in the latter is a string of allowable modification tokens (no space in between). A modification token starts with a one-letter amino acid (A through Z, plus n or c for the termini), followed by <code>[<tag>]</code> if it is not the unmodified form.
 + 
 +For example, the command
 + 
 +<code>spectrast -cNfoo_heavy -cAM -cx’K[134]R[162]’ foo.splib</code>
 + 
 +will create a library <code>foo_heavy.splib</code> containing the same peptide ions as <code>foo.splib</code>, but with all lysines and arginines being heavy (both +6 Da). Note that since K and R are not included in the string following <code>-cx</code>, unmodified K or R is not allowed to be used. This is commonly referred to as a static modification.
 + 
 +The command
 + 
 +<code>spectrast -cNfoo_metox -cAM -cx’MM[147]’ foo.splib</code>
 + 
 +will create a library <code>foo_metox.splib</code> containing the same peptide ions as <code>foo.splib</code>, but with all possible permutations of normal and oxidized methionine. This is commonly referred to as a variable modification. Note that the string <code>’MM[147]’</code> means that both M (normal methionine) and M[147] (oxidized methionine) are allowed where an M is present in a sequence.
 + 
 +The command
 + 
 +<code>spectrast -cNfoo_binary -cAM -cx’{K}{K[134]}’ foo.splib</code>
 + 
 +specifies a “binary” modification: lysines on the same peptide must be either all light or all heavy. The curly bracelet (<code>{ }</code>) specifies sets of allowable modification tokens; for the same peptide only tokens from a single set can be used.
 + 
 +Note that while the modification tokens, such as M[147], contain only integer values (of the modified amino acids) as a tag, in calculations SpectraST will use the corresponding accurate mass stored for that particular modification type indicated by the token. Hence, for modifications currently not recognized by SpectraST, the user must define them (along with the accurate mass) in a text file using the <code>-M</code> option.
 + 
 +==== SpectraST File List Feature ====
SpectraST allows the user to list the files to be processed in a text file with extension .list. This can be useful when the number of files to be processed is very large, possibly overwhelming the UNIX command line. It is also an easy way to queue up multiple SpectraST tasks and to keep track of them. For example, if the file job.list contains the lines: SpectraST allows the user to list the files to be processed in a text file with extension .list. This can be useful when the number of files to be processed is very large, possibly overwhelming the UNIX command line. It is also an easy way to queue up multiple SpectraST tasks and to keep track of them. For example, if the file job.list contains the lines:
Line 261: Line 405:
|colspan=4|CANDIDATE SELECTION AND SCORING OPTIONS |colspan=4|CANDIDATE SELECTION AND SCORING OPTIONS
|- |-
-| '''-sM<tol>'''||'''indexRetrievalMzTolerance'''||Specify precursor m/z tolerance in Th. Monoisotopic mass is assumed.||Default is 3.0 Th.+| '''-sM<tol>'''||'''precursorMzTolerance'''||Specify precursor m/z tolerance in Th. Monoisotopic mass is assumed.||Default is 3.0 Th.
|- |-
-| '''-sA'''||'''indexRetrievalUseAverage'''||Use average mass instead of monoisotopic mass.||Turn on with -sA, off with -sA!. Default is off.+| -sA||precursorMzUseAverage||Report average mass instead of monoisotopic mass in search results. Precursor m/z window is expanded to account for difference between average and monoisotopic mass.||Turn on with -sA, off with -sA!. Default is off.
|- |-
-| -sC<type>||expectedCysteineMod||Specify the expected kind of cysteine modification. Those candidate library entries with a wrong kind of cysteine modification will be ignored.||-sCICAT_cl = cleavable ICAT<br>-sCICAT_uc = uncleavable ICAT<br>-sCCAM = Carbamidomethyl.<br> Default is off (search all candidates).+| -sz||searchAllCharges||Search all candidate library spectra regardless of precursor charge, ignoring the precursor charge specified in the query data.||Turn on with -sz, off with -sz!. Default is off.
-|-+
-| -sc||ignoreSpectraWithUnmodCysteine||Ignore any candidate library entries with an unmodified cysteine.|| Turn on with -sc, off with -sc!. Default is off.+
|- |-
| -s_HOM<rank>||detectHomologs||Detect homologous lower hits up to <rank>. Looks for lower hits homologous to the top hit and adjust delta accordingly.||Default is 4. | -s_HOM<rank>||detectHomologs||Detect homologous lower hits up to <rank>. Looks for lower hits homologous to the top hit and adjust delta accordingly.||Default is 4.
|- |-
-| -s_NO1||ignoreChargeOneLibSpectra||Ignore all library entries with +1 charge state.||Turn on with -s_NO1, off with -s_NO1!. Default is off.+| -s_FDL<frac>||fvalFractionDelta||Specify the fraction of the normalized delta score (delta/dot) in the F-value formula.||Default is 0.4.
|- |-
-| -s_NOS||ignoreAbnormalSpectra||Ignore all spectra which have non-Normal status.||Turn of with -s_NOS, off with -s_NOS!. Default is off. +| -s_SP4||useSp4Scoring||Use original SpectraST (4.0 or earlier) scoring, based on dot products of square-root intensities.||Turn on with -s_SP4, off with -s_SP4!. Default is off.
 +|-
 +| -s_FBI||fvalUseDotBias||Use dot bias to penalize high-scoring matches with massive noise and/or dominant peak.||Turn on with -s_FBI, off with -s_FBI!. Default is on. Only applicable for SP4 scoring.
 +|-
 +| -s_PVL||usePValue||Compute p-value by fitting score distribution of lower hits, and use it solely in F-value, which produces better behaved negative distribution.||Turn on with -s_PVL, off with -s_PVL!. Default is off. NOT applicable for SP4 scoring. Only tested in low-resolution CID data.
 +|-
 +| -s_OMT||useTierwiseOpenModSearch||Perform tier-wise open modification search for modifications within precursor m/z window specified with -sM.||Turn on with -s_OMT, off with -s_OMT!. Default is off. Note that the scoring is different from normal SpectraST searches.
 +|-
 +| -sC<type> ''deprecated''||expectedCysteineMod||Specify the expected kind of cysteine modification. Those candidate library entries with a wrong kind of cysteine modification will be ignored.||-sCICAT_cl = cleavable ICAT<br>-sCICAT_uc = uncleavable ICAT<br>-sCCAM = Carbamidomethyl.<br> Default is off (search all candidates).
 +|-
 +| -sc ''deprecated''||ignoreSpectraWithUnmodCysteine||Ignore any candidate library entries with an unmodified cysteine.|| Turn on with -sc, off with -sc!. Default is off.
 +|-
 +| -s_NO1 ''deprecated''||ignoreChargeOneLibSpectra||Ignore all library entries with +1 charge state.||Turn on with -s_NO1, off with -s_NO1!. Default is off.
 +|-
 +| -s_NOS ''deprecated''||ignoreAbnormalSpectra||Ignore all spectra which have non-Normal status.||Turn of with -s_NOS, off with -s_NOS!. Default is off.
|- |-
|colspan=4|OUTPUT AND DISPLAY OPTIONS |colspan=4|OUTPUT AND DISPLAY OPTIONS
|- |-
| '''-sE<ext>'''||'''outputExtension'''||Output format. The search result will be written to a file with the same base name as the search file, with extension <ext>.||-sEtxt = Fixed-width text format<br>-sExls = Tab-delimited text format)<br> -sExml (default) or -sEpepXML = .pepXML format. | '''-sE<ext>'''||'''outputExtension'''||Output format. The search result will be written to a file with the same base name as the search file, with extension <ext>.||-sEtxt = Fixed-width text format<br>-sExls = Tab-delimited text format)<br> -sExml (default) or -sEpepXML = .pepXML format.
 +|-
 +| -sO<path> ||outputDirectory||Specify a directory to hold the search output files.||Default: Same directory as the corresponding search data (.mzML/.mzXML) file.
|- |-
| -s_FV1<thres>||hitListTopHitFvalThreshold||Minimum F value threshold for the top hit. Only top hits having F value greater than <thres> will be printed.||Default = 0.0 (all top hits will be displayed) | -s_FV1<thres>||hitListTopHitFvalThreshold||Minimum F value threshold for the top hit. Only top hits having F value greater than <thres> will be printed.||Default = 0.0 (all top hits will be displayed)
Line 284: Line 442:
|- |-
| -s_SHH||hitListShowHomologs||Always displays homologous lower hits regardless of F value.||Turn on with -s_SHH (need -s_HOM on), off with -s_SHH! Default is on. | -s_SHH||hitListShowHomologs||Always displays homologous lower hits regardless of F value.||Turn on with -s_SHH (need -s_HOM on), off with -s_SHH! Default is on.
 +|-
 +| -s_SHR<rank>||hitListShowMaxRank||Maximum rank for hits shown for each query, e.g. -s_SHR3 will show the top 3 hits.||Default is 1.
|- |-
| -s_SH1||hitListOnlyTopHit||Only display the top hit for each query.||Turn on with -s_SH1, off with -s_SH1!. Default is on. | -s_SH1||hitListOnlyTopHit||Only display the top hit for each query.||Turn on with -s_SH1, off with -s_SH1!. Default is on.
Line 289: Line 449:
| -s_SHM||hitListExcludeNoMatch||Do not display queries for which there is no candidate, or the top hit is below the minimum F value threshold specified with -sV.||Turn on with -s_SHM, off with -s_SHM!. Default is on. | -s_SHM||hitListExcludeNoMatch||Do not display queries for which there is no candidate, or the top hit is below the minimum F value threshold specified with -sV.||Turn on with -s_SHM, off with -s_SHM!. Default is on.
|- |-
-| -s_SAV||saveSpectra||Save query and matched library spectra as .msp files.||Turn on with -s_SAV, off with -s_SAV!. Default is off.+| -s_ENZ<enz>||enzymeForPepXMLOutput||Specify the proteolytic enzyme used, for the purpose of pepXML output. <enz> can be trypsin, lysc, etc.||This does not affect SpectraST searching. It only affects how the results are processed by downstream TPP tools.
|- |-
-| -s_TGZ||tgzSavedSpectra||Archive the saved query and matched library spectra as .tgz files to save space.||Turn on with -s_TGZ (need -s_SAV on), off with -s_TGZ!. Default is off.+| -s_FIN<file>||printFingerprintingSummary||Print a text file of name <file> summarizing fingerprinting results.||Default is off.
|- |-
-|colspan=4|SPECTRUM FILTERING AND PROCESSING OPTIONS+|colspan=4|SPECTRUM FILTERING OPTIONS
|- |-
-| -s_XNP<thres>||filterMinPeakCount||Require minimum number of peaks. All query spectra with fewer than <thres> peaks passing the intensity threshold set with -sP will be removed.||Default is 10.+| -s_XNP<thres>||filterMinPeakCount||Discard query spectra with fewer than <thres> peaks above threshold set with -s_CNT.||Default is 10.
|- |-
-| -s_XMZ<m/z>||filterAllPeaksBelowMz||Remove spectra with almost no peaks above a certain m/z value. All query spectra with 95%+ of the total intensity below <m/z> will be removed.|| Default is 520.+| -s_XMZ<m/z>||filterAllPeaksBelowMz||Discard query spectra with almost no peaks above a certain m/z value. All query spectra with 95%+ of the total intensity below <m/z> will be removed.|| Default is 520.
|- |-
-| -s_XIN<inten>||filterMaxIntensityBelow||Filter query spectra with no peaks with intensity above <inten>.||Default is 0.+| -s_XIN<inten>||filterMaxIntensityBelow||Discard query spectra with no peaks with intensity above <inten>.||Default is 0.
 +|-
 +| -s_XMR<range>||filterMinMzRange||Discard query spectra with m/z range narrower than <range>.||Default is 350.
|- |-
| -s_CNT<thres>||filterCountPeakIntensityThreshold||Minimum peak intensity for peaks to be counted. Only peaks with intensity above <thres> will be counted to meet the requirement for minimum number of peaks.|| Default is 2.01 | -s_CNT<thres>||filterCountPeakIntensityThreshold||Minimum peak intensity for peaks to be counted. Only peaks with intensity above <thres> will be counted to meet the requirement for minimum number of peaks.|| Default is 2.01
|- |-
-| -s_RNT<thres>||filterRemovePeakIntensityThreshold||Noise peak threshold. All peaks with intensities below <thres> will be zeroed.|| Default is 2.01+|colspan=4|SPECTRUM PROCESSING OPTIONS
|- |-
-| -s_R51<thres>||filterRemoveHuge515Threshold||Remove dominant peak at 515.3 Th. All dominant peaks near 515.3 Th (with intensity greater than <thres> of the total intensity of the spectrum) will be zeroed.||Default is off. Dominant 515.3 Th peaks are a common impurity artifact in cleavable ICAT experiments.+| -s_RNT<thres>||filterRemovePeakIntensityThreshold||Noise peak threshold. All peaks with intensities below <thres> will be zeroed.|| Default is 2.01
|- |-
| -s_RNP<num>||filterMaxPeaksUsed||Remove all but the top <num> peaks in query spectra.||Default is 150. | -s_RNP<num>||filterMaxPeaksUsed||Remove all but the top <num> peaks in query spectra.||Default is 150.
Line 311: Line 473:
| -s_RDR<num>||filterMaxDynamicRange||Remove all peaks smaller than 1/<num> of the base (highest) peak in query spectra.||Default is 1000. | -s_RDR<num>||filterMaxDynamicRange||Remove all peaks smaller than 1/<num> of the base (highest) peak in query spectra.||Default is 1000.
|- |-
-| -s_MZS<mzpow>,<br>-s_INS<intpow>||peakScalingMzPower,<br>peakScalingIntensityPower||Intensity scaling power with respect to the m/z value and the raw intensity. The scaled intensity will be (m/z)^<mzpow> * (raw intensity)^<intpow>||Default is <mzpow> = 0.0, <intpow> = 0.5. +| -s_MZS<mzpow>,<br>-s_INS<intpow>||peakScalingMzPower,<br>peakScalingIntensityPower||Intensity scaling power with respect to the m/z value and the raw intensity. The scaled intensity will be (m/z)^<mzpow> * (raw intensity)^<intpow>||Default is <mzpow> = 0.0, <intpow> = 0.5. Only applicable in SP4 scoring.
|- |-
-| -s_UAS<factor>||peakScalingUnassignedPeaks||Scaling factor for unassigned peaks in library spectra. Unassigned peaks in the library spectra will be scaled by <factor>.||Default is 0.1.+| -s_UAS<factor>||peakScalingUnassignedPeaks||Scaling factor for unassigned peaks in library spectra. Unassigned peaks in the library spectra will be scaled by <factor>.||Default is 1.0.
 +|-
 +| -s_NOB||peakNoBinning||Disable binning and instead perform peak-to-peak matching.||Turn on with -s_NOB, off with -s_NOB!. Default is off. NOT applicable in SP4 scoring, and not recommended for low-resolution data. Specify fragment m/z tolerance by -s_BIN<num>, tolerance = 1/<num>.
|- |-
| -s_BIN<num>||peakBinningNumBinsPerMzUnit||Number of bins per Th.||Default is 1. | -s_BIN<num>||peakBinningNumBinsPerMzUnit||Number of bins per Th.||Default is 1.
|- |-
| -s_NEI<frac>||peakBinningFractionToNeighbor||Fraction of the scaled intensity assigned to neighboring bins.||Default is 0.5. | -s_NEI<frac>||peakBinningFractionToNeighbor||Fraction of the scaled intensity assigned to neighboring bins.||Default is 0.5.
 +|-
 +| -s_LNP<num>||filterLibMaxPeaksUsed||Remove all but the top <num> peaks in the LIBRARY spectra.||Default is 50.
 +|-
 +| -s_RLI<thres>||filterLighIonsMzThreshold||Remove all light ions with m/z lower than <thres> Th for both library and query spectra.||Default is 180.
 +|-
 +| -s_ITQ||filterITRAQReporterPeaks||Remove iTRAQ reporter peaks in the range 112-122 Th.||Turn on with -s_ITQ, off with -s_ITQ!. Default is off.
 +|-
 +| -s_TMT||filterTMTReporterPeaks||Remove TMT reporter peaks in the range 126-132 Th. ||Turn on with -s_TMT, off with -s_TMT!. Default is off.
 +|-
 +| -s_R51<thres> ''deprecated''||filterRemoveHuge515Threshold||Remove dominant peak at 515.3 Th. All dominant peaks near 515.3 Th (with intensity greater than <thres> of the total intensity of the spectrum) will be zeroed.||Default is off. Dominant 515.3 Th peaks are a common impurity artifact in cleavable ICAT experiments.
|} |}
Line 337: Line 511:
|- |-
| -cO<file>||useProteinList||Use protein list in <file>. Only those peptide ions associated with proteins in the list will be imported.||A protein list is a text file with one protein identifier per line. If a number X is supplied following the protein separated by a tab, then at most X peptide ions associated with that protein will be imported. Peptides with more replicates are favored. | -cO<file>||useProteinList||Use protein list in <file>. Only those peptide ions associated with proteins in the list will be imported.||A protein list is a text file with one protein identifier per line. If a number X is supplied following the protein separated by a tab, then at most X peptide ions associated with that protein will be imported. Peptides with more replicates are favored.
 +|-
 +| -cM<format>||printMRMTable||Write all library spectra as SRM transition tables. Leave <format> blank for default tab-delimited table format.||Turn off with -cM!. Default is off.
|- |-
| -cm<remark>||remark||Remark. Add a Remark=<remark> comment to all library entries created.||Default is off. | -cm<remark>||remark||Remark. Add a Remark=<remark> comment to all library entries created.||Default is off.
-|- 
-| -c_ANN||annotatePeaks||Annotate peaks.||Turn on with -c_ANN, off with -c_ANN!. Default is on. 
|- |-
| -c_BIN||binaryFormat||Write library in binary format, which enables quicker search.|| Turn on with -c_BIN, off with -c_BIN!. Default is on. | -c_BIN||binaryFormat||Write library in binary format, which enables quicker search.|| Turn on with -c_BIN, off with -c_BIN!. Default is on.
Line 346: Line 520:
| -c_DTA||writeDtaFiles||Write all library spectra as .dta files.|| Turn on with -c_DTA, off with -c_DTA!. Default is off. | -c_DTA||writeDtaFiles||Write all library spectra as .dta files.|| Turn on with -c_DTA, off with -c_DTA!. Default is off.
|- |-
-| -c_PLT<crit>||plotSpectra||Plot the library spectra as they are created.||-c_PLT or -c_PLTALL = Plot every spectrum. -c_PLT<crit> = Plot spectrum when either the Status or the Spec comment value = <crit>.+| -c_MGF||writeMgfFiles||Write all library spectra as one .mgf file.|| Default is off.
|- |-
-|colspan=4|PEPXML IMPORT OPTIONS (Applicable with .pepXML file input)+| -c_RDY<prefix>||removeDecoyProteins||Remove spectra of decoys, for which proteins have names starting with <prefix>. Also remove decoy proteins from Protein field for peptides mapped to both target and decoy sequences.|| Default is off.
 +|-
 +| -c_PLT<crit> ''deprecated''||plotSpectra||Plot the library spectra as they are created.||-c_PLT or -c_PLTALL = Plot every spectrum. -c_PLT<crit> = Plot spectrum when either the Status or the Spec comment value = <crit>.
 +|-
 +|colspan=4|LIBRARY IMPORT OPTIONS (Applicable with .pep.xml, .tsv, .msp, .hlf, .ms3, .mz(X)ML file input)
|- |-
| '''-cP<prob>'''||'''minimumProbabilityToInclude'''||Include all spectra identified with probability no less than <prob> in the library.||Default is 0.9. | '''-cP<prob>'''||'''minimumProbabilityToInclude'''||Include all spectra identified with probability no less than <prob> in the library.||Default is 0.9.
 +|-
 +| '''-cq<fdr>'''||'''maximumFDRToInclude'''||(Only for pepXML import) Include spectra with global FDR no greater than <fdr> the library.||Default is 9999.0.
|- |-
| '''-cn<name>'''||'''datasetName'''||Specify a dataset identifier for the file to be imported.||If not set, SpectraST will construct it from the path and the name of the .pepXML file. | '''-cn<name>'''||'''datasetName'''||Specify a dataset identifier for the file to be imported.||If not set, SpectraST will construct it from the path and the name of the .pepXML file.
 +|-
 +| -cI<type>||setFragmentation||Set the fragmentation type of all spectra, overriding existing information.|| -cIETD = tag all library spectra as ETD spectra. <br> -cICID-QTOF = tag all library spectra as Q-TOF (high-resolution) CID spectra.<br> -cIHCD = tag all library spectra as HCD (high-resolution) spectra. <br>Default is off (determined from the data files).
|- |-
| -cg||setDeamidatedNXST||Set all asparagines (N) in the motif NX(S/T) as deamidated (N[115]), and all asparagines not in the motif NX(S/T) as unmodified. Use for glycocaptured peptides.|| Turn on with -cg, off with -cg!. Default is off. | -cg||setDeamidatedNXST||Set all asparagines (N) in the motif NX(S/T) as deamidated (N[115]), and all asparagines not in the motif NX(S/T) as unmodified. Use for glycocaptured peptides.|| Turn on with -cg, off with -cg!. Default is off.
Line 358: Line 540:
| -co||addMzXMLFileToDatasetName||Add the originating mzXML file name to the dataset identifier. Good for keeping track of the MS run in which the peptide is observed.|| Turn on with -co, off with -co!. Default is off. | -co||addMzXMLFileToDatasetName||Add the originating mzXML file name to the dataset identifier. Good for keeping track of the MS run in which the peptide is observed.|| Turn on with -co, off with -co!. Default is off.
|- |-
-| -c_NPK<num>||minimumNumPeaksToInclude||Exclude spectra of peptide IDs shorter than <num> amino acids.||Default is 10.+| -c_CEN||centroidPeaks||Centroid peaks as raw spectra are imported.||Designed mostly for Q-TOF spectra in profile mode.
-|-+
-| -c_NAA<thres>||minimumDeltaCnToInclude||Exclude spectra with fewer than <num> peaks.||Default is 6.+
-|-+
-| -c_DCN<num>||minimumNumAAToInclude||Exclude spectra with deltaCn smaller than <thres>. Useful for excluding spectra with indiscriminate modification sites.|| Turn on with -c_DCN, off with -c_DCN!. Default is 0.0.+
|- |-
| -c_RNT<thres>||rawSpectraNoiseThreshold||Absolute noise filter. Remove noise peaks with intensity below <thres> in imported spectra.||Default is 0.0. | -c_RNT<thres>||rawSpectraNoiseThreshold||Absolute noise filter. Remove noise peaks with intensity below <thres> in imported spectra.||Default is 0.0.
|- |-
| -c_RDR<range>||rawSpectraMaxDynamicRange||Relative noise filter. Filter out noise peaks with intensity below 1/<range> of that of the highest peak.||Default is 100000.0. | -c_RDR<range>||rawSpectraMaxDynamicRange||Relative noise filter. Filter out noise peaks with intensity below 1/<range> of that of the highest peak.||Default is 100000.0.
 +|-
 +| -c_NAA<num>||minimumNumAAToInclude||Exclude spectra of peptide IDs shorter than <num> amino acids.||Default is 6.
 +|-
 +| -c_NPK<num>||minimumNumPeaksToInclude||Exclude spectra with fewer than <num> peaks.||Default is 10.
 +|-
 +| -c_XAN||skipRawAnnotation||Skip the annotation of raw spectra as they are imported.||Annotation is quite slow and might be impractical if the number of imported spectra is enormous.
 +|-
 +| -c_DCN<num>||minimumDeltaCnToInclude||(Only for pepXML import) Exclude spectra with deltaCn smaller than <thres>. Useful for excluding spectra with indiscriminate modification sites.|| Turn on with -c_DCN, off with -c_DCN!. Default is 0.0.
 +|-
 +| -c_MDF<thres>||maximumMassDiffToInclude||(Only for pepXML import) Exclude spectra with precursor mass difference (absolute value) greater than <thres> Daltons.|| Default is 9999.0.
 +|-
 +| -c_BRK||bracketSpectra||(Only for pepXML import) Bracket import: for each confident ID, also search neighboring scans for repeated scans to import.||Turn on with -c_BRK, off with -c_BRK!. Default is off.
 +|-
 +| -c_BRM||mergeBracket||(Only for pepXML import) Merge bracketed spectra: merge repeated scans of a bracket into one consensus spectrum for import.||Turn on with -c_BRM, off with -c_BRM!. Default is off.
|- |-
|colspan=4|LIBRARY MANIPULATION OPTIONS (Applicable with .splib file input) |colspan=4|LIBRARY MANIPULATION OPTIONS (Applicable with .splib file input)
Line 372: Line 564:
| '''-cf<pred>'''||'''filterCriteria'''||Filter library by criteria. Keep only those entries satisfying the predicate <pred>.||<pred> should be in quotes in the form “<attr> <op> <value>”. <attr> can refer to any of the fields and any comment entries. <op> can be ==, !=, <, >, <=, >=, =~ and !~. Multiple predicates can be separated by either & (AND logic) or <nowiki>|</nowiki> (OR logic), but not both. Default is off. | '''-cf<pred>'''||'''filterCriteria'''||Filter library by criteria. Keep only those entries satisfying the predicate <pred>.||<pred> should be in quotes in the form “<attr> <op> <value>”. <attr> can refer to any of the fields and any comment entries. <op> can be ==, !=, <, >, <=, >=, =~ and !~. Multiple predicates can be separated by either & (AND logic) or <nowiki>|</nowiki> (OR logic), but not both. Default is off.
|- |-
-| '''-cJ'''||'''combineAction'''||Combine action.||-cJU = Union (default). Include all the peptide ions in all the files. <br>-cJI = Intersection. Only include peptide ions that are present in all the files. <br>-cJS = Subtraction. Only include peptide ions in the first file that are not present in any of the other files. <br>-cJH = Subtraction of homologs. Only include peptide ions in the first file that do not have any homologs with similar m/z in any of the other files.+| '''-cJ'''||'''combineAction'''||Combine action.||-cJU = Union (default). Include all the peptide ions in all the files. <br>-cJI = Intersection. Only include peptide ions that are present in all the files. <br>-cJS = Subtraction. Only include peptide ions in the first file that are not present in any of the other files. <br>-cJH = Subtraction of homologs. Only include peptide ions in the first file that do not have any homologs with similar m/z in any of the other files. <br>-cJA = Appending. Each peptide ion is added from only one library: the first one in the command line that contains that peptide ion.
|- |-
-| '''-cA'''||'''buildAction'''||Build action.||-cAB = Best replicate. Pick the best replicate of each peptide ion. <br>-cAC = Consensus. Create the consensus spectrum of all replicate spectra of each peptide ion. <br>-cAQ = Quality filter. Apply quality filters to library.<br>Default is no build action - all spectra will be included as is.+| '''-cA'''||'''buildAction'''||Build action.||-cAB = Best replicate. Pick the best replicate of each peptide ion. <br>-cAC = Consensus. Create the consensus spectrum of all replicate spectra of each peptide ion. <br>-cAQ = Quality filter. Apply quality filters to library.<br>-cAD = Decoy. Generate decoy spectra.<br> -cAN = Sort by descending number of replicates (tie-breaking by probability).<br> -cAM = Semi-empirical. Generate semi-empirical spectra. <br> -cAS = Clustering by spectral similarity. <br>Default is no build action - all spectra will be included as is.
|- |-
-|colspan=4|CONSENSUS SPECTRUM CREATION OPTIONS (Applicable with -cAC option)+| '''-cD<file>'''||'''refreshDatabase'''||Refresh protein mappings against the database <file> in FASTA format.||Default is off.
 +|-
 +| -cQ<num>||reduceSpectra||Produce reduced spectra of at most <num> peaks, based on rules prioritizing desirable SRM transitions.||Default is 0 (keep entire spectrum).
 +|-
 +| -cu||refreshDeleteUnmapped||Delete entries whose peptide sequences do not map to any protein during refreshing with -cD option.||Default is off.
 +|-
 +| -cd||refreshDeleteMultimapped||Delete entries whose peptide sequences map to multiple proteins during refreshing with the -cD option.||Default is off.
 +|-
 +| -c_ANN||reannotatePeaks||Re-annotate peaks.||Turn on with -c_ANN, off with -c_ANN!. Default is off.
 +|-
 +| -c_NPK<num>||minimumNumPeaksToInclude||Exclude spectra with fewer than <num> peaks.||Default is 10.
 +|-
 +| -c_Q3L<thres>||minimumMRMQ3MZ||Specify the lower m/z limit for Q3 in SRM table generation.||Default is 200.
 +|-
 +| -c_Q3H<thres>||maximumMRMQ3MZ||Specify the upper m/z limit for Q3 in SRM table generation.||Default is 1400.
 +|-
 +| -c_NPK<num>||minimumNumPeaksToInclude||Exclude spectra with fewer than <num> peaks.||Default is 10.
 +|-
 +| -c_RTO||refreshTrypticOnly||When refreshing database (-cD option), only map peptide to protein when the peptide is tryptic in that protein.||Default is off.
 +|-
 +|colspan=4|CONSENSUS/BEST-REPLICATE LIBRARY CREATION OPTIONS (Applicable with -cAC and -cAB options)
|- |-
| '''-cr<num>'''||minimumNumReplicates||Minimum number of replicates required for each library entry. Peptide ions with fewer than <num> replicates will be excluded from library when creating consensus library.||Default is 1. | '''-cr<num>'''||minimumNumReplicates||Minimum number of replicates required for each library entry. Peptide ions with fewer than <num> replicates will be excluded from library when creating consensus library.||Default is 1.
Line 392: Line 604:
| -c_WGT<score>||replicateWeight||Select the type of score to weigh and rank the replicates.||-c_WGTS (default) = Use a measure of signal-to-noise ratio as the weight.<br>-c_WGTX = Use a function of the SEQUEST xcorr score as the weight.<br>-c_WGTP = Use a function of the PeptideProphet probability as the weight. | -c_WGT<score>||replicateWeight||Select the type of score to weigh and rank the replicates.||-c_WGTS (default) = Use a measure of signal-to-noise ratio as the weight.<br>-c_WGTX = Use a function of the SEQUEST xcorr score as the weight.<br>-c_WGTP = Use a function of the PeptideProphet probability as the weight.
|- |-
-|colspan=4|BEST REPLICATE SELECTION OPTIONS (Applicable with -cAB option)+| -c_RRS||recordRawSpectra||Record all raw spectra (in the format file.scan.scan) used in build the consensus in the Comment line.||Default is off.
-|-+
-| '''-cr<num>'''||minimumNumReplicates||Minimum number of replicates required for each library entry. Peptide ions with fewer than <num> replicates will be excluded from library when creating best-replicate library library.||Default is 1.+
-|-+
-| -c_DIS||removeDissimilarReplicates||Remove dissimilar replicates before selecting best replicate.||Turn on with -c_DIS, off with -c_DIS!. Default is on.+
|- |-
|colspan=4|QUALITY FILTER OPTIONS (Applicable with -cAQ option) |colspan=4|QUALITY FILTER OPTIONS (Applicable with -cAQ option)
Line 409: Line 617:
|- |-
| -c_QIE||qualityImmuneMultipleEngines||Make spectra identified by multiple sequence search engines immune to quality filters.||Turn on with -c_QIE, off with -c_QIE!. Default is on. | -c_QIE||qualityImmuneMultipleEngines||Make spectra identified by multiple sequence search engines immune to quality filters.||Turn on with -c_QIE, off with -c_QIE!. Default is on.
 +|-
 +|colspan=4|BAYESIAN DENOISER OPTIONS
 +|-
 +| -c_BDU||useBayesianDenoiser||Use Bayesian denoiser. Default parameters are used unless trained on the fly with -c_BDT option, or read from a file specified by -c_BDF option.||Default is off.
 +|-
 +| -c_BDT||trainBayesianDenoiser||Train Bayesian denoiser. Only active in consensus mode (-cAC option). ||Default is off.
 +|-
 +| -c_BDP<thres>||denoiserMinimumSignalProb||Minimum signal probability to retain a peak when denoiser is used.||Default is 0.0.
 +|-
 +| -c_BDF<file>||denoiserParamFile||Specify parameter file for Bayesian denoiser, for both writing and reading.||Default is off (no writing or reading).
 +|-
 +|colspan=4|DECOY GENERATION OPTIONS (Applicable with -cAD option)
 +|-
 +| '''-cc'''||'''decoyConcatenate'''||Concatenate real and decoy libraries.||Default is off: library consisting of only decoy spectra is created.
 +|-
 +| -cy<num>||decoySizeRatio||Specify the (decoy / real) size ratio.||Default is 1. <num> must be an integer.
 +|-
 +| -c_DPS||decoyPrecursorSwap||Use a modified form of the [http://www.ncbi.nlm.nih.gov/pubmed/23560440 precursor swap method] for generating decoys.||Turn on with -c_DPS, off with -c_DPS! Default is off.
 +|-
 +|colspan=4|RETENTION TIME NORMALIZATION OPTIONS (Applicable with .pep.xml file input)
 +|-
 +| -c_IRT<file>||normalizeRTWithLandmarks||Use landmark peptides in <file> to normalize retention times to iRTs.||Default is off. <file> should be a space-delimited table with two columns: peptide sequence and iRT
 +|-
 +| -c_IRR||normalizeRTLinearRegression||Regress the real RTs of landmark peptides (i.e. assume they form a straight line).||Turn on with -c_IRR, off with -c_IRR!. Default is off.
 +|-
 +|colspan=4|UNIDENTIFIED LIBRARY/CLUSTERING OPTIONS
 +|-
 +| -c_UCR||unidentifiedClusterIndividualRun||Merge neighboring spectra in each run as they are imported from data (mz(X)ML) files.||Turn on with -c_UCR, off with -c_UCR!. Default is off.
 +|-
 +| -c_UCD<thres>||unidentifiedClusterMinimumDot||Specify minimum dot products for two spectra to be clustered.||Default is 0.7.
 +|-
 +| -c_UX1||unidentifiedRemoveSinglyCharged||Remove spectra that appear to be from singly charged precursors.||Turn on with -c_UX1, off with -c_UX1!. Default is on.
 +|-
 +| -c_UNP<num>||unidentifiedMinimumNumPeaksToInclude||Remove spectra that have fewer than <num> peaks.||Default is 35.
 +|-
 +| -c_USX<thres>||unidentifiedSingletonXreaThreshold||Apply an Xrea (quality measure) filter to singleton spectra after clustering. Only those with Xrea at least <thres> are kept.||Default is 0.6.
 +|-
 +|colspan=4|SEMI-EMPIRICAL SPECTRUM GENERATION OPTIONS (Applicable with -cAM option)
 +|-
 +| -cx<string>||allowableModTokens||Specify the set(s) of modifications allowed in semi-empirical spectrum generation by -cAM option.||Default is off: no semi-empirical spectrum generated.
|} |}
Line 422: Line 670:
|- |-
| '''-L<file>'''||'''None'''||Specify name of log file.||Default is spectrast.log. | '''-L<file>'''||'''None'''||Specify name of log file.||Default is spectrast.log.
 +|-
 +| '''-M<file>'''||'''None'''||Activate user-defined modifications listed in <file>.||Default is off. If <file> is omitted spectrast.usermods is assumed.
|} |}
Line 447: Line 697:
* MatchTol: The m/z tolerance within which a peak is considered matched between the library and query spectra. This affects the labeling and coloring of the peaks. * MatchTol: The m/z tolerance within which a peak is considered matched between the library and query spectra. This affects the labeling and coloring of the peaks.
* Y-Zoom: Zooming factor in the Y axis (the peak intensity). * Y-Zoom: Zooming factor in the Y axis (the peak intensity).
-* BlankPrecRegion: Blank the region around the precursor m/z. (Note: in SpectraST searching peaks in this region are ignored.)+* BlankPrecRegion: Blank the region around the precursor m/z. (Note: in SpectraST searching, peaks in this region are ignored.)
* Annotation Options * Annotation Options
** LabelType: Toggling between displaying the ion type, the m/z value, or no label for selected annotated peaks ** LabelType: Toggling between displaying the ion type, the m/z value, or no label for selected annotated peaks
Line 459: Line 709:
</div> </div>
-The stand-alone plotspectrast application produces a static .jpg image in the same directory as the query spectrum file. It has the following usage:+The stand-alone plotspectrast application produces a static .png image in the same directory as the query spectrum file. It has the following usage:
<code>plotspectrast <.splib file> <library file offset> <.mzXML file> <query scan number></code> <code>plotspectrast <.splib file> <library file offset> <.mzXML file> <query scan number></code>
Line 476: Line 726:
Similar to above, except the library spectrum is given in a .msp file. Similar to above, except the library spectrum is given in a .msp file.
 +
 +<code>plotspectrast <.splib file 1> <library file offset 1> <.splib file 2> <library file offset 2></code>
 +
 +Plots two library spectra head to tail.
==== Lib2HTML ==== ==== Lib2HTML ====
Line 491: Line 745:
== Developer's Guide == == Developer's Guide ==
-The SpectraST source code contains detailed documentation. More tips to developers who want to modify SpectraST will be available shortly.+The SpectraST source code contains detailed documentation.
 + 
 +<h4>sptxt file format:</h4>
 + 
 +<p>The sptxt file format is very closely realted to the msp format, whose
 +documentation can be found [http://chemdata.nist.gov/mass-spc/ftp/mass-spc/PepLib.pdf here].</p>
 + 
 +<h4>Annotation syntax:</h4>
 + 
 +SpectraST's syntax to annotate a fragment follows the scheme proposed by
 +Roepstorff and Fohlman.
 + 
 +An annotation tag starts with the assigned ion type (a,b,c,x,y or z) and is
 +followed by the number of amino acid residues present in the fragment. This
 +number is possibly followed by a signed integer value, indicating a
 +modification. Please note that besides post-translational modifications also
 +loss of water (-18) and loss of ammonia (-17), e.g., are taken into account.
 +The caret symbol '^' followed by an integer value depicts the charge state of
 +the fragment. Its absence indicates a singly charged fragment ion. An
 +additional 'i' at the end of the annotation tag implies that the mass value
 +does not correspond to the expected mass value of the monoisotopic peak, but
 +can be assigned to a different isotopic peak of the fragment. Finally, the
 +annotation pattern contains the average mass deviation (in Da) from the
 +theoretically expected mass. A slash '/' preceds this number.
 + 
 +The list of possible annotations is ordered by ascending charge states,
 +where ties are broken by ascending mass deviations.
 + 
 +Annotation tags can be enclosed by square brackets, indicating that several
 +peaks could be assigned the same particular ion. Usually, SpectraST would
 +resolve such a situation by annotating only one of the ions and leaving the
 +other ones blank. If data is not (sufficiently) centroided, this strategy might
 +lead to a buch of unresolved peaks, which might in turn cause quality filters
 +to fail. To circumvent this problem, if there are additional intense peaks that
 +look to be the same ion, a bracketed annotation will be given to them.
 + 
 +Besides annotations following the Roepstorff/Fohlman notation SpectraST also
 +assigns immonium ions. The corresponding tag consists of 3 capital letters,
 +always starting with an 'I' (for immonium), followed by the amino acid and an
 +additional letter to designate different residue-specific ions from that amino
 +acid.
 + 
 +More tips to developers who want to modify SpectraST will be available
 +shortly.
== Where to Get Help == == Where to Get Help ==
Line 511: Line 808:
* [http://www.thegpm.org/hunter/index.html GPM's X!Hunter Project] * [http://www.thegpm.org/hunter/index.html GPM's X!Hunter Project]
* [http://proteome.gs.washington.edu/bibliospec/documentation/ BiblioSpec Project at University of Washington] * [http://proteome.gs.washington.edu/bibliospec/documentation/ BiblioSpec Project at University of Washington]
- 
== Reference == == Reference ==
Line 522: Line 818:
* Frewen, Barbara, ''et al.'' (2006). "Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries". ''Analytical Chemistry'' '''78''' (16), 5678-5684. [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&TermToSearch=16906711 Abstract] * Frewen, Barbara, ''et al.'' (2006). "Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries". ''Analytical Chemistry'' '''78''' (16), 5678-5684. [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&TermToSearch=16906711 Abstract]
 +
 +* Lam, Henry, ''et al.'' (2008). "Building consensus spectral libraries for peptide identifications in proteomics". ''Nature Methods'' '''5''', 873-875. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2637392/?tool=pubmed FullText]
 +
 +* Picotti, Paola, ''et al.'' (2008). "A database of validated assays for the targeted mass spectrometric analysis of the S. cerevisiae proteome". ''Nature Methods'' '''5''', 913-914. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2770732/?tool=pubmed FullText]
 +
 +* Sherwood, Carly, ''et al.'' (2009). "MaRiMba: A software application for spectral library-based MRM transition list assembly ". ''Journal of Proteome Research'' '''8''', 4396-4405. [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&TermToSearch=19603829 Abstract]
 +
 +* Lam, Henry, ''et al.'' (2010). " Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics". ''Journal of Proteome Research'' '''9''', 605-610. [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&TermToSearch=19916561 Abstract]

Current revision

SpectraST (short for "Spectra Search Tool" and rhymes with "contrast") is a spectral library building and searching tool designed primarily for shotgun proteomics applications. It is developed at the Institute for Systems Biology (ISB), in the research group of Professor Ruedi Aebersold. The main developer is Henry Lam.

The latest version of SpectraST is 5.0, released beta in November 2013, and officially with TPP 4.7 in March 2014. It is distributed by ISB under the LPGL license, as a component of the Trans Proteomic Pipeline (TPP) suite of software, distributed under the same license. The source code repository is at [1], and the official download site for the Windows installer is at [2].

Contents

Introduction to Shotgun Proteomics and Spectral Searching

The goal of proteomics is the systematic identification and quantification of all proteins in a biological system. In one of the most frequently practiced workflows, commonly known as shotgun proteomics, a protein sample of interest is first digested with a proteolytic enzyme (trypsin being the most common) to yield peptides that are amenable to LC-MS/MS analysis. The peptides in the resulting mixture are chromatographically resolved, ionized by techniques such as electrospray ionization (ESI) or matrix-assisted laser desorption ionization (MALDI) before being analyzed by a mass spectrometer. A fraction of the peptide ions are selectively isolated by the mass spectrometer and subjected to collision-induced dissociation (CID), in which the peptide ions are bombarded with noble gas atoms to induce fragmentation. (Other types of fragmentation techniques are also rapidly maturing.) The fragment ions are detected and reported by the mass spectrometer as tandem mass (MS/MS) spectra. Because peptide ions tend to fragment mostly along the peptide backbone in a somewhat predictable manner, the MS/MS spectra contain information that can be used to deduce the peptide sequence.

Traditionally, the inference of the peptide sequence from its characteristic tandem mass spectra is done by sequence (database) searching. In sequence searching, a target protein (or translated DNA) database is used as a reference to generate all possible putative peptide sequences by in silico digestion. The search engines then use various rules to predict the theoretical fragmentation pattern of each of these putative peptides, and compare the experimentally observed MS/MS spectra to these theoretical spectra one-by-one. Presumably, a positive identification is made if the experimental spectrum is sufficiently similar to one of the theoretical spectra. Several popular computational tools developed for this purpose have emerged over the years, each employing different algorithms and heuristics to achieve an acceptable balance of sensitivity and accuracy. Unfortunately, traditional sequence searching is a challenging, error-prone, and computationally expensive exercise. Despite the tremendous improvement in computer hardware and software over the past decade, this step often remains the bottleneck of any given proteomics experiment. The requirement of computational resources is also substantial, limiting the use of this powerful technique to only those research groups that can afford the costly computational infrastructure.

Spectral searching is an alternative approach that promises to address some of the shortcomings of sequence searching. In spectral searching, a spectral library is meticulously compiled from a large collection of previously observed and identified peptide MS/MS spectra. The unknown spectrum can then by identified by comparing it to all the candidates in the spectral library for the best match. This approach has been commonly employed for mass spectrometric analysis of small molecules with great success, but has only become possible for proteomics very recently. The main difficulty of generating enough high-quality experimental spectra for compilation into spectral libraries has been overcome by the recent explosion of proteomics data and the availability of public data repositories. Several attempts at creating and searching spectral libraries in the context of proteomics have been published within the past year, all demonstrating the tremendous improvement in search speed and the great potential of this method in complementing, if not replacing, sequence searching in many proteomics applications.

Advantages of Spectral Searching

1. Speed

Spectral searching benefits from a much reduced search space compared to sequence searching. In spectral searching, only peptide ions that are observed and identified in previous experiments will be included in spectral libraries and considered as candidates, whereas in sequence searching, all putative peptide sequences -- plus all permutations of post-translational modification sites, if specified -- in a protein database are considered. Most of these putative peptide ions considered in sequence searching are never observed in practice for a variety of reasons. With typical search parameters, the search space of spectral searching can be several orders of magnitude smaller. It is therefore not surprising that spectral searching can also be several orders of magnitude faster. SpectraST can achieve a top speed of 0.001 to 0.01 second per query spectrum (against a library of about 50,000 entries) on a modern personal computer. In contrast, SEQUEST, one of the most popular sequence search engine, needs about 5 to 20 seconds per query spectrum (against a human IPI database).

2. Preciseness

Spectral searching compares experimental spectra to experimental spectra; sequence searching compares experimental spectra to theoretical spectra. In general, the theoretical spectra considered in sequence searching are very simplistic (e.g., only including b- and y-type ions, at a fixed intensity), and do not resemble the experimental spectra that they are supposed to match. On the other hand, armed with previously observed experimental spectra compiled into spectral libraries, spectral searching can take full advantage of all spectral features, including actual peak intensities, neutral losses from fragments, and various uncommon or even unknown fragments, to determine the best match. The similarity scoring of spectral searching is therefore more precise, and will generally provide better discrimination between good and bad matches. This usually results in much superior statistics (e.g., sensitivity, false discovery rates) for the search results, compared to sequence searching.

Versions

What's new in SpectraST 5.0

  • New, rank-based similarity scoring function (Old scoring function remains as an option)
  • High mass accuracy MS2 (including HCD) support
  • Spectral archive (unidentified spectral library) building
  • Biological sample fingerprinting by spectral archives (Contributor: Dr. Wenguang Shao)
  • Open (blind) modification search (Contributor: Dr. Manson Ma)
  • Improved decoy generation, including alternative method by precursor swapping
  • Semi-empirical spectrum generation for amino acid substitutions (Contributor: Dr. Yingwei Hu)
  • De-noising based on Bayesian classifier (Contributor: Dr. Wenguang Shao)
  • Retention time normalization using injected landmark peptides
  • Support for glycopeptides (Contributor: Dr. Yingwei Hu)

What's new in SpectraST 4.0

  • ETD support
  • iProphet support
  • Decoy spectrum generation
  • MRM transition list generation
  • User-defined modifications
  • Semi-empirical spectrum generation from real spectrum of closely related identification (Contributor: Dr. Yingwei Hu)
  • Searching .mgf files
  • Clickable (HTML) search output format
  • Better book-keeping in library building
  • Various bug fixes and performance enhancements

What's new in SpectraST 3.1

  • Re-mapping peptide identifications of library entries to protein sequence database of choice
  • Rudimentary centroiding for imported spectra in profile mode
  • mzML support via TPP
  • Various bug fixes and performance enhancements

What's new in SpectraST 3.0

  • Creating libraries from sequence search results
  • Library manipulation
    • Union/Intersect/Subtract operations
    • Consensus/Best-replicate library building
    • Filtering based on criteria
    • Quality filters
  • Importing libraries from X!Hunter and BiblioSpec formats
  • File list feature
  • Logging
  • Lib2HTML utility for visualizing library
  • Monoisotopic mass support
  • Various bug fixes and performance enhancements

What’s new in SpectraST 2.0

  • Binary library format, enabling speed gain
  • Library information and statistics in preambles of .sptxt and .pepidx files
  • Searching of .dta files
  • Detecting homologs in hit list
  • Various bug fixes and performance enhancements

User's Guide

Installing SpectraST

SpectraST is an integral component of the Trans Proteomic Pipeline suite of software. Although it can be used alone without other TPP components, SpectraST users are strongly encouraged to download and install the entire TPP suite, which provides other useful functionalities such as raw data importation, automatic validation of search results, protein inference, and quantification and visualization.

Windows users: SpectraST is available as part of TPP for Windows. A one-click installer is available, in which Windows-native executables are compiled by MinGW.

UNIX/LINUX users: Visit the Sashimi project page on SourceForge.net, and download the code as a tarball directly. Compiling, installation and configuration information is available in the README file. Alternatively, follow the instructions for Ubuntu LINUX installation.

Running SpectraST

SpectraST has two modes, the Create mode and the Search mode. In the former, SpectraST creates a searchable spectral library from various formats to prepare for searching. In the latter, SpectraST takes in unknown spectra and searches each of them against the spectral library.

The simplest way of running SpectraST is from the command line of your UNIX/LINUX or Windows cmd shell. The general usage is:

spectrast <options> <list of files of appropriate formats>

Options must be separated by space, and all begin with a hyphen ('-'). Search mode options always have an 's' following the hyphen; Create mode options a 'c'. SpectraST will perform the appropriate action based on the options specified, and complain when there are problems interpreting the command statement. The usage statement, and a list of options can be viewed by issuing the command spectrast by itself.

Once TPP is installed, SpectraST can also be run from the Petunia web interface, with limited options.

SpectraST Search Mode

SpectraST can perform spectral searching from the following data formats:

  • .mzML format
  • .mzXML (all versions) format
  • .mzData format
  • .mgf (Mascot Generic) format
  • .dta (SEQUEST) format, a simple peak list preceded by precursor information
  • NIST (National Institute of Standards and Technology)’s .msp format

To search, the spectral library must be in SpectraST’s .splib format, which can be created in SpectraST Create Mode.

The results can be outputted to the following formats:

  • .pepXML format
  • .txt format, a fixed-width column text format
  • .xls format, a tab-delimited column text format
  • .html format, a HTML table with clickable links to spectrum viewer

The search mode is initiated with the option -s, or any of the search mode options. For instance, to search the MS/MS spectra in the file foo.mzXML against the spectral library bar.splib, using the parameters specified in the file spectrast.params, the command is simply:

Note: If the library is not specified in the parameter file or if the parameter file is not given, then the option -sL is mandatory; otherwise SpectraST will not know which spectral library to use.

spectrast -sFspectrast.params -sLbar.splib foo.mzXML

In the above, -sF and -sL are search mode options that the user can specify to customize the behavior of SpectraST. SpectraST will search all the MS/MS spectra in the file foo.mzXML against the spectral library bar.splib, using the parameters specified in the file spectrast.params. The result will be written to a file named foo.<ext> in the same directory where <ext> specifies the output format (.pep.xml, .txt, .xls, or .html).

For a full list of options, see SpectraST Options.

SpectraST Create Mode

Importing Existing Libraries

SpectraST can create a searchable spectral library from the following formats:

  • NIST (National Institute of Standards and Technology)'s .msp format (Download here)
  • X!Hunter's .hlf format [3]
  • BiblioSpec’s .ms2 format [4]

If files of these extensions are supplied, SpectraST simply converts those spectral libraries into a form suitable for SpectraST searches (.splib formats). (Note however that there is no study on how well SpectraST works with X!Hunter and BiblioSpec libraries.) For instance, to import the NIST yeast consensus library, and call the resulting library bar.splib and put it in the directory /dir/, the command is:

spectrast -cN/dir/bar yeast_consensus.msp

When it is done, it produces 5 files in the directory /dir/. The file bar.splib is the library itself; it’s in a binary (machine-readable) format. The file bar.sptxt is a text (human-readable) version of bar.splib. This .sptxt file is of no use to SpectraST; it can be deleted after manual inspection. The files bar.spidx and bar.pepidx are indices on the precursor m/z value and peptide, respectively. Keep the indices and the .splib file in the same directory for SpectraST to function properly. Lastly, a file spectrast.log is also created to document the command executed. Some useful information about the library is printed at the beginning of the bar.sptxt and bar.pepidx.

For a full list of SpectraST options, see SpectraST Options.

Creating Libraries from Sequence Search Results

Note: As per TPP convention, the spectrum query must be named:

<mzXML file name>.<start scan>.<end scan>.<charge>

in the .pepXML file, so that SpectraST knows where to find the corresponding experimental spectrum. (If the .pepXML file is created with TPP tools, this should not be an issue.)

SpectraST can create a spectral library from a .pepXML file, which contains peptide identifications from a previous shotgun proteomics experiment. For this purpose, it is preferable that the .pepXML has been processed with PeptideProphet and/or iProphet, such that all the search hits have probabilities assigned. (iProphet probabilities are used over PeptideProphet ones if both are present.)

When importing from a .pepXML file, SpectraST scans through the .pepXML file for confident identifications, and attempts to extract the corresponding experimental spectra from .mzXML files. For instance, the command

spectrast -cNraw -cP0.9 dataset1.xml

will import all peptide identifications with probability at or above 0.9 from the file dataset1.xml, and put them in a library called raw.splib (with the accompanying raw.sptxt, raw.spidx and raw.pepidx files). For a full list of SpectraST options, see SpectraST Options.

Manipulating SpectraST Libraries

SpectraST can convert one or more .splib libraries to another, performing various operations. For instance, to create a consensus library from all the entries in bar.splib and foo.splib, the command is:

spectrast -cNconsensus -cJU -cAC bar.splib foo.splib

SpectraST will take the union (specified by the option -cJU) of all the entries in bar.splib and foo.splib, and wherever a certain peptide ion is present as multiple entries (replicates), it will coalesce the replicates into a single consensus spectrum (specified by -cAC).

Some additional examples:

spectrast -cNphospho -cf”Mods =~ Phospho” bar.splib

This will screen the library bar.splib for all entries with a phosphorylation modification, and put the phosphopeptides in the library phospho.splib.

spectrast -cNcommon -cJI dataset1.splib dataset2.splib

This will take the intersection of the two libraries dataset1.splib and dataset2.splib, and put all entries of peptide ions that are seen in both files in the library common.splib.

spectrast -cNquality -cAQ -cL2 bar.splib

This will apply SpectraST’s quality filters to the library bar.splib; only those entries that pass the first 2 quality filters will be included in the library quality.splib.

For a full list of SpectraST options, see SpectraST Options. For a typical recipe for creating consensus libraries from sequence search results, see Creating Consensus Libraries.

Creating Consensus Libraries

A recipe for creating consensus libraries from TPP-processed sequence search results is detailed here. Consider the following example:

Dataset IdentifierpepXML FilesmzXML Files
AlphaA-SEQ.xml (SEQUEST results of A1.mzXML),
A-MAS.xml (Mascot results of A1.mzXML)
A1.mzXML
BetaB1.xml (SEQUEST results of B1.mzXML),
B2.xml (SEQUEST results of B2.mzXML)
B1.mzXML,
B2.mzXML
GammaG.xml (combined SEQUEST results of all .mzXML files)G1.mzXML,
G2.mzXML,
G3.mzXML

The following commands should be issued in succession:

Note: Alternatively, the library building recipe can be encoded in a recipe.list file (see SpectraST File List Feature):
? -cNrawA -cnAlpha
A-SEQ.xml
A-MAS.xml
? -cNrawB -cnBeta
B1.xml
B2.xml
? -cNrawG -cnGamma
G.xml
? -cJU -cAC -cNconsABC
rawA.splib
rawB.splib
rawC.splib
? -cAQ -cNconsABC_Q
consABC.splib
The command spectrast recipe.list will complete the entire library building procedure.

1. Importing the raw spectra into SpectraST
spectrast -cNrawA -cnAlpha A-SEQ.xml A.MAS.xml
spectrast -cNrawB -cnBeta B1.xml B2.xml
spectrast -cNrawG -cnGamma G.xml

These commands will create the raw libraries rawA.splib, rawB.splib and rawC.splib. Identifications from multiple .pepXML files of the same dataset are imported with the same dataset identifier. The same query with identifications from multiple search engines will be combined intelligently. The probability threshold above which identifications are imported can be specified with the option -cP<prob>, which defaults to 0.9. This will not coalesce replicates of the same peptide ion identification into a consensus spectrum yet. Remember that the .mzXML files must be in the same directories as their corresponding .pepXML files.

2. Creating a consensus spectral library
spectrast -cJU -cAC -cNconsABC raw*.splib

This will combine the three raw libraries, then replace multiple replicates of the same peptide ion identification with a consensus spectrum. Many options are available to fine-tune the algorithm; however, the default parameters are usually adequate.

3. Performing quality control of the consensus spectral library
spectrast -cAQ -cNconsABC_Q consABC.splib

This will run the consensus spectra through SpectraST's quality filters. With the default settings, spectra failing either or both of the first 2 filters will be removed, and spectra failing any of the other filters will be marked. Different quality levels can be set with the options -cL and -cl. It is recommended that a consensus spectral library is subject to some quality control before using it in spectral searching; the optimal quality level reflects the user's desired compromise between library comprehensiveness and library quality. This is to minimize mis-identified and low-quality spectra in the library. These questionable spectra can propagate errors from sequence searching, reduce the discriminating power of the spectral search engine, and induce false positive and false negative hits.

For details on the consensus and quality filter algorithms, please refer to Lam et al. (2008) Nature Methods 5, 873-875.

4. Appending artificial decoy spectra
spectrast -cAD -cc -cy1 -cNcons_ABC_Q_DECOY

This will generate an equal-size decoy spectral library and append it to the real library consABD_Q.splib. The presence of decoys enables the estimation of false discovery rate (FDR) by the well-established decoy counting method, and improves the accuracy of PeptideProphet in validating spectral search results.

The algorithm of generating artificial decoy spectra is described in Lam et al. (2010) Journal of Proteome Research 9, 605-610.

Miscellaneous Features

SpectraST Parameter Files

Note: All options set in the parameter file will be overridden by command-line options, if specified.

SpectraST allows the use of parameter files to simplify the process of spectral library building and searching. Namely, desired options can be specified in a text file, and supplied to SpectraST every time the same action is performed, saving the user from having to specify lengthy list of command-line options. To invoke the parameter files, specify the options -sF<parameter file> and -cF<parameter file> for Search Mode and Create Mode, respectively. Exemplary parameters file are provided below (these are essentially the defaults):

Search Mode: spectrast.params

Create Mode: spectrast_create.params

High Mass Accuracy MS2 Support

As of version 5.0, SpectraST supports the handling of high mass accuracy MS2 spectra (including higher-energy collisional dissociation, HCD, spectra). The fragmentation methods CID-QTOF and HCD are created to tag such spectra, and that information is automatically extracted from the data (.mzML/.mzXML) files (if specified therein) for library building. The user can also explicitly specify the fragmentation method, using the -cI option.

When building libraries, SpectraST annotates and aligns CID-QTOF/HCD spectra differently, using a narrower tolerance and considering immonium ions and internal fragments. However, when searching, the user must still specifies the "bin size" (equivalent to the product ion tolerance) as the mass accuracy can vary from instrument to instrument.

Spectral Archive (Unidentified Spectral Library) Building

As of version 5.0, SpectraST can also build and search spectral libraries without identifications. Such libraries are also referred to as spectral archives. The spectral archive building prodecure is as follows:

spectrast -cNraw_unid foo.mzXML

This imports all MS2 spectra in foo.mzXML into the library raw_unid.splib, with some filtering and merging. Each spectrum will be given a unique identifier starting with an underscore '_' in place of the peptide identification.

spectrast -cNclustered_unid -cAS raw_unid.splib

This performs spectral clustering, such that replicate spectra of similar precursor m/z and high spectral similarity are detected and merged into consensus spectra.

Advanced options -c_UCR, -c_UCD, -c_UX1, -c_UNP and -c_USX control some tunable parameters of this procedure.

Spectral archives can be searched by SpectraST in the same way as identified spectral libraries, such as for biological sample fingerprinting (see below). Please note that results obtained by searching spectral archives may not be recognized by downstream TPP tools.

Biological Sample Fingerprinting by Spectral Archives

Spectral archives can be used for biological sample fingerprinting. As of version 5.0, upon searching such libraries, SpectraST can output a text file reporting "dataset similarities" calculated from counting spectral matches of library spectra from different sources (with -s_FIN option). The details of this algorithm and the instructions can be found in Onders et al. (2014) Nature Protocols 9, 842-850.

Open (Blind) Modification Search

As of version 5.0, SpectraST can perform open (blind) modification search using the algorithm published in Ma et al. (2014) Journal of Proteome Research 13, 2262-71.

De-noising by Bayesian Classifier

As of version 5.0, SpectraST implements the algorithm published in Shao et al. (2013) Journal of Proteome Research 12, 3223-32 for de-noising singleton library spectra.

Retention Time Normalization using Injected Landmark Peptides

As of version 5.0, SpectraST can make use of injected standard peptides to calculate a normalized retention time index (iRT) to be included in the library entry, which removes the variability due to LC settings and columns. This enables retention time information to be used in SWATH/SRM assays, among other uses. It implements two different ways of calculating iRT, linear regression and linear interpolation.

ETD Support

As of version 4.0, SpectraST supports the import and searching of MS2 spectra by electron-transfer dissociation (ETD). A tag encoding the fragmentation method is added to each library entry to differentiate between CID (collision-induced dissociation) and ETD spectra, and that information is automatically extracted from the data (.mzML/.mzXML) files (if specified therein) for library building. The user can also explicitly specify the fragmentation method, using the -cI option.

SpectraST annotates ETD spectra differently than CID spectra, and the spectral matching algorithm is slightly modified to deal with the charged-reduced precursor peaks common in ETD spectra.

Generation of Transition Lists for Selected/Multiple Reaction Monitoring (S/MRM)

Selected reaction monitoring (SRM, also known as Multiple Reaction Monitoring, MRM) is an acquisition mode available on some mass spectrometers (mostly triple quadrupole instruments) which has gained increasing popularity as a targeted quantification technique. It requires as input a list of “transitions” to be monitored over the course of the experiment. Each transition consists of two numbers, Q1 (the precursor m/z) and Q3 (the fragment ion m/z) and must be selected beforehand based on knowledge of the peptide to be quantified. One of the most effective strategies for selecting appropriate transitions is to rely on previously acquired MS2 spectra of the target peptides stored in spectral libraries.

As of version 4.0, SpectraST implements an algorithm to select the N (a user-specified number) most suitable transitions for each library spectrum, and print the list of transitions in a table format. For example, the command:

spectrast -cNfoo_MRM -cM -cQ5 foo.splib

will create a “reduced” spectral library foo_MRM.splib, in which only each spectrum only retains the 5 peaks most suited as MRM transitions. In addition, a text file foo_MRM.mrm containing the transitions in a table format will be printed.

A fully functional software tool for MRM experiment design, MaRiMba (Sherwood et al. (2009) Journal of Proteome Research 8, 4396-4405), which essentially wraps this transition selection algorithm of SpectraST as well as implements some surrounding functionalities, is freely available as part of the TPP and accessible via the Petunia web interface.

User-defined Modifications

As of version 4.0, users can specify allowable modifications in a text file. (Prior to 4.0, SpectraST internally maintains a list of common modifications and will reject identifications with unrecognized modifications.) This text file should contain space- or comma-delimited strings of the following form:

<token>|<monoisotopic mass change from unmodified amino acid>|<name of modification>

where <token> must be one of the amino acids in one-letter code (capitalized), n (the N terminus) or c (the C terminus), optionally followed by [<tag>], a user-defined short tag for that modification, and <name of modification> is a user-defined formal name of that modification which will be written to the library created.

An example of this file can be found in spectrast.usermods.

To activate the modifications specified in this file, add the option -M to the command line whether identifications containing these modifications are to be processed. By default, SpectraST expects the file to be named spectrast.usermods and reside in the current working directory. You can also specify otherwise by appending the file name (with path if necessary) after the -M.

Since a spectral library is meant to be a long-lasting and shared resource, special care should be taken in specifying new modifications. Accurate monoisotopic masses (preferably to at least 4 digits) should be used, and the name of the modification should follow HUPO-PSI convention as much as possible. Consult Unimod to see if your modification is already defined in HUPO-PSI standards.

Semi-empirical Spectrum Generation

As of version 4.0, SpectraST can generate semi-empirical spectra from real spectra of closely-related identifications, simply by shifting peaks on the m/z axis wherever appropriate. This is useful when spectra of one modification state are used to built the library, but one wishes to match spectra of the same sequence but another modification state in spectral searching.

To do so, the user needs to turn on the build action of semi-empirical spectrum generation by the option -cAM, and specify what modifications are desired by the -cx<str> option. The <str> in the latter is a string of allowable modification tokens (no space in between). A modification token starts with a one-letter amino acid (A through Z, plus n or c for the termini), followed by [<tag>] if it is not the unmodified form.

For example, the command

spectrast -cNfoo_heavy -cAM -cx’K[134]R[162]’ foo.splib

will create a library foo_heavy.splib containing the same peptide ions as foo.splib, but with all lysines and arginines being heavy (both +6 Da). Note that since K and R are not included in the string following -cx, unmodified K or R is not allowed to be used. This is commonly referred to as a static modification.

The command

spectrast -cNfoo_metox -cAM -cx’MM[147]’ foo.splib

will create a library foo_metox.splib containing the same peptide ions as foo.splib, but with all possible permutations of normal and oxidized methionine. This is commonly referred to as a variable modification. Note that the string ’MM[147]’ means that both M (normal methionine) and M[147] (oxidized methionine) are allowed where an M is present in a sequence.

The command

spectrast -cNfoo_binary -cAM -cx’{K}{K[134]}’ foo.splib

specifies a “binary” modification: lysines on the same peptide must be either all light or all heavy. The curly bracelet ({ }) specifies sets of allowable modification tokens; for the same peptide only tokens from a single set can be used.

Note that while the modification tokens, such as M[147], contain only integer values (of the modified amino acids) as a tag, in calculations SpectraST will use the corresponding accurate mass stored for that particular modification type indicated by the token. Hence, for modifications currently not recognized by SpectraST, the user must define them (along with the accurate mass) in a text file using the -M option.

SpectraST File List Feature

SpectraST allows the user to list the files to be processed in a text file with extension .list. This can be useful when the number of files to be processed is very large, possibly overwhelming the UNIX command line. It is also an easy way to queue up multiple SpectraST tasks and to keep track of them. For example, if the file job.list contains the lines:

# This is a comment line ignored by SpectraST.
? -sLfoo.splib   # '?' signals the start of a new job; options for this job follow the '?'
1.mzXML
2.mzXML

? -sLbar.splib
3.mzXML
4.mzXML
Note: One can mix Search jobs and Create jobs in the same .list file. Command-line options will be overridden by those specified in the .list file with lines preceded by ‘?’.

Then running the command:

spectrast -sFspectrast.params job.list

is equivalent to running

spectrast -sFspectrast.params -sLfoo.splib 1.mzXML 2.mzXML

followed by

spectrast -sFspectrast.params -sLbar.splib 3.mzXML 4.mzXML

SpectraST Options

Commonly used options are shown in bold. The rest are advanced options that should rarely need to be used.

Search Mode Options
Command-line TokenName in Parameter FileMeaningRemarks
GENERAL OPTIONS
-sNoneSpecify search mode.Not needed when any other search options are set.
-sF<file>NoneRead search options from <file>.If <file> is omitted, “spectrast.params” is assumed
-sL<file>libraryFileSpecify library file.Mandatory unless specified in parameter file. <file> must have .splib extension.
-sD<file>databaseFileSpecify a sequence database file. This will not affect the search in any way, but this information will be included in the output for any downstream data processing.<file> must have .fasta extension. If not set, SpectraST will try to determine this from the preamble of the library.
-sT<type>databaseTypeSpecify the type of the sequence database file.-sTAA (default) = protein database
-sTDNA = genomic database.
-sRindexCacheAllCache all entries in RAM. Requires a lot of memory (the library will usually be loaded almost in its entirety), but speeds up search for unsorted queries. Turn on with -sR, off with -sR!. Default is off.
-sS<file>filterSelectedListFileNameOnly search a subset of the query spectra in the search file. Only query spectra with names matching a line of <file> will be searched.Default is off (search all queries).
CANDIDATE SELECTION AND SCORING OPTIONS
-sM<tol>precursorMzToleranceSpecify precursor m/z tolerance in Th. Monoisotopic mass is assumed.Default is 3.0 Th.
-sAprecursorMzUseAverageReport average mass instead of monoisotopic mass in search results. Precursor m/z window is expanded to account for difference between average and monoisotopic mass.Turn on with -sA, off with -sA!. Default is off.
-szsearchAllChargesSearch all candidate library spectra regardless of precursor charge, ignoring the precursor charge specified in the query data.Turn on with -sz, off with -sz!. Default is off.
-s_HOM<rank>detectHomologsDetect homologous lower hits up to <rank>. Looks for lower hits homologous to the top hit and adjust delta accordingly.Default is 4.
-s_FDL<frac>fvalFractionDeltaSpecify the fraction of the normalized delta score (delta/dot) in the F-value formula.Default is 0.4.
-s_SP4useSp4ScoringUse original SpectraST (4.0 or earlier) scoring, based on dot products of square-root intensities.Turn on with -s_SP4, off with -s_SP4!. Default is off.
-s_FBIfvalUseDotBiasUse dot bias to penalize high-scoring matches with massive noise and/or dominant peak.Turn on with -s_FBI, off with -s_FBI!. Default is on. Only applicable for SP4 scoring.
-s_PVLusePValueCompute p-value by fitting score distribution of lower hits, and use it solely in F-value, which produces better behaved negative distribution.Turn on with -s_PVL, off with -s_PVL!. Default is off. NOT applicable for SP4 scoring. Only tested in low-resolution CID data.
-s_OMTuseTierwiseOpenModSearchPerform tier-wise open modification search for modifications within precursor m/z window specified with -sM.Turn on with -s_OMT, off with -s_OMT!. Default is off. Note that the scoring is different from normal SpectraST searches.
-sC<type> deprecatedexpectedCysteineModSpecify the expected kind of cysteine modification. Those candidate library entries with a wrong kind of cysteine modification will be ignored.-sCICAT_cl = cleavable ICAT
-sCICAT_uc = uncleavable ICAT
-sCCAM = Carbamidomethyl.
Default is off (search all candidates).
-sc deprecatedignoreSpectraWithUnmodCysteineIgnore any candidate library entries with an unmodified cysteine. Turn on with -sc, off with -sc!. Default is off.
-s_NO1 deprecatedignoreChargeOneLibSpectraIgnore all library entries with +1 charge state.Turn on with -s_NO1, off with -s_NO1!. Default is off.
-s_NOS deprecatedignoreAbnormalSpectraIgnore all spectra which have non-Normal status.Turn of with -s_NOS, off with -s_NOS!. Default is off.
OUTPUT AND DISPLAY OPTIONS
-sE<ext>outputExtensionOutput format. The search result will be written to a file with the same base name as the search file, with extension <ext>.-sEtxt = Fixed-width text format
-sExls = Tab-delimited text format)
-sExml (default) or -sEpepXML = .pepXML format.
-sO<path> outputDirectorySpecify a directory to hold the search output files.Default: Same directory as the corresponding search data (.mzML/.mzXML) file.
-s_FV1<thres>hitListTopHitFvalThresholdMinimum F value threshold for the top hit. Only top hits having F value greater than <thres> will be printed.Default = 0.0 (all top hits will be displayed)
-s_FV2<thres>hitListLowerHitsFvalThresholdMinimum F value threshold for the lower hits. Only lower hits having F value greater than <thres> will be printed.Default = 0.45
-s_SHHhitListShowHomologsAlways displays homologous lower hits regardless of F value.Turn on with -s_SHH (need -s_HOM on), off with -s_SHH! Default is on.
-s_SHR<rank>hitListShowMaxRankMaximum rank for hits shown for each query, e.g. -s_SHR3 will show the top 3 hits.Default is 1.
-s_SH1hitListOnlyTopHitOnly display the top hit for each query.Turn on with -s_SH1, off with -s_SH1!. Default is on.
-s_SHMhitListExcludeNoMatchDo not display queries for which there is no candidate, or the top hit is below the minimum F value threshold specified with -sV.Turn on with -s_SHM, off with -s_SHM!. Default is on.
-s_ENZ<enz>enzymeForPepXMLOutputSpecify the proteolytic enzyme used, for the purpose of pepXML output. <enz> can be trypsin, lysc, etc.This does not affect SpectraST searching. It only affects how the results are processed by downstream TPP tools.
-s_FIN<file>printFingerprintingSummaryPrint a text file of name <file> summarizing fingerprinting results.Default is off.
SPECTRUM FILTERING OPTIONS
-s_XNP<thres>filterMinPeakCountDiscard query spectra with fewer than <thres> peaks above threshold set with -s_CNT.Default is 10.
-s_XMZ<m/z>filterAllPeaksBelowMzDiscard query spectra with almost no peaks above a certain m/z value. All query spectra with 95%+ of the total intensity below <m/z> will be removed. Default is 520.
-s_XIN<inten>filterMaxIntensityBelowDiscard query spectra with no peaks with intensity above <inten>.Default is 0.
-s_XMR<range>filterMinMzRangeDiscard query spectra with m/z range narrower than <range>.Default is 350.
-s_CNT<thres>filterCountPeakIntensityThresholdMinimum peak intensity for peaks to be counted. Only peaks with intensity above <thres> will be counted to meet the requirement for minimum number of peaks. Default is 2.01
SPECTRUM PROCESSING OPTIONS
-s_RNT<thres>filterRemovePeakIntensityThresholdNoise peak threshold. All peaks with intensities below <thres> will be zeroed. Default is 2.01
-s_RNP<num>filterMaxPeaksUsedRemove all but the top <num> peaks in query spectra.Default is 150.
-s_RDR<num>filterMaxDynamicRangeRemove all peaks smaller than 1/<num> of the base (highest) peak in query spectra.Default is 1000.
-s_MZS<mzpow>,
-s_INS<intpow>
peakScalingMzPower,
peakScalingIntensityPower
Intensity scaling power with respect to the m/z value and the raw intensity. The scaled intensity will be (m/z)^<mzpow> * (raw intensity)^<intpow>Default is <mzpow> = 0.0, <intpow> = 0.5. Only applicable in SP4 scoring.
-s_UAS<factor>peakScalingUnassignedPeaksScaling factor for unassigned peaks in library spectra. Unassigned peaks in the library spectra will be scaled by <factor>.Default is 1.0.
-s_NOBpeakNoBinningDisable binning and instead perform peak-to-peak matching.Turn on with -s_NOB, off with -s_NOB!. Default is off. NOT applicable in SP4 scoring, and not recommended for low-resolution data. Specify fragment m/z tolerance by -s_BIN<num>, tolerance = 1/<num>.
-s_BIN<num>peakBinningNumBinsPerMzUnitNumber of bins per Th.Default is 1.
-s_NEI<frac>peakBinningFractionToNeighborFraction of the scaled intensity assigned to neighboring bins.Default is 0.5.
-s_LNP<num>filterLibMaxPeaksUsedRemove all but the top <num> peaks in the LIBRARY spectra.Default is 50.
-s_RLI<thres>filterLighIonsMzThresholdRemove all light ions with m/z lower than <thres> Th for both library and query spectra.Default is 180.
-s_ITQfilterITRAQReporterPeaksRemove iTRAQ reporter peaks in the range 112-122 Th.Turn on with -s_ITQ, off with -s_ITQ!. Default is off.
-s_TMTfilterTMTReporterPeaksRemove TMT reporter peaks in the range 126-132 Th. Turn on with -s_TMT, off with -s_TMT!. Default is off.
-s_R51<thres> deprecatedfilterRemoveHuge515ThresholdRemove dominant peak at 515.3 Th. All dominant peaks near 515.3 Th (with intensity greater than <thres> of the total intensity of the spectrum) will be zeroed.Default is off. Dominant 515.3 Th peaks are a common impurity artifact in cleavable ICAT experiments.


Create Mode Options
Command-line TokenName in Parameter FileMeaningRemarks
GENERAL OPTIONS (Applicable with any file input)
-cNoneSpecify create mode.Not needed when any other create options are set.
-cF<file>NoneRead create options from file <file>.If <file> is omitted, "spectrast_create.params" is assumed.
-cN<name>outputFileNameSpecify output file name for .splib, .sptxt, .spidx and .pepidx files.If not set, SpectraST will try to construct a sensible name.
-cT<file>useProbTableUse probability table in <file>. Only those peptide ions included in the table will be imported, and their probability adjusted optionally.A probability table is a text file with one peptide ion in the format AC[160]DEFGHIK/2 per line. If a probability is supplied following the peptide ion separated by a tab, it will be used to replace the original probability of that library entry.
-cO<file>useProteinListUse protein list in <file>. Only those peptide ions associated with proteins in the list will be imported.A protein list is a text file with one protein identifier per line. If a number X is supplied following the protein separated by a tab, then at most X peptide ions associated with that protein will be imported. Peptides with more replicates are favored.
-cM<format>printMRMTableWrite all library spectra as SRM transition tables. Leave <format> blank for default tab-delimited table format.Turn off with -cM!. Default is off.
-cm<remark>remarkRemark. Add a Remark=<remark> comment to all library entries created.Default is off.
-c_BINbinaryFormatWrite library in binary format, which enables quicker search. Turn on with -c_BIN, off with -c_BIN!. Default is on.
-c_DTAwriteDtaFilesWrite all library spectra as .dta files. Turn on with -c_DTA, off with -c_DTA!. Default is off.
-c_MGFwriteMgfFilesWrite all library spectra as one .mgf file. Default is off.
-c_RDY<prefix>removeDecoyProteinsRemove spectra of decoys, for which proteins have names starting with <prefix>. Also remove decoy proteins from Protein field for peptides mapped to both target and decoy sequences. Default is off.
-c_PLT<crit> deprecatedplotSpectraPlot the library spectra as they are created.-c_PLT or -c_PLTALL = Plot every spectrum. -c_PLT<crit> = Plot spectrum when either the Status or the Spec comment value = <crit>.
LIBRARY IMPORT OPTIONS (Applicable with .pep.xml, .tsv, .msp, .hlf, .ms3, .mz(X)ML file input)
-cP<prob>minimumProbabilityToIncludeInclude all spectra identified with probability no less than <prob> in the library.Default is 0.9.
-cq<fdr>maximumFDRToInclude(Only for pepXML import) Include spectra with global FDR no greater than <fdr> the library.Default is 9999.0.
-cn<name>datasetNameSpecify a dataset identifier for the file to be imported.If not set, SpectraST will construct it from the path and the name of the .pepXML file.
-cI<type>setFragmentationSet the fragmentation type of all spectra, overriding existing information. -cIETD = tag all library spectra as ETD spectra.
-cICID-QTOF = tag all library spectra as Q-TOF (high-resolution) CID spectra.
-cIHCD = tag all library spectra as HCD (high-resolution) spectra.
Default is off (determined from the data files).
-cgsetDeamidatedNXSTSet all asparagines (N) in the motif NX(S/T) as deamidated (N[115]), and all asparagines not in the motif NX(S/T) as unmodified. Use for glycocaptured peptides. Turn on with -cg, off with -cg!. Default is off.
-coaddMzXMLFileToDatasetNameAdd the originating mzXML file name to the dataset identifier. Good for keeping track of the MS run in which the peptide is observed. Turn on with -co, off with -co!. Default is off.
-c_CENcentroidPeaksCentroid peaks as raw spectra are imported.Designed mostly for Q-TOF spectra in profile mode.
-c_RNT<thres>rawSpectraNoiseThresholdAbsolute noise filter. Remove noise peaks with intensity below <thres> in imported spectra.Default is 0.0.
-c_RDR<range>rawSpectraMaxDynamicRangeRelative noise filter. Filter out noise peaks with intensity below 1/<range> of that of the highest peak.Default is 100000.0.
-c_NAA<num>minimumNumAAToIncludeExclude spectra of peptide IDs shorter than <num> amino acids.Default is 6.
-c_NPK<num>minimumNumPeaksToIncludeExclude spectra with fewer than <num> peaks.Default is 10.
-c_XANskipRawAnnotationSkip the annotation of raw spectra as they are imported.Annotation is quite slow and might be impractical if the number of imported spectra is enormous.
-c_DCN<num>minimumDeltaCnToInclude(Only for pepXML import) Exclude spectra with deltaCn smaller than <thres>. Useful for excluding spectra with indiscriminate modification sites. Turn on with -c_DCN, off with -c_DCN!. Default is 0.0.
-c_MDF<thres>maximumMassDiffToInclude(Only for pepXML import) Exclude spectra with precursor mass difference (absolute value) greater than <thres> Daltons. Default is 9999.0.
-c_BRKbracketSpectra(Only for pepXML import) Bracket import: for each confident ID, also search neighboring scans for repeated scans to import.Turn on with -c_BRK, off with -c_BRK!. Default is off.
-c_BRMmergeBracket(Only for pepXML import) Merge bracketed spectra: merge repeated scans of a bracket into one consensus spectrum for import.Turn on with -c_BRM, off with -c_BRM!. Default is off.
LIBRARY MANIPULATION OPTIONS (Applicable with .splib file input)
-cf<pred>filterCriteriaFilter library by criteria. Keep only those entries satisfying the predicate <pred>.<pred> should be in quotes in the form “<attr> <op> <value>”. <attr> can refer to any of the fields and any comment entries. <op> can be ==, !=, <, >, <=, >=, =~ and !~. Multiple predicates can be separated by either & (AND logic) or | (OR logic), but not both. Default is off.
-cJcombineActionCombine action.-cJU = Union (default). Include all the peptide ions in all the files.
-cJI = Intersection. Only include peptide ions that are present in all the files.
-cJS = Subtraction. Only include peptide ions in the first file that are not present in any of the other files.
-cJH = Subtraction of homologs. Only include peptide ions in the first file that do not have any homologs with similar m/z in any of the other files.
-cJA = Appending. Each peptide ion is added from only one library: the first one in the command line that contains that peptide ion.
-cAbuildActionBuild action.-cAB = Best replicate. Pick the best replicate of each peptide ion.
-cAC = Consensus. Create the consensus spectrum of all replicate spectra of each peptide ion.
-cAQ = Quality filter. Apply quality filters to library.
-cAD = Decoy. Generate decoy spectra.
-cAN = Sort by descending number of replicates (tie-breaking by probability).
-cAM = Semi-empirical. Generate semi-empirical spectra.
-cAS = Clustering by spectral similarity.
Default is no build action - all spectra will be included as is.
-cD<file>refreshDatabaseRefresh protein mappings against the database <file> in FASTA format.Default is off.
-cQ<num>reduceSpectraProduce reduced spectra of at most <num> peaks, based on rules prioritizing desirable SRM transitions.Default is 0 (keep entire spectrum).
-curefreshDeleteUnmappedDelete entries whose peptide sequences do not map to any protein during refreshing with -cD option.Default is off.
-cdrefreshDeleteMultimappedDelete entries whose peptide sequences map to multiple proteins during refreshing with the -cD option.Default is off.
-c_ANNreannotatePeaksRe-annotate peaks.Turn on with -c_ANN, off with -c_ANN!. Default is off.
-c_NPK<num>minimumNumPeaksToIncludeExclude spectra with fewer than <num> peaks.Default is 10.
-c_Q3L<thres>minimumMRMQ3MZSpecify the lower m/z limit for Q3 in SRM table generation.Default is 200.
-c_Q3H<thres>maximumMRMQ3MZSpecify the upper m/z limit for Q3 in SRM table generation.Default is 1400.
-c_NPK<num>minimumNumPeaksToIncludeExclude spectra with fewer than <num> peaks.Default is 10.
-c_RTOrefreshTrypticOnlyWhen refreshing database (-cD option), only map peptide to protein when the peptide is tryptic in that protein.Default is off.
CONSENSUS/BEST-REPLICATE LIBRARY CREATION OPTIONS (Applicable with -cAC and -cAB options)
-cr<num>minimumNumReplicatesMinimum number of replicates required for each library entry. Peptide ions with fewer than <num> replicates will be excluded from library when creating consensus library.Default is 1.
-c_DISremoveDissimilarReplicatesRemove dissimilar replicates before creating consensus spectrum.Turn on with -c_DIS, off with -c_DIS!. Default is on.
-c_QUO<frac>peakQuorumSpecify peak quorum: the fraction of all replicates required to contain a certain peak. Peaks not present in enough replicates will be deleted.Default is 0.6.
-c_XPU<num>maximumNumPeaksUsedMaximum number of peaks in each replicate to be considered in creating consensus. Only the most intense <num> peaks by intensity will be considered.Default is 300.
-c_XNR<num>maximumNumReplicatesMaximum number of replicates used to build consensus spectrum.Default is 100.
-c_XPK<num>maximumNumPeaksKeptDe-noise single spectra by keeping only the most intense <num> peaks.Default is 150. Will not affect consensus spectra of more than one replicates.
-c_WGT<score>replicateWeightSelect the type of score to weigh and rank the replicates.-c_WGTS (default) = Use a measure of signal-to-noise ratio as the weight.
-c_WGTX = Use a function of the SEQUEST xcorr score as the weight.
-c_WGTP = Use a function of the PeptideProphet probability as the weight.
-c_RRSrecordRawSpectraRecord all raw spectra (in the format file.scan.scan) used in build the consensus in the Comment line.Default is off.
QUALITY FILTER OPTIONS (Applicable with -cAQ option)
-cr<num>minimumNumReplicatesReplicate quorum. Its value affects behavior of quality filter (see below).Default is 1.
-cL<level>,
-cl<level>
qualityLevelRemove,
qualityLevelMark
Specify the stringency of the quality filter. -cL specifies the level for removal, -cl specifies the level for marking.<level> = 0: No filter.
<level> = 1: Remove/mark impure spectra.
<level> = 2: Also remove/mark spectra with a spectrally similar counterpart in the library that is better.
<level> = 3: Also remove/mark inquorate entries (defined with -cr) that share no peptide sub-sequences with any other entries in the library.
<level> = 4: Also remove/mark all singleton entries.
<level> = 5: Also remove/mark all inquorate entries (defined with -cr).
Default is -cL2, -cl5
-c_QP1qualityPenalizeSingletonsApply stricter thresholds to singleton spectra during quality filters.Turn on with -c_QP1, off with -c_QP1!. Default is on.
-c_QIP<thres>qualityImmuneProbThresholdSpecify a probability above which library spectra are immune to quality filters.Default is 0.999.
-c_QIEqualityImmuneMultipleEnginesMake spectra identified by multiple sequence search engines immune to quality filters.Turn on with -c_QIE, off with -c_QIE!. Default is on.
BAYESIAN DENOISER OPTIONS
-c_BDUuseBayesianDenoiserUse Bayesian denoiser. Default parameters are used unless trained on the fly with -c_BDT option, or read from a file specified by -c_BDF option.Default is off.
-c_BDTtrainBayesianDenoiserTrain Bayesian denoiser. Only active in consensus mode (-cAC option). Default is off.
-c_BDP<thres>denoiserMinimumSignalProbMinimum signal probability to retain a peak when denoiser is used.Default is 0.0.
-c_BDF<file>denoiserParamFileSpecify parameter file for Bayesian denoiser, for both writing and reading.Default is off (no writing or reading).
DECOY GENERATION OPTIONS (Applicable with -cAD option)
-ccdecoyConcatenateConcatenate real and decoy libraries.Default is off: library consisting of only decoy spectra is created.
-cy<num>decoySizeRatioSpecify the (decoy / real) size ratio.Default is 1. <num> must be an integer.
-c_DPSdecoyPrecursorSwapUse a modified form of the precursor swap method for generating decoys.Turn on with -c_DPS, off with -c_DPS! Default is off.
RETENTION TIME NORMALIZATION OPTIONS (Applicable with .pep.xml file input)
-c_IRT<file>normalizeRTWithLandmarksUse landmark peptides in <file> to normalize retention times to iRTs.Default is off. <file> should be a space-delimited table with two columns: peptide sequence and iRT
-c_IRRnormalizeRTLinearRegressionRegress the real RTs of landmark peptides (i.e. assume they form a straight line).Turn on with -c_IRR, off with -c_IRR!. Default is off.
UNIDENTIFIED LIBRARY/CLUSTERING OPTIONS
-c_UCRunidentifiedClusterIndividualRunMerge neighboring spectra in each run as they are imported from data (mz(X)ML) files.Turn on with -c_UCR, off with -c_UCR!. Default is off.
-c_UCD<thres>unidentifiedClusterMinimumDotSpecify minimum dot products for two spectra to be clustered.Default is 0.7.
-c_UX1unidentifiedRemoveSinglyChargedRemove spectra that appear to be from singly charged precursors.Turn on with -c_UX1, off with -c_UX1!. Default is on.
-c_UNP<num>unidentifiedMinimumNumPeaksToIncludeRemove spectra that have fewer than <num> peaks.Default is 35.
-c_USX<thres>unidentifiedSingletonXreaThresholdApply an Xrea (quality measure) filter to singleton spectra after clustering. Only those with Xrea at least <thres> are kept.Default is 0.6.
SEMI-EMPIRICAL SPECTRUM GENERATION OPTIONS (Applicable with -cAM option)
-cx<string>allowableModTokensSpecify the set(s) of modifications allowed in semi-empirical spectrum generation by -cAM option.Default is off: no semi-empirical spectrum generated.


Miscellaneous Options
Command-line TokenName in Parameter FileMeaningRemarks
-VNoneVerbose mode. More information displayed to console.Default is off.
-QNoneQuiet mode.Default is off.
-L<file>NoneSpecify name of log file.Default is spectrast.log.
-M<file>NoneActivate user-defined modifications listed in <file>.Default is off. If <file> is omitted spectrast.usermods is assumed.

Other SpectraST Utilities

Plotspectrast

Plotspectrast is a spectrum viewer designed for SpectraST. It comes as two programs: a CGI that can be launched from a web page (e.g., from PepXMLViewer), and a stand-alone application. They are included in the TPP and no additional installation is necessary.

The most common use of Plotspectrast is for visualization of spectral matches from PepXMLViewer. When displaying SpectraST results, PepXMLViewer provides a link to invoke plotspectrast.cgi for each spectrum query. The query spectrum will be plotted as a "mirror image" of the best-matched library spectrum, enabling the user to quickly assess the quality of the match. Below the plot there is an ion table, and tables listing information about the library spectrum. The legend of the plots and ion table is as follows:

  • Library spectrum
    • Peak color: Red = Selected annotated peaks; Blue = Unannotated peaks
    • Label color: Red = Selected annotated peaks that have matched peaks in the query spectrum; Black = Unmatched peaks
  • Query spectrum
    • Peak color: Red = Peaks that match selected annotated peaks in the library spectrum; Black = Unmatched peaks
  • Ion table
    • Cell color: Red = Ions present in both spectra; Pink = Ions present in the library spectrum only; White = Ions present in neither spectrum

Various controls are available to the left of the plot to customize how the spectra are displayed:

  • X-Range: The range of X axis (the m/z values) displayed
  • MatchTol: The m/z tolerance within which a peak is considered matched between the library and query spectra. This affects the labeling and coloring of the peaks.
  • Y-Zoom: Zooming factor in the Y axis (the peak intensity).
  • BlankPrecRegion: Blank the region around the precursor m/z. (Note: in SpectraST searching, peaks in this region are ignored.)
  • Annotation Options
    • LabelType: Toggling between displaying the ion type, the m/z value, or no label for selected annotated peaks
    • NumPeaks: The number of peaks considered for labeling, from the highest peak down
    • MinInten: The minimum intensity for a peak to be labeled
    • Ions a, b, y (+1, +2, +3): Whether or not to label that particular type of ion of that charge state
    • -H2O/-NH3/-P: Whether or not to label water/ammonia/phosphate neutral loss peaks of fragment ions
    • Prec losses: Whether or not to label neutral loss peaks of the precursor
    • All: Whether or not to label all annotated peaks
    • ColorAll: Whether to color all the annotated peaks regardless of label selection

The stand-alone plotspectrast application produces a static .png image in the same directory as the query spectrum file. It has the following usage:

plotspectrast <.splib file> <library file offset> <.mzXML file> <query scan number>

Plots the library spectrum at <library file offset> and the query spectrum of <query scan number> in the .mzXML file. The desired value of <library file offset> can be extracted from the .spidx, .pepidx or .sptxt file (BinaryFileOffset in the Comment field).

plotspectrast <.splib file> <library file offset> <.dta file>

Similar to above, except the query spectrum is in a .dta file.

plotspectrast <.splib file> <library file offset> <.none file>

Plots the library spectrum by itself. It will not actually look for the .none file, but the resulting .jpg file will be named with the same prefix as the .none file and place in the same directory.

plotspectrast <.msp file of library spectrum> <.msp, .dta, or .none file>

Similar to above, except the library spectrum is given in a .msp file.

plotspectrast <.splib file 1> <library file offset 1> <.splib file 2> <library file offset 2>

Plots two library spectra head to tail.

Lib2HTML

Lib2HTML is an application that converts a SpectraST library into an HTML file for viewing. It is included in the TPP and no additional installation is necessary. In the resulting HTML file, replicates of the same peptide ion will be listed on one row, and links are provided to each replicate to view the spectrum using Plotspectrast. The usage is:

Lib2HTML <options> <full path from webserver root to .splib file>

Options include:

  • -V : Verbose. Displaying more information for each entry.
  • -N<num> : Specify the maximum number of replicates displayed for each unique peptide ion. Default is 10.
  • -P<path> : Specify the full path from the webserver root to the plotspectrast.cgi binary.

Developer's Guide

The SpectraST source code contains detailed documentation.

sptxt file format:

The sptxt file format is very closely realted to the msp format, whose documentation can be found here.

Annotation syntax:

SpectraST's syntax to annotate a fragment follows the scheme proposed by Roepstorff and Fohlman.

An annotation tag starts with the assigned ion type (a,b,c,x,y or z) and is followed by the number of amino acid residues present in the fragment. This number is possibly followed by a signed integer value, indicating a modification. Please note that besides post-translational modifications also loss of water (-18) and loss of ammonia (-17), e.g., are taken into account. The caret symbol '^' followed by an integer value depicts the charge state of the fragment. Its absence indicates a singly charged fragment ion. An additional 'i' at the end of the annotation tag implies that the mass value does not correspond to the expected mass value of the monoisotopic peak, but can be assigned to a different isotopic peak of the fragment. Finally, the annotation pattern contains the average mass deviation (in Da) from the theoretically expected mass. A slash '/' preceds this number.

The list of possible annotations is ordered by ascending charge states, where ties are broken by ascending mass deviations.

Annotation tags can be enclosed by square brackets, indicating that several peaks could be assigned the same particular ion. Usually, SpectraST would resolve such a situation by annotating only one of the ions and leaving the other ones blank. If data is not (sufficiently) centroided, this strategy might lead to a buch of unresolved peaks, which might in turn cause quality filters to fail. To circumvent this problem, if there are additional intense peaks that look to be the same ion, a bracketed annotation will be given to them.

Besides annotations following the Roepstorff/Fohlman notation SpectraST also assigns immonium ions. The corresponding tag consists of 3 capital letters, always starting with an 'I' (for immonium), followed by the amino acid and an additional letter to designate different residue-specific ions from that amino acid.

More tips to developers who want to modify SpectraST will be available shortly.

Where to Get Help

The SPC Tools Discussion Group: spctools-discuss.googlegroups.com

The SPC Tools Announcement Group: spctools-announce.googlegroups.com

Public spectral libraries are available for download at PeptideAtlas

External Links

Reference

  • Keller, Andrew, et al. (2005) "A uniform proteomics MS/MS analysis platform utilizing open XML file formats". Molecular Systems Biology 1, 17. Full text
  • Lam, Henry, et al. (2007). "Development and validation of a spectral library searching method for peptide identification from MS/MS". Proteomics 7 (5), 655-667. Abstract
  • Craig, Robertson, et al. (2006). "Using annotated peptide mass spectrum libraries for protein identification". Journal of Proteome Research 5 (8), 1843-1849. Abstract
  • Frewen, Barbara, et al. (2006). "Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries". Analytical Chemistry 78 (16), 5678-5684. Abstract
  • Lam, Henry, et al. (2008). "Building consensus spectral libraries for peptide identifications in proteomics". Nature Methods 5, 873-875. FullText
  • Picotti, Paola, et al. (2008). "A database of validated assays for the targeted mass spectrometric analysis of the S. cerevisiae proteome". Nature Methods 5, 913-914. FullText
  • Sherwood, Carly, et al. (2009). "MaRiMba: A software application for spectral library-based MRM transition list assembly ". Journal of Proteome Research 8, 4396-4405. Abstract
  • Lam, Henry, et al. (2010). " Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics". Journal of Proteome Research 9, 605-610. Abstract
Personal tools