Expert search and TPP usage
From SPCTools
Revision as of 01:34, 11 June 2009 Tfarrah (Talk | contribs) (→Experimental Generic Search Parameters File) ← Previous diff |
Current revision Tfarrah (Talk | contribs) |
||
Line 1: | Line 1: | ||
- | This is a guide to searching MS/MS data from within ISB, using a Unix command line interface. | + | This is a guide to searching and running the TPP on MS/MS data from within ISB, using a Unix command line interface. |
== Preparation == | == Preparation == | ||
Line 9: | Line 9: | ||
== Data storage == | == Data storage == | ||
- | If data is to ultimately be stored in SBEAMS, it should be placed in the following directory structure '''(Feb. 2009 note -- this disk is nearly full and new data should not be moved there)''': | + | If data is to ultimately be stored in SBEAMS, it should be placed in the following directory structure '''(Feb. 2009 note -- this disk is nearly full and new data should not be moved there -- still true July 2009)''': |
/regis/sbeams/archive/<investigator>/<project>/<experiment_tag>/<search_descriptor> | /regis/sbeams/archive/<investigator>/<project>/<experiment_tag>/<search_descriptor> | ||
For example: | For example: | ||
Line 31: | Line 31: | ||
setenv SEARCHDIR <search_descriptor> | setenv SEARCHDIR <search_descriptor> | ||
mkdir $SEARCHDIR | mkdir $SEARCHDIR | ||
+ | # link mzXML files to current directory (like creating shortcuts) | ||
foreach file ( *.mzXML ) | foreach file ( *.mzXML ) | ||
ln -s ../$file $SEARCHDIR/$file | ln -s ../$file $SEARCHDIR/$file | ||
end | end | ||
cd $SEARCHDIR | cd $SEARCHDIR | ||
- | /sbeams/bin/params/createEngineSpecificParams.pl --config_file ../search.params --output Tandem | + | cp /regis/sbeams/bin/tandem/tandem.params-cam tandem.params |
- | # edit tandem.params to conduct a search appropriate for your data | + | # edit tandem.params to include the modifications, target DB, and refinement options appropriate for your data |
- | vi tandem.params | + | vim tandem.params |
echo “ -OdA -dDECOY_ -E<experiment_tag>” > xinteract.params | echo “ -OdA -dDECOY_ -E<experiment_tag>” > xinteract.params | ||
/sbeams/bin/tandem/runtandemsearch *.mzXML | /sbeams/bin/tandem/runtandemsearch *.mzXML | ||
- | (TPP is automatically run on the search results based on whatever is in xinteract.params and whole log emailed to you) | + | The TPP is automatically run on the search results based on whatever is in xinteract.params and the results, including Mayu and calctppstat output, are stored in a file named zzpostprocessing.log. This file will be emailed to you when the analysis is complete. |
==== X!Tandem parameters ==== | ==== X!Tandem parameters ==== | ||
Line 59: | Line 60: | ||
For a listing of all xinteract parameters, type xinteract | more. | For a listing of all xinteract parameters, type xinteract | more. | ||
- | Here is a sample xinteract.params file for an X!Tandem search using decoys; it works with and without refinement: | + | Here is a sample xinteract.params file for an X!Tandem search using decoys on high mass accuracy data; it works with and without refinement: |
-OdA -dDECOY_ -EYoungAhFem1912 | -OdA -dDECOY_ -EYoungAhFem1912 | ||
Line 68: | Line 69: | ||
** A says that you have high mass accuracy data that was searched using mono-isotopic masses | ** A says that you have high mass accuracy data that was searched using mono-isotopic masses | ||
* -dDECOY_ use decoy hits to pin down the negative distribution; DECOY_ is the decoy identifier prefix | * -dDECOY_ use decoy hits to pin down the negative distribution; DECOY_ is the decoy identifier prefix | ||
- | * -E is the experiment tag | + | * -E is the experiment tag: substitute in a tag that describes your experiment |
+ | |||
+ | === checking your results === | ||
+ | Here are some notes off the top of my (Terry's) head. They reflect my incomplete understanding of the TPP software. | ||
+ | |||
+ | ==== TPP output ==== | ||
+ | * See if the PeptideProphet modeling failed for any of the charge states. If failed for charge states 2 or 3, your data is probably quite poor. Modeling will fail for charge states for which there are no spectra. For non-ETD data this is usually +6 and +7 -- don't worry about these. But if modeling fails for charges for which there exist spectra, you should re-run the TPP and tell it to ignore those charge states (see below). Warnings look like this: | ||
+ | WARNING: Mixture model quality test failed for charge (2+). | ||
+ | |||
+ | ==== calctppstat output ==== | ||
+ | The following script (run automatically by runtandemsearch and the like) gives a lot of useful info about how your searches and TPP went: | ||
+ | /regis/sbeams/bin/calctppstat.pl --input interact-ipro.pep.xml --FDRthresh --full --write | ||
+ | The first line looks like this: | ||
+ | PepP 33949/147523 0.230 | ProP 122 ( 1543, 1526) /regis/sbeams2/archive/pmallick/Plasma/BOB2H_ISB/SPC_HsNIST2.0 | ||
+ | Here is a key: | ||
+ | PeptideProphet <spectra-searched><spectra-identified> <fraction> | ProteinProphet <proteins> (<distinct-peptides>, <??>) <experiment-location> | ||
+ | Check for the following: | ||
+ | * Check line that says "At PepPro FDR 0.010 FDR based on decoys=0.0051". The FDR based on decoys should not be more than an order of magnitude different from the PepPro FDR. Otherwise, PeptideProphet is not modeling the data well and you should consider throwing out problematic charge states and/or experiments, or altering the TPP parameters. | ||
+ | * Check decoy rates for different charge states. If decoy rate is greater than about 0.02, consider omitting spectra of that charge state by adding a parameter to xinteract. To omit spectra of charge +1, for example, add -I1 to xinteract.params | ||
+ | * Consider throwing out experiments with very small numbers of identified peptides. | ||
+ | * Fraction semi-tryptics should be less than 0.1 | ||
+ | * Fraction of missed cleavages should be less than 0.5 | ||
+ | * If accurate mass instrument, histogram at bottom should be binned at 1 dalton intervals. | ||
== Experimental Generic Search Parameters File == | == Experimental Generic Search Parameters File == | ||
Line 76: | Line 99: | ||
Go to the directory for your experiment and set up a generic search parameters file: | Go to the directory for your experiment and set up a generic search parameters file: | ||
+ | cd /regis/sbeams/archive/youngah/HsUrine/HsUrineNormFem_163A ''(for example)'' | ||
# copy a generic search parameter file and edit to suit your data | # copy a generic search parameter file and edit to suit your data | ||
cp /sbeams/bin/params/search.params . | cp /sbeams/bin/params/search.params . | ||
Line 82: | Line 106: | ||
Go to the directory where you will do your searches and generate a search-specific parameters file: | Go to the directory where you will do your searches and generate a search-specific parameters file: | ||
- | cd XTK ''(for example)'' | + | cd XTK_Hs3.38 ''(for example)'' |
/sbeams/bin/params/createEngineSpecificParams.pl --config_file ../search.params --output Tandem | /sbeams/bin/params/createEngineSpecificParams.pl --config_file ../search.params --output Tandem | ||
Current revision
This is a guide to searching and running the TPP on MS/MS data from within ISB, using a Unix command line interface.
Contents |
Preparation
Know the following about your data:
- which amino acid modifications are ubiquitous, and what are their weights?
- which amino acid modifications are present only some of the time?
- how complete was the digest?
- was the data generated by a high mass accuracy instrument?
Data storage
If data is to ultimately be stored in SBEAMS, it should be placed in the following directory structure (Feb. 2009 note -- this disk is nearly full and new data should not be moved there -- still true July 2009):
/regis/sbeams/archive/<investigator>/<project>/<experiment_tag>/<search_descriptor>
For example:
/regis/sbeams/archive/youngah/HsUrine/HsNormFemUrine_163A/XTK_Hs3.38
<investigator> should be the name of the person who generated the data, in the format used in SBEAMS. If that investigator is not registered in SBEAMS, they should be registered.
<search_descriptor> should be a three-letter abbreviation for the search engine, underscore, and a brief descriptor of the database searched.
Raw data and mzXML files should be stored in the <experiment> directory. mzXML files should be symlinked (use ln -s) to each <search_descriptor> directory. Search results and TPP results should be stored in the <search_descriptor> directories.
Moving XML files
Paths are hardcoded within pepXML and protXML files. If you move these files, you must run the following script in order for the files to work properly with the TPP:
/sbeams/bin/updateAllPaths.pl *.xml *.xls *.shtml
Search and TPP
X!Tandem-K
We use a modification of the publicly-available X!Tandem search engine called X!Tandem-K. It uses a significantly different scoring algorithm, K-score.
# see above for example <search_descriptor> setenv SEARCHDIR <search_descriptor> mkdir $SEARCHDIR # link mzXML files to current directory (like creating shortcuts) foreach file ( *.mzXML ) ln -s ../$file $SEARCHDIR/$file end cd $SEARCHDIR cp /regis/sbeams/bin/tandem/tandem.params-cam tandem.params # edit tandem.params to include the modifications, target DB, and refinement options appropriate for your data vim tandem.params echo “ -OdA -dDECOY_ -E<experiment_tag>” > xinteract.params /sbeams/bin/tandem/runtandemsearch *.mzXML
The TPP is automatically run on the search results based on whatever is in xinteract.params and the results, including Mayu and calctppstat output, are stored in a file named zzpostprocessing.log. This file will be emailed to you when the analysis is complete.
X!Tandem parameters
See [1] for a description of X!Tandem parameters.
A major choice you must make is whether to do a one-pass search with generous criteria (allowing semi-tryptic matches, modifications, and missed cleavages), or whether to do a two-pass search, the first pass with stricter criteria, and the second pass with generous criteria but only searching those proteins that were matched in the first pass. The two-pass method is called "refine" mode. It is much faster, but violates some of the assumptions made in the TPP and therefore may give slightly less accurate TPP results.
Example parameters for a one-pass search allowing semi-tryptic cleavage:
<note type="input" label="protein, cleavage semi">yes</note> <note type="input" label="refine">no</note>
Example parameters for a search using refine mode, allowing semi-tryptic cleavage only in the second pass:
<note type="input" label="refine">yes</note> <note type="input" label="refine, cleavage semi">yes</note>
xinteract (TPP) parameters for X!Tandem searches
For a listing of all xinteract parameters, type xinteract | more.
Here is a sample xinteract.params file for an X!Tandem search using decoys on high mass accuracy data; it works with and without refinement:
-OdA -dDECOY_ -EYoungAhFem1912
Key:
- -O (letter, not digit) introduces options for PeptideProphet
- d reports decoy hits with a computed probability based on the model learned
- A says that you have high mass accuracy data that was searched using mono-isotopic masses
- -dDECOY_ use decoy hits to pin down the negative distribution; DECOY_ is the decoy identifier prefix
- -E is the experiment tag: substitute in a tag that describes your experiment
checking your results
Here are some notes off the top of my (Terry's) head. They reflect my incomplete understanding of the TPP software.
TPP output
- See if the PeptideProphet modeling failed for any of the charge states. If failed for charge states 2 or 3, your data is probably quite poor. Modeling will fail for charge states for which there are no spectra. For non-ETD data this is usually +6 and +7 -- don't worry about these. But if modeling fails for charges for which there exist spectra, you should re-run the TPP and tell it to ignore those charge states (see below). Warnings look like this:
WARNING: Mixture model quality test failed for charge (2+).
calctppstat output
The following script (run automatically by runtandemsearch and the like) gives a lot of useful info about how your searches and TPP went:
/regis/sbeams/bin/calctppstat.pl --input interact-ipro.pep.xml --FDRthresh --full --write
The first line looks like this:
PepP 33949/147523 0.230 | ProP 122 ( 1543, 1526) /regis/sbeams2/archive/pmallick/Plasma/BOB2H_ISB/SPC_HsNIST2.0
Here is a key:
PeptideProphet <spectra-searched><spectra-identified> <fraction> | ProteinProphet <proteins> (<distinct-peptides>, <??>) <experiment-location>
Check for the following:
- Check line that says "At PepPro FDR 0.010 FDR based on decoys=0.0051". The FDR based on decoys should not be more than an order of magnitude different from the PepPro FDR. Otherwise, PeptideProphet is not modeling the data well and you should consider throwing out problematic charge states and/or experiments, or altering the TPP parameters.
- Check decoy rates for different charge states. If decoy rate is greater than about 0.02, consider omitting spectra of that charge state by adding a parameter to xinteract. To omit spectra of charge +1, for example, add -I1 to xinteract.params
- Consider throwing out experiments with very small numbers of identified peptides.
- Fraction semi-tryptics should be less than 0.1
- Fraction of missed cleavages should be less than 0.5
- If accurate mass instrument, histogram at bottom should be binned at 1 dalton intervals.
Experimental Generic Search Parameters File
In Fall 2008, Abhishek Pratap wrote a program that can take a generic search parameters file and generate parameters files for X!Tandem, SEQUEST, and Myrimatch. Here is how to use it:
Go to the directory for your experiment and set up a generic search parameters file:
cd /regis/sbeams/archive/youngah/HsUrine/HsUrineNormFem_163A (for example) # copy a generic search parameter file and edit to suit your data cp /sbeams/bin/params/search.params . vim search.params
Go to the directory where you will do your searches and generate a search-specific parameters file:
cd XTK_Hs3.38 (for example) /sbeams/bin/params/createEngineSpecificParams.pl --config_file ../search.params --output Tandem
In place of Tandem you can type Sequest or Myrimatch.
Finally, you may edit the resulting parameters file if you wish:
vim tandem.params
Eric's instructions for all available searches, March 2009
Hi everyone, I have not advertised this extensively and it still needs work, but I encourage use of the following for testing. The intent is that you can search any set of mzXMLs with any of our supported/beta supported search engines as follows:
X!Tandem:
cd searchSubDir
cp –p /sbeams/bin/tandem/tandem.params-cam tandem.params
vi tandem.params
cat “” > xinteract.params
/sbeams/bin/tandem/runtandemsearch *.mzXML
OMSSA:
cd searchSubDir
cp –p /sbeams/bin/omssa/omssa.params .
vi omssa.params
cat “-OPd –dDECOY -eT” > xinteract.params
setenv TESTDEVPATH /tools/bin/TPP/tpp-dev/bin
/sbeams/bin/tandem/runomssasearch *.mzXML
Note:
1. OMSSA does not read plain FASTA files, but rather FASTA files must be formatted with something like: /package/genome/bin/formatdb -i YeastCombNR_20070207_ForwDecoy.fasta -p T -o T -l YeastCombNR_20070207_ForwDecoy.log
2. In OMSSA search result, each protein is given a id number, such as protein="18037", so need to replace the id number with the real protein name, such as protein="YLR129W" when running InteractParser. Use -P option for this purpose.
Myrimatch:
cd searchSubDir
cp –p /sbeams/bin/myrimatch/myrimatch.params .
vi myrimatch.params
cat “-OPd –dDECOY -eT” > xinteract.params
setenv TESTDEVPATH /tools/bin/TPP/tpp-dev/bin
/sbeams/bin/myrimatch/runmyrimatchsearch *.mzXML
InsPect:
cd searchSubDir
cp –p /sbeams/bin/inspect/inspect.params .
vi inspect.params
cat “-OPd –dDECOY -eT” > xinteract.params
setenv TESTDEVPATH /tools/bin/TPP/tpp-dev/bin
/sbeams/bin/inspect/runinspectsearch *.mzXML
SpectraST:
cd searchSubDir
cp –p /sbeams/bin/spectrast/spectrast.params .
vi spectrast.params
cat “-OPd –dDECOY” > xinteract.params
setenv TESTDEVPATH /tools/bin/TPP/tpp-dev/bin
/sbeams/bin/spectrast/runspectrastsearch *.mzXML
CProbID:
cd searchSubDir
cp –p /sbeams/bin/cprobid/cprobid.params .
vi cprobid.params
cat “-OPd –dDECOY -eT” > xinteract.params
setenv TESTDEVPATH /tools/bin/TPP/tpp-dev/bin
/sbeams/bin/cprobid/runcprobidsearch *.mzXML
Do they all work? Yes, mostly, fragilely. You can set a TESTDEVPATH to use a particular version of the TPP. In the absence of that envvar, it defaults to production or dshteynb-bin in a few cases of the truly speculative ones.
You should be able to examine my directory of tests that demonstrate/test the functionality of each as follows:
cd /regis/sbeams/tests
ls -al
more =tests.notes
You’ll find the test data and some example params files in:
/regis/sbeams/tests/referenceData/
(the haloICAT one is probably the one with the most complete examples).
In each case, if the search is successful, but the TPP part (which is triggered automatically using the specified xinteract.params flags) fails. You can rerun just the TPP part with:
cd searchSubDir
vi xinteract.params
setenv TESTDEVPATH /tools/bin/TPP/tpp-dev/bin
/sbeams/bin/omssa/postProcessSearch --nowait *.mzXML
(the --nowait is currently necessary, although that should be fixed)
This is all very much a work in progress (although not much progress recently). The vision is that any of the supported search engines could be easily invoked in this manner. Note that this system submits all searches to the “tandem cluster”
It is all ripe for some more professional/organized setup to make this more widely usable (which I encourage someone to do!). But I encourage all of you to search using this mechanism rather than some other. If it doesn’t work for you, please help refine it so that it does. Please email this list with questions, suggestions, fixes, etc. This is not for the most part yet stable enough for users outside this group.
Thanks!
Eric
raw notes from Eric, November 2008
If you want to rerun the TPP on a dataset, do:
/bin/rm interact*
/bin/rm zztandem*
setenv TESTDEVPATH /tools/bin/TPP/tpp-dshteynb/bin
/sbeams/bin/tandem/finishtandemsearch --nowait *.mzXML >& zztandempostprocessing.log
(the --nowait is required so that the finisher doesn’t want around for the .done files)
Note that runtandemsearch and finishtandemsearch will also run on *.mgf or *.mzData or *.pkl
SpectraST:
- To process 1 or more mzXML files, do this:
setenv SEARCHDIR SST_HsNISTIT2.0_aDECOY1
mkdir $SEARCHDIR
foreach file ( *.mzXML )
ln -s ../$file $SEARCHDIR/$file
end
cd $SEARCHDIR
cp /regis/sbeams/bin/spectrast/spectrast.params .
vi spectrast.params
echo “-OPNMd -dDECOY” > xinteract.params
setenv TESTDEVPATH /tools/bin/TPP/tpp-dshteynb/bin
/sbeams/bin/spectrast/runspectrast *.mzXML
(TPP is automatically run on the search results based on whatever is in xinteract.params and whole log emailed to you)
If you want to rerun the TPP on a dataset, do:
/bin/rm interact*
/bin/rm zzpost*
setenv TESTDEVPATH /tools/bin/TPP/tpp-dshteynb/bin
/sbeams/bin/spectrast/finishspectrastsearch --nowait *.mzXML >& zzpostprocessing.log
(the --nowait is required so that the finisher doesn’t want around for the .done files)
Notes:
- For Xtandem, let’s not use the semi-parametric model.
- For SpectraST, let’s DO use the semi-parametric model.
- iProphet and MHT confidence scores are good. Let’s use them. Confidence scores are conservative but good.
- I think LOGPROBS is a bust. I’m not recommending them at the moment.