Software:ProteinProphet

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 21:25, 12 December 2008
Tfarrah (Talk | contribs)
(Usage)
← Previous diff
Revision as of 21:29, 12 December 2008
Tfarrah (Talk | contribs)
(Caution for large datasets)
Next diff →
Line 43: Line 43:
==Caution for large datasets== ==Caution for large datasets==
 +
 +ProteinProphet tends to overestimate probabilities when given large datasets (more than, say, 500,000 spectra). Here is what developer Henry Lam had to say in May 2008 on the TPP developers' forum:
 +
 +"... in larger datasets it becomes too easy for a protein to get a high probabilities due to two or even more random incorrect peptide IDs. I suggest trying the INSTANCES option on ProteinProphet and see if it does better (it should), but it still won't match the decoy estimates. We can still use the probabilitites to rank and threshold protein identifications. There's also the very useful feature of resolving ambiguous peptide-protein mapping. I just wouldn't trust the predicted false discovery rate. Instead, come up with some threshold, preferably as conservative as is sensible, (e.g. ProteinProphet at least 0.9, at least 3 peptides, each observed more than once at PeptideProphet P>0.9, etc), and then count decoys to estimate FDRs."
==protXML output format== ==protXML output format==

Revision as of 21:29, 12 December 2008

Contents

Getting the software

This software is included in the current TPP distribution.

In a nutshell

ProteinProphet is a tool for generating probablities for protein identifications based on MS/MS data. ProteinProphet makes use of results from PeptideProphet, which produces validation results for peptide sequence identifications. This software was originally developed at the SPC, part of the ISB. ProteinProphet is an integral part of the Trans-Proteomic Pipeline software distribution.

More info

Since MS/MS spectra are produced by peptides, and not proteins, there is a need for an additional statistical model for validation of the identifications at the protein level. We developed a model that has as input the list of peptides assigned to MS/MS spectra and corresponding probabilities that those peptide assignments are correct. Different peptide identifications corresponding to the same protein are combined together to estimate the probability that their corresponding protein is present in the sample. This protein grouping information is then employed to adjust the individual peptide probabilities, thus making the approach more discriminative. We also address the problem that we call degeneracy, which occurs when one peptide corresponds to several different proteins.

Usage

As of December 12, 2008, options available for command line users are:

               NOPLOT: do not generate plot png file
               NOOCCAM: non-conservative maximum protein list
               ICAT: highlight peptide cysteines
               GLYC: highlight peptide N-glycosylation motif
               MINPROB: pepeptideProphet probabilty threshold (default=0.05)
               GROUPWTS: check peptide's total weight in the Protein Group against the threshold (default: check peptide's actual weight against threshold)
               ACCURACY: equivalent to MINPROB0
               ASAP: compute ASAP ratios for protein entries
                       (ASAP must have been run previously on interact dataset)
               REFRESH: import manual changes to ASAP ratios
                       (after initially using ASAP option)
               NORMPROTLEN: Normalize NSP using Protein Length
               PROTLEN: Report Protein Length
               INSTANCES: Use Expected Number of Ion Instances to adjust the peptide probabilities prior to NSP adjustment
               PROTMW: Get protein mol weights
               IPROPHET: input is from iProphet
               ASAP_PROPHET: *New and Improved* compute ASAP ratios for protein entries
                       (ASAP must have been run previously on all input interact datasets with mz/XML raw data format)
               DELUDE: do NOT use peptide degeneracy information when assessing proteins
               EXCELPEPS: write output tab delim xls file including all peptides
               EXCELxx: write output tab delim xls file including all protein (group)s
                               with minimum probability xx, where xx is a number between 0 and 1

Some usage notes:

NORMPROTLEN: ProteinProphet grants higher probabilities to proteins with more identified (sibling) peptides (NSP="number of sibling peptides"). NSP is computed as the sum of the probabilities of the peptides: a protein with three peptides of probabilities 0.9, 0.6, and 0.4 would have NSP=1.9. With NORMPROTLEN, NSP is scaled according to protein length. Use of NORMPROTLEN is recommended.

IPROPHET: iProphet, or InterProphet, is software under development that further processes PeptideProphet output before processing by ProteinProphet. It can be used to combine PeptideProphet results from several experiments and search engines. The IPROPHET option to ProteinProphet should be used if and only if the input pepXML file(s) were created by iProphet. iProphet is not yet released to the public.

INSTANCES: this option is superceded by iProphet and should not be used with iProphet.

Caution for large datasets

ProteinProphet tends to overestimate probabilities when given large datasets (more than, say, 500,000 spectra). Here is what developer Henry Lam had to say in May 2008 on the TPP developers' forum:

"... in larger datasets it becomes too easy for a protein to get a high probabilities due to two or even more random incorrect peptide IDs. I suggest trying the INSTANCES option on ProteinProphet and see if it does better (it should), but it still won't match the decoy estimates. We can still use the probabilitites to rank and threshold protein identifications. There's also the very useful feature of resolving ambiguous peptide-protein mapping. I just wouldn't trust the predicted false discovery rate. Instead, come up with some threshold, preferably as conservative as is sensible, (e.g. ProteinProphet at least 0.9, at least 3 peptides, each observed more than once at PeptideProphet P>0.9, etc), and then count decoys to estimate FDRs."

protXML output format

Reference

Nesvizhskii AI, Keller A, Kolker E, Aebersold R. (2003) "A statistical model for identifying proteins by tandem mass spectrometry." Anal Chem 75:4646-58 download PDF

Personal tools