Software:ProteinProphet

From SPCTools

(Difference between revisions)

Revision as of 22:05, 12 December 2008

1 Getting the software
2 In a nutshell
3 More info
4 Usage
5 Caution for large datasets
6 Protein Groups
7 Reference

Getting the software

This software is included in the current TPP distribution.

In a nutshell

ProteinProphet is a tool for generating probablities for protein identifications based on MS/MS data. ProteinProphet makes use of results from PeptideProphet, which produces validation results for peptide sequence identifications. This software was originally developed at the SPC, part of the ISB. ProteinProphet is an integral part of the Trans-Proteomic Pipeline software distribution.

More info

Since MS/MS spectra are produced by peptides, and not proteins, there is a need for an additional statistical model for validation of the identifications at the protein level. We developed a model that has as input the list of peptides assigned to MS/MS spectra and corresponding probabilities that those peptide assignments are correct. Different peptide identifications corresponding to the same protein are combined together to estimate the probability that their corresponding protein is present in the sample. This protein grouping information is then employed to adjust the individual peptide probabilities, thus making the approach more discriminative. We also address the problem that we call degeneracy, which occurs when one peptide corresponds to several different proteins.

Usage

As of December 12, 2008, options available for command line users are:

               NOPLOT: do not generate plot png file
               NOOCCAM: non-conservative maximum protein list
               ICAT: highlight peptide cysteines
               GLYC: highlight peptide N-glycosylation motif
               MINPROB: pepeptideProphet probabilty threshold (default=0.05)
               GROUPWTS: check peptide's total weight in the Protein Group against the threshold (default: check peptide's actual weight against threshold)
               ACCURACY: equivalent to MINPROB0
               ASAP: compute ASAP ratios for protein entries
                       (ASAP must have been run previously on interact dataset)
               REFRESH: import manual changes to ASAP ratios
                       (after initially using ASAP option)
               NORMPROTLEN: Normalize NSP using Protein Length
               PROTLEN: Report Protein Length
               INSTANCES: Use Expected Number of Ion Instances to adjust the peptide probabilities prior to NSP adjustment
               PROTMW: Get protein mol weights
               IPROPHET: input is from iProphet
               ASAP_PROPHET: *New and Improved* compute ASAP ratios for protein entries
                       (ASAP must have been run previously on all input interact datasets with mz/XML raw data format)
               DELUDE: do NOT use peptide degeneracy information when assessing proteins
               EXCELPEPS: write output tab delim xls file including all peptides
               EXCELxx: write output tab delim xls file including all protein (group)s
                               with minimum probability xx, where xx is a number between 0 and 1

Some usage notes:

NORMPROTLEN: ProteinProphet grants higher probabilities to proteins with more identified (sibling) peptides (NSP="number of sibling peptides"). NSP is computed as the sum of the probabilities of the peptides: a protein with three peptides of probabilities 0.9, 0.6, and 0.4 would have NSP=1.9. With NORMPROTLEN, NSP is scaled according to protein length. Use of NORMPROTLEN is recommended.

IPROPHET: iProphet, or InterProphet, is software under development that further processes PeptideProphet output before processing by ProteinProphet. It can be used to combine PeptideProphet results from several experiments and search engines. The IPROPHET option to ProteinProphet should be used if and only if the input pepXML file(s) were created by iProphet. iProphet is not yet released to the public.

INSTANCES: this option is superceded by iProphet and should not be used with iProphet.

Caution for large datasets

ProteinProphet tends to overestimate probabilities when given large datasets (more than, say, 500,000 spectra). Here is what developer Henry Lam had to say in May 2008 on the TPP developers' forum:

"... in larger datasets it becomes too easy for a protein to get a high probabilities due to two or even more random incorrect peptide IDs. I suggest trying the INSTANCES option on ProteinProphet and see if it does better (it should), but it still won't match the decoy estimates. We can still use the probabilitites to rank and threshold protein identifications. There's also the very useful feature of resolving ambiguous peptide-protein mapping. I just wouldn't trust the predicted false discovery rate. Instead, come up with some threshold, preferably as conservative as is sensible, (e.g. ProteinProphet at least 0.9, at least 3 peptides, each observed more than once at PeptideProphet P>0.9, etc), and then count decoys to estimate FDRs."

Protein Groups

Proteins are clustered into groups within the protXML element <protein group>. Usually there is just one protein per group, but sometimes there is more than one. In these cases, these proteins share identified peptides. Often, only a subset of the proteins in a group is needed to explain the presence of all the peptides in the group. Applying Occam's Razor, we assign a probability of zero to the unneeded proteins. This probability is not to be interpreted literally; rather, it allows us to present the shortest list of proteins needed to explain the data.

Reference

Nesvizhskii AI, Keller A, Kolker E, Aebersold R. (2003) "A statistical model for identifying proteins by tandem mass spectrometry." Anal Chem 75:4646-58 download PDF

Retrieved from "http://tools.proteomecenter.org/wiki/index.php?title=Software:ProteinProphet"

 "... in larger datasets it becomes too easy for a protein to get a high probabilities due to two or even more random incorrect peptide IDs. I suggest trying the INSTANCES option on ProteinProphet and see if it does better (it should), but it still won't match the decoy estimates. We can still use the probabilitites to rank and threshold protein identifications. There's also the very useful feature of resolving ambiguous peptide-protein mapping. I just wouldn't trust the predicted false discovery rate. Instead, come up with some threshold, preferably as conservative as is sensible, (e.g. ProteinProphet at least 0.9, at least 3 peptides, each observed more than once at PeptideProphet P>0.9, etc), and then count decoys to estimate FDRs."
-==protXML output format==
+==Protein Groups==
+Proteins are clustered into groups within the protXML element <protein group>. Usually there is just one protein per group, but sometimes there is more than one. In these cases, these proteins share identified peptides. Often, only a subset of the proteins in a group is needed to explain the presence of all the peptides in the group. Applying [http://en.wikipedia.org/wiki/Occam's_razor Occam's Razor], we assign a probability of zero to the unneeded proteins. This probability is not to be interpreted literally; rather, it allows us to present the shortest list of proteins needed to explain the data.
 ==Reference==
 Nesvizhskii AI, Keller A, Kolker E, Aebersold R. (2003) "A statistical model for identifying proteins by tandem mass spectrometry."  Anal Chem 75:4646-58
 [http://tools.proteomecenter.org/publications/Nesvizhskii.AnalChem.03.pdf download PDF]

Software:ProteinProphet

From SPCTools

Revision as of 22:05, 12 December 2008

Contents

Getting the software

In a nutshell

More info

Usage

Caution for large datasets

Protein Groups

Reference

Views

Personal tools

Navigation

support newsgroups

Search

Toolbox