Getting the software
This software is included in the current TPP distribution.
In a nutshell
ProteinProphet is a tool for generating probablities for protein identifications based on MS/MS data. ProteinProphet makes use of results from PeptideProphet, which produces validation results for peptide sequence identifications. This software was originally developed at the SPC, part of the ISB. ProteinProphet is an integral part of the Trans-Proteomic Pipeline software distribution.
Since MS/MS spectra are produced by peptides, and not proteins, there is a need for an additional statistical model for validation of the identifications at the protein level. We developed a model that has as input the list of peptides assigned to MS/MS spectra and corresponding probabilities that those peptide assignments are correct. Different peptide identifications corresponding to the same protein are combined together to estimate the probability that their corresponding protein is present in the sample. This protein grouping information is then employed to adjust the individual peptide probabilities, thus making the approach more discriminative. We also address the problem that we call degeneracy, which occurs when one peptide corresponds to several different proteins.
As of December 12, 2008, options available for command line users are:
NOPLOT: do not generate plot png file NOOCCAM: non-conservative maximum protein list ICAT: highlight peptide cysteines GLYC: highlight peptide N-glycosylation motif MINPROB: pepeptideProphet probabilty threshold (default=0.05) GROUPWTS: check peptide's total weight in the Protein Group against the threshold (default: check peptide's actual weight against threshold) ACCURACY: equivalent to MINPROB0 ASAP: compute ASAP ratios for protein entries (ASAP must have been run previously on interact dataset) REFRESH: import manual changes to ASAP ratios (after initially using ASAP option) NORMPROTLEN: Normalize NSP using Protein Length PROTLEN: Report Protein Length INSTANCES: Use Expected Number of Ion Instances to adjust the peptide probabilities prior to NSP adjustment PROTMW: Get protein mol weights IPROPHET: input is from iProphet ASAP_PROPHET: *New and Improved* compute ASAP ratios for protein entries (ASAP must have been run previously on all input interact datasets with mz/XML raw data format) DELUDE: do NOT use peptide degeneracy information when assessing proteins EXCELPEPS: write output tab delim xls file including all peptides EXCELxx: write output tab delim xls file including all protein (group)s with minimum probability xx, where xx is a number between 0 and 1
Some usage notes:
NORMPROTLEN: ProteinProphet grants higher probabilities to proteins with more identified (sibling) peptides (NSP="number of sibling peptides"). NSP is computed as the sum of the probabilities of the peptides: a protein with three peptides of probabilities 0.9, 0.6, and 0.4 would have NSP=1.9. With NORMPROTLEN, NSP is scaled according to protein length. Use of NORMPROTLEN is recommended.
IPROPHET: iProphet, or InterProphet, is software under development that further processes PeptideProphet output before processing by ProteinProphet. It can be used to combine PeptideProphet results from several experiments and search engines. The IPROPHET option to ProteinProphet should be used if and only if the input pepXML file(s) were created by iProphet. iProphet is not yet released to the public.
INSTANCES: this option is superceded by iProphet and should not be used with iProphet.
Caution for large datasets
ProteinProphet tends to overestimate probabilities when given large datasets (more than, say, 500,000 spectra). Here is what developer Henry Lam had to say in May 2008 on the TPP developers' forum:
"... in larger datasets it becomes too easy for a protein to get a high probabilities due to two or even more random incorrect peptide IDs. I suggest trying the INSTANCES option on ProteinProphet and see if it does better (it should), but it still won't match the decoy estimates. We can still use the probabilitites to rank and threshold protein identifications. There's also the very useful feature of resolving ambiguous peptide-protein mapping. I just wouldn't trust the predicted false discovery rate. Instead, come up with some threshold, preferably as conservative as is sensible, (e.g. ProteinProphet at least 0.9, at least 3 peptides, each observed more than once at PeptideProphet P>0.9, etc), and then count decoys to estimate FDRs."
Proteins are clustered into groups within the protXML element <protein group>. Usually there is just one protein per group, but sometimes there is more than one. In these cases, these proteins share identified peptides. Often, only a subset of the proteins in a group is needed to explain the presence of all the peptides in the group. Applying Occam's Razor, we assign a probability of zero to the unneeded proteins. This probability is not to be interpreted literally; rather, it allows us to present the shortest list of proteins needed to explain the data.
Options still under development
MININDEP: Takes a value between 0 and 1. The larger this parameter is, the more protein grouping will take place and the fewer groups you will get. MININDEP sets the minimum amount of protein independentness needed for a protein to not join a group that it shares a peptide with. A protein's "independentness" from a group = (# proteins in group with which it shares peptides )/(# peptides shared). MININDEP=1 means any protein connected to a group by even one peptide will be grouped. Default value is zero. Recommended values are 0.2 or 0.3.
Author David Shteynberg clarifies: "The independentness is measured as the fraction of independent evidence peptides among all peptides in a given protein. To be grouped at least one of the proteins in the group must not have sufficient independent evidence to earn its own entry in the protXML file."
UNMAPPED: Considers even proteins whose identifiers were appended with "_UNMAPPED" in a previous step; that is, proteins that are not mappable to the desired database.
Nesvizhskii AI, Keller A, Kolker E, Aebersold R. (2003) "A statistical model for identifying proteins by tandem mass spectrometry." Anal Chem 75:4646-58 download PDF