Terry's blog
From SPCTools
Contents |
October 20, 2008
Read AMASS paper on filtering spectra based on (a) match percentage of high-abundance ions, and (b) consecutivity / continuity of matched fragment ions. Did not have time to comprehend the formulae. Paper is not highly cited.
October 14, 2008
Learning what a bad spectrum identification looks like
Talked for about an hour with Jimmy Eng, trying to find out the features of a misidentified spectrum. We looked at the five decoys that made it into the human urine peptide atlas. He was easily able to tag each of them as a poor identification. Concepts:
- Red flag: low intensity ID'd peaks amidst high density of low intensity non-ID'd peaks. Jimmy would devalue/ignore those peak IDs.
- Red flag: no strong ID'd peaks in the higher mass range
- Red flag: gaps in consecutivity, especially in higher mass range and esp. for Y-ions
- For precursor ion of charge N, should see strongest peaks for fragments of charge <= N-1
- Unidentified peaks at -18 (water) and -28 (?) may be OK
- Want to see lots of strong ID'd peaks rising above the level of the non-ID's peaks
- One decoy had R at N-term (missed cleavage) and no RK at C-term (miscleavage). Must have been a semi-tryptic search. Such a combination of unconventional cleavages is unusual but not extremely rare. I asked Jimmy if he'd devalue the ID and he said, "No, the semi-tryptic search was done because such cleavage patterns are expected." However, I didn't suggest tossing out the ID, just devaluing it. I think devaluing it might be appropriate.
- PeptideProphet does not consider X!Tandem's expect score, which is based on the distribution of scores for a particular spectrum vs. entire sequence DB. Expect = # of peptides we expect to see at this [raw] score by random chance. One decoy had an expect=12 -- very high. PeptideProphet only looks at hyperscore (top hit) and next (next best hit) and does not consider the distribution.
- Low mass portion of spectrum tends to be more messy and less interesting
- We want big, high mass Y-ions
I learned the following tangentially:
- PeptideProphet does make use of LC retention time.
- We do expect to see strong peaks just to the left of the precursor m/z.
- Mobile proton theory: Cleavage via CID occurs near any positive charge (proton). When there is a basic C-term (KR) and a basic N-term (N-term always basic), a spare proton is free to roam the length of the peptide because it is equally attracted to both basic ends. This makes cleavage approximately equally likely at all positions. If a peptide lacks that balance, then cleavage will tend to happen closer to any basic residue.
- Enhanced cleavage N-terminal to proline --> strong Y-ion beginning with that proline
- SEQUEST already considers consecutiveness of peak IDs in its preliminary scoring
- Jimmy has tried peak picking by sliding a window across a spectrum, calculating mean/SD of peak heights, to separate signal from noise.
October 13, 2008
States et. al. paper (Nature Biotechnology 2005) offers a method for estimating protein ID confidence levels which incorporates protein length. Eric implemented this in ProteinProphet and reports results in analysis.out. This method gives similar yield to simply tossing out all singletons, but is perhaps more theoretically sound (sometimes Eric thinks it's just more complicated).
Plan:
- Incorporate Swiss-Prot annotations into PeptideAtlas as a way of learning how to program PeptideAtlas
- Incorporate PeptideProphet information (e.g. probabilities) into PeptideAtlas. When changing schema, also insert spaces for States et. al. metrics:
- expected number of false positives per protein
- likelihood of actual # of hits occuring by chance ("probability" = 1 - this)
- Figure out how to get this info into PeptideAtlas
- Reload yeast PeptideAtlas so that this info is viewable, and draw conclusions from the info.
Talked to Lik Wee this morning -- new postdoc in Martin lab, studying proteomics. His first assignment is to learn the different search engines. I explained my project to him. Analyzing spectra for (a) large unidentified peaks, (b) big gaps in peak IDs, and (c) low average peak intensity sounds like a viable approach for appropriately deprecating some PeptideAtlas IDs. Can I use a machine learning algorithm for this? Ask Nikita.
October 8, 2008
I keep thinking the date must be a little later than it actually is. Like, I thought today should be the 9th. At the same time, it seems not so long ago that 2008 felt like a very new year, and it was hard to imagine it would ever feel otherwise.
My task for the next 11 months here at ISB is to reduce the false positive rate in PeptideAtlas. We want to catalog as many proteins as possible while minimizing inclusion of proteins that aren't really in the sample. Eric Deutsch and I have had several discussions about this, starting at my interview in May of this year. This is a place for me to record what I've learned from Eric, and to record ideas of my own and those I've received from others.
Ideas for reducing the false positive rate in Peptide Atlas
- Discard singletons (proteins represented by only a single spectrum
- Require a much higher probability cutoff for singletons (e.g. 0.99 instead of 0.90)
- Require a much higher probability cutoff for all protein identifications (e.g. 0.99 instead of 0.90)
- Set a fixed FDR, say 0.1%, and set probability cutoffs accordingly
- Local FDRs should match the global FDR
- Use Henry Lam's SpectraST quality filter
- Cutoff of 0.99 for all nobs (cryptic incomplete note)
- Make use of this observation: the decoy estimated FDR is much smaller than that obtained by averaging the probabilities of all (peptide?) identifications and subtracting from 1. Suggests that decoy estimated FDR is too small, or probabilities are too small.
- Recalculate proabilities for short peptides based on ((# short decoys) / (total hits to short peptides))
- Do the search engines make use of LC retention times? If not, try.
- Look at peak intensity. Searches and TPP do not look at this.
Spectrum features to watch for
- Big unidentified peaks are bad (possible exception: peak just to left of precursor m/z -- confirm with expert)
- Consecutive identified peaks are good. Breaks in this are bad.
- Important identifications should be on big peaks, not small ones. Searches and TPP do not take into account peak intensity.