Terry's blog

From SPCTools

Revision as of 22:46, 9 November 2008; view current revision
←Older revision | Newer revision→
Jump to: navigation, search

Contents

November 10, 2008

To discuss with Eric today:

Gil Omenn meeting

Swiss-Prot success (prepare). Linking PA and "complete list of human protein seqs"

Modify QualScore to score IDs. Possible features:

  • AMASS consecutivity formula
  • If exists Proline, height of adjacent peak relative to average ID peak height
  • For several (five?) windows: average height of ID'd peaks relative to average height of non-ID'd. We expect this feature to be more significant in high m/z range.
  • Given precursor ion charge N, avg. peak height for fragments of charge <= N-1 divided by avg peak height for fragments of charge N (should be less than 1)
  • Some combined measure of the count and height of unidentified peaks within each of several (five?) windows.
  • Ratio of identified Y-ion peaks vs. B-ion (correct for peptides where B-ions expected to be strong; I think these are certain semi-tryptic peptides)

Talk to Nikita about appropriate methods. Can use QualScore (advantage: learn Java) or PeptideProphet as base code?

Make PA use FDR, not probability

November 9, 2008

Lessons and ideas taken away from proteomics course last week:

  • Don't use PeptideProphet to merge results from different samples (as I've been doing with Priska's data)
  • Gained understanding of X!Tandem's refinement option: allows two stage search, second stage (a) gathers additional spectra+peptide-ID for already ID's proteins, and (b) loosens requirements by, for example, allowing missed cleavage, fewer tryptic terminii, more modified aa's, etc. Purpose: speed. Stage 2 need only consider already ID'd proteins -- a fraction of the initial protein database -- when applying time-consuming loosened requirements.
  • NTT option in PeptideProphet makes a huge difference in the probabilities.
    • Acceptable with X!Tandem k-score but not with native X!Tandem
    • Can't use with X!Tandem refinement because refinement breaks assumptions used by NTT modeling.
  • How can we find out for, say, the Urine PeptideAtlas, which proteins were ID'd by SEQUEST and X!Tandem that were missed by SpectraST?
  • Proteomics pipeline of the future (according to Stein): 1. SpectraST searching 2. Search against library of spectra that were unidentified in previous experiments 3. QualScore tags high quality unidentified spectra (TMF inserted this) 4. Search with SEQUEST, X!Tandem, or other engine(s) that compare against theoretical spectra. Implement this. Or, implement only steps 1+4 (easier).
  • Minimum size library for spectral searching is 20,000, else too many FP. If library too small, tack on library from another species.
  • Fundamental limitation in MS/MS proteomics: modified versions of high abundance proteins are still abundant enough to mask low abundance proteins.
  • Can find more proteins by searching DNA databases. Alexei searches ESTs.
  • Immonium ions -- single side chain fragments formed by combo of a-type and y-type cleavage -- can be diagnostic to the presence of that AA in the seq. Don't know if any search engines use this.
  • Jimmy likes to use unconstrained search for same purpose as decoy search
  • Can do spectral processing (cleaning) -- helps for some searches, hinders or is neutral for others.
  • Use Pep3D (avail from Petunia) to quickly, visually assess success of MS/MS
  • QualScore can be used to select high quality, unidentified spectra for a less constrained search (DNA dbs, more mods, etc.)

November 6, 2008

Alter QualScore algorithm (or even code) to score peptide IDs. This could be my project.

Eric: "Btw, here is an article about the UniProt 'Complete Proteome'. It would be nice to understand how to correlate the 20,325 with what we use in PeptideAtlas." What would this mean?

  • Adding links to the database from PeptideAtlas. There are already links to UniProt and UniProtKB/TrEMBL for some proteins, and there are Swiss-Prot accession numbers in the biosequence_name field of many/most IPI sequences.

The Nov. 4 release of UniProtKB/Swiss-Prot is shown as having 20,328 human entries. Is this, then, the 'Complete Proteome'?

October 28, 2008

Eric thinks it will be useful to apply Mayu after ProteinProphet to compute the FDR (?)

October 27, 2008

Meeting with Eric:

Major topic is how to best incorporate iProphet into the TPP. Should iProphet be run on each experiment (possibly multiple searches) individually, or on many experiments together? Results with HUPO data suggest the latter. With the former, false positives accumulate. Here are some steps to take toward figuring out how to make use of iProphet:

  • Establish this practice: always run xinteract (PeptideProphet) with -E (experiment_label) flag. This will make it easier to adapt createPipelineInput.pl to process iProphet output.
  • iProphet is memory consumptive, which makes it difficult to use on lots of data. Possible solutions: break up spectra into batches and run iProphet on each separately, have iProphet generate its models based only on a random set of MAXSPECTRA (~ 1,000,000) spectra.
  • We want to lower the pepXML probability threshold for inclusion in PeptideAtlas. Currently it takes the user-specified threshold (usually 0.9) and discards all identifications with probabilities less than this in the pepXML file, even if the ProteinProphet adjusted probabilities are above the threshold, because the adjusted probabilities aren't that different. But now with iProphet, the ProteinProphet adjusted probabilities could be significantly different. So we don't want to pre-filter peptides with PeptideProphet probs < 0.9. Instead, we hard-coded createPipelineInput.pl to pre-filter using prob < 0.5.
  • SpectraST now uses only PeptideProphet probabilities. Henry is willing to use ProteinProphet probs if we give them to him in a text file. When the time is right, I should email Henry about this.

We also talked about stuff that should improve PeptideAtlas quality aside from iProphet:

  • step01 of the PeptideAtlas build pipeline optionally takes a spectrum library and removes identifications that don't appear in the library. We can make use of this by going all the way through step08 (spectrum library creation using SpectraST), then running the build pipeline steps 01-08 again, applying the filter. Should improve our results.

October 21, 2008

Yesterday's meeting with Eric:

We had wished to be first to publish a large, high-quality human plasma proteome. Mann beat us to it just this month. Mann published 697 very high quality IDs. We have about 2400 but they need to be screened. Would be nice to publish 2000. We have a greater variety of data. HUPO plasma proteome project published about 3000 proteins. Leader = Gil Omenn, VP of HUPO. Will visit here. HUPO plasma project phase 1 is now over, a volume was published including a paper by Eric. Now we're in phase II.

AMASS method for screening spectra based on continuity of identified ions: two approaches:

  • incorporate scoring function into PeptideAtlas so we can see how it performs
  • incorporate into a program which processes mzXML files and adds a value in the <search_score> field to the pepXML. David S. has already expressed some interest in allowing an additional arbitrary discriminant score as input to PeptideProphet, which he would then try to model and would use if it were discriminating. The AMASS score could be such a score. If David implements this general functionality, then the score would automatically be used to influence peptide probabilities.

Want to figure out which search engines already consider ID continuity so we don't double model. SEQUEST does not. X!Tandem? Ask Henry.

A-Score method, already used at ISB for figuring out which sites are actually phosphorylated, could be incorporated similarly (?)

October 20, 2008

Read AMASS paper on filtering spectra based on (a) match percentage of high-abundance ions, and (b) consecutivity / continuity of matched fragment ions. Did not have time to comprehend the formulae. Paper is not highly cited.

Am building a PeptideAtlas from iProphet results. Creating wiki page describing the process.

October 14, 2008

Learning what a bad spectrum identification looks like

Talked for about an hour with Jimmy Eng, trying to find out the features of a misidentified spectrum. We looked at the five decoys that made it into the human urine peptide atlas. He was easily able to tag each of them as a poor identification. Concepts:

  • Red flag: low intensity ID'd peaks amidst high density of low intensity non-ID'd peaks. Jimmy would devalue/ignore those peak IDs.
  • Red flag: no strong ID'd peaks in the higher mass range
  • Red flag: gaps in consecutivity, especially in higher mass range and esp. for Y-ions
  • For precursor ion of charge N, should see strongest peaks for fragments of charge <= N-1
  • Unidentified peaks at -18 (water) and -28 (?) may be OK
  • Want to see lots of strong ID'd peaks rising above the level of the non-ID's peaks
  • One decoy had R at N-term (missed cleavage) and no RK at C-term (miscleavage). Must have been a semi-tryptic search. Such a combination of unconventional cleavages is unusual but not extremely rare. I asked Jimmy if he'd devalue the ID and he said, "No, the semi-tryptic search was done because such cleavage patterns are expected." However, I didn't suggest tossing out the ID, just devaluing it. I think devaluing it might be appropriate.
  • PeptideProphet does not consider X!Tandem's expect score, which is based on the distribution of scores for a particular spectrum vs. entire sequence DB. Expect = # of peptides we expect to see at this [raw] score by random chance. One decoy had an expect=12 -- very high. PeptideProphet only looks at hyperscore (top hit) and next (next best hit) and does not consider the distribution.
  • Low mass portion of spectrum tends to be more messy and less interesting
  • We want big, high mass Y-ions

I learned the following tangentially:

  • PeptideProphet does make use of LC retention time.
  • We do expect to see strong peaks just to the left of the precursor m/z.
  • Mobile proton theory: Cleavage via CID occurs near any positive charge (proton). When there is a basic C-term (KR) and a basic N-term (N-term always basic), a spare proton is free to roam the length of the peptide because it is equally attracted to both basic ends. This makes cleavage approximately equally likely at all positions. If a peptide lacks that balance, then cleavage will tend to happen closer to any basic residue.
  • Enhanced cleavage N-terminal to proline --> strong Y-ion beginning with that proline
  • SEQUEST already considers consecutiveness of peak IDs in its preliminary scoring
  • Jimmy has tried peak picking by sliding a window across a spectrum, calculating mean/SD of peak heights, to separate signal from noise.

October 13, 2008

States et. al. paper (Nature Biotechnology 2005) offers a method for estimating protein ID confidence levels which incorporates protein length. Eric implemented this in ProteinProphet and reports results in analysis.out. This method gives similar yield to simply tossing out all singletons, but is perhaps more theoretically sound (sometimes Eric thinks it's just more complicated).

Plan:

  • Incorporate Swiss-Prot annotations into PeptideAtlas as a way of learning how to program PeptideAtlas
  • Incorporate PeptideProphet information (e.g. probabilities) into PeptideAtlas. When changing schema, also insert spaces for States et. al. metrics:
    • expected number of false positives per protein
    • likelihood of actual # of hits occuring by chance ("probability" = 1 - this)
  • Figure out how to get this info into PeptideAtlas
  • Reload yeast PeptideAtlas so that this info is viewable, and draw conclusions from the info.

Talked to Lik Wee this morning -- new postdoc in Martin lab, studying proteomics. His first assignment is to learn the different search engines. I explained my project to him. Analyzing spectra for (a) large unidentified peaks, (b) big gaps in peak IDs, and (c) low average peak intensity sounds like a viable approach for appropriately deprecating some PeptideAtlas IDs. Can I use a machine learning algorithm for this? Ask Nikita.

October 8, 2008

I keep thinking the date must be a little later than it actually is. Like, I thought today should be the 9th. At the same time, it seems not so long ago that 2008 felt like a very new year, and it was hard to imagine it would ever feel otherwise.

My task for the next 11 months here at ISB is to reduce the false positive rate in PeptideAtlas. We want to catalog as many proteins as possible while minimizing inclusion of proteins that aren't really in the sample. Eric Deutsch and I have had several discussions about this, starting at my interview in May of this year. This is a place for me to record what I've learned from Eric, and to record ideas of my own and those I've received from others.

Ideas for reducing the false positive rate in Peptide Atlas

  • Discard singletons (proteins represented by only a single spectrum
  • Require a much higher probability cutoff for singletons (e.g. 0.99 instead of 0.90)
  • Require a much higher probability cutoff for all protein identifications (e.g. 0.99 instead of 0.90)
  • Set a fixed FDR, say 0.1%, and set probability cutoffs accordingly
  • Use Henry Lam's SpectraST quality filter
  • Cutoff of 0.99 for all nobs (cryptic incomplete note)
  • Make use of this observation: the decoy estimated FDR is much smaller than that obtained by averaging the probabilities of all (peptide?) identifications and subtracting from 1. Suggests that decoy estimated FDR is too small, or probabilities are too small.
  • Recalculate proabilities for short peptides based on ((# short decoys) / (total hits to short peptides))
  • Do the search engines make use of LC retention times? If not, try.
  • Look at peak intensity. Searches and TPP do not look at this.
  • Implement States, et. al. (20060 in ProteinProphet. It considers protein length when estimating protein ID confidence. The longer a protein, the more chances there are for the protein to be associated with a false peptide ID. (Better than considering protein length would be # of proteolytic peptides.)
  • See paper by Y. Shen & .... & Richard Smith of PNW National Labs (2008). I didn't understand it after 3 attempts. "... Improved FDRs Using Unique Sequence Tags."
  • Run my data with different kinds of decoys (peptides reversed, proteins reversed, peptides scrambled, different organism), because different kinds of decoys resemble real data in different ways. For example, decoys which are scrambled lose the phobic/philic periodicity of real sequence.
  • Where to arginine/lysine occur in protein structures? Knowing this can help us define properties of real spectra & real tryptic peptides & can help us create better decoys & know which false positive decoys are not worth worrying about because they do not resemble real peptides.
  • Install and use Mayu (previously Pandora). For Peptide Atlas builds that incorporate decoys, it calculates the FDR for each spectrum, peptide, and protein without making use of TPP. It can also calculate FDRs for subpopulations ("local FDR") such as peptides seen only once, or only +2 spectra. Does not calculate probabilities; Eric asked; we're not sure why.
  • Local FDRs should match the global FDR
  • Give SpectraST hits high probability.
  • Use InterProphet. Use (# IDs) / (best prob) as a parameter. (???)
  • Can make ProProph behave better using a minimum probability for the input peptides. Using spectral counting techniques, get spectrum counts for urine and compute abundances of the various proteins.
  • Add functionality to Peptide Atlas to tag an ID as garbage. (I think such a tag would need to have a userID and be temporarily reversible? Some kind of consensus? Enter reasoning?)
  • Goal: to achieve some particular FDR (e.g. 2%) for Peptide Atlas serum build and/or to assign probabilities to protein IDs There are no human plasma decoy searches. They would take a long time. Possible in future, perhaps.
  • The large quantity of data in a Peptide Atlas build allows certain features of the data to emerge. This is what I want to see. For example, it's been observed that if you see a peptide 20 times and the max prob is 1.0, it's probably correct. If the max prob is only 0.95, it's probably wrong. We do not yet assign probabilities to proteins in PA -- only peptides. (If ProteinProphet worked, we could have protein probabilities.) Ning Zhang (was at ISB, now at UW), primary spectral counting author: Took all ProProphet IDs w/ prob=1.0 – still got a ton of decoys
  • Apply AMASS formula for scoring spectra based on continuity.

Spectrum features to watch for

  • Big unidentified peaks are bad (possible exception: peak just to left of precursor m/z -- confirm with expert)
  • Consecutive identified peaks are good. Breaks in this are bad.
  • Important identifications should be on big peaks, not small ones. Searches and TPP do not take into account peak intensity.
Personal tools