Processing glycopeptide data
From SPCTools
Revision as of 21:44, 9 April 2009 Tfarrah (Talk | contribs) (→Outline of analysis procedure) ← Previous diff |
Revision as of 23:24, 21 April 2009 Tfarrah (Talk | contribs) (→Notes on how to analyze) Next diff → |
||
Line 38: | Line 38: | ||
batch_convert.sh - script to translate batch of xml files. | batch_convert.sh - script to translate batch of xml files. | ||
changeback.pl - perl script called by batch script above to back-substitute files. | changeback.pl - perl script called by batch script above to back-substitute files. | ||
+ | |||
+ | |||
+ | '''Henry Lam's comments:''' | ||
+ | |||
+ | If you're using Tandem, you don't need the modified database. Tandem has a motif-specific modification option, such that you can specify that only the N in an NXS/T motif can be modified. The N's in an NXS/T motif can be unmodified too, and this method theoretically can capture those. In reality, however, the sequence search engine is never quite good enough to tell between the unmodified and modified versions, so we're often faced with the situation where we had to ASSUME all N's in NXS/T among the found IDs must be modified, hence the modified database idea. Anyway, I think I'm confusing you now. | ||
+ | |||
+ | If you're using SpectraST, you probably want to use a library built by us, not the one built by NIST. NIST libraries don't have glyco peptides, and the overlap between non-glyco-captured proteome and glyco-captured proteome is surprisingly small. You don't have to do anything special with search options, once you find the right library to use. |
Revision as of 23:24, 21 April 2009
Background
We sometimes want to assay the glycoproteins in a sample. Because they are low-abundance, we first remove non-glycopeptides from the digested sample. Glycosylation always occurs at the asparagine (N) of an N[^P][ST] motif (N, followed by anything but P, followed by S or T). However, not all such motifs are glycosylated, and since a peptide may contain more than one such motif, we need to distinguish glycosylated motifs from non-glycosylated. Therefore we remove the carbohydrate moiety (something we want to do anyway to reduce sample complexity) using PNGase F, which also changes the N to a D, effectively tagging the N. One might then search with an optional modification of N with mass equal to the N/D massdiff. But this would allow N/D substitution at Ns that are not part of a glyco motif. So we substitute J or B for the N's in all the glyco motifs, and allow an optional modification at J (or B).
Outline of analysis procedure
- Create a target database where the N of all glyco sites is swapped for a J or B (depends on downstream software limitations, and also whether original seq DB has Bs in it already -- B is a standard but uncommon abbrev. for "N or D, not distinguishable") -- I think we use J -- and where seq IDs are prepended with NXST_.
- Search glyco data against this DB, allowing an optional modification at J/B of 0.9840
- In search results (pepXML files), change J/B back to N and also remove NXST_.
Notes on how to analyze
You can look here for a search that was done with sequest and xtandem. Xtandem params are the same as usual except that the target database is ipi.HUMAN.v3.38_forwdecoy_nxst.fasta:
/regis/sbeams/archive/jwatts/HsGlycoPlasma35indiv/HsGlycoPlasma35indiv
Pertinent scripts and other useful files are in ~tfarrah/alt_nxst. Out of context, so might be some unforeseen issues. If they don't work out of the box just let me know and I'll help you troubleshoot.
The basic method entails searching against a modified db with all NXS/T replaced by BXS/T (except for NPS/T and a few other exceptions). B stands for a D that's been substituted in. We then run the search with a static modification on B (to make it the same weight as a D) in the sequest.params, then back-converting the results (substituting Ns for all Bs?) and processing as normal (including refresh-parsing against the original db). We modified the method a little to use 'B' (avg of D and N) because our version of Sequest was limited in its ability to accept non-standard amino acids. If you are running xtandem you might want to consider whether there is a better way to do this. I've outlined the files in the archive below, let me know if you have questions.
Atwood-York_GlycopeptideSearchStrategy.pdf - original paper this is based on
nxst_conversion_recipe.txt - README file for this process. Note perl -pi -e step, this must be done.
sequest.params.ft - modified sequest.params file, the salient line is shown below:
add_B_avg_NandD = 0.4920 ; added to B - avg. 114.5962, mono. 114.53494
make_nxst_db.pl - script to convert database; assumes the sequence is all on one line, otherwise the regex might need tweaking.
batch_convert.sh - script to translate batch of xml files. changeback.pl - perl script called by batch script above to back-substitute files.
Henry Lam's comments:
If you're using Tandem, you don't need the modified database. Tandem has a motif-specific modification option, such that you can specify that only the N in an NXS/T motif can be modified. The N's in an NXS/T motif can be unmodified too, and this method theoretically can capture those. In reality, however, the sequence search engine is never quite good enough to tell between the unmodified and modified versions, so we're often faced with the situation where we had to ASSUME all N's in NXS/T among the found IDs must be modified, hence the modified database idea. Anyway, I think I'm confusing you now.
If you're using SpectraST, you probably want to use a library built by us, not the one built by NIST. NIST libraries don't have glyco peptides, and the overlap between non-glyco-captured proteome and glyco-captured proteome is surprisingly small. You don't have to do anything special with search options, once you find the right library to use.