Processing glycopeptide data

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 22:54, 20 February 2009
Tfarrah (Talk | contribs)

← Previous diff
Current revision
Tfarrah (Talk | contribs)
(Notes on how to analyze)
Line 1: Line 1:
-=== Raw notes from Dave Campbell email to tfarrah on Jan. 20, 2009 ===+=== Background ===
 +We sometimes want to assay the glycoproteins in a sample. Because they are low-abundance, we first remove non-glycopeptides from the digested sample. Glycosylation always occurs at the asparagine (N) of an N[^P][ST] motif (N, followed by anything but P, followed by S or T). However, not all such motifs are glycosylated, and since a peptide may contain more than one such motif, we need to distinguish glycosylated motifs from non-glycosylated. Therefore we remove the carbohydrate moiety (something we want to do anyway to reduce sample complexity) using PNGase F, which also changes the N to a D, effectively tagging the N. One might then search with an optional modification of N with mass equal to the N/D massdiff. But this would allow N/D substitution at Ns that are not part of a glyco motif. So we substitute J or B for the N's in all the glyco motifs, and allow an optional modification at J (or B).
 +=== Outline of analysis procedure ===
 +# Create a target database where the N of all glyco sites is swapped for a J or B (depends on downstream software limitations, and also whether original seq DB has Bs in it already -- B is a standard but uncommon abbrev. for "N or D, not distinguishable") -- I think we use J -- and where seq IDs are prepended with NXST_.
 +# Search glyco data against this DB, allowing an optional modification at J/B of 0.9840
 +# In search results (pepXML files), change J/B back to N and also remove NXST_.
 +
 +=== Notes on how to analyze ===
You can look here for a search that was done with sequest and xtandem. Xtandem params are the same as usual except that the target database is ipi.HUMAN.v3.38_forwdecoy_nxst.fasta: You can look here for a search that was done with sequest and xtandem. Xtandem params are the same as usual except that the target database is ipi.HUMAN.v3.38_forwdecoy_nxst.fasta:
Line 23: Line 30:
nxst_conversion_recipe.txt - README file for this process. Note perl -pi -e step, this must be done. nxst_conversion_recipe.txt - README file for this process. Note perl -pi -e step, this must be done.
 +
 + # set path to include scripts
 + # bash syntax
 + export PATH=${PATH}:/users/tfarrah/alt_nxst/ # db/mimas, dione, ...
 + export PATH=${PATH}:/proteomics/tfarrah/bin/alt_nxst/ # regis
 + # csh syntax
 + setenv PATH ${PATH}:/users/tfarrah/alt_nxst/ #db/mimas, dione, ...
 + setenv PATH ${PATH}:/proteomics/tfarrah/bin/alt_nxst/ #regis
 +
 + # create search database with NxS/T changed to NBS/T. This will
 + # make nxst and 'reverse' versions of the source database, called
 + # myfasta_nxst_db.fsa myfasta_reverse_db.fsa for the example shown.
 + make_nxst_db.pl myfasta.fsa
 +
 + # Edit sequest.params file, which has static mass mod for B.
 + # Edit database to point at modified db
 +
 + # Run sequest
 + runsequest *.mzXML
 +
 + # Substitute search settings
 + ls *.pep.xml | xargs batch_convert.sh
 +
 + # This step is still manual, substitute db name in pepXML files, as well as
 + # copy of sequest.params file
 + cp sequest.params sequest.params.run
 + perl -pi -e 's/ipi.MOUSE.v3.24.fasta-mix7prot_nxst_db.fsa/ipi.MOUSE.v3.24.fasta-mix7prot.fsa/' *.xml sequest.params
 +
 + # Run xinteract
 +
sequest.params.ft - modified sequest.params file, the salient line is shown below: sequest.params.ft - modified sequest.params file, the salient line is shown below:
Line 31: Line 68:
batch_convert.sh - script to translate batch of xml files. batch_convert.sh - script to translate batch of xml files.
changeback.pl - perl script called by batch script above to back-substitute files. changeback.pl - perl script called by batch script above to back-substitute files.
 +
 +
 +'''Terry's notes on using B-swap technique with X!Tandem'''
 +
 + # link scripts to ~/bin, which is on my path
 + cd bin; ln -s ../alt_nxst/*{pl,sh} .
 +
 + # create search database with NxS/T changed to BxS/T. This will
 + # make a nxst version of the source database, called
 + # myfasta_nxst.fasta for the example shown.
 + # ~tfarrah/alt_nxst version accommodates sequences spanning multiple lines
 + make_nxst_db.pl myfasta.fasta
 +
 + # Go to the directory where you'll search
 +
 + # Edit tandem.params file to add a modification mass of 0.984016 (monoisotopic)
 + # or 0.9848 (average) for B, which X!Tandem thinks is same mass as N.
 + # Edit tandem.params to point at modified db
 + vim tandem.params
 +
 + # Make a copy of runtandemsearch and edit it so that it doesn't
 + # run finishtandemsearch
 + cp /regis/sbeams/bin/tandem/runtandemsearch .
 + vim runtandemsearch
 +
 + # Run tandem.
 + runtandemsearch *.mzXML
 +
 + # Substitute search settings into pepXML formatted search results
 + ls *.pep.xml | xargs batch_convert_glyco_pepXML.sh
 + rm *.pep.xml.bak
 +
 + # Manually substitute db name in pepXML files, as well as
 + # copy of tandem.params file
 + cp tandem.params tandem.params.run
 + sed -i -e 's/forwdecoy_nxst/forwdecoy/g' *.pep.xml tandem.params
 +
 + # Run xinteract.
 + /regis/sbeams/bin/tandem/finishtandemsearch *.mzXML >& zztandempostprocessing.log
 +
 +
 +'''Henry Lam's comments:'''
 +
 +If you're using Tandem, you don't need the modified database. Tandem has a motif-specific modification option, such that you can specify that only the N in an NXS/T motif can be modified. The N's in an NXS/T motif can be unmodified too, and this method theoretically can capture those. In reality, however, the sequence search engine is never quite good enough to tell between the unmodified and modified versions, so we're often faced with the situation where we had to ASSUME all N's in NXS/T among the found IDs must be modified, hence the modified database idea. Anyway, I think I'm confusing you now.
 +
 +If you're using SpectraST, you probably want to use a library built by us, not the one built by NIST. NIST libraries don't have glyco peptides, and the overlap between non-glyco-captured proteome and glyco-captured proteome is surprisingly small. You don't have to do anything special with search options, once you find the right library to use.

Current revision

Background

We sometimes want to assay the glycoproteins in a sample. Because they are low-abundance, we first remove non-glycopeptides from the digested sample. Glycosylation always occurs at the asparagine (N) of an N[^P][ST] motif (N, followed by anything but P, followed by S or T). However, not all such motifs are glycosylated, and since a peptide may contain more than one such motif, we need to distinguish glycosylated motifs from non-glycosylated. Therefore we remove the carbohydrate moiety (something we want to do anyway to reduce sample complexity) using PNGase F, which also changes the N to a D, effectively tagging the N. One might then search with an optional modification of N with mass equal to the N/D massdiff. But this would allow N/D substitution at Ns that are not part of a glyco motif. So we substitute J or B for the N's in all the glyco motifs, and allow an optional modification at J (or B).

Outline of analysis procedure

  1. Create a target database where the N of all glyco sites is swapped for a J or B (depends on downstream software limitations, and also whether original seq DB has Bs in it already -- B is a standard but uncommon abbrev. for "N or D, not distinguishable") -- I think we use J -- and where seq IDs are prepended with NXST_.
  2. Search glyco data against this DB, allowing an optional modification at J/B of 0.9840
  3. In search results (pepXML files), change J/B back to N and also remove NXST_.

Notes on how to analyze

You can look here for a search that was done with sequest and xtandem. Xtandem params are the same as usual except that the target database is ipi.HUMAN.v3.38_forwdecoy_nxst.fasta:

/regis/sbeams/archive/jwatts/HsGlycoPlasma35indiv/HsGlycoPlasma35indiv

Pertinent scripts and other useful files are in ~tfarrah/alt_nxst. Out of context, so might be some unforeseen issues. If they don't work out of the box just let me know and I'll help you troubleshoot.

The basic method entails searching against a modified db with all NXS/T replaced by BXS/T (except for NPS/T and a few other exceptions). B stands for a D that's been substituted in. We then run the search with a static modification on B (to make it the same weight as a D) in the sequest.params, then back-converting the results (substituting Ns for all Bs?) and processing as normal (including refresh-parsing against the original db). We modified the method a little to use 'B' (avg of D and N) because our version of Sequest was limited in its ability to accept non-standard amino acids. If you are running xtandem you might want to consider whether there is a better way to do this. I've outlined the files in the archive below, let me know if you have questions.

Atwood-York_GlycopeptideSearchStrategy.pdf - original paper this is based on

nxst_conversion_recipe.txt - README file for this process. Note perl -pi -e step, this must be done.

# set path to include scripts
# bash syntax
export PATH=${PATH}:/users/tfarrah/alt_nxst/ # db/mimas, dione, ...
export PATH=${PATH}:/proteomics/tfarrah/bin/alt_nxst/ # regis
# csh syntax
setenv PATH ${PATH}:/users/tfarrah/alt_nxst/ #db/mimas, dione, ...
setenv PATH ${PATH}:/proteomics/tfarrah/bin/alt_nxst/ #regis

# create search database with NxS/T changed to NBS/T.  This will
# make nxst and 'reverse' versions of the source database, called
# myfasta_nxst_db.fsa  myfasta_reverse_db.fsa for the example shown.
make_nxst_db.pl myfasta.fsa

# Edit sequest.params file, which has static mass mod for B.
# Edit database to point at modified db

# Run sequest
runsequest *.mzXML

# Substitute search settings
ls *.pep.xml | xargs  batch_convert.sh

# This step is still manual, substitute db name in pepXML files, as well as
# copy of sequest.params file
cp sequest.params sequest.params.run
perl -pi -e 's/ipi.MOUSE.v3.24.fasta-mix7prot_nxst_db.fsa/ipi.MOUSE.v3.24.fasta-mix7prot.fsa/' *.xml sequest.params

# Run xinteract


sequest.params.ft - modified sequest.params file, the salient line is shown below:

add_B_avg_NandD = 0.4920               ; added to B - avg. 114.5962, mono. 114.53494

make_nxst_db.pl - script to convert database; assumes the sequence is all on one line, otherwise the regex might need tweaking.

batch_convert.sh - script to translate batch of xml files. changeback.pl - perl script called by batch script above to back-substitute files.


Terry's notes on using B-swap technique with X!Tandem

# link scripts to ~/bin, which is on my path
cd bin; ln -s ../alt_nxst/*{pl,sh} .

# create search database with NxS/T changed to BxS/T.  This will
# make a nxst version of the source database, called
# myfasta_nxst.fasta for the example shown.
# ~tfarrah/alt_nxst version accommodates sequences spanning multiple lines
make_nxst_db.pl myfasta.fasta
# Go to the directory where you'll search

# Edit tandem.params file to add a modification mass of 0.984016 (monoisotopic)
#  or 0.9848 (average) for B, which X!Tandem thinks is same mass as N.
# Edit tandem.params to point at modified db
vim tandem.params
# Make a copy of runtandemsearch and edit it so that it doesn't
# run finishtandemsearch
cp /regis/sbeams/bin/tandem/runtandemsearch .
vim runtandemsearch

# Run tandem.
runtandemsearch *.mzXML

# Substitute search settings into pepXML formatted search results
ls *.pep.xml | xargs  batch_convert_glyco_pepXML.sh
rm *.pep.xml.bak

# Manually substitute db name in pepXML files, as well as
# copy of tandem.params file
cp tandem.params tandem.params.run
sed -i -e 's/forwdecoy_nxst/forwdecoy/g' *.pep.xml tandem.params

# Run xinteract.
/regis/sbeams/bin/tandem/finishtandemsearch *.mzXML >& zztandempostprocessing.log


Henry Lam's comments:

If you're using Tandem, you don't need the modified database. Tandem has a motif-specific modification option, such that you can specify that only the N in an NXS/T motif can be modified. The N's in an NXS/T motif can be unmodified too, and this method theoretically can capture those. In reality, however, the sequence search engine is never quite good enough to tell between the unmodified and modified versions, so we're often faced with the situation where we had to ASSUME all N's in NXS/T among the found IDs must be modified, hence the modified database idea. Anyway, I think I'm confusing you now.

If you're using SpectraST, you probably want to use a library built by us, not the one built by NIST. NIST libraries don't have glyco peptides, and the overlap between non-glyco-captured proteome and glyco-captured proteome is surprisingly small. You don't have to do anything special with search options, once you find the right library to use.

Personal tools