PABST peptide examples

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 21:04, 16 September 2009
Tfarrah (Talk | contribs)

← Previous diff
Current revision
Tfarrah (Talk | contribs)

Line 13: Line 13:
matching sequences, scores greater than 1 will reward them. For example, if a sequence had both a Proline and a Serine, and the score for each is set to 0.5, matching sequences, scores greater than 1 will reward them. For example, if a sequence had both a Proline and a Serine, and the score for each is set to 0.5,
then the final score will be multiplied by 0.5 * 0.5, or 0.25. If the bonus_obs param is set to 2, then the empirical (observed) suitability score will be then the final score will be multiplied by 0.5 * 0.5, or 0.25. If the bonus_obs param is set to 2, then the empirical (observed) suitability score will be
-multiplied by 2.+multiplied by 2.
-The script must be run from $SBEAMS/lib/scripts/PeptideAtlas/, where $SBEAMS=/net/dblocal/www/html/<your_dev_area>/sbeams. If you don't have a dev area, use that of someone you know who's recently updated their software (maybe dev2 -- Eric Deutsch -- or devTF -- Terry Farrah).+usage: /net/dblocal/www/html/sbeams/lib/scripts/PeptideAtlas/fetch_best_peptides.pl -a build_id [ -t outfile -n obs_cutoff -p proteins_file -v -b .3 ]
- + 
- usage: fetch_best_peptides.pl -a build_id [ -t outfile -n obs_cutoff -p proteins_file -v -b .3 ]+ -a, --atlas_build one or more atlas build ids to be queried for observed
- -a, --atlas_build Numeric atlas build ID to query+ peptides, will be used in order provided. Can be
 + specified as a numeric id ( -a 123 -a 189 ) or as a composite
 + id:weight ( -a 123:3 ). Scores from EPS and ESS will
 + be multiplied by given weight, defaults to 1.
-c, --config Config file defining penalites for various sequence -c, --config Config file defining penalites for various sequence
-d, --default_config prints an example config file with defaults in CWD, -d, --default_config prints an example config file with defaults in CWD,
Line 26: Line 29:
biosequence.biosequence_name biosequence.biosequence_name
-s, --show_builds Print info about builds in db -s, --show_builds Print info about builds in db
- -b, --bonus_obs Value by which observed peptide suitability score is+ --build_name Regular expression to limit return values from
- augmented relative to theoretical score, default 0.5.+ show builds, will be used in LIKE clause, with wildcard
 + characters added automatically.
 + -b, --bioseq_set Explictly defined biosequence set. If not provided,
 + the BSS defined by the first atlas_build specified will
 + be used.
-t, --tsv_file print output to specified file rather than stdout -t, --tsv_file print output to specified file rather than stdout
- -n, --n_peptides number of peptides to return per protein+ --n_peptides number of peptides to return per protein
 + --name_prefix prefix constraint on biosequences, allows subset of
 + of bioseqs to be selected.
-o, --obs_min Minimum n_obs to consider for observed peptides -o, --obs_min Minimum n_obs to consider for observed peptides
-h, --help Print usage -h, --help Print usage
-v, --verbose Verbose output, prints progress -v, --verbose Verbose output, prints progress
 +
Default config file: Default config file:
<PRE> <PRE>
-C 0.3 # Avoid C+4H 1 # Avoid 4 straight hydrophobic residues
 +C 0.95 # Avoid C
D 1 # Slightly penalize D or S in general? D 1 # Slightly penalize D or S in general?
-DG 0.5 # Avoid dipeptide DG+DG 1 # Avoid dipeptide DG
-DP 0.5 # Avoid dipeptide DP+DP 1 # Avoid dipeptide DP
-M 0.3 # Avoid M+M 0.95 # Avoid M
-NG 0.5 # Avoid dipeptide NG+NG 1 # Avoid dipeptide NG
-P 0.5 # Avoid P+NxST 1 # Penalize peptides which lack NxST motif
-QG 0.5 # Avoid dipeptide QG+P 0.95 # Avoid P
 +QG 1 # Avoid dipeptide QG
S 1 # Slightly penalize D or S in general? S 1 # Slightly penalize D or S in general?
-W 0.1 # Avoid W+W 1 # Avoid W
-Xc 0.5 # Avoid any C-terminal peptide+Xc 1 # Avoid any C-terminal peptide
-max_l 0 # Maximum length for peptide+max_l 25 # Maximum length for peptide
-max_p 1 # Penalty for peptides over max length+max_p 0.2 # Penalty for peptides over max length
-min_l 0 # Minimum length for peptide+min_l 7 # Minimum length for peptide
-min_p 1 # Penalty for peptides under min length+min_p 0.2 # Penalty for peptides under min length
-nE 0.4 # Avoid N-terminal E+nE 1 # Avoid N-terminal E
-nGPG 0.1 # Avoid nxyG where x or y is P or G+nGPG 1 # Avoid nxyG where x or y is P or G
-nQ 0.1 # Avoid N-terminal Q+nM 1 # Avoid N-terminal M
-nxxG 0.3 # Avoid nxxG+nQ 1 # Avoid N-terminal Q
 +nX 1 # Avoid any N-terminal peptide
 +nxxG 1 # Avoid nxxG
obs 2 # Bonus for observed peptides, usually > 1 obs 2 # Bonus for observed peptides, usually > 1
 +ssr_p 0.5 # Penalty for very high or low hydrophobicity
 +
 +
</PRE> </PRE>
===Usage notes=== ===Usage notes===
-Below are some perhaps interesting example proteins to explore how the various scoring parameters affect the peptides selected. +====Single-protein examples to illustrate the effects of the parameters ====
 +On each of the pages below, scroll down to "PABST best peptides". You will be able to adjust the PABST settings and see how that changes the results. You will need to log in to SBEAMS.
-https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=c87rxtje 
-Protein: ALCAM, moderate number of observations+https://db.systemsbiology.net/devDC/sbeams/cgi/PeptideAtlas/GetProtein?atlas_build_id=195&protein_name=YAL003W&action=QUERY
 + 
 +
 +https://db.systemsbiology.net/devDC/sbeams/cgi/PeptideAtlas/GetProtein?atlas_build_id=195&protein_name=YKL098W&action=QUERY
 + 
 + 
 +----
-https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=xsp03v1h+These are temporarily decommissioned...
-Protein with tons of observed peptides, lots of them NT or MC. 
 +Protein: ALCAM, moderate number of observations https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=c87rxtje
-https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=h5bwnrt2+Protein with tons of observed peptides, lots of them NT or MC. https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=xsp03v1h
-Protein with many fewer observations+Protein with many fewer observations https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=h5bwnrt2
-https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=s2kvwg9r+Protein with moderate number of obs, mixed MGL/SGL https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=s2kvwg9r
-Protein with moderate number of obs, mixed MGL/SGL+----
 +====Looking at empirical or theoretical peptides only====
If you want just empirically observed peptides, filter for lines that do not have "na" in the empirical_proteotypic_score and suitability_score columns: If you want just empirically observed peptides, filter for lines that do not have "na" in the empirical_proteotypic_score and suitability_score columns:

Current revision

PABST is a tool to help users select the best potential peptides to use for Mass Spectrometric identification of a set of proteins. It merges various data sources and evaluates the results based on user-tunable parameters. The current default parameter weightings are shown below, and the lower sections show links to example peptides along with comments to aid in the development of the selection algorithm.

The script can be run with the -h flag, or no args at all, to see the usage statement below. The only required parameter is build_id, which the script uses to determine which atlas build to export peptides from. The default config file is shown below the usage stmt, these values will be used unless a user-defined config file is used. To get a template config file, simply execute the script with the -d flag and an example config file will be written to the CWD, which can then be edited as desired.

The config file specifies various sequence attributes and an associated score; each peptide sequence is evaluated for every attribute, and a composite score is reached by multiplying together the score for each that matches. Each peptide has 2 possible sources, empirical data from having been observed in the specified atlas build, and theoretical data from the electronic analysis of the reference database. Scores less than 1 will penalize matching sequences, scores greater than 1 will reward them. For example, if a sequence had both a Proline and a Serine, and the score for each is set to 0.5, then the final score will be multiplied by 0.5 * 0.5, or 0.25. If the bonus_obs param is set to 2, then the empirical (observed) suitability score will be multiplied by 2.

usage: /net/dblocal/www/html/sbeams/lib/scripts/PeptideAtlas/fetch_best_peptides.pl -a build_id [ -t outfile -n obs_cutoff -p proteins_file -v -b .3 ]

  -a, --atlas_build    one or more atlas build ids to be queried for observed
                       peptides, will be used in order provided.  Can be
                       specified as a numeric id ( -a 123 -a 189 ) or as a composite
                       id:weight ( -a 123:3 ).  Scores from EPS and ESS will
                       be multiplied by given weight, defaults to 1.
  -c, --config         Config file defining penalites for various sequence
  -d, --default_config prints an example config file with defaults in CWD,
                       named best_peptide.conf, will not overwrite existing
                       file.  Exits after printing.
  -p, --protein_file   file of protein names, one per line.  Should match
                       biosequence.biosequence_name
  -s, --show_builds    Print info about builds in db
      --build_name     Regular expression to limit return values from
                       show builds, will be used in LIKE clause, with wildcard
                       characters added automatically.
  -b, --bioseq_set     Explictly defined biosequence set.  If not provided,
                       the BSS defined by the first atlas_build specified will
                       be used.
  -t, --tsv_file       print output to specified file rather than stdout
      --n_peptides     number of peptides to return per protein
      --name_prefix    prefix constraint on biosequences, allows subset of
                       of bioseqs to be selected.
  -o, --obs_min        Minimum n_obs to consider for observed peptides
  -h, --help           Print usage
  -v, --verbose        Verbose output, prints progress


Default config file:

4H      1       # Avoid 4 straight hydrophobic residues
C       0.95    # Avoid C
D       1       # Slightly penalize D or S in general?
DG      1       # Avoid dipeptide DG
DP      1       # Avoid dipeptide DP
M       0.95    # Avoid M
NG      1       # Avoid dipeptide NG
NxST    1       # Penalize peptides which lack NxST motif
P       0.95    # Avoid P
QG      1       # Avoid dipeptide QG
S       1       # Slightly penalize D or S in general?
W       1       # Avoid W
Xc      1       # Avoid any C-terminal peptide
max_l   25      # Maximum length for peptide
max_p   0.2     # Penalty for peptides over max length
min_l   7       # Minimum length for peptide
min_p   0.2     # Penalty for peptides under min length
nE      1       # Avoid N-terminal E
nGPG    1       # Avoid nxyG where x or y is P or G
nM      1       # Avoid N-terminal M
nQ      1       # Avoid N-terminal Q
nX      1       # Avoid any N-terminal peptide
nxxG    1       # Avoid nxxG
obs     2       # Bonus for observed peptides, usually > 1
ssr_p   0.5     # Penalty for very high or low hydrophobicity


Usage notes

Single-protein examples to illustrate the effects of the parameters

On each of the pages below, scroll down to "PABST best peptides". You will be able to adjust the PABST settings and see how that changes the results. You will need to log in to SBEAMS.


https://db.systemsbiology.net/devDC/sbeams/cgi/PeptideAtlas/GetProtein?atlas_build_id=195&protein_name=YAL003W&action=QUERY


https://db.systemsbiology.net/devDC/sbeams/cgi/PeptideAtlas/GetProtein?atlas_build_id=195&protein_name=YKL098W&action=QUERY



These are temporarily decommissioned...


Protein: ALCAM, moderate number of observations https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=c87rxtje

Protein with tons of observed peptides, lots of them NT or MC. https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=xsp03v1h

Protein with many fewer observations https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=h5bwnrt2

Protein with moderate number of obs, mixed MGL/SGL https://db.systemsbiology.net/devDC/sbeams/cgi/shortURL?key=s2kvwg9r


Looking at empirical or theoretical peptides only

If you want just empirically observed peptides, filter for lines that do not have "na" in the empirical_proteotypic_score and suitability_score columns:

./fetch_best_peptides.pl --atlas_build 162 --bioseq_set 33 | awk '{if ($5!="na" && $6!="na" ) print}'

If you want just theoretical peptides (an in silico digest of an Atlas proteome), filter for lines that do not have "na" in the predicted_suitability_score column:

./fetch_best_peptides.pl --atlas_build 162 --bioseq_set 33 | awk '{ if ($7!="na") print }'

Of course, some peptides are both theoretical and empirically observed.

Personal tools