Software:PeptideSieveRetrainingTurtorial

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 17:43, 8 April 2010
Zsun (Talk | contribs)

← Previous diff
Revision as of 18:10, 8 April 2010
Zsun (Talk | contribs)

Next diff →
Line 1: Line 1:
-==PeptideSieve Retraining Tutorial==+==PeptideSieve Retraining and Predictors Combining Tutorial==
===PeptideSieve Retraining=== ===PeptideSieve Retraining===
 +
 +====Reference====
 +[http://www.nature.com/nbt/journal/v25/n1/abs/nbt1275.html Nat Biotechnol. 2007 Jan;25(1):125-31. Computational prediction of proteotypic peptides for quantitative proteomics]. Mallick P, Schirle M, Chen SS, Flory MR, Lee H, Martin D, Ranish J, Raught B, Schmitt R, Werner T, Kuster B, Aebersold R.
====Getting the PeptideSieve Software==== ====Getting the PeptideSieve Software====
Line 79: Line 82:
:::''-search_type 2 -search_n 0 -search_verbosity 3\''<br/> :::''-search_type 2 -search_n 0 -search_verbosity 3\''<br/>
:::''-search_fparam test.for_bk.S.param -roc_target 0 -roc_lower 0 -roc_upper 1 \''<br/> :::''-search_fparam test.for_bk.S.param -roc_target 0 -roc_lower 0 -roc_upper 1 \''<br/>
-:::''|& nn_tee -h test.for_bk.S.log''<br/>''<br/>+:::''|& nn_tee -h test.for_bk.S.log''<br/>''
-=====Training=====+===== Training =====
- + 
After the feature selection, you get a list of selected features, and then it is the time for the training. After the feature selection, you get a list of selected features, and then it is the time for the training.
Line 108: Line 111:
You may already see from the command above, 13 features are selected from the 98 input features using feature selection program, and used for the final training. You cat get slightly different error rate when you vary the number of epoch step, etta value, and number of internal node. You may already see from the command above, 13 features are selected from the 98 input features using feature selection program, and used for the final training. You cat get slightly different error rate when you vary the number of epoch step, etta value, and number of internal node.
-====Code Generation Using an LNKnet Parameter File====+=====Code Generation Using an LNKnet Parameter File=====
LNKnet has a filter program for each classification algorithm that generates C subroutines for pattern classification. Each filter program takes as an argument an algorithm parameter file, for this case it is X1mlp.param file. The program prints a subroutine, classify (), to the UNIX standard output stream. This subroutine can be called from a C program to classify patterns. The command to run the code generation: LNKnet has a filter program for each classification algorithm that generates C subroutines for pattern classification. Each filter program takes as an argument an algorithm parameter file, for this case it is X1mlp.param file. The program prints a subroutine, classify (), to the UNIX standard output stream. This subroutine can be called from a C program to classify patterns. The command to run the code generation:
Line 116: Line 119:
:::''mlp2c -model_file X1mlp.param -suffix X1mlp >! X1mlp.c''<br/> :::''mlp2c -model_file X1mlp.param -suffix X1mlp >! X1mlp.c''<br/>
-Create New MUDPIT_ESI.txt file for the PeptideSieve+=====Create New MUDPIT_ESI.txt file for the PeptideSieve=====
-Finally, we have reached our goal to create a new parameter file for PeptideSieve. In this retraining, we only retain the parameter file for the experiment type, MUDPIT_ESI.+Finally, we have reached our goal of creating a new parameter file for PeptideSieve. In this retraining, we only retrain the parameter file for the experiment type, MUDPIT_ESI.
-ESPPredictor: +===ESPPredictor===
-Reference: Prediction of high-responding peptides for targeted protein assays by mass spectrometry+'''Reference''':''[http://www.nature.com/nbt/journal/v27/n2/abs/nbt.1524.html Prediction of high-responding peptides for targeted protein assays by mass spectrometry]''
- Vincent A. Fusaro, D. R. Mani, Jill P. Mesirov & Steven A. Carr+Vincent A. Fusaro, D. R. Mani, Jill P. Mesirov & Steven A. Carr
- Nature Biotechnology (2009) 27:190-198. +Nature Biotechnology (2009) 27:190-198.
-Classfier: random forest +
-How to run the module: +'''Classfier''':Random forest
- +
-a. Run it through genepattern web service tool hosted by Broad Institute. There is a detailed instruction on how to run it. +
-b. Run it through command line:+'''How to run the module'''
 +
 +There are two ways of running it:
-SYSTEM requirement: R, matlab, java jdk1.5.0 installed. We have found that jdk 1.6 is not working.+*Using genepattern web service tool hosted by ''[http://www.broadinstitute.org/ Broad Institute]''. There is a detailed ''[http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html instruction]'' on how to run it.
-1. Follow the first two steps of "How to run the module" in the instruction page. +
-2. Click export on the right hand side of reset button and between properties and help text to export a zip file, which contains the program source files.+*Through command line
 + 
 +:*SYSTEM requirement: R, matlab, Java
 + 
 +:*Follow the first two steps of "How to run the module" in the ''[http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html instruction]'' page. <br/>
 +:*Click export on the right hand side of reset button and between "properties" and "help" text to export a zip file, which contains the program source files.
 + 
 +::You will need to do a little bit of modification on ESPPredictor.java file to let it parse the command line parameters correctly, since the class, CmdSplitter, does not exist. After a simple modification, my local ESPPredictor can run using the following command line. The "zzz" phrase is the separator for the input parameters of the matlab and R program.
-You need to do a little bit of modification on ESPPredictor.java file to let it parse the command line parameter correctly, since the class, CmdSplitter, does not exist. After a simple modification, my local ESPPredictor can run using the following command line. The "zzz" phrase is the separator for the input parameters of the matlab and R program. 
-java -classpath <libdir>/../ ESPPredictor.ESPPredictor <libdir> peptideFeatureSet <input.file> zzz \+:::''java -classpath <libdir>/../ ESPPredictor.ESPPredictor <libdir> peptideFeatureSet <input.file> zzz \
-<R2.5_HOME> <libdir>/ESP_Predictor.R Predict \+:::''<R2.5_HOME> <libdir>/ESP_Predictor.R Predict <libdir>PeptideFeatureSet.csv <libdir>ESP_Predictor_Model_020708.RData''<br/>
-<libdir>PeptideFeatureSet.csv <libdir>ESP_Predictor_Model_020708.RData+ 
 + 
 +===Detectability Predictor===
-Detectability Predictor+'''Reference''': H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, P. Radivojac. ''[http://www.iub.edu/~clemmer/Publications/pub%20113.pdf A computational approach toward label-free protein quantification using predicted peptide detectability]''. Bioinformatics, (2006) 22 (14): e481-e488.
-Reference: H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, P. Radivojac. A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, (2006) 22 (14): e481-e488. +'''Classifier''': 30 two-layer feed-forward neural networks trained using the resilient back propagation algorithm
-Classifier: 30 two-layer feed-forward neural networks trained using the resilient back propagation algorithm+'''How to run''':
 +There are also two ways of running it:
-How to run: +*Through Delectability Predictor ''[http://darwin.informatics.indiana.edu/applications/PeptideDetectabilityPredictor/ web service tool]'' hosted by ''[http://www.iub.edu/ Indiana University]''.
-1. Run it through Delectability Predictor web service tool hosted by Indiana University. +*Through command line: you need to make request to hatang@indiana.edu in order to get standalone version
-2. Run it through command line: you need to make request to hatang@indiana.edu in order to get standalone version+
-APEX+===APEX===
This tool is still under development. If want more information, please contact Lars This tool is still under development. If want more information, please contact Lars
-Combine Predictors+===Combining Predictors===
We use the same set of peptides used to retrain PeptideSieve to do the training of combining the scores produced by above four different predictors. The steps of doing the training are simple. The predictor scores are the features for this training, so there are four features for each peptide, and the class label is inherited from the previous training file. The feature selection step is skipped since we want to include all the predictors. The train step is same as above. We use the same set of peptides used to retrain PeptideSieve to do the training of combining the scores produced by above four different predictors. The steps of doing the training are simple. The predictor scores are the features for this training, so there are four features for each peptide, and the class label is inherited from the previous training file. The feature selection step is skipped since we want to include all the predictors. The train step is same as above.

Revision as of 18:10, 8 April 2010

Contents

PeptideSieve Retraining and Predictors Combining Tutorial

PeptideSieve Retraining

Reference

Nat Biotechnol. 2007 Jan;25(1):125-31. Computational prediction of proteotypic peptides for quantitative proteomics. Mallick P, Schirle M, Chen SS, Flory MR, Lee H, Martin D, Ranish J, Raught B, Schmitt R, Werner T, Kuster B, Aebersold R.

Getting the PeptideSieve Software

The source code can be download from PeptideSieve

The Binary executable can be download from PeptideSieve

Files required to run the program: properties.txt and one of the experiment design file, MUDPIT_ESI.txt, MUDPIT_ICAT.txt, PAGE_ESI.txt, and PAGE_MALDI.txt

Getting LNKnet

Download LNKnet program from LNKnet

INSTALL LNKnet

Unzip the package, and install the LNKnet on your computer according to INSTALL.src file. It is good idea to go through the LNKnet quick start guide. When you perform an action on the graphical user interface, it gives an option to store a shell script. This script can then be edited and started from a shell window or called.

Training Steps

Selecting Proteins

Use PeptideAtlas build "Yeast non-ICAT PeptideAtlas 2009-12", use proteins with minimum length of 200AA and have peptide coverage of 60-90%. 451 proteins are selected using this criteria.

Selecting Peptides

Proteotyptic peptide: 1439 peptides are selected

1. observed tryptic peptide;

2. n_sample >= 4 (observed in at least 4 sample);

3. Empirical Proteotypic Score (EPS) >= 0.5; EPS=Nsamples(peptide)/Nsamples(parent protein);

NON-Proteotyptic peptide: 1645 peptides are selected

All none observed peptide for the selected proteins;

Selecting Features to Start Feature Selection

Parag started his feature selection using 1010 features (numeric physicochemical property scales for amino acids, peptide length, amino acid counts and peptide mass). It is too slow to do the features selection using all the features. I choose a subset of features based on PeptideSieve finally selected feature list across four different experiment designs, and the top 35 features from the ESP predictor. Through this way, I select 98 features to star with. The 98 features include peptide mass, peptide length, 20 AA counts, and 38 numeric physicochemical property scales for amino acids (mean and sum of the scores).

Converting the Peptide List to Property Vector (training file)

In order to do the training, each peptide needs to be represented by a fixed-length vector of real or discrete valued features. In this case, each peptide will be represented by 98 features. Each amino acid of a given peptide is replaced by a numerical value for each selected property and each property value was summed and average for each peptide result in a 76 dimensional property vector, plus the 20 amino acid compositions, and the length and mass of the peptide. Finally, a binary output class label was added to each feature vector. 1(positive) for proteotypic peptides and 0 otherwise.

Feature Selection

After creating the training file, a description file needs to be created. It should have the same base name as the training file. For example, if you name your training file as test.train, then the description file should be test.default. The description file contain the information about database type, number of input, number of output, and class labels, and input features label. The last two are optional.

It is stated in the LNKnet user guide that "For many classifiers, classification results are improved when the data has been normalized in some way." So I choose simple normalization to normalize the training input, test.train. You can do this either through graphical user interface or through command line. Here is an example command.

#!/bin/csh -ef
# ./test.norm.simple.run
norm_lnk -pathdata $path \
-finput test.train -fdescribe test.defaults -normalization 1\
-fparam test.norm.simple -debug 0 -verbose 3 \
|& nn_tee -h test.norm.simple.log

For feature selection, I use forward and backward searches. These searches select features one at a time based on the increase or decrease in the error rate measured using cross validation and any classifier. Once the feature selection search has complete, a subset of features can be selected for use in classification. This subset can be the first and presumably most important features. Here is the shell script to run forward and backward feature selection.

#!/bin/csh -ef
# ./test.for_bk.S.run
set loc=`pwd`
#feature selection
mlp_feat_sel -pathexp $loc -ferror X1mlp.err.cv -fparam X1mlp.param\
-pathdata $path\
-finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 98\
-normalize -fnorm test.norm.simple -cross_valid 4\
-fcross_valid vowel.test.cv -random_cv -random -seed 0\
-priors_npatterns 0 -debug 0 -verbose 3 -verror 0 \
-nodes 98,25,2 -alpha 0.6 -etta 0.1 -etta_change_type 0 -epsilon 0.1\
-kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\
-ofunction 0 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\
-batch 1,1,0 -init_mag 0.1 \
-search_type 2 -search_n 0 -search_verbosity 3\
-search_fparam test.for_bk.S.param -roc_target 0 -roc_lower 0 -roc_upper 1 \
|& nn_tee -h test.for_bk.S.log
Training

After the feature selection, you get a list of selected features, and then it is the time for the training.

#!/bin/csh -ef
# ./X1mlp.run
set loc=`pwd`
#train
(time mlp_lnk\
-create -pathexp $loc -ferror X1mlp.err.train -fparam X1mlp.param\
-pathdata $path\
-finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 13\
-features 70,3,71,60,8,26,45,97,61,22,46,14,81 -normalize -fnorm test.norm.simple\
-cross_valid 10 -fcross_valid vowel.test.cv -random -seed 0 -priors_npatterns 0 \
-debug 0 -verbose 3 -verror 0 \
-nodes 13,25,2 -alpha 0.6 -etta 0.01 -etta_change_type 0 -epsilon 0.1\
-kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\
-ofunction 0 -sig_param_list 1,1 -sigmoid_param 1 -cost_func 0 -cost_param 1 \
-epochs 30 -batch 1,1,0 -init_mag 0.1 \
)|& nn_tee -h X1mlp.log
echo -n "X1mlp.run" >> $path/LNKnet.note
grep "LAST TRAIN EPOCH" X1mlp.log | tail -1 >> $path/LNKnet.note
echo "current directory:" >> X1mlp.log
echo $loc >> X1mlp.log

You may already see from the command above, 13 features are selected from the 98 input features using feature selection program, and used for the final training. You cat get slightly different error rate when you vary the number of epoch step, etta value, and number of internal node.

Code Generation Using an LNKnet Parameter File

LNKnet has a filter program for each classification algorithm that generates C subroutines for pattern classification. Each filter program takes as an argument an algorithm parameter file, for this case it is X1mlp.param file. The program prints a subroutine, classify (), to the UNIX standard output stream. This subroutine can be called from a C program to classify patterns. The command to run the code generation:

#!/bin/csh -ef
# ./X1mlp.c.run
mlp2c -model_file X1mlp.param -suffix X1mlp >! X1mlp.c
Create New MUDPIT_ESI.txt file for the PeptideSieve

Finally, we have reached our goal of creating a new parameter file for PeptideSieve. In this retraining, we only retrain the parameter file for the experiment type, MUDPIT_ESI.

ESPPredictor

Reference:Prediction of high-responding peptides for targeted protein assays by mass spectrometry Vincent A. Fusaro, D. R. Mani, Jill P. Mesirov & Steven A. Carr Nature Biotechnology (2009) 27:190-198.

Classfier:Random forest

How to run the module

There are two ways of running it:

  • Through command line
  • SYSTEM requirement: R, matlab, Java
  • Follow the first two steps of "How to run the module" in the instruction page.
  • Click export on the right hand side of reset button and between "properties" and "help" text to export a zip file, which contains the program source files.
You will need to do a little bit of modification on ESPPredictor.java file to let it parse the command line parameters correctly, since the class, CmdSplitter, does not exist. After a simple modification, my local ESPPredictor can run using the following command line. The "zzz" phrase is the separator for the input parameters of the matlab and R program.


java -classpath <libdir>/../ ESPPredictor.ESPPredictor <libdir> peptideFeatureSet <input.file> zzz \
<R2.5_HOME> <libdir>/ESP_Predictor.R Predict <libdir>PeptideFeatureSet.csv <libdir>ESP_Predictor_Model_020708.RData


Detectability Predictor

Reference: H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, P. Radivojac. A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, (2006) 22 (14): e481-e488.

Classifier: 30 two-layer feed-forward neural networks trained using the resilient back propagation algorithm

How to run: There are also two ways of running it:

  • Through Delectability Predictor web service tool hosted by Indiana University.
  • Through command line: you need to make request to hatang@indiana.edu in order to get standalone version

APEX

This tool is still under development. If want more information, please contact Lars

Combining Predictors

We use the same set of peptides used to retrain PeptideSieve to do the training of combining the scores produced by above four different predictors. The steps of doing the training are simple. The predictor scores are the features for this training, so there are four features for each peptide, and the class label is inherited from the previous training file. The feature selection step is skipped since we want to include all the predictors. The train step is same as above.

Personal tools