Software:PeptideSieveRetrainingTurtorial

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 20:50, 7 April 2010
Zsun (Talk | contribs)

← Previous diff
Current revision
Zsun (Talk | contribs)

Line 1: Line 1:
-PeptideSieve Retraining: 
- 1. Get PeptideSieve Program: 
- Get PeptideSieve source code from <http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/> 
- Binary executable: in http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/bin/ 
- File required to run the program: properties.txt and MUDPIT_ESI.txt, MUDPIT_ICAT.txt, PAGE_ESI.txt, PAGE_MALDI.txt 
- 
-2. Get training program: download LNKnet program from http://www.ll.mit.edu/mission/communications/ist/lnknet/index.html 
- 
-3. INSTALL LNKnet: 
- unzip the package, and install the LNKnet on your computer according to INSTALL.src file 
- It is good idea to go through the LNKnet quick start guide, http://www.ll.mit.edu/mission/communications/ist/lnknet/quickstart.pdf. 
- When you perform an action on the window API, a cshell script file will be created in the working directory. If you run this script, it will do the 
- exact same thing as you just did. 
- 
-4. Training steps: 
- 
- a. choose Proteins: use build "Yeast non-ICAT PeptideAtlas 2009-12", choose proteins with minium length of 200AAs and have 
- peptide coverage of 60-90%. 
- 451 protein selected. 
- b. choose peptides: Proteotyptic peptide: observed tryptic peptide 
- n_sample >= 4 (observed in at least 4 sample) 
- Empirical Proteotypic Score (EPS) >= 0.5 (EPS=Nsamples(peptide)/Nsamples(parent protein)) 
- 1439 peptides selected 
- NON-Proteotyptic peptide: all none observed peptide for the proteins 
- 
- 1645 peptides selected 
- c. choose features to start with: Parag start his feature selection using 1100 features (numeric physicochemical properity 
- scales for amino acid, peptide length, amino acid counts and peptide Mass). It is too slow to do the features selection using 
- so many features. I choose features based on PeptideSieve finally selected feature list and the top 35 features 
- from the ESP predictor. Through this way, I select 98 features to star with. The 98 features include 
- peptide Mass, peptide length, 20 AA counts, and 38 numeric physicochemical properity 
- scales for AAs (mean and sum of the scores). 
- 
- b. convert the peptide list to property vector (training file). 
- in order to do the training, each peptide need to be represented by a fixed-length vector of real or discrete valued features. 
- In this case, each peptide will be represented by 98 features. each amino acid of a given peptide was replaced by a numerical value for each 
- selected property and each properity value was summed and average for each peptide result in a 76 dimensional property vector 
- plus the 20 amino acid composition, and the length and mass of the peptide. Finally, A binary output class label was added to each 
- feature vector. 1 (positive) Proteotyptic peptide and 0 otherwise. 
- 
- e. feature selection: after creating the training file, a description file need to be created. it should have the same basename as the training file. For example, 
- if you name your training file as test.train, then the description file should be test.default. 
- The description file contain the information about database type, number of input, number of output, and class labels, and input 
- features label. The last two are optional. 
-  
- it is stated in the LNKnet user guide that "For many classifiers, classification results are improved when the data has been 
- normalized in some way." So I choose simple normalization to normalize the training input, test.train. You can do this either 
- through window API or through command line. Here is an example: 
- 
- #!/bin/csh -ef 
- # ./test.norm.simple.run 
- norm_lnk -pathdata $path \ 
- -finput test.train -fdescribe test.defaults -normalization 1\ 
- -fparam test.norm.simple -debug 0 -verbose 3 \ 
- |& nn_tee -h test.norm.simple.log 
- 
-  
- For feature selection, I use forward and backward searches. These searches select features one at a time based on the increase or 
- decrease in the error rate measured using cross validation and any classifier. 
- Once the feature selection search has complete, a subset of features can be selected for use in classification. 
- This subset can be the first, and presumably most important features. Here is the shell script to run forward and backward feature 
- selection. 
- 
- #!/bin/csh -ef 
- # ./test.for_bk.S.run 
- set loc=`pwd` 
- 
- #feature selection 
- mlp_feat_sel -pathexp $loc -ferror X1mlp.err.cv -fparam X1mlp.param\ 
- -pathdata $path\ 
- -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 98\ 
- -normalize -fnorm test.norm.simple -cross_valid 4\ 
- -fcross_valid vowel.test.cv -random_cv -random -seed 0\ 
- -priors_npatterns 0 -debug 0 -verbose 3 -verror 0 \ 
- -nodes 98,25,2 -alpha 0.6 -etta 0.1 -etta_change_type 0 -epsilon 0.1\ 
- -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\ 
- -ofunction 0 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\ 
- -batch 1,1,0 -init_mag 0.1 \ 
- -search_type 2 -search_n 0 -search_verbosity 3\ 
- -search_fparam test.for_bk.S.param -roc_target 0 -roc_lower 0 -roc_upper 1 \ 
- |& nn_tee -h test.for_bk.S.log 
-  
- d. training: After the feature selection, you get a list of features, then is the time for training. 
- #!/bin/csh -ef 
- # ./X1mlp.run 
- set loc=`pwd` 
- 
- #train 
- (time mlp_lnk\ 
- -create -pathexp $loc -ferror X1mlp.err.train -fparam X1mlp.param\ 
- -pathdata $path\ 
- -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 13\ 
- -features 70,3,71,60,8,26,45,97,61,22,46,14,81 -normalize -fnorm test.norm.simple\ 
- -cross_valid 10 -fcross_valid vowel.test.cv -random -seed 0 -priors_npatterns 0 -debug 0 -verbose 3\ 
- -verror 0 \ 
- -nodes 13,25,2 -alpha 0.6 -etta 0.04 -etta_change_type 0 -epsilon 0.1\ 
- -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\ 
- -ofunction 0 -sig_param_list 1,1 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\ 
- -batch 1,1,0 -init_mag 0.1 \ 
- )|& nn_tee -h X1mlp.log 
- echo -n "X1mlp.run" >> $path/LNKnet.note 
- grep "LAST TRAIN EPOCH" X1mlp.log | tail -1 >> $path/LNKnet.note 
- echo "current directory:" >> X1mlp.log 
- echo $loc >> X1mlp.log 
- 
-  
-ESPPredictor:  
- Reference: Prediction of high-responding peptides for targeted protein assays by mass spectrometry 
- Vincent A. Fusaro, D. R. Mani, Jill P. Mesirov & Steven A. Carr 
- Nature Biotechnology (2009) 27:190-198. 
- Classfier: random forest  
- How to run module:  
- a. run it using genepattern web service: there is a detailed instruction on how to run it in: 
- http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html 
- b. run it through command line: 
- SYSTEM requirement: R, matlab, java jdk1.5.0 installed. We have found that jdk 1.6 is not working. 
- 
- 1. follow the first two steps of "How to run the module" in page 
- http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html 
- 2. click export on the right hand side of reset button and between properties and help to export a zip file, which contains the program source files. 
- 
- you need to do a little bit of modification on ESPPredictor.java file to let it parse the command line parameter correctly,  
- since the class, CmdSplitter, is not exist. After a simple modification , my local ESPPredictor 
- can run using the following command line. The "zzz" phrase is the separater for the input parameters for the matlab and R program. 
-  
- java -classpath <libdir>/../ ESPPredictor.ESPPredictor \ 
- <libdir> \ 
- peptideFeatureSet \ 
- <input.file> zzz \ 
- <R2.5_HOME> \ 
- <libdir>/ESP_Predictor.R \ 
- Predict \ 
- <libdir>PeptideFeatureSet.csv \ 
- <libdir>ESP_Predictor_Model_020708.RData 
- 
- NOTES:The *.java files in the package you download contain code that can be used to combine all the above steps, as in ESPPredictor.java. 
- 
-Detectability Predictor: 
- Reference: http://www.iub.edu/~clemmer/Publications/pub%20113.pdf 
- H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, P. Radivojac. 
- A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, (2006) 22 (14): e481-e488. 
- Classfier: 30 two-layer feed-forward neural networks trained using the resilient back propagation algorithm 
- How to run: 1. run it through web service tool: http://darwin.informatics.indiana.edu/applications/PeptideDetectabilityPredictor/ 
- 2. run it through command line: you need to make request to hatang@indiana.edu to download standalone program 
- 
-APEX:  
- This is still under development. If want more information, please contact lars 
- 
-Combine Predictors: 
- use same set of peptides used to retrained PeptideSieve to do the training of combining the scores produced by above four different predictors. The steps of doing 
- the training is simple. We use the predictor scores as features, so there are four features for each peptide, and inherent the class label from the PeptideSieve 
- retraining test.train file. The feature selection step is skipped since we want to include all the predictors. The train step is same as above. 

Current revision

Personal tools