Software:PeptideSieveRetrainingTurtorial

From SPCTools

(Difference between revisions)

Jump to: navigation, search

Revision as of 20:54, 7 April 2010

Retrieved from "http://tools.proteomecenter.org/wiki/index.php?title=Software:PeptideSieveRetrainingTurtorial"

-PeptideSieve Retraining:
-. Get PeptideSieve Program:
- Get PeptideSieve source code from <http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/>
- Binary executable: in http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/bin/
- File required to run the program: properties.txt and MUDPIT_ESI.txt, MUDPIT_ICAT.txt,  PAGE_ESI.txt, PAGE_MALDI.txt
-. Get training program: download LNKnet program from http://www.ll.mit.edu/mission/communications/ist/lnknet/index.html
-. INSTALL LNKnet:
-    unzip the package, and install the LNKnet on your computer according to INSTALL.src file
-    It is good idea to go through the LNKnet quick start guide, http://www.ll.mit.edu/mission/communications/ist/lnknet/quickstart.pdf.
-    When you perform an action on the window API, a cshell script file will be created in the working directory. If you run this script, it will do the
-    exact same thing as you just did.
-. Training steps:
-    a. choose Proteins: use build "Yeast non-ICAT PeptideAtlas 2009-12", choose proteins with minium length of 200AAs and have
-                               peptide coverage of 60-90%.
-protein selected.
-    b. choose peptides: Proteotyptic peptide: observed tryptic peptide
-                                                          n_sample >= 4 (observed in at least 4 sample)
-                                                          Empirical Proteotypic Score (EPS) >= 0.5 (EPS=Nsamples(peptide)/Nsamples(parent protein))
-peptides selected
-                             NON-Proteotyptic peptide: all none observed peptide for the proteins
-peptides selected
-    c. choose features to start with: Parag start his feature selection using 1100 features (numeric physicochemical properity
-                                                scales for amino acid, peptide length, amino acid counts and peptide Mass). It is too slow to do the features selection using
-                                                 so many features. I choose features based on PeptideSieve finally selected feature list and the top 35 features
-                                                 from the ESP predictor. Through this way, I select 98 features to star with. The 98 features include
-                                                 peptide Mass, peptide length, 20 AA counts, and 38 numeric physicochemical properity
-                                                scales for AAs (mean and sum of the scores).
-   b. convert the peptide list to property vector (training file).
-                            in order to do the training, each peptide need to be represented by a fixed-length vector of real or discrete valued features.
-                            In this case, each peptide will be represented by 98 features. each amino acid of a given peptide was replaced by a numerical value for each
-                            selected property and each properity value was summed and average for each peptide result in a 76 dimensional property vector
-                            plus the 20 amino acid composition, and the length and mass of the peptide. Finally, A binary output class label was added to each
-                            feature vector. 1 (positive) Proteotyptic peptide and 0 otherwise.
-   e. feature selection: after creating the training file, a description file need to be created. it should have the same basename as the training file. For example,
-                             if you name your training file as test.train, then the description file should be test.default.
-                              The description file contain the information about database type, number of input, number of output, and class labels, and input
-                              features label. The last two are optional.
-                              it is stated in the LNKnet user guide that "For many classifiers, classification results are improved when the data has been
-                             normalized in some way." So I choose simple normalization to normalize the training input, test.train. You can do this either
-                             through window API or through command line. Here is an example:
-                                    #!/bin/csh -ef
-                                    # ./test.norm.simple.run
-                                    norm_lnk -pathdata $path \
-                                    -finput test.train  -fdescribe test.defaults  -normalization 1\
-                                    -fparam test.norm.simple  -debug 0  -verbose 3 \
-                                    |& nn_tee -h test.norm.simple.log
-                              For feature selection, I use forward and backward searches. These searches select features one at a time based on the increase or
-                              decrease in the error rate measured using cross validation and any classifier.
-                              Once the feature selection search has complete, a subset of features can be selected for use in classification.
-                              This subset can be the first, and presumably most important features. Here is the shell script to run forward and backward feature
-                              selection.
-                                  #!/bin/csh -ef
-                                  # ./test.for_bk.S.run
-                                  set loc=`pwd`
-                                   #feature selection
-                                  mlp_feat_sel -pathexp $loc  -ferror X1mlp.err.cv  -fparam X1mlp.param\
-                                  -pathdata $path\
-                                  -finput test.train  -fdescribe test.defaults  -npatterns 3084 -ninputs 98\
-                                  -normalize  -fnorm test.norm.simple  -cross_valid 4\
-                                  -fcross_valid vowel.test.cv  -random_cv  -random  -seed 0\
-                                  -priors_npatterns 0  -debug 0  -verbose 3  -verror 0 \
-                                   -nodes 98,25,2  -alpha 0.6  -etta 0.1  -etta_change_type 0  -epsilon 0.1\
-                                   -kappa 0.01  -etta_nepochs 0  -decay 0  -tolerance 0.01  -hfunction 0\
-                                   -ofunction 0  -sigmoid_param 1  -cost_func 0  -cost_param 1  -epochs 30\
-                                   -batch 1,1,0  -init_mag 0.1 \
-                                   -search_type 2  -search_n 0  -search_verbosity 3\
-                                   -search_fparam test.for_bk.S.param  -roc_target 0  -roc_lower 0  -roc_upper 1 \
-                                  |& nn_tee -h test.for_bk.S.log
-   d. training: After the feature selection, you get a list of features, then is the time for training.
-                      #!/bin/csh -ef
-                      # ./X1mlp.run
-                     set loc=`pwd`
-                      #train
-                      (time mlp_lnk\
-                       -create  -pathexp $loc  -ferror X1mlp.err.train  -fparam X1mlp.param\
-                      -pathdata $path\
-                      -finput test.train  -fdescribe test.defaults  -npatterns 3084 -ninputs 13\
-                      -features 70,3,71,60,8,26,45,97,61,22,46,14,81 -normalize  -fnorm test.norm.simple\
-                      -cross_valid 10 -fcross_valid vowel.test.cv  -random  -seed 0  -priors_npatterns 0  -debug 0  -verbose 3\
-                      -verror 0 \
-                       -nodes 13,25,2  -alpha 0.6  -etta 0.04  -etta_change_type 0  -epsilon 0.1\
-                       -kappa 0.01  -etta_nepochs 0  -decay 0  -tolerance 0.01  -hfunction 0\
-                       -ofunction 0  -sig_param_list 1,1  -sigmoid_param 1  -cost_func 0  -cost_param 1  -epochs 30\
-                        -batch 1,1,0  -init_mag 0.1 \
-                       )|& nn_tee -h X1mlp.log
-                      echo -n "X1mlp.run" >> $path/LNKnet.note
-                      grep "LAST TRAIN EPOCH" X1mlp.log | tail -1 >> $path/LNKnet.note
-                      echo "current directory:" >> X1mlp.log
-                      echo $loc >> X1mlp.log
-ESPPredictor:
-    Reference: Prediction of high-responding peptides for targeted protein assays by mass spectrometry
-                    Vincent A. Fusaro, D. R. Mani, Jill P. Mesirov & Steven A. Carr
-                    Nature Biotechnology (2009) 27:190-198.
-     Classfier:  random forest
-     How to run module:
-                    a. run it using genepattern web service: there is a detailed instruction on how to run it in:
-                         http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html
-                    b. run it through command line:
-                        SYSTEM requirement: R, matlab, java jdk1.5.0 installed. We have found that jdk 1.6 is not working.
-. follow the first two steps of "How to run the module" in page
-                             http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html
-. click export  on the right hand side of reset button and between  properties and help to export a zip file, which contains the program source files.
-                             you need to do a little bit of modification on ESPPredictor.java file to let it parse the command line parameter correctly,
-                             since the class, CmdSplitter, is not exist. After a simple modification , my local ESPPredictor
-                             can run using the following command line. The "zzz" phrase is the separater for the input parameters for the matlab and R program.
-                             java -classpath <libdir>/../   ESPPredictor.ESPPredictor \
-                             <libdir> \
-                             peptideFeatureSet \
-                             <input.file> zzz \
-                             <R2.5_HOME>   \
-                             <libdir>/ESP_Predictor.R \
-                             Predict \
-                             <libdir>PeptideFeatureSet.csv \
-                             <libdir>ESP_Predictor_Model_020708.RData
-                              NOTES:The *.java files in the package you download contain code that can be used to combine all the above steps, as in ESPPredictor.java.
-Detectability Predictor:
-         Reference: http://www.iub.edu/~clemmer/Publications/pub%20113.pdf
-                 H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, P. Radivojac.
-                A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, (2006) 22 (14): e481-e488.
-         Classfier:  30 two-layer feed-forward neural networks trained using the resilient back propagation algorithm
-         How to run: 1. run it through web service tool: http://darwin.informatics.indiana.edu/applications/PeptideDetectabilityPredictor/
-. run it through command line: you need to make request to hatang@indiana.edu to download standalone program
-APEX:
-         This is still under development. If want more information, please contact lars
-Combine Predictors:
-        use same set of peptides used to retrained PeptideSieve to do the training of combining the scores produced by above four different predictors. The steps of doing
-        the training is simple. We use the predictor scores as features, so there are four features for each peptide, and inherent the class label from the PeptideSieve
-        retraining test.train file. The feature selection step is skipped since we want to include all the predictors. The train step is same as above.

Software:PeptideSieveRetrainingTurtorial

From SPCTools

Revision as of 20:54, 7 April 2010

Views

Personal tools

Navigation

support newsgroups

Search

Toolbox