Software:PeptideSieveRetrainingTurtorial

From SPCTools

(Difference between revisions)
Jump to: navigation, search

Revision as of 20:50, 7 April 2010

PeptideSieve Retraining:

1. Get PeptideSieve Program:
Get PeptideSieve source code from <http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/>
Binary executable: in http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/bin/
File required to run the program: properties.txt and MUDPIT_ESI.txt, MUDPIT_ICAT.txt,  PAGE_ESI.txt, PAGE_MALDI.txt

2. Get training program: download LNKnet program from http://www.ll.mit.edu/mission/communications/ist/lnknet/index.html

3. INSTALL LNKnet:

   unzip the package, and install the LNKnet on your computer according to INSTALL.src file
   It is good idea to go through the LNKnet quick start guide, http://www.ll.mit.edu/mission/communications/ist/lnknet/quickstart.pdf.
   When you perform an action on the window API, a cshell script file will be created in the working directory. If you run this script, it will do the
   exact same thing as you just did.

4. Training steps:

   a. choose Proteins: use build "Yeast non-ICAT PeptideAtlas 2009-12", choose proteins with minium length of 200AAs and have
                              peptide coverage of 60-90%.
                              451 protein selected.
   b. choose peptides: Proteotyptic peptide: observed tryptic peptide
                                                         n_sample >= 4 (observed in at least 4 sample)
                                                         Empirical Proteotypic Score (EPS) >= 0.5 (EPS=Nsamples(peptide)/Nsamples(parent protein))
                                                         1439 peptides selected
                            NON-Proteotyptic peptide: all none observed peptide for the proteins
                                                                       1645 peptides selected
   c. choose features to start with: Parag start his feature selection using 1100 features (numeric physicochemical properity
                                               scales for amino acid, peptide length, amino acid counts and peptide Mass). It is too slow to do the features selection using
                                                so many features. I choose features based on PeptideSieve finally selected feature list and the top 35 features
                                                from the ESP predictor. Through this way, I select 98 features to star with. The 98 features include
                                                peptide Mass, peptide length, 20 AA counts, and 38 numeric physicochemical properity
                                               scales for AAs (mean and sum of the scores).
  b. convert the peptide list to property vector (training file).
                           in order to do the training, each peptide need to be represented by a fixed-length vector of real or discrete valued features.
                           In this case, each peptide will be represented by 98 features. each amino acid of a given peptide was replaced by a numerical value for each
                           selected property and each properity value was summed and average for each peptide result in a 76 dimensional property vector
                           plus the 20 amino acid composition, and the length and mass of the peptide. Finally, A binary output class label was added to each
                           feature vector. 1 (positive) Proteotyptic peptide and 0 otherwise.
  e. feature selection: after creating the training file, a description file need to be created. it should have the same basename as the training file. For example,
                            if you name your training file as test.train, then the description file should be test.default.
                             The description file contain the information about database type, number of input, number of output, and class labels, and input
                             features label. The last two are optional.
                           
                             it is stated in the LNKnet user guide that "For many classifiers, classification results are improved when the data has been
                            normalized in some way." So I choose simple normalization to normalize the training input, test.train. You can do this either
                            through window API or through command line. Here is an example:
                                   #!/bin/csh -ef
                                   # ./test.norm.simple.run
                                   norm_lnk -pathdata $path \
                                   -finput test.train  -fdescribe test.defaults  -normalization 1\
                                   -fparam test.norm.simple  -debug 0  -verbose 3 \
                                   |& nn_tee -h test.norm.simple.log


                             For feature selection, I use forward and backward searches. These searches select features one at a time based on the increase or
                             decrease in the error rate measured using cross validation and any classifier.
                             Once the feature selection search has complete, a subset of features can be selected for use in classification.
                             This subset can be the first, and presumably most important features. Here is the shell script to run forward and backward feature
                             selection.
                                 #!/bin/csh -ef
                                 # ./test.for_bk.S.run
                                 set loc=`pwd`
                                  #feature selection
                                 mlp_feat_sel -pathexp $loc  -ferror X1mlp.err.cv  -fparam X1mlp.param\
                                 -pathdata $path\
                                 -finput test.train  -fdescribe test.defaults  -npatterns 3084 -ninputs 98\
                                 -normalize  -fnorm test.norm.simple  -cross_valid 4\
                                 -fcross_valid vowel.test.cv  -random_cv  -random  -seed 0\
                                 -priors_npatterns 0  -debug 0  -verbose 3  -verror 0 \
                                  -nodes 98,25,2  -alpha 0.6  -etta 0.1  -etta_change_type 0  -epsilon 0.1\
                                  -kappa 0.01  -etta_nepochs 0  -decay 0  -tolerance 0.01  -hfunction 0\
                                  -ofunction 0  -sigmoid_param 1  -cost_func 0  -cost_param 1  -epochs 30\
                                  -batch 1,1,0  -init_mag 0.1 \
                                  -search_type 2  -search_n 0  -search_verbosity 3\
                                  -search_fparam test.for_bk.S.param  -roc_target 0  -roc_lower 0  -roc_upper 1 \
                                 |& nn_tee -h test.for_bk.S.log
                
  d. training: After the feature selection, you get a list of features, then is the time for training.
                     #!/bin/csh -ef
                     # ./X1mlp.run
                    set loc=`pwd`
                     #train
                     (time mlp_lnk\
                      -create  -pathexp $loc  -ferror X1mlp.err.train  -fparam X1mlp.param\
                     -pathdata $path\
                     -finput test.train  -fdescribe test.defaults  -npatterns 3084 -ninputs 13\
                     -features 70,3,71,60,8,26,45,97,61,22,46,14,81 -normalize  -fnorm test.norm.simple\
                     -cross_valid 10 -fcross_valid vowel.test.cv  -random  -seed 0  -priors_npatterns 0  -debug 0  -verbose 3\
                     -verror 0 \
                      -nodes 13,25,2  -alpha 0.6  -etta 0.04  -etta_change_type 0  -epsilon 0.1\
                      -kappa 0.01  -etta_nepochs 0  -decay 0  -tolerance 0.01  -hfunction 0\
                      -ofunction 0  -sig_param_list 1,1  -sigmoid_param 1  -cost_func 0  -cost_param 1  -epochs 30\
                       -batch 1,1,0  -init_mag 0.1 \
                      )|& nn_tee -h X1mlp.log
                     echo -n "X1mlp.run" >> $path/LNKnet.note
                     grep "LAST TRAIN EPOCH" X1mlp.log | tail -1 >> $path/LNKnet.note
                     echo "current directory:" >> X1mlp.log
                     echo $loc >> X1mlp.log


ESPPredictor:

   Reference: Prediction of high-responding peptides for targeted protein assays by mass spectrometry
                   Vincent A. Fusaro, D. R. Mani, Jill P. Mesirov & Steven A. Carr
                   Nature Biotechnology (2009) 27:190-198.
    Classfier:  random forest 
    How to run module:                 
                   a. run it using genepattern web service: there is a detailed instruction on how to run it in:
                        http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html
                   b. run it through command line:
                       SYSTEM requirement: R, matlab, java jdk1.5.0 installed. We have found that jdk 1.6 is not working.
                       1. follow the first two steps of "How to run the module" in page
                            http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html
                        2. click export  on the right hand side of reset button and between  properties and help to export a zip file, which contains the program source files.
                            you need to do a little bit of modification on ESPPredictor.java file to let it parse the command line parameter correctly, 
                            since the class, CmdSplitter, is not exist. After a simple modification , my local ESPPredictor
                            can run using the following command line. The "zzz" phrase is the separater for the input parameters for the matlab and R program.
                            
                            java -classpath <libdir>/../   ESPPredictor.ESPPredictor \
                            <libdir> \
                            peptideFeatureSet \
                            <input.file> zzz \
                            <R2.5_HOME>   \
                            <libdir>/ESP_Predictor.R \
                            Predict \
                            <libdir>PeptideFeatureSet.csv \
                            <libdir>ESP_Predictor_Model_020708.RData
                             NOTES:The *.java files in the package you download contain code that can be used to combine all the above steps, as in ESPPredictor.java.

Detectability Predictor:

        Reference: http://www.iub.edu/~clemmer/Publications/pub%20113.pdf
                H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, P. Radivojac.
               A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, (2006) 22 (14): e481-e488.
        Classfier:  30 two-layer feed-forward neural networks trained using the resilient back propagation algorithm
        How to run: 1. run it through web service tool: http://darwin.informatics.indiana.edu/applications/PeptideDetectabilityPredictor/
                         2. run it through command line: you need to make request to hatang@indiana.edu to download standalone program

APEX:

        This is still under development. If want more information, please contact lars

Combine Predictors:

       use same set of peptides used to retrained PeptideSieve to do the training of combining the scores produced by above four different predictors. The steps of doing
       the training is simple. We use the predictor scores as features, so there are four features for each peptide, and inherent the class label from the PeptideSieve
       retraining test.train file. The feature selection step is skipped since we want to include all the predictors. The train step is same as above.
Personal tools