Software:PeptideSieveRetrainingTurtorial
From SPCTools
(Difference between revisions)
| Revision as of 20:50, 7 April 2010 Zsun (Talk | contribs) ← Previous diff |
Revision as of 20:54, 7 April 2010 Zsun (Talk | contribs) Next diff → |
||
| Line 1: | Line 1: | ||
| - | PeptideSieve Retraining: | ||
| - | 1. Get PeptideSieve Program: | ||
| - | Get PeptideSieve source code from <http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/> | ||
| - | Binary executable: in http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/bin/ | ||
| - | File required to run the program: properties.txt and MUDPIT_ESI.txt, MUDPIT_ICAT.txt, PAGE_ESI.txt, PAGE_MALDI.txt | ||
| - | |||
| - | 2. Get training program: download LNKnet program from http://www.ll.mit.edu/mission/communications/ist/lnknet/index.html | ||
| - | |||
| - | 3. INSTALL LNKnet: | ||
| - | unzip the package, and install the LNKnet on your computer according to INSTALL.src file | ||
| - | It is good idea to go through the LNKnet quick start guide, http://www.ll.mit.edu/mission/communications/ist/lnknet/quickstart.pdf. | ||
| - | When you perform an action on the window API, a cshell script file will be created in the working directory. If you run this script, it will do the | ||
| - | exact same thing as you just did. | ||
| - | |||
| - | 4. Training steps: | ||
| - | |||
| - | a. choose Proteins: use build "Yeast non-ICAT PeptideAtlas 2009-12", choose proteins with minium length of 200AAs and have | ||
| - | peptide coverage of 60-90%. | ||
| - | 451 protein selected. | ||
| - | b. choose peptides: Proteotyptic peptide: observed tryptic peptide | ||
| - | n_sample >= 4 (observed in at least 4 sample) | ||
| - | Empirical Proteotypic Score (EPS) >= 0.5 (EPS=Nsamples(peptide)/Nsamples(parent protein)) | ||
| - | 1439 peptides selected | ||
| - | NON-Proteotyptic peptide: all none observed peptide for the proteins | ||
| - | |||
| - | 1645 peptides selected | ||
| - | c. choose features to start with: Parag start his feature selection using 1100 features (numeric physicochemical properity | ||
| - | scales for amino acid, peptide length, amino acid counts and peptide Mass). It is too slow to do the features selection using | ||
| - | so many features. I choose features based on PeptideSieve finally selected feature list and the top 35 features | ||
| - | from the ESP predictor. Through this way, I select 98 features to star with. The 98 features include | ||
| - | peptide Mass, peptide length, 20 AA counts, and 38 numeric physicochemical properity | ||
| - | scales for AAs (mean and sum of the scores). | ||
| - | |||
| - | b. convert the peptide list to property vector (training file). | ||
| - | in order to do the training, each peptide need to be represented by a fixed-length vector of real or discrete valued features. | ||
| - | In this case, each peptide will be represented by 98 features. each amino acid of a given peptide was replaced by a numerical value for each | ||
| - | selected property and each properity value was summed and average for each peptide result in a 76 dimensional property vector | ||
| - | plus the 20 amino acid composition, and the length and mass of the peptide. Finally, A binary output class label was added to each | ||
| - | feature vector. 1 (positive) Proteotyptic peptide and 0 otherwise. | ||
| - | |||
| - | e. feature selection: after creating the training file, a description file need to be created. it should have the same basename as the training file. For example, | ||
| - | if you name your training file as test.train, then the description file should be test.default. | ||
| - | The description file contain the information about database type, number of input, number of output, and class labels, and input | ||
| - | features label. The last two are optional. | ||
| - | |||
| - | it is stated in the LNKnet user guide that "For many classifiers, classification results are improved when the data has been | ||
| - | normalized in some way." So I choose simple normalization to normalize the training input, test.train. You can do this either | ||
| - | through window API or through command line. Here is an example: | ||
| - | |||
| - | #!/bin/csh -ef | ||
| - | # ./test.norm.simple.run | ||
| - | norm_lnk -pathdata $path \ | ||
| - | -finput test.train -fdescribe test.defaults -normalization 1\ | ||
| - | -fparam test.norm.simple -debug 0 -verbose 3 \ | ||
| - | |& nn_tee -h test.norm.simple.log | ||
| - | |||
| - | |||
| - | For feature selection, I use forward and backward searches. These searches select features one at a time based on the increase or | ||
| - | decrease in the error rate measured using cross validation and any classifier. | ||
| - | Once the feature selection search has complete, a subset of features can be selected for use in classification. | ||
| - | This subset can be the first, and presumably most important features. Here is the shell script to run forward and backward feature | ||
| - | selection. | ||
| - | |||
| - | #!/bin/csh -ef | ||
| - | # ./test.for_bk.S.run | ||
| - | set loc=`pwd` | ||
| - | |||
| - | #feature selection | ||
| - | mlp_feat_sel -pathexp $loc -ferror X1mlp.err.cv -fparam X1mlp.param\ | ||
| - | -pathdata $path\ | ||
| - | -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 98\ | ||
| - | -normalize -fnorm test.norm.simple -cross_valid 4\ | ||
| - | -fcross_valid vowel.test.cv -random_cv -random -seed 0\ | ||
| - | -priors_npatterns 0 -debug 0 -verbose 3 -verror 0 \ | ||
| - | -nodes 98,25,2 -alpha 0.6 -etta 0.1 -etta_change_type 0 -epsilon 0.1\ | ||
| - | -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\ | ||
| - | -ofunction 0 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\ | ||
| - | -batch 1,1,0 -init_mag 0.1 \ | ||
| - | -search_type 2 -search_n 0 -search_verbosity 3\ | ||
| - | -search_fparam test.for_bk.S.param -roc_target 0 -roc_lower 0 -roc_upper 1 \ | ||
| - | |& nn_tee -h test.for_bk.S.log | ||
| - | |||
| - | d. training: After the feature selection, you get a list of features, then is the time for training. | ||
| - | #!/bin/csh -ef | ||
| - | # ./X1mlp.run | ||
| - | set loc=`pwd` | ||
| - | |||
| - | #train | ||
| - | (time mlp_lnk\ | ||
| - | -create -pathexp $loc -ferror X1mlp.err.train -fparam X1mlp.param\ | ||
| - | -pathdata $path\ | ||
| - | -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 13\ | ||
| - | -features 70,3,71,60,8,26,45,97,61,22,46,14,81 -normalize -fnorm test.norm.simple\ | ||
| - | -cross_valid 10 -fcross_valid vowel.test.cv -random -seed 0 -priors_npatterns 0 -debug 0 -verbose 3\ | ||
| - | -verror 0 \ | ||
| - | -nodes 13,25,2 -alpha 0.6 -etta 0.04 -etta_change_type 0 -epsilon 0.1\ | ||
| - | -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\ | ||
| - | -ofunction 0 -sig_param_list 1,1 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\ | ||
| - | -batch 1,1,0 -init_mag 0.1 \ | ||
| - | )|& nn_tee -h X1mlp.log | ||
| - | echo -n "X1mlp.run" >> $path/LNKnet.note | ||
| - | grep "LAST TRAIN EPOCH" X1mlp.log | tail -1 >> $path/LNKnet.note | ||
| - | echo "current directory:" >> X1mlp.log | ||
| - | echo $loc >> X1mlp.log | ||
| - | |||
| - | |||
| - | ESPPredictor: | ||
| - | Reference: Prediction of high-responding peptides for targeted protein assays by mass spectrometry | ||
| - | Vincent A. Fusaro, D. R. Mani, Jill P. Mesirov & Steven A. Carr | ||
| - | Nature Biotechnology (2009) 27:190-198. | ||
| - | Classfier: random forest | ||
| - | How to run module: | ||
| - | a. run it using genepattern web service: there is a detailed instruction on how to run it in: | ||
| - | http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html | ||
| - | b. run it through command line: | ||
| - | SYSTEM requirement: R, matlab, java jdk1.5.0 installed. We have found that jdk 1.6 is not working. | ||
| - | |||
| - | 1. follow the first two steps of "How to run the module" in page | ||
| - | http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html | ||
| - | 2. click export on the right hand side of reset button and between properties and help to export a zip file, which contains the program source files. | ||
| - | |||
| - | you need to do a little bit of modification on ESPPredictor.java file to let it parse the command line parameter correctly, | ||
| - | since the class, CmdSplitter, is not exist. After a simple modification , my local ESPPredictor | ||
| - | can run using the following command line. The "zzz" phrase is the separater for the input parameters for the matlab and R program. | ||
| - | |||
| - | java -classpath <libdir>/../ ESPPredictor.ESPPredictor \ | ||
| - | <libdir> \ | ||
| - | peptideFeatureSet \ | ||
| - | <input.file> zzz \ | ||
| - | <R2.5_HOME> \ | ||
| - | <libdir>/ESP_Predictor.R \ | ||
| - | Predict \ | ||
| - | <libdir>PeptideFeatureSet.csv \ | ||
| - | <libdir>ESP_Predictor_Model_020708.RData | ||
| - | |||
| - | NOTES:The *.java files in the package you download contain code that can be used to combine all the above steps, as in ESPPredictor.java. | ||
| - | |||
| - | Detectability Predictor: | ||
| - | Reference: http://www.iub.edu/~clemmer/Publications/pub%20113.pdf | ||
| - | H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, P. Radivojac. | ||
| - | A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, (2006) 22 (14): e481-e488. | ||
| - | Classfier: 30 two-layer feed-forward neural networks trained using the resilient back propagation algorithm | ||
| - | How to run: 1. run it through web service tool: http://darwin.informatics.indiana.edu/applications/PeptideDetectabilityPredictor/ | ||
| - | 2. run it through command line: you need to make request to hatang@indiana.edu to download standalone program | ||
| - | |||
| - | APEX: | ||
| - | This is still under development. If want more information, please contact lars | ||
| - | |||
| - | Combine Predictors: | ||
| - | use same set of peptides used to retrained PeptideSieve to do the training of combining the scores produced by above four different predictors. The steps of doing | ||
| - | the training is simple. We use the predictor scores as features, so there are four features for each peptide, and inherent the class label from the PeptideSieve | ||
| - | retraining test.train file. The feature selection step is skipped since we want to include all the predictors. The train step is same as above. | ||

