Software:PeptideSieveRetrainingTurtorial
From SPCTools
(Difference between revisions)
Revision as of 20:50, 7 April 2010 Zsun (Talk | contribs) ← Previous diff |
Current revision Zsun (Talk | contribs) |
||
Line 1: | Line 1: | ||
- | PeptideSieve Retraining: | ||
- | 1. Get PeptideSieve Program: | ||
- | Get PeptideSieve source code from <http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/> | ||
- | Binary executable: in http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/bin/ | ||
- | File required to run the program: properties.txt and MUDPIT_ESI.txt, MUDPIT_ICAT.txt, PAGE_ESI.txt, PAGE_MALDI.txt | ||
- | |||
- | 2. Get training program: download LNKnet program from http://www.ll.mit.edu/mission/communications/ist/lnknet/index.html | ||
- | |||
- | 3. INSTALL LNKnet: | ||
- | unzip the package, and install the LNKnet on your computer according to INSTALL.src file | ||
- | It is good idea to go through the LNKnet quick start guide, http://www.ll.mit.edu/mission/communications/ist/lnknet/quickstart.pdf. | ||
- | When you perform an action on the window API, a cshell script file will be created in the working directory. If you run this script, it will do the | ||
- | exact same thing as you just did. | ||
- | |||
- | 4. Training steps: | ||
- | |||
- | a. choose Proteins: use build "Yeast non-ICAT PeptideAtlas 2009-12", choose proteins with minium length of 200AAs and have | ||
- | peptide coverage of 60-90%. | ||
- | 451 protein selected. | ||
- | b. choose peptides: Proteotyptic peptide: observed tryptic peptide | ||
- | n_sample >= 4 (observed in at least 4 sample) | ||
- | Empirical Proteotypic Score (EPS) >= 0.5 (EPS=Nsamples(peptide)/Nsamples(parent protein)) | ||
- | 1439 peptides selected | ||
- | NON-Proteotyptic peptide: all none observed peptide for the proteins | ||
- | |||
- | 1645 peptides selected | ||
- | c. choose features to start with: Parag start his feature selection using 1100 features (numeric physicochemical properity | ||
- | scales for amino acid, peptide length, amino acid counts and peptide Mass). It is too slow to do the features selection using | ||
- | so many features. I choose features based on PeptideSieve finally selected feature list and the top 35 features | ||
- | from the ESP predictor. Through this way, I select 98 features to star with. The 98 features include | ||
- | peptide Mass, peptide length, 20 AA counts, and 38 numeric physicochemical properity | ||
- | scales for AAs (mean and sum of the scores). | ||
- | |||
- | b. convert the peptide list to property vector (training file). | ||
- | in order to do the training, each peptide need to be represented by a fixed-length vector of real or discrete valued features. | ||
- | In this case, each peptide will be represented by 98 features. each amino acid of a given peptide was replaced by a numerical value for each | ||
- | selected property and each properity value was summed and average for each peptide result in a 76 dimensional property vector | ||
- | plus the 20 amino acid composition, and the length and mass of the peptide. Finally, A binary output class label was added to each | ||
- | feature vector. 1 (positive) Proteotyptic peptide and 0 otherwise. | ||
- | |||
- | e. feature selection: after creating the training file, a description file need to be created. it should have the same basename as the training file. For example, | ||
- | if you name your training file as test.train, then the description file should be test.default. | ||
- | The description file contain the information about database type, number of input, number of output, and class labels, and input | ||
- | features label. The last two are optional. | ||
- | |||
- | it is stated in the LNKnet user guide that "For many classifiers, classification results are improved when the data has been | ||
- | normalized in some way." So I choose simple normalization to normalize the training input, test.train. You can do this either | ||
- | through window API or through command line. Here is an example: | ||
- | |||
- | #!/bin/csh -ef | ||
- | # ./test.norm.simple.run | ||
- | norm_lnk -pathdata $path \ | ||
- | -finput test.train -fdescribe test.defaults -normalization 1\ | ||
- | -fparam test.norm.simple -debug 0 -verbose 3 \ | ||
- | |& nn_tee -h test.norm.simple.log | ||
- | |||
- | |||
- | For feature selection, I use forward and backward searches. These searches select features one at a time based on the increase or | ||
- | decrease in the error rate measured using cross validation and any classifier. | ||
- | Once the feature selection search has complete, a subset of features can be selected for use in classification. | ||
- | This subset can be the first, and presumably most important features. Here is the shell script to run forward and backward feature | ||
- | selection. | ||
- | |||
- | #!/bin/csh -ef | ||
- | # ./test.for_bk.S.run | ||
- | set loc=`pwd` | ||
- | |||
- | #feature selection | ||
- | mlp_feat_sel -pathexp $loc -ferror X1mlp.err.cv -fparam X1mlp.param\ | ||
- | -pathdata $path\ | ||
- | -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 98\ | ||
- | -normalize -fnorm test.norm.simple -cross_valid 4\ | ||
- | -fcross_valid vowel.test.cv -random_cv -random -seed 0\ | ||
- | -priors_npatterns 0 -debug 0 -verbose 3 -verror 0 \ | ||
- | -nodes 98,25,2 -alpha 0.6 -etta 0.1 -etta_change_type 0 -epsilon 0.1\ | ||
- | -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\ | ||
- | -ofunction 0 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\ | ||
- | -batch 1,1,0 -init_mag 0.1 \ | ||
- | -search_type 2 -search_n 0 -search_verbosity 3\ | ||
- | -search_fparam test.for_bk.S.param -roc_target 0 -roc_lower 0 -roc_upper 1 \ | ||
- | |& nn_tee -h test.for_bk.S.log | ||
- | |||
- | d. training: After the feature selection, you get a list of features, then is the time for training. | ||
- | #!/bin/csh -ef | ||
- | # ./X1mlp.run | ||
- | set loc=`pwd` | ||
- | |||
- | #train | ||
- | (time mlp_lnk\ | ||
- | -create -pathexp $loc -ferror X1mlp.err.train -fparam X1mlp.param\ | ||
- | -pathdata $path\ | ||
- | -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 13\ | ||
- | -features 70,3,71,60,8,26,45,97,61,22,46,14,81 -normalize -fnorm test.norm.simple\ | ||
- | -cross_valid 10 -fcross_valid vowel.test.cv -random -seed 0 -priors_npatterns 0 -debug 0 -verbose 3\ | ||
- | -verror 0 \ | ||
- | -nodes 13,25,2 -alpha 0.6 -etta 0.04 -etta_change_type 0 -epsilon 0.1\ | ||
- | -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\ | ||
- | -ofunction 0 -sig_param_list 1,1 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\ | ||
- | -batch 1,1,0 -init_mag 0.1 \ | ||
- | )|& nn_tee -h X1mlp.log | ||
- | echo -n "X1mlp.run" >> $path/LNKnet.note | ||
- | grep "LAST TRAIN EPOCH" X1mlp.log | tail -1 >> $path/LNKnet.note | ||
- | echo "current directory:" >> X1mlp.log | ||
- | echo $loc >> X1mlp.log | ||
- | |||
- | |||
- | ESPPredictor: | ||
- | Reference: Prediction of high-responding peptides for targeted protein assays by mass spectrometry | ||
- | Vincent A. Fusaro, D. R. Mani, Jill P. Mesirov & Steven A. Carr | ||
- | Nature Biotechnology (2009) 27:190-198. | ||
- | Classfier: random forest | ||
- | How to run module: | ||
- | a. run it using genepattern web service: there is a detailed instruction on how to run it in: | ||
- | http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html | ||
- | b. run it through command line: | ||
- | SYSTEM requirement: R, matlab, java jdk1.5.0 installed. We have found that jdk 1.6 is not working. | ||
- | |||
- | 1. follow the first two steps of "How to run the module" in page | ||
- | http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html | ||
- | 2. click export on the right hand side of reset button and between properties and help to export a zip file, which contains the program source files. | ||
- | |||
- | you need to do a little bit of modification on ESPPredictor.java file to let it parse the command line parameter correctly, | ||
- | since the class, CmdSplitter, is not exist. After a simple modification , my local ESPPredictor | ||
- | can run using the following command line. The "zzz" phrase is the separater for the input parameters for the matlab and R program. | ||
- | |||
- | java -classpath <libdir>/../ ESPPredictor.ESPPredictor \ | ||
- | <libdir> \ | ||
- | peptideFeatureSet \ | ||
- | <input.file> zzz \ | ||
- | <R2.5_HOME> \ | ||
- | <libdir>/ESP_Predictor.R \ | ||
- | Predict \ | ||
- | <libdir>PeptideFeatureSet.csv \ | ||
- | <libdir>ESP_Predictor_Model_020708.RData | ||
- | |||
- | NOTES:The *.java files in the package you download contain code that can be used to combine all the above steps, as in ESPPredictor.java. | ||
- | |||
- | Detectability Predictor: | ||
- | Reference: http://www.iub.edu/~clemmer/Publications/pub%20113.pdf | ||
- | H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, P. Radivojac. | ||
- | A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, (2006) 22 (14): e481-e488. | ||
- | Classfier: 30 two-layer feed-forward neural networks trained using the resilient back propagation algorithm | ||
- | How to run: 1. run it through web service tool: http://darwin.informatics.indiana.edu/applications/PeptideDetectabilityPredictor/ | ||
- | 2. run it through command line: you need to make request to hatang@indiana.edu to download standalone program | ||
- | |||
- | APEX: | ||
- | This is still under development. If want more information, please contact lars | ||
- | |||
- | Combine Predictors: | ||
- | use same set of peptides used to retrained PeptideSieve to do the training of combining the scores produced by above four different predictors. The steps of doing | ||
- | the training is simple. We use the predictor scores as features, so there are four features for each peptide, and inherent the class label from the PeptideSieve | ||
- | retraining test.train file. The feature selection step is skipped since we want to include all the predictors. The train step is same as above. |