Software:PeptideSieveRetrainingTurtorial
From SPCTools
(Difference between revisions)
Revision as of 20:50, 7 April 2010
PeptideSieve Retraining:
1. Get PeptideSieve Program: Get PeptideSieve source code from <http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/> Binary executable: in http://proteowizard.svn.sourceforge.net/viewvc/proteowizard/trunk/pwiz/pwiz_aux/sfcap/peptideSieve/bin/ File required to run the program: properties.txt and MUDPIT_ESI.txt, MUDPIT_ICAT.txt, PAGE_ESI.txt, PAGE_MALDI.txt
2. Get training program: download LNKnet program from http://www.ll.mit.edu/mission/communications/ist/lnknet/index.html
3. INSTALL LNKnet:
unzip the package, and install the LNKnet on your computer according to INSTALL.src file It is good idea to go through the LNKnet quick start guide, http://www.ll.mit.edu/mission/communications/ist/lnknet/quickstart.pdf. When you perform an action on the window API, a cshell script file will be created in the working directory. If you run this script, it will do the exact same thing as you just did.
4. Training steps:
a. choose Proteins: use build "Yeast non-ICAT PeptideAtlas 2009-12", choose proteins with minium length of 200AAs and have peptide coverage of 60-90%. 451 protein selected. b. choose peptides: Proteotyptic peptide: observed tryptic peptide n_sample >= 4 (observed in at least 4 sample) Empirical Proteotypic Score (EPS) >= 0.5 (EPS=Nsamples(peptide)/Nsamples(parent protein)) 1439 peptides selected NON-Proteotyptic peptide: all none observed peptide for the proteins
1645 peptides selected c. choose features to start with: Parag start his feature selection using 1100 features (numeric physicochemical properity scales for amino acid, peptide length, amino acid counts and peptide Mass). It is too slow to do the features selection using so many features. I choose features based on PeptideSieve finally selected feature list and the top 35 features from the ESP predictor. Through this way, I select 98 features to star with. The 98 features include peptide Mass, peptide length, 20 AA counts, and 38 numeric physicochemical properity scales for AAs (mean and sum of the scores).
b. convert the peptide list to property vector (training file). in order to do the training, each peptide need to be represented by a fixed-length vector of real or discrete valued features. In this case, each peptide will be represented by 98 features. each amino acid of a given peptide was replaced by a numerical value for each selected property and each properity value was summed and average for each peptide result in a 76 dimensional property vector plus the 20 amino acid composition, and the length and mass of the peptide. Finally, A binary output class label was added to each feature vector. 1 (positive) Proteotyptic peptide and 0 otherwise.
e. feature selection: after creating the training file, a description file need to be created. it should have the same basename as the training file. For example, if you name your training file as test.train, then the description file should be test.default. The description file contain the information about database type, number of input, number of output, and class labels, and input features label. The last two are optional. it is stated in the LNKnet user guide that "For many classifiers, classification results are improved when the data has been normalized in some way." So I choose simple normalization to normalize the training input, test.train. You can do this either through window API or through command line. Here is an example:
#!/bin/csh -ef # ./test.norm.simple.run norm_lnk -pathdata $path \ -finput test.train -fdescribe test.defaults -normalization 1\ -fparam test.norm.simple -debug 0 -verbose 3 \ |& nn_tee -h test.norm.simple.log
For feature selection, I use forward and backward searches. These searches select features one at a time based on the increase or decrease in the error rate measured using cross validation and any classifier. Once the feature selection search has complete, a subset of features can be selected for use in classification. This subset can be the first, and presumably most important features. Here is the shell script to run forward and backward feature selection.
#!/bin/csh -ef # ./test.for_bk.S.run set loc=`pwd`
#feature selection mlp_feat_sel -pathexp $loc -ferror X1mlp.err.cv -fparam X1mlp.param\ -pathdata $path\ -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 98\ -normalize -fnorm test.norm.simple -cross_valid 4\ -fcross_valid vowel.test.cv -random_cv -random -seed 0\ -priors_npatterns 0 -debug 0 -verbose 3 -verror 0 \ -nodes 98,25,2 -alpha 0.6 -etta 0.1 -etta_change_type 0 -epsilon 0.1\ -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\ -ofunction 0 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\ -batch 1,1,0 -init_mag 0.1 \ -search_type 2 -search_n 0 -search_verbosity 3\ -search_fparam test.for_bk.S.param -roc_target 0 -roc_lower 0 -roc_upper 1 \ |& nn_tee -h test.for_bk.S.log d. training: After the feature selection, you get a list of features, then is the time for training. #!/bin/csh -ef # ./X1mlp.run set loc=`pwd`
#train (time mlp_lnk\ -create -pathexp $loc -ferror X1mlp.err.train -fparam X1mlp.param\ -pathdata $path\ -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 13\ -features 70,3,71,60,8,26,45,97,61,22,46,14,81 -normalize -fnorm test.norm.simple\ -cross_valid 10 -fcross_valid vowel.test.cv -random -seed 0 -priors_npatterns 0 -debug 0 -verbose 3\ -verror 0 \ -nodes 13,25,2 -alpha 0.6 -etta 0.04 -etta_change_type 0 -epsilon 0.1\ -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\ -ofunction 0 -sig_param_list 1,1 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\ -batch 1,1,0 -init_mag 0.1 \ )|& nn_tee -h X1mlp.log echo -n "X1mlp.run" >> $path/LNKnet.note grep "LAST TRAIN EPOCH" X1mlp.log | tail -1 >> $path/LNKnet.note echo "current directory:" >> X1mlp.log echo $loc >> X1mlp.log
ESPPredictor:
Reference: Prediction of high-responding peptides for targeted protein assays by mass spectrometry Vincent A. Fusaro, D. R. Mani, Jill P. Mesirov & Steven A. Carr Nature Biotechnology (2009) 27:190-198. Classfier: random forest How to run module: a. run it using genepattern web service: there is a detailed instruction on how to run it in: http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html b. run it through command line: SYSTEM requirement: R, matlab, java jdk1.5.0 installed. We have found that jdk 1.6 is not working.
1. follow the first two steps of "How to run the module" in page http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.html 2. click export on the right hand side of reset button and between properties and help to export a zip file, which contains the program source files.
you need to do a little bit of modification on ESPPredictor.java file to let it parse the command line parameter correctly, since the class, CmdSplitter, is not exist. After a simple modification , my local ESPPredictor can run using the following command line. The "zzz" phrase is the separater for the input parameters for the matlab and R program. java -classpath <libdir>/../ ESPPredictor.ESPPredictor \ <libdir> \ peptideFeatureSet \ <input.file> zzz \ <R2.5_HOME> \ <libdir>/ESP_Predictor.R \ Predict \ <libdir>PeptideFeatureSet.csv \ <libdir>ESP_Predictor_Model_020708.RData
NOTES:The *.java files in the package you download contain code that can be used to combine all the above steps, as in ESPPredictor.java.
Detectability Predictor:
Reference: http://www.iub.edu/~clemmer/Publications/pub%20113.pdf H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, P. Radivojac. A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, (2006) 22 (14): e481-e488. Classfier: 30 two-layer feed-forward neural networks trained using the resilient back propagation algorithm How to run: 1. run it through web service tool: http://darwin.informatics.indiana.edu/applications/PeptideDetectabilityPredictor/ 2. run it through command line: you need to make request to hatang@indiana.edu to download standalone program
APEX:
This is still under development. If want more information, please contact lars
Combine Predictors:
use same set of peptides used to retrained PeptideSieve to do the training of combining the scores produced by above four different predictors. The steps of doing the training is simple. We use the predictor scores as features, so there are four features for each peptide, and inherent the class label from the PeptideSieve retraining test.train file. The feature selection step is skipped since we want to include all the predictors. The train step is same as above.