PeptideSieveRetrainingTutorial
From SPCTools
PeptideSieve Retraining Tutorial
Reference
Nat Biotechnol. 2007 Jan;25(1):125-31. Computational prediction of proteotypic peptides for quantitative proteomics. Mallick P, Schirle M, Chen SS, Flory MR, Lee H, Martin D, Ranish J, Raught B, Schmitt R, Werner T, Kuster B, Aebersold R.
Getting the PeptideSieve Software
The source codes are available from PeptideSieve
The Binary executable can be downloaded from PeptideSieve
Files required to run the program: properties.txt and one or all of the experiment designing file, MUDPIT_ESI.txt, MUDPIT_ICAT.txt, PAGE_ESI.txt, and PAGE_MALDI.txt
Getting LNKnet
LNKnet software is public domain software made available by MIT Lincoln Laboratory. It can be downloaded from LNKnet
INSTALLING LNKnet
Unzip the package, and install the LNKnet on your computer according to INSTALL.src file. It is good idea to go through the LNKnet quick start guide using the the graphical user interface. When you perform an action on the graphical user interface, it gives an option to store a shell script. This script can then be edited and started from a shell window or called.
Training Steps
Protein Selection
Use PeptideAtlas build "Yeast non-ICAT PeptideAtlas 2009-12", and select proteins with minimum sequence length of 200AA and have protein coverage of 60-90%. 451 proteins are selected using this criteria. The protein list can be downloaded from [here].
Peptide Selection
- Proteotyptic peptides: 1439 peptides are selected using the following three criteria.
- Peptide length >= 7
- Observed in at least 4 samples
- Empirical Proteotypic Score (EPS) >= 0.5; EPS = Nsamples(peptide)/Nsamples(parent protein)
- NON-Proteotyptic peptides: All none observed peptides from the selected proteins are grouped into this category; There 1645 peptides in the group.
The peptide list can be downloaded from [here].
Selecting Features to Start Feature Selection
Parag Mallick started his feature selection using 1010 features (numeric physicochemical property scales for amino acids, peptide length, amino acid counts and peptide mass). It is too slow to do the features selection using all the features. I choose a subset of features based on PeptideSieve finally selected feature list for four different experiment designs, and the top 35 features from the ESP predictor. Through this way, I select 98 features to start with. The 98 features include peptide mass, peptide sequence length, 20 amino acid composition, and 38 numeric physicochemical property scales for amino acids (mean and sum of the scores).
Converting the Peptide List to Property Vector (training file)
In order to do the training, each peptide needs to be represented by a fixed-length vector of real or discrete valued features. In this case, each peptide will be represented by 98 features. Each amino acid of a given peptide is replaced by a numerical value for each selected property and each property value was summed and average for each peptide result in a 76 dimensional property vector, plus the 20 amino acid compositions, and the length and mass of the peptide. Finally, a binary output class label was added to each feature vector, 1(positive) for proteotypic peptides and 0 otherwise.
Feature Selection
After creating the training file, a description file needs to be created. It should have the same base name as the training file. For example, if you name your training file as test.train, then the description file should be test.default. The description file contain the information about database type, number of input, number of output, class labels, and input features label. The last two are optional.
It is stated in the LNKnet user guide that "For many classifiers, classification results are improved when the data has been normalized in some way." So I choose simple normalization method to normalize the training input, test.train. You can do this either through graphical user interface or through command line. Here is an example shell script.
- #!/bin/csh -ef
- # ./test.norm.simple.run
- norm_lnk -pathdata $path \
- -finput test.train -fdescribe test.defaults -normalization 1\
- -fparam test.norm.simple -debug 0 -verbose 3 \
- |& nn_tee -h test.norm.simple.log
- #!/bin/csh -ef
For feature selection, I use forward and backward searches. These searches select features one at a time based on the increase or decrease in the error rate measured using cross validation and any classifier. Once the feature selection search has complete, a subset of features can be selected for use in classification. This subset can be the first and presumably most important features. Here is the shell script to run forward and backward feature selection.
- #!/bin/csh -ef
- # ./test.for_bk.S.run
- set loc=`pwd`
- #feature selection
- mlp_feat_sel -pathexp $loc -ferror X1mlp.err.cv -fparam X1mlp.param\
- -pathdata $path\
- -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 98\
- -normalize -fnorm test.norm.simple -cross_valid 4\
- -fcross_valid vowel.test.cv -random_cv -random -seed 0\
- -priors_npatterns 0 -debug 0 -verbose 3 -verror 0 \
- -nodes 98,25,2 -alpha 0.6 -etta 0.1 -etta_change_type 0 -epsilon 0.1\
- -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\
- -ofunction 0 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 30\
- -batch 1,1,0 -init_mag 0.1 \
- -search_type 2 -search_n 0 -search_verbosity 3\
- -search_fparam test.for_bk.S.param -roc_target 0 -roc_lower 0 -roc_upper 1 \
- |& nn_tee -h test.for_bk.S.log
- #!/bin/csh -ef
Training
After the feature selection, you get a list of selected features, and then it is the time for the training. We use the MLP(Multi-Layer Perceptron) classifier to do the training.
- #!/bin/csh -ef
- # ./X1mlp.run
- set loc=`pwd`
- #train
- (time mlp_lnk\
- -create -pathexp $loc -ferror X1mlp.err.train -fparam X1mlp.param\
- -pathdata $path\
- -finput test.train -fdescribe test.defaults -npatterns 3084 -ninputs 13\
- -features 70,3,71,60,8,26,45,97,61,22,46,14,81 -normalize -fnorm test.norm.simple\
- -cross_valid 10 -fcross_valid vowel.test.cv -random -seed 0 -priors_npatterns 0 \
- -debug 0 -verbose 3 -verror 0 \
- -nodes 13,25,2 -alpha 0.6 -etta 0.01 -etta_change_type 0 -epsilon 0.1\
- -kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\
- -ofunction 0 -sig_param_list 1,1 -sigmoid_param 1 -cost_func 0 -cost_param 1 \
- -epochs 30 -batch 1,1,0 -init_mag 0.1 \
- )|& nn_tee -h X1mlp.log
- echo -n "X1mlp.run" >> $path/LNKnet.note
- grep "LAST TRAIN EPOCH" X1mlp.log | tail -1 >> $path/LNKnet.note
- echo "current directory:" >> X1mlp.log
- echo $loc >> X1mlp.log
- #!/bin/csh -ef
You may already see from the script above, 13 features are selected from the 98 input features using feature selection program, and used for the final training. You can get slightly different error rate when you vary the number of epoch step, etta value, and number of hidden nodes. The training file is available [here].
Code Generation Using an LNKnet Parameter File
LNKnet has a filter program for each classification algorithm that generates C subroutines for pattern classification. Each filter program takes as an argument an algorithm parameter file, for this case it is X1mlp.param file. The program prints a subroutine, classify (), to the UNIX standard output stream. This subroutine can be called from a C program to classify patterns. The command to run the code generation:
- #!/bin/csh -ef
- # ./X1mlp.c.run
- mlp2c -model_file X1mlp.param -suffix X1mlp >! X1mlp.c
- #!/bin/csh -ef
Creating New MUDPIT_ESI.txt File for the PeptideSieve
Finally, we have reached our goal of creating a new parameter file for PeptideSieve. In this retraining, we only retrain the parameter file for the experiment type, MUDPIT_ESI.
To use the retrained PeptideSieve, you only need to update the old MUDPIT_ESI.txt file with this new [file].