Recipe for generating proteotypic peptides for example database

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 19:12, 24 September 2009
Dcampbel (Talk | contribs)

← Previous diff
Revision as of 19:15, 24 September 2009
Dcampbel (Talk | contribs)

Next diff →
Line 1: Line 1:
-# Notes creating proteotypic peptide information - based on Human 2009-04 modified 
-# to include changes needed to show (almost) all peptides. Updated based on 2009-08 
-# mouse build, see specific build README below for additional info and examples. 
-# Root directory for this is at /net/db/projects/PeptideAtlas/species 
-## 1) Set up reference database file+Notes creating proteotypic peptide information - based on Human 2009-04 modified
 +to include changes needed to show (almost) all peptides. Updated based on 2009-08
 +mouse build, see specific build README below for additional info and examples.
 +Root directory for this is at /net/db/projects/PeptideAtlas/species
-# Set up add a couple dirs to your PATH, bash syntax+ 
 +1) Set up reference database file
 + 
 +Set up add a couple dirs to your PATH, bash syntax
export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH
-# Make processing dir, cd there, and assemble source data.+Make processing dir, cd there, and assemble source data.
cd /net/db/projects/PeptideAtlas/species cd /net/db/projects/PeptideAtlas/species
mkdir organism mkdir organism
mkdir date mkdir date
-# Get database file to work on - see Mouse build instrux below if this needs to+Get database file to work on - see Mouse build instrux below if this needs to
-# be assembled.+be assembled.
cd /net/db/projects/PeptideAtlas/species/organim/date cd /net/db/projects/PeptideAtlas/species/organim/date
cp reference_db.fasta . cp reference_db.fasta .
-# Assuming accessions are correct, filter decoys and trim long proteins+Assuming accessions are correct, filter decoys and trim long proteins
-# longer than 8999 AA (which choke Peptide Sieve)+longer than 8999 AA (which choke Peptide Sieve)
processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta
-# Break files into bite-sized chunks!+Break files into bite-sized chunks!
split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta
-## 2) Run predictor algorithms. +2) Run predictor algorithms.
-# symlink binaries.+symlink binaries.
ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor
ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin
-# Run predictor, then merge into one results file +Run predictor, then merge into one results file
-# These scripts automate the searching, and are located in+These scripts automate the searching, and are located in
-# /bin/net/db/projects/PeptideAtlas/species/bin+/bin/net/db/projects/PeptideAtlas/species/bin
-run_PDP.csh # runs predictor on each sub-file+run_PDP.csh runs predictor on each sub-file
-# Once the run is complete...+Once the run is complete...
-mk_PDP.csh # merges results files into results.PDP+mk_PDP.csh merges results files into results.PDP
-## Peptide Sieve run+Peptide Sieve run
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve . ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve .
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt . ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt .
-# Run PeptideDetectabilityPredictor, then merge into one results file +Run PeptideDetectabilityPredictor, then merge into one results file
-# These scripts automate the searching, and are located in+These scripts automate the searching, and are located in
-# /bin/net/db/projects/PeptideAtlas/species/bin+/bin/net/db/projects/PeptideAtlas/species/bin
-run_PS.csh # runs predictor on each sub-file+run_PS.csh runs predictor on each sub-file
-# Once the run is complete...+Once the run is complete...
-mk_PS.csh # merges results files into results.PS+mk_PS.csh merges results files into results.PS
-# Merge the results from the two prediction engines. +Merge the results from the two prediction engines.
/regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv /regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv
-# Sort with ENS entries first for mapping, since the same peptides from +Sort with ENS entries first for mapping, since the same peptides from
-# proteins without mapping can then borrow the mapping info. For yeast use +proteins without mapping can then borrow the mapping info. For yeast use
-# the -y flag, other non-ENS organism flags may be needed.+the -y flag, other non-ENS organism flags may be needed.
sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv
-# Finally, calculate genome mappings. If calc script is invoked with no args,+Finally, calculate genome mappings. If calc script is invoked with no args,
-# it will output a usage stmt that includes current (2009-09) ENS mapping +it will output a usage stmt that includes current (2009-09) ENS mapping
-# options ( -d species_core_52_37e )+options ( -d species_core_52_37e )
nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e & nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e &
-#+ 
-# End general notes section+End general notes section
-#+ 
-# This section outlines the steps taken to process the new mouse reference+This section outlines the steps taken to process the new mouse reference
-# database 2009-08.+database 2009-08.
-# Add /regis/sbeams/bin to PATH+Add /regis/sbeams/bin to PATH
export PATH=/regis/sbeams/bin/:$PATH export PATH=/regis/sbeams/bin/:$PATH
-### 1: Fetch up-to-date data sources, do some light processing.+1: Fetch up-to-date data sources, do some light processing.
-## IPI - version 3.62 (mouse 3.62 56733)+IPI - version 3.62 (mouse 3.62 56733)
wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.MOUSE.fasta.gz wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.MOUSE.fasta.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/README -O readme.ipi wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/README -O readme.ipi
gunzip ipi.MOUSE.fasta.gz gunzip ipi.MOUSE.fasta.gz
-# Fix IPI fasta accession line, forces seqs to one line.+Fix IPI fasta accession line, forces seqs to one line.
processFasta.pl -f ipi.MOUSE.fasta -i -v -o mouse_ipi_fixed-acc.fasta processFasta.pl -f ipi.MOUSE.fasta -i -v -o mouse_ipi_fixed-acc.fasta
-## Ensembl +Ensembl
ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/ ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/
wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/Mus_musculus.NCBIM37.55.pep.all.fa.gz wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/Mus_musculus.NCBIM37.55.pep.all.fa.gz
gunzip Mus_musculus.NCBIM37.55.pep.all.fa.gz gunzip Mus_musculus.NCBIM37.55.pep.all.fa.gz
wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/README -O readme.ens wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/README -O readme.ens
-# Streamline the sequence to a single line+Streamline the sequence to a single line
processFasta.pl -f Mus_musculus.NCBIM37.55.pep.all.fa -v -o mouse_ensembl.fasta processFasta.pl -f Mus_musculus.NCBIM37.55.pep.all.fa -v -o mouse_ensembl.fasta
-## Swiss Prot+Swiss Prot
-# Fetch and then filter with processFasta, extracting MOUSE entries, fixing accession,+Fetch and then filter with processFasta, extracting MOUSE entries, fixing accession,
-# main sp+main sp
wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz
processFasta.pl -f uniprot_sprot.fasta -v -s -r '_MOUSE' -o mouse_sprot_main.fasta processFasta.pl -f uniprot_sprot.fasta -v -s -r '_MOUSE' -o mouse_sprot_main.fasta
-# isoforms file+isoforms file
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz
/regis/sbeams/bin/extractFasta.pl -f uniprot_sprot_varsplic.fasta -r '_MOUSE' -s -o swiss-prot_varsplice_mouse.fasta /regis/sbeams/bin/extractFasta.pl -f uniprot_sprot_varsplic.fasta -r '_MOUSE' -s -o swiss-prot_varsplice_mouse.fasta
processFasta.pl -f uniprot_sprot_varsplic.fasta -v -s -r '_MOUSE' -o mouse_sprot_isoforms.fasta processFasta.pl -f uniprot_sprot_varsplic.fasta -v -s -r '_MOUSE' -o mouse_sprot_isoforms.fasta
-# concatenate - use processFasta to eliminate redundancy...+concatenate - use processFasta to eliminate redundancy...
processFasta.pl -f mouse_sprot_main.fasta -f mouse_sprot_isoforms.fasta -m -o mouse_sprot_merged.fasta -v processFasta.pl -f mouse_sprot_main.fasta -f mouse_sprot_isoforms.fasta -m -o mouse_sprot_merged.fasta -v
-## cRAP+cRAP
cp /regis/dbase/users/sbeams/cRAP/crap.fasta . cp /regis/dbase/users/sbeams/cRAP/crap.fasta .
processFasta.pl -f crap.fasta -s -o crap_clean.fasta processFasta.pl -f crap.fasta -s -o crap_clean.fasta
-## Decoys from original search/reference database (not common?)+Decoys from original search/reference database (not common?)
cp /net/db/projects/PeptideAtlas/pipeline/output/Mouse_2008-12_Ens47_P0.9/DATA_FILES/Mus_musculus.fasta ./Original_Mouse_reference_db.fasta cp /net/db/projects/PeptideAtlas/pipeline/output/Mouse_2008-12_Ens47_P0.9/DATA_FILES/Mus_musculus.fasta ./Original_Mouse_reference_db.fasta
processFasta.pl -f Original_Mouse_reference_db.fasta -r 'DECOY_' -v -o mouse_ipi_decoys.fasta processFasta.pl -f Original_Mouse_reference_db.fasta -r 'DECOY_' -v -o mouse_ipi_decoys.fasta
-## Concatenate all together!+Concatenate all together!
cat mouse_ensembl.fasta mouse_sprot_merged.fasta mouse_ipi_fixed-acc.fasta mouse_ipi_decoys.fasta crap_clean.fasta > mouse_reference_2009-08.fasta cat mouse_ensembl.fasta mouse_sprot_merged.fasta mouse_ipi_fixed-acc.fasta mouse_ipi_decoys.fasta crap_clean.fasta > mouse_reference_2009-08.fasta
-# Count unique/redundant seqs by 'merging' file to itself!+Count unique/redundant seqs by 'merging' file to itself!
processFasta.pl -f mouse_reference_2009-08.fasta -m -v -o mouse_reference_non-redundant_2009-08.fasta processFasta.pl -f mouse_reference_2009-08.fasta -m -v -o mouse_reference_non-redundant_2009-08.fasta
Line 132: Line 134:
redundant => 56590 redundant => 56590
-# Finally, for the proteotypic peptide stuff, remove DECOY seqs and trim any long (> 9000) sequences.+Finally, for the proteotypic peptide stuff, remove DECOY seqs and trim any long (> 9000) sequences.
processFasta.pl -f mouse_reference_2009-08.fasta -t 8999 -r DECOY_ -e -v -o mouse_reference_trimmed_no-decoy.fasta processFasta.pl -f mouse_reference_2009-08.fasta -t 8999 -r DECOY_ -e -v -o mouse_reference_trimmed_no-decoy.fasta
-### 2: Run the proteotypic scripts.+2: Run the proteotypic scripts.
kin => 3238189 kin => 3238189
no_pdp_prot => 488 no_pdp_prot => 488
no_ps_prot => 819640 no_ps_prot => 819640
-orphan => 956 +orphan => 956
pdp_nan => 3086 pdp_nan => 3086
pdp_no => 10123 pdp_no => 10123
Line 153: Line 155:
psieve_ok => 2011325 psieve_ok => 2011325
- 3239145 merged_proteotypic.tsv+ 3239145 merged_proteotypic.tsv
wc: wc: No such file or directory wc: wc: No such file or directory
- 3239145 total+ 3239145 total
mergeProteotypicScores.pl -f mouse_reference_nodecoys_2009-08.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv mergeProteotypicScores.pl -f mouse_reference_nodecoys_2009-08.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv

Revision as of 19:15, 24 September 2009


Notes creating proteotypic peptide information - based on Human 2009-04 modified to include changes needed to show (almost) all peptides. Updated based on 2009-08 mouse build, see specific build README below for additional info and examples. Root directory for this is at /net/db/projects/PeptideAtlas/species


1) Set up reference database file

Set up add a couple dirs to your PATH, bash syntax export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH

Make processing dir, cd there, and assemble source data. cd /net/db/projects/PeptideAtlas/species mkdir organism mkdir date

Get database file to work on - see Mouse build instrux below if this needs to be assembled. cd /net/db/projects/PeptideAtlas/species/organim/date cp reference_db.fasta .

Assuming accessions are correct, filter decoys and trim long proteins longer than 8999 AA (which choke Peptide Sieve) processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta

Break files into bite-sized chunks! split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta


2) Run predictor algorithms.

symlink binaries. ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin

Run predictor, then merge into one results file These scripts automate the searching, and are located in /bin/net/db/projects/PeptideAtlas/species/bin run_PDP.csh runs predictor on each sub-file

Once the run is complete... mk_PDP.csh merges results files into results.PDP


Peptide Sieve run ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve . ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt .

Run PeptideDetectabilityPredictor, then merge into one results file These scripts automate the searching, and are located in /bin/net/db/projects/PeptideAtlas/species/bin run_PS.csh runs predictor on each sub-file

Once the run is complete... mk_PS.csh merges results files into results.PS


Merge the results from the two prediction engines. /regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv

Sort with ENS entries first for mapping, since the same peptides from proteins without mapping can then borrow the mapping info. For yeast use the -y flag, other non-ENS organism flags may be needed. sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv

Finally, calculate genome mappings. If calc script is invoked with no args, it will output a usage stmt that includes current (2009-09) ENS mapping options ( -d species_core_52_37e ) nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e &


End general notes section


This section outlines the steps taken to process the new mouse reference database 2009-08.

Add /regis/sbeams/bin to PATH export PATH=/regis/sbeams/bin/:$PATH

1: Fetch up-to-date data sources, do some light processing.

IPI - version 3.62 (mouse 3.62 56733) wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.MOUSE.fasta.gz wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/README -O readme.ipi gunzip ipi.MOUSE.fasta.gz Fix IPI fasta accession line, forces seqs to one line. processFasta.pl -f ipi.MOUSE.fasta -i -v -o mouse_ipi_fixed-acc.fasta

Ensembl ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/ wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/Mus_musculus.NCBIM37.55.pep.all.fa.gz gunzip Mus_musculus.NCBIM37.55.pep.all.fa.gz wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/README -O readme.ens Streamline the sequence to a single line processFasta.pl -f Mus_musculus.NCBIM37.55.pep.all.fa -v -o mouse_ensembl.fasta

Swiss Prot Fetch and then filter with processFasta, extracting MOUSE entries, fixing accession, main sp wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz processFasta.pl -f uniprot_sprot.fasta -v -s -r '_MOUSE' -o mouse_sprot_main.fasta

isoforms file wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz /regis/sbeams/bin/extractFasta.pl -f uniprot_sprot_varsplic.fasta -r '_MOUSE' -s -o swiss-prot_varsplice_mouse.fasta processFasta.pl -f uniprot_sprot_varsplic.fasta -v -s -r '_MOUSE' -o mouse_sprot_isoforms.fasta

concatenate - use processFasta to eliminate redundancy... processFasta.pl -f mouse_sprot_main.fasta -f mouse_sprot_isoforms.fasta -m -o mouse_sprot_merged.fasta -v

cRAP cp /regis/dbase/users/sbeams/cRAP/crap.fasta . processFasta.pl -f crap.fasta -s -o crap_clean.fasta

Decoys from original search/reference database (not common?) cp /net/db/projects/PeptideAtlas/pipeline/output/Mouse_2008-12_Ens47_P0.9/DATA_FILES/Mus_musculus.fasta ./Original_Mouse_reference_db.fasta processFasta.pl -f Original_Mouse_reference_db.fasta -r 'DECOY_' -v -o mouse_ipi_decoys.fasta


Concatenate all together! cat mouse_ensembl.fasta mouse_sprot_merged.fasta mouse_ipi_fixed-acc.fasta mouse_ipi_decoys.fasta crap_clean.fasta > mouse_reference_2009-08.fasta

Count unique/redundant seqs by 'merging' file to itself! processFasta.pl -f mouse_reference_2009-08.fasta -m -v -o mouse_reference_non-redundant_2009-08.fasta

total_files => 1 total_seqs => 133420 unique => 76830 redundant => 56590

Finally, for the proteotypic peptide stuff, remove DECOY seqs and trim any long (> 9000) sequences. processFasta.pl -f mouse_reference_2009-08.fasta -t 8999 -r DECOY_ -e -v -o mouse_reference_trimmed_no-decoy.fasta

2: Run the proteotypic scripts. kin => 3238189 no_pdp_prot => 488 no_ps_prot => 819640 orphan => 956 pdp_nan => 3086 pdp_no => 10123 pdp_ok => 3229022 prots => 121195 psieve_cterm_no => 17555 psieve_cterm_ok => 44824 psieve_has_first => 73870 psieve_has_last => 62903 psieve_no => 1091825 psieve_nterm_no => 31861 psieve_nterm_ok => 41755 psieve_ok => 2011325

3239145 merged_proteotypic.tsv

wc: wc: No such file or directory

3239145 total

mergeProteotypicScores.pl -f mouse_reference_nodecoys_2009-08.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv

nohup calculateNGenomeMappings.pl -f mouse_reference_nodecoys_2009-08.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d mus_musculus_core_52_37e &

Personal tools