Recipe for generating proteotypic peptides for example database

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 19:15, 24 September 2009
Dcampbel (Talk | contribs)

← Previous diff
Current revision
Dcampbel (Talk | contribs)

Line 1: Line 1:
- 
- 
Notes creating proteotypic peptide information - based on Human 2009-04 modified Notes creating proteotypic peptide information - based on Human 2009-04 modified
to include changes needed to show (almost) all peptides. Updated based on 2009-08 to include changes needed to show (almost) all peptides. Updated based on 2009-08
Line 7: Line 5:
-1) Set up reference database file+1. Set up reference database file
 + 
 + # Set up add a couple dirs to your PATH, bash syntax
 + export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH
 + 
 + # Make processing dir, cd there, and assemble source data.
 + cd /net/db/projects/PeptideAtlas/species
 + mkdir organism
 + mkdir date
 + 
 + # Get database file to work on - see Mouse build instrux below if this needs to be assembled.
 + cd /net/db/projects/PeptideAtlas/species/organim/date
 + cp reference_db.fasta .
 + 
 + # Assuming accessions are correct, filter decoys and trim long proteins longer than 8999 AA (which choke Peptide Sieve)
 + processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta
 + 
 + # Break files into bite-sized chunks!
 + split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta
 + 
 + 
 + 
 +2. Run predictor algorithms.
 + 
 +A) Peptide Detectability Predictor
 + 
 + - symlink binaries.
 + ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor
 + ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin
 + 
 + - run wrapper scripts that automate the searching, located in /bin/net/db/projects/PeptideAtlas/species/bin
 + run_PDP.csh runs predictor on each sub-file
 + - Once the run is complete...
 + mk_PDP.csh merges results files into results.PDP
-Set up add a couple dirs to your PATH, bash syntax+B) Run Peptide Sieve
-export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH+
-Make processing dir, cd there, and assemble source data.+ - symlink binaries.
-cd /net/db/projects/PeptideAtlas/species+ ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve .
-mkdir organism+ ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt .
-mkdir date+
-Get database file to work on - see Mouse build instrux below if this needs to+ - run wrapper scripts
-be assembled.+ run_PS.csh runs predictor on each sub-file
-cd /net/db/projects/PeptideAtlas/species/organim/date+ - Once the run is complete...
-cp reference_db.fasta .+ mk_PS.csh merges results files into results.PS
-Assuming accessions are correct, filter decoys and trim long proteins 
-longer than 8999 AA (which choke Peptide Sieve) 
-processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta 
-Break files into bite-sized chunks!+3. Merge and process results from the two prediction engines.
-split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta+
 + - Merge predictions
 + /regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv
 + - Sort with ENS entries first for mapping, since the same peptides from proteins without mapping can then borrow the mapping info.
 + - (For yeast use the -y flag, other non-ENS organism flags may be needed. )
 + sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv
 + - Calculate genome mappings. If calc script is invoked with no args, it will output a usage stmt that includes
 + - current (2009-09) ENS mapping file options ( e.g. -d Mus_musculus_core_52_37e )
 + nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e &
-2) Run predictor algorithms.+* End general build notes section
-symlink binaries. 
-ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor 
-ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin 
-Run predictor, then merge into one results file+----
-These scripts automate the searching, and are located in+----
-/bin/net/db/projects/PeptideAtlas/species/bin+
-run_PDP.csh runs predictor on each sub-file+
-Once the run is complete... 
-mk_PDP.csh merges results files into results.PDP 
-Peptide Sieve run+ The next step is to load these into the database for later use. This requires that the reference database used is already loaded
-ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve .+as a biosequence set. The load_theo_proteotypic_scores.pl script is located in the SBEAMS/PeptideAtlas codebase in $sbeams/lib/scripts/PeptideAtlas.
-ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt .+
-Run PeptideDetectabilityPredictor, then merge into one results file+nohup load_theo_proteotypic_scores.pl -i merged_results_sorted_mapped_final.tsv -s Hs_ENSP_IPI_SPvarspl_decoy &
-These scripts automate the searching, and are located in+
-/bin/net/db/projects/PeptideAtlas/species/bin+
-run_PS.csh runs predictor on each sub-file+
-Once the run is complete... 
-mk_PS.csh merges results files into results.PS 
 +----
 +----
-Merge the results from the two prediction engines. 
-/regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv 
-Sort with ENS entries first for mapping, since the same peptides from+Usage statements for various scripts
-proteins without mapping can then borrow the mapping info. For yeast use+
-the -y flag, other non-ENS organism flags may be needed.+ Usage: load_theo_proteotypic_scores.pl [OPTIONS]
-sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv+ Options:
 + --verbose n Set verbosity level. default is 0
 + --quiet Set flag to print nothing at all except errors
 + --debug n Set debug flag
 + --testonly If set, rows in the database are not changed or added
 + --list If set, list the available builds and exit
 + --help print this usage and exit.
 + --purge_mappings Delete peptide mappings pertaining to this set
 + --set_tag Name of the biosequence set tag
 + --update_peptide_info will update info in proteotypic_peptide table, e.g.
 + pI, mw, SSRCalc, Peptide Sieve. Does *not* currently
 + update info in proteotypic_peptide_mapping table, so
 + one should run purge_mappings first and then update.
 + --input_file Name of file with PepSeive and Indiana scores, as
 + well as n_mapping info.
 +
 +
 + e.g.: load_theo_proteotypic_scores.pl --list
 + load_theo_proteotypic_scores.pl --set_tag 'YeastCombNR_20070207_ForwDecoy' --input_file 'proteotypic_peptide.txt'
 + load_theo_proteotypic_scores.pl --delete_set 'YeastCombNR_20070207_ForwDecoy'
-Finally, calculate genome mappings. If calc script is invoked with no args, 
-it will output a usage stmt that includes current (2009-09) ENS mapping 
-options ( -d species_core_52_37e ) 
-nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e & 
 + Usage: processFasta.pl -f fasta_file -r regex [ -s (-i) -t 9000 -o output_file ]
 + Usage: processFasta.pl -f fasta one -f fasta two -f fasta three -m -v -o output_file
 +
 + -f, --fasta_file Name of fasta input file, required
 + -o, --output_file Name of output file, defaults to STDOUT
 + -r, --regex Regular expression applied to defline to define matching
 + subset
 + -e, --exclude 'Invert' regex, ie exclude matches instead of non-matches
 + -m, --merge_files Merges two or more fasta files to a sequence unique
 + combined file. The first accession encountered for a
 + given sequence is kept. Does not honor -t, -s, or -r
 + options.
 + -v, --verbose Verbose output
 + -t, --trim_seq Trims the sequence to the specified number of characters.
 + -h, --help Print this usage and exit
 + -s, --sprot_extract If set, run sprot extraction to pull accession from pipe
 + delimited descriptor.
 + -i, --ipi_extract If set, perform standard 'fix' to descriptor line
 +
 +
-End general notes section+ Usage: calculateNGenomeMappings.pl -p peptide_file -f protein_fasta_file -o output_file -m map_adaptor
 +
 + -f, --fasta_file Reference fasta file of proteins for mapping (req)
 + -p, --peptide_file Merged and sorted file of proteotypic scored
 + peptides (req)
 + -o, --output_file Output file name for proteotypic peptides with
 + mapping (req)
 + -d, --dbname Ensembl mapping database, see below for version
 + 52 values (req)
 + -y, --yeast Use Yeast SGD accessions for Ensembl mapping
 +
 + DB names of 2009-08-15:
 +
 + bos_taurus_core_52_4b
 + caenorhabditis_elagans_core_52_190
 + drosophila_melanogaster_core_52_54a
 + homo_sapiens_core_52_36n
 + mus_musculus_core_52_37e
 + pan_troglodytes_core_52_21j
 + rattus_norvegicus_core_52_34u
 + saccharomyces_cerevisiae_core_52_1i
 +----
-This section outlines the steps taken to process the new mouse reference+This section outlines the steps taken to create and process the new mouse
-database 2009-08.+reference database 2009-08.
Add /regis/sbeams/bin to PATH Add /regis/sbeams/bin to PATH

Current revision

Notes creating proteotypic peptide information - based on Human 2009-04 modified to include changes needed to show (almost) all peptides. Updated based on 2009-08 mouse build, see specific build README below for additional info and examples. Root directory for this is at /net/db/projects/PeptideAtlas/species


1. Set up reference database file

# Set up add a couple dirs to your PATH, bash syntax 
export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH
# Make processing dir, cd there, and assemble source data.
cd /net/db/projects/PeptideAtlas/species
mkdir organism
mkdir date
# Get database file to work on - see Mouse build instrux below if this needs to be assembled.
cd /net/db/projects/PeptideAtlas/species/organim/date
cp reference_db.fasta .
# Assuming accessions are correct, filter decoys and trim long proteins longer than 8999 AA (which choke Peptide Sieve)
processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta
# Break files into bite-sized chunks!
split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta


2. Run predictor algorithms.

A) Peptide Detectability Predictor

- symlink binaries.
ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor
ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin
- run wrapper scripts that automate the searching, located in /bin/net/db/projects/PeptideAtlas/species/bin
run_PDP.csh    runs predictor on each sub-file
- Once the run is complete...
mk_PDP.csh     merges results files into results.PDP

B) Run Peptide Sieve

- symlink binaries.
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve .
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt .
- run wrapper scripts
run_PS.csh    runs predictor on each sub-file
- Once the run is complete...
mk_PS.csh     merges results files into results.PS


3. Merge and process results from the two prediction engines.

- Merge predictions
/regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv
- Sort with ENS entries first for mapping, since the same peptides from proteins without mapping can then borrow the mapping info. 
- (For yeast use the -y flag, other non-ENS organism flags may be needed. )
sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv
- Calculate genome mappings.  If calc script is invoked with no args, it will output a usage stmt that includes 
- current (2009-09) ENS mapping file options ( e.g. -d Mus_musculus_core_52_37e )
nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e &
  • End general build notes section





The next step is to load these into the database for later use.  This requires that the reference database used is already loaded 

as a biosequence set. The load_theo_proteotypic_scores.pl script is located in the SBEAMS/PeptideAtlas codebase in $sbeams/lib/scripts/PeptideAtlas.

nohup load_theo_proteotypic_scores.pl -i merged_results_sorted_mapped_final.tsv -s Hs_ENSP_IPI_SPvarspl_decoy &





Usage statements for various scripts

 Usage: load_theo_proteotypic_scores.pl [OPTIONS]
 Options:
   --verbose n            Set verbosity level.  default is 0
   --quiet                Set flag to print nothing at all except errors
   --debug n              Set debug flag
   --testonly             If set, rows in the database are not changed or added
   --list                 If set, list the available builds and exit
   --help                 print this usage and exit.
   --purge_mappings       Delete peptide mappings pertaining to this set
   --set_tag              Name of the biosequence set tag
   --update_peptide_info  will update info in proteotypic_peptide table, e.g.
                          pI, mw, SSRCalc, Peptide Sieve.  Does *not* currently
                          update info in proteotypic_peptide_mapping table, so
                          one should run purge_mappings first and then update.
   --input_file           Name of file with PepSeive and Indiana scores, as
                          well as n_mapping info.


  e.g.: load_theo_proteotypic_scores.pl --list
        load_theo_proteotypic_scores.pl --set_tag 'YeastCombNR_20070207_ForwDecoy' --input_file 'proteotypic_peptide.txt'
      load_theo_proteotypic_scores.pl --delete_set 'YeastCombNR_20070207_ForwDecoy'


 Usage: processFasta.pl -f fasta_file -r regex [ -s (-i) -t 9000 -o output_file ]
 Usage: processFasta.pl -f fasta one -f fasta two -f fasta three -m -v -o output_file
 
 -f, --fasta_file    Name of fasta input file, required
 -o, --output_file   Name of output file, defaults to STDOUT
 -r, --regex         Regular expression applied to defline to define matching
                     subset
 -e, --exclude       'Invert' regex, ie exclude matches instead of non-matches
 -m, --merge_files   Merges two or more fasta files to a sequence unique
                     combined file.  The first accession encountered for a
                     given sequence is kept.  Does not honor -t, -s, or -r
                     options.
 -v, --verbose       Verbose output
 -t, --trim_seq      Trims the sequence to the specified number of characters.
 -h, --help          Print this usage and exit
 -s, --sprot_extract If set, run sprot extraction to pull accession from pipe
                     delimited descriptor.
 -i, --ipi_extract   If set, perform standard 'fix' to descriptor line

  
Usage: calculateNGenomeMappings.pl -p peptide_file -f protein_fasta_file -o output_file -m map_adaptor

     -f, --fasta_file    Reference fasta file of proteins for mapping (req)
     -p, --peptide_file  Merged and sorted file of proteotypic scored
                         peptides (req)
     -o, --output_file   Output file name for proteotypic peptides with
                         mapping (req)
     -d, --dbname        Ensembl mapping database, see below for version
                         52 values (req)
     -y, --yeast         Use Yeast SGD accessions for Ensembl mapping

DB names of 2009-08-15:

bos_taurus_core_52_4b
caenorhabditis_elagans_core_52_190
drosophila_melanogaster_core_52_54a
homo_sapiens_core_52_36n
mus_musculus_core_52_37e
pan_troglodytes_core_52_21j
rattus_norvegicus_core_52_34u
saccharomyces_cerevisiae_core_52_1i



This section outlines the steps taken to create and process the new mouse reference database 2009-08.

Add /regis/sbeams/bin to PATH export PATH=/regis/sbeams/bin/:$PATH

1: Fetch up-to-date data sources, do some light processing.

IPI - version 3.62 (mouse 3.62 56733) wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.MOUSE.fasta.gz wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/README -O readme.ipi gunzip ipi.MOUSE.fasta.gz Fix IPI fasta accession line, forces seqs to one line. processFasta.pl -f ipi.MOUSE.fasta -i -v -o mouse_ipi_fixed-acc.fasta

Ensembl ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/ wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/Mus_musculus.NCBIM37.55.pep.all.fa.gz gunzip Mus_musculus.NCBIM37.55.pep.all.fa.gz wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/README -O readme.ens Streamline the sequence to a single line processFasta.pl -f Mus_musculus.NCBIM37.55.pep.all.fa -v -o mouse_ensembl.fasta

Swiss Prot Fetch and then filter with processFasta, extracting MOUSE entries, fixing accession, main sp wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz processFasta.pl -f uniprot_sprot.fasta -v -s -r '_MOUSE' -o mouse_sprot_main.fasta

isoforms file wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz /regis/sbeams/bin/extractFasta.pl -f uniprot_sprot_varsplic.fasta -r '_MOUSE' -s -o swiss-prot_varsplice_mouse.fasta processFasta.pl -f uniprot_sprot_varsplic.fasta -v -s -r '_MOUSE' -o mouse_sprot_isoforms.fasta

concatenate - use processFasta to eliminate redundancy... processFasta.pl -f mouse_sprot_main.fasta -f mouse_sprot_isoforms.fasta -m -o mouse_sprot_merged.fasta -v

cRAP cp /regis/dbase/users/sbeams/cRAP/crap.fasta . processFasta.pl -f crap.fasta -s -o crap_clean.fasta

Decoys from original search/reference database (not common?) cp /net/db/projects/PeptideAtlas/pipeline/output/Mouse_2008-12_Ens47_P0.9/DATA_FILES/Mus_musculus.fasta ./Original_Mouse_reference_db.fasta processFasta.pl -f Original_Mouse_reference_db.fasta -r 'DECOY_' -v -o mouse_ipi_decoys.fasta


Concatenate all together! cat mouse_ensembl.fasta mouse_sprot_merged.fasta mouse_ipi_fixed-acc.fasta mouse_ipi_decoys.fasta crap_clean.fasta > mouse_reference_2009-08.fasta

Count unique/redundant seqs by 'merging' file to itself! processFasta.pl -f mouse_reference_2009-08.fasta -m -v -o mouse_reference_non-redundant_2009-08.fasta

total_files => 1 total_seqs => 133420 unique => 76830 redundant => 56590

Finally, for the proteotypic peptide stuff, remove DECOY seqs and trim any long (> 9000) sequences. processFasta.pl -f mouse_reference_2009-08.fasta -t 8999 -r DECOY_ -e -v -o mouse_reference_trimmed_no-decoy.fasta

2: Run the proteotypic scripts. kin => 3238189 no_pdp_prot => 488 no_ps_prot => 819640 orphan => 956 pdp_nan => 3086 pdp_no => 10123 pdp_ok => 3229022 prots => 121195 psieve_cterm_no => 17555 psieve_cterm_ok => 44824 psieve_has_first => 73870 psieve_has_last => 62903 psieve_no => 1091825 psieve_nterm_no => 31861 psieve_nterm_ok => 41755 psieve_ok => 2011325

3239145 merged_proteotypic.tsv

wc: wc: No such file or directory

3239145 total

mergeProteotypicScores.pl -f mouse_reference_nodecoys_2009-08.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv

nohup calculateNGenomeMappings.pl -f mouse_reference_nodecoys_2009-08.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d mus_musculus_core_52_37e &

Personal tools