Recipe for generating proteotypic peptides for example database

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 20:52, 24 September 2009
Dcampbel (Talk | contribs)

← Previous diff
Current revision
Dcampbel (Talk | contribs)

Line 66: Line 66:
* End general build notes section * End general build notes section
 +
---- ----
 +----
 +
 +
The next step is to load these into the database for later use. This requires that the reference database used is already loaded The next step is to load these into the database for later use. This requires that the reference database used is already loaded
Line 74: Line 78:
nohup load_theo_proteotypic_scores.pl -i merged_results_sorted_mapped_final.tsv -s Hs_ENSP_IPI_SPvarspl_decoy & nohup load_theo_proteotypic_scores.pl -i merged_results_sorted_mapped_final.tsv -s Hs_ENSP_IPI_SPvarspl_decoy &
 +
 +----
---- ----
 +
Usage statements for various scripts Usage statements for various scripts
Line 94: Line 101:
--input_file Name of file with PepSeive and Indiana scores, as --input_file Name of file with PepSeive and Indiana scores, as
well as n_mapping info. well as n_mapping info.
- .+
- .+
e.g.: load_theo_proteotypic_scores.pl --list e.g.: load_theo_proteotypic_scores.pl --list
load_theo_proteotypic_scores.pl --set_tag 'YeastCombNR_20070207_ForwDecoy' --input_file 'proteotypic_peptide.txt' load_theo_proteotypic_scores.pl --set_tag 'YeastCombNR_20070207_ForwDecoy' --input_file 'proteotypic_peptide.txt'
Line 101: Line 108:
 + Usage: processFasta.pl -f fasta_file -r regex [ -s (-i) -t 9000 -o output_file ]
 + Usage: processFasta.pl -f fasta one -f fasta two -f fasta three -m -v -o output_file
 +
 + -f, --fasta_file Name of fasta input file, required
 + -o, --output_file Name of output file, defaults to STDOUT
 + -r, --regex Regular expression applied to defline to define matching
 + subset
 + -e, --exclude 'Invert' regex, ie exclude matches instead of non-matches
 + -m, --merge_files Merges two or more fasta files to a sequence unique
 + combined file. The first accession encountered for a
 + given sequence is kept. Does not honor -t, -s, or -r
 + options.
 + -v, --verbose Verbose output
 + -t, --trim_seq Trims the sequence to the specified number of characters.
 + -h, --help Print this usage and exit
 + -s, --sprot_extract If set, run sprot extraction to pull accession from pipe
 + delimited descriptor.
 + -i, --ipi_extract If set, perform standard 'fix' to descriptor line
 +
 +
 + Usage: calculateNGenomeMappings.pl -p peptide_file -f protein_fasta_file -o output_file -m map_adaptor
 +
 + -f, --fasta_file Reference fasta file of proteins for mapping (req)
 + -p, --peptide_file Merged and sorted file of proteotypic scored
 + peptides (req)
 + -o, --output_file Output file name for proteotypic peptides with
 + mapping (req)
 + -d, --dbname Ensembl mapping database, see below for version
 + 52 values (req)
 + -y, --yeast Use Yeast SGD accessions for Ensembl mapping
 +
 + DB names of 2009-08-15:
 +
 + bos_taurus_core_52_4b
 + caenorhabditis_elagans_core_52_190
 + drosophila_melanogaster_core_52_54a
 + homo_sapiens_core_52_36n
 + mus_musculus_core_52_37e
 + pan_troglodytes_core_52_21j
 + rattus_norvegicus_core_52_34u
 + saccharomyces_cerevisiae_core_52_1i
 +
---- ----

Current revision

Notes creating proteotypic peptide information - based on Human 2009-04 modified to include changes needed to show (almost) all peptides. Updated based on 2009-08 mouse build, see specific build README below for additional info and examples. Root directory for this is at /net/db/projects/PeptideAtlas/species


1. Set up reference database file

# Set up add a couple dirs to your PATH, bash syntax 
export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH
# Make processing dir, cd there, and assemble source data.
cd /net/db/projects/PeptideAtlas/species
mkdir organism
mkdir date
# Get database file to work on - see Mouse build instrux below if this needs to be assembled.
cd /net/db/projects/PeptideAtlas/species/organim/date
cp reference_db.fasta .
# Assuming accessions are correct, filter decoys and trim long proteins longer than 8999 AA (which choke Peptide Sieve)
processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta
# Break files into bite-sized chunks!
split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta


2. Run predictor algorithms.

A) Peptide Detectability Predictor

- symlink binaries.
ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor
ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin
- run wrapper scripts that automate the searching, located in /bin/net/db/projects/PeptideAtlas/species/bin
run_PDP.csh    runs predictor on each sub-file
- Once the run is complete...
mk_PDP.csh     merges results files into results.PDP

B) Run Peptide Sieve

- symlink binaries.
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve .
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt .
- run wrapper scripts
run_PS.csh    runs predictor on each sub-file
- Once the run is complete...
mk_PS.csh     merges results files into results.PS


3. Merge and process results from the two prediction engines.

- Merge predictions
/regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv
- Sort with ENS entries first for mapping, since the same peptides from proteins without mapping can then borrow the mapping info. 
- (For yeast use the -y flag, other non-ENS organism flags may be needed. )
sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv
- Calculate genome mappings.  If calc script is invoked with no args, it will output a usage stmt that includes 
- current (2009-09) ENS mapping file options ( e.g. -d Mus_musculus_core_52_37e )
nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e &
  • End general build notes section





The next step is to load these into the database for later use.  This requires that the reference database used is already loaded 

as a biosequence set. The load_theo_proteotypic_scores.pl script is located in the SBEAMS/PeptideAtlas codebase in $sbeams/lib/scripts/PeptideAtlas.

nohup load_theo_proteotypic_scores.pl -i merged_results_sorted_mapped_final.tsv -s Hs_ENSP_IPI_SPvarspl_decoy &





Usage statements for various scripts

 Usage: load_theo_proteotypic_scores.pl [OPTIONS]
 Options:
   --verbose n            Set verbosity level.  default is 0
   --quiet                Set flag to print nothing at all except errors
   --debug n              Set debug flag
   --testonly             If set, rows in the database are not changed or added
   --list                 If set, list the available builds and exit
   --help                 print this usage and exit.
   --purge_mappings       Delete peptide mappings pertaining to this set
   --set_tag              Name of the biosequence set tag
   --update_peptide_info  will update info in proteotypic_peptide table, e.g.
                          pI, mw, SSRCalc, Peptide Sieve.  Does *not* currently
                          update info in proteotypic_peptide_mapping table, so
                          one should run purge_mappings first and then update.
   --input_file           Name of file with PepSeive and Indiana scores, as
                          well as n_mapping info.


  e.g.: load_theo_proteotypic_scores.pl --list
        load_theo_proteotypic_scores.pl --set_tag 'YeastCombNR_20070207_ForwDecoy' --input_file 'proteotypic_peptide.txt'
      load_theo_proteotypic_scores.pl --delete_set 'YeastCombNR_20070207_ForwDecoy'


 Usage: processFasta.pl -f fasta_file -r regex [ -s (-i) -t 9000 -o output_file ]
 Usage: processFasta.pl -f fasta one -f fasta two -f fasta three -m -v -o output_file
 
 -f, --fasta_file    Name of fasta input file, required
 -o, --output_file   Name of output file, defaults to STDOUT
 -r, --regex         Regular expression applied to defline to define matching
                     subset
 -e, --exclude       'Invert' regex, ie exclude matches instead of non-matches
 -m, --merge_files   Merges two or more fasta files to a sequence unique
                     combined file.  The first accession encountered for a
                     given sequence is kept.  Does not honor -t, -s, or -r
                     options.
 -v, --verbose       Verbose output
 -t, --trim_seq      Trims the sequence to the specified number of characters.
 -h, --help          Print this usage and exit
 -s, --sprot_extract If set, run sprot extraction to pull accession from pipe
                     delimited descriptor.
 -i, --ipi_extract   If set, perform standard 'fix' to descriptor line

  
Usage: calculateNGenomeMappings.pl -p peptide_file -f protein_fasta_file -o output_file -m map_adaptor

     -f, --fasta_file    Reference fasta file of proteins for mapping (req)
     -p, --peptide_file  Merged and sorted file of proteotypic scored
                         peptides (req)
     -o, --output_file   Output file name for proteotypic peptides with
                         mapping (req)
     -d, --dbname        Ensembl mapping database, see below for version
                         52 values (req)
     -y, --yeast         Use Yeast SGD accessions for Ensembl mapping

DB names of 2009-08-15:

bos_taurus_core_52_4b
caenorhabditis_elagans_core_52_190
drosophila_melanogaster_core_52_54a
homo_sapiens_core_52_36n
mus_musculus_core_52_37e
pan_troglodytes_core_52_21j
rattus_norvegicus_core_52_34u
saccharomyces_cerevisiae_core_52_1i



This section outlines the steps taken to create and process the new mouse reference database 2009-08.

Add /regis/sbeams/bin to PATH export PATH=/regis/sbeams/bin/:$PATH

1: Fetch up-to-date data sources, do some light processing.

IPI - version 3.62 (mouse 3.62 56733) wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.MOUSE.fasta.gz wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/README -O readme.ipi gunzip ipi.MOUSE.fasta.gz Fix IPI fasta accession line, forces seqs to one line. processFasta.pl -f ipi.MOUSE.fasta -i -v -o mouse_ipi_fixed-acc.fasta

Ensembl ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/ wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/Mus_musculus.NCBIM37.55.pep.all.fa.gz gunzip Mus_musculus.NCBIM37.55.pep.all.fa.gz wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/README -O readme.ens Streamline the sequence to a single line processFasta.pl -f Mus_musculus.NCBIM37.55.pep.all.fa -v -o mouse_ensembl.fasta

Swiss Prot Fetch and then filter with processFasta, extracting MOUSE entries, fixing accession, main sp wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz processFasta.pl -f uniprot_sprot.fasta -v -s -r '_MOUSE' -o mouse_sprot_main.fasta

isoforms file wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz /regis/sbeams/bin/extractFasta.pl -f uniprot_sprot_varsplic.fasta -r '_MOUSE' -s -o swiss-prot_varsplice_mouse.fasta processFasta.pl -f uniprot_sprot_varsplic.fasta -v -s -r '_MOUSE' -o mouse_sprot_isoforms.fasta

concatenate - use processFasta to eliminate redundancy... processFasta.pl -f mouse_sprot_main.fasta -f mouse_sprot_isoforms.fasta -m -o mouse_sprot_merged.fasta -v

cRAP cp /regis/dbase/users/sbeams/cRAP/crap.fasta . processFasta.pl -f crap.fasta -s -o crap_clean.fasta

Decoys from original search/reference database (not common?) cp /net/db/projects/PeptideAtlas/pipeline/output/Mouse_2008-12_Ens47_P0.9/DATA_FILES/Mus_musculus.fasta ./Original_Mouse_reference_db.fasta processFasta.pl -f Original_Mouse_reference_db.fasta -r 'DECOY_' -v -o mouse_ipi_decoys.fasta


Concatenate all together! cat mouse_ensembl.fasta mouse_sprot_merged.fasta mouse_ipi_fixed-acc.fasta mouse_ipi_decoys.fasta crap_clean.fasta > mouse_reference_2009-08.fasta

Count unique/redundant seqs by 'merging' file to itself! processFasta.pl -f mouse_reference_2009-08.fasta -m -v -o mouse_reference_non-redundant_2009-08.fasta

total_files => 1 total_seqs => 133420 unique => 76830 redundant => 56590

Finally, for the proteotypic peptide stuff, remove DECOY seqs and trim any long (> 9000) sequences. processFasta.pl -f mouse_reference_2009-08.fasta -t 8999 -r DECOY_ -e -v -o mouse_reference_trimmed_no-decoy.fasta

2: Run the proteotypic scripts. kin => 3238189 no_pdp_prot => 488 no_ps_prot => 819640 orphan => 956 pdp_nan => 3086 pdp_no => 10123 pdp_ok => 3229022 prots => 121195 psieve_cterm_no => 17555 psieve_cterm_ok => 44824 psieve_has_first => 73870 psieve_has_last => 62903 psieve_no => 1091825 psieve_nterm_no => 31861 psieve_nterm_ok => 41755 psieve_ok => 2011325

3239145 merged_proteotypic.tsv

wc: wc: No such file or directory

3239145 total

mergeProteotypicScores.pl -f mouse_reference_nodecoys_2009-08.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv

nohup calculateNGenomeMappings.pl -f mouse_reference_nodecoys_2009-08.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d mus_musculus_core_52_37e &

Personal tools