Recipe for generating proteotypic peptides for example database
From SPCTools
Revision as of 19:12, 24 September 2009 Dcampbel (Talk | contribs) ← Previous diff |
Revision as of 19:15, 24 September 2009 Dcampbel (Talk | contribs) Next diff → |
||
Line 1: | Line 1: | ||
- | # Notes creating proteotypic peptide information - based on Human 2009-04 modified | ||
- | # to include changes needed to show (almost) all peptides. Updated based on 2009-08 | ||
- | # mouse build, see specific build README below for additional info and examples. | ||
- | # Root directory for this is at /net/db/projects/PeptideAtlas/species | ||
- | ## 1) Set up reference database file | + | Notes creating proteotypic peptide information - based on Human 2009-04 modified |
+ | to include changes needed to show (almost) all peptides. Updated based on 2009-08 | ||
+ | mouse build, see specific build README below for additional info and examples. | ||
+ | Root directory for this is at /net/db/projects/PeptideAtlas/species | ||
- | # Set up add a couple dirs to your PATH, bash syntax | + | |
+ | 1) Set up reference database file | ||
+ | |||
+ | Set up add a couple dirs to your PATH, bash syntax | ||
export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH | export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH | ||
- | # Make processing dir, cd there, and assemble source data. | + | Make processing dir, cd there, and assemble source data. |
cd /net/db/projects/PeptideAtlas/species | cd /net/db/projects/PeptideAtlas/species | ||
mkdir organism | mkdir organism | ||
mkdir date | mkdir date | ||
- | # Get database file to work on - see Mouse build instrux below if this needs to | + | Get database file to work on - see Mouse build instrux below if this needs to |
- | # be assembled. | + | be assembled. |
cd /net/db/projects/PeptideAtlas/species/organim/date | cd /net/db/projects/PeptideAtlas/species/organim/date | ||
cp reference_db.fasta . | cp reference_db.fasta . | ||
- | # Assuming accessions are correct, filter decoys and trim long proteins | + | Assuming accessions are correct, filter decoys and trim long proteins |
- | # longer than 8999 AA (which choke Peptide Sieve) | + | longer than 8999 AA (which choke Peptide Sieve) |
processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta | processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta | ||
- | # Break files into bite-sized chunks! | + | Break files into bite-sized chunks! |
split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta | split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta | ||
- | ## 2) Run predictor algorithms. | + | 2) Run predictor algorithms. |
- | # symlink binaries. | + | symlink binaries. |
ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor | ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor | ||
ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin | ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin | ||
- | # Run predictor, then merge into one results file | + | Run predictor, then merge into one results file |
- | # These scripts automate the searching, and are located in | + | These scripts automate the searching, and are located in |
- | # /bin/net/db/projects/PeptideAtlas/species/bin | + | /bin/net/db/projects/PeptideAtlas/species/bin |
- | run_PDP.csh # runs predictor on each sub-file | + | run_PDP.csh runs predictor on each sub-file |
- | # Once the run is complete... | + | Once the run is complete... |
- | mk_PDP.csh # merges results files into results.PDP | + | mk_PDP.csh merges results files into results.PDP |
- | ## Peptide Sieve run | + | Peptide Sieve run |
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve . | ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve . | ||
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt . | ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt . | ||
- | # Run PeptideDetectabilityPredictor, then merge into one results file | + | Run PeptideDetectabilityPredictor, then merge into one results file |
- | # These scripts automate the searching, and are located in | + | These scripts automate the searching, and are located in |
- | # /bin/net/db/projects/PeptideAtlas/species/bin | + | /bin/net/db/projects/PeptideAtlas/species/bin |
- | run_PS.csh # runs predictor on each sub-file | + | run_PS.csh runs predictor on each sub-file |
- | # Once the run is complete... | + | Once the run is complete... |
- | mk_PS.csh # merges results files into results.PS | + | mk_PS.csh merges results files into results.PS |
- | # Merge the results from the two prediction engines. | + | Merge the results from the two prediction engines. |
/regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv | /regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv | ||
- | # Sort with ENS entries first for mapping, since the same peptides from | + | Sort with ENS entries first for mapping, since the same peptides from |
- | # proteins without mapping can then borrow the mapping info. For yeast use | + | proteins without mapping can then borrow the mapping info. For yeast use |
- | # the -y flag, other non-ENS organism flags may be needed. | + | the -y flag, other non-ENS organism flags may be needed. |
sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv | sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv | ||
- | # Finally, calculate genome mappings. If calc script is invoked with no args, | + | Finally, calculate genome mappings. If calc script is invoked with no args, |
- | # it will output a usage stmt that includes current (2009-09) ENS mapping | + | it will output a usage stmt that includes current (2009-09) ENS mapping |
- | # options ( -d species_core_52_37e ) | + | options ( -d species_core_52_37e ) |
nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e & | nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e & | ||
- | # | + | |
- | # End general notes section | + | End general notes section |
- | # | + | |
- | # This section outlines the steps taken to process the new mouse reference | + | This section outlines the steps taken to process the new mouse reference |
- | # database 2009-08. | + | database 2009-08. |
- | # Add /regis/sbeams/bin to PATH | + | Add /regis/sbeams/bin to PATH |
export PATH=/regis/sbeams/bin/:$PATH | export PATH=/regis/sbeams/bin/:$PATH | ||
- | ### 1: Fetch up-to-date data sources, do some light processing. | + | 1: Fetch up-to-date data sources, do some light processing. |
- | ## IPI - version 3.62 (mouse 3.62 56733) | + | IPI - version 3.62 (mouse 3.62 56733) |
wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.MOUSE.fasta.gz | wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.MOUSE.fasta.gz | ||
wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/README -O readme.ipi | wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/README -O readme.ipi | ||
gunzip ipi.MOUSE.fasta.gz | gunzip ipi.MOUSE.fasta.gz | ||
- | # Fix IPI fasta accession line, forces seqs to one line. | + | Fix IPI fasta accession line, forces seqs to one line. |
processFasta.pl -f ipi.MOUSE.fasta -i -v -o mouse_ipi_fixed-acc.fasta | processFasta.pl -f ipi.MOUSE.fasta -i -v -o mouse_ipi_fixed-acc.fasta | ||
- | ## Ensembl | + | Ensembl |
ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/ | ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/ | ||
wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/Mus_musculus.NCBIM37.55.pep.all.fa.gz | wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/Mus_musculus.NCBIM37.55.pep.all.fa.gz | ||
gunzip Mus_musculus.NCBIM37.55.pep.all.fa.gz | gunzip Mus_musculus.NCBIM37.55.pep.all.fa.gz | ||
wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/README -O readme.ens | wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/README -O readme.ens | ||
- | # Streamline the sequence to a single line | + | Streamline the sequence to a single line |
processFasta.pl -f Mus_musculus.NCBIM37.55.pep.all.fa -v -o mouse_ensembl.fasta | processFasta.pl -f Mus_musculus.NCBIM37.55.pep.all.fa -v -o mouse_ensembl.fasta | ||
- | ## Swiss Prot | + | Swiss Prot |
- | # Fetch and then filter with processFasta, extracting MOUSE entries, fixing accession, | + | Fetch and then filter with processFasta, extracting MOUSE entries, fixing accession, |
- | # main sp | + | main sp |
wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz | wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz | ||
processFasta.pl -f uniprot_sprot.fasta -v -s -r '_MOUSE' -o mouse_sprot_main.fasta | processFasta.pl -f uniprot_sprot.fasta -v -s -r '_MOUSE' -o mouse_sprot_main.fasta | ||
- | # isoforms file | + | isoforms file |
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz | wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz | ||
/regis/sbeams/bin/extractFasta.pl -f uniprot_sprot_varsplic.fasta -r '_MOUSE' -s -o swiss-prot_varsplice_mouse.fasta | /regis/sbeams/bin/extractFasta.pl -f uniprot_sprot_varsplic.fasta -r '_MOUSE' -s -o swiss-prot_varsplice_mouse.fasta | ||
processFasta.pl -f uniprot_sprot_varsplic.fasta -v -s -r '_MOUSE' -o mouse_sprot_isoforms.fasta | processFasta.pl -f uniprot_sprot_varsplic.fasta -v -s -r '_MOUSE' -o mouse_sprot_isoforms.fasta | ||
- | # concatenate - use processFasta to eliminate redundancy... | + | concatenate - use processFasta to eliminate redundancy... |
processFasta.pl -f mouse_sprot_main.fasta -f mouse_sprot_isoforms.fasta -m -o mouse_sprot_merged.fasta -v | processFasta.pl -f mouse_sprot_main.fasta -f mouse_sprot_isoforms.fasta -m -o mouse_sprot_merged.fasta -v | ||
- | ## cRAP | + | cRAP |
cp /regis/dbase/users/sbeams/cRAP/crap.fasta . | cp /regis/dbase/users/sbeams/cRAP/crap.fasta . | ||
processFasta.pl -f crap.fasta -s -o crap_clean.fasta | processFasta.pl -f crap.fasta -s -o crap_clean.fasta | ||
- | ## Decoys from original search/reference database (not common?) | + | Decoys from original search/reference database (not common?) |
cp /net/db/projects/PeptideAtlas/pipeline/output/Mouse_2008-12_Ens47_P0.9/DATA_FILES/Mus_musculus.fasta ./Original_Mouse_reference_db.fasta | cp /net/db/projects/PeptideAtlas/pipeline/output/Mouse_2008-12_Ens47_P0.9/DATA_FILES/Mus_musculus.fasta ./Original_Mouse_reference_db.fasta | ||
processFasta.pl -f Original_Mouse_reference_db.fasta -r 'DECOY_' -v -o mouse_ipi_decoys.fasta | processFasta.pl -f Original_Mouse_reference_db.fasta -r 'DECOY_' -v -o mouse_ipi_decoys.fasta | ||
- | ## Concatenate all together! | + | Concatenate all together! |
cat mouse_ensembl.fasta mouse_sprot_merged.fasta mouse_ipi_fixed-acc.fasta mouse_ipi_decoys.fasta crap_clean.fasta > mouse_reference_2009-08.fasta | cat mouse_ensembl.fasta mouse_sprot_merged.fasta mouse_ipi_fixed-acc.fasta mouse_ipi_decoys.fasta crap_clean.fasta > mouse_reference_2009-08.fasta | ||
- | # Count unique/redundant seqs by 'merging' file to itself! | + | Count unique/redundant seqs by 'merging' file to itself! |
processFasta.pl -f mouse_reference_2009-08.fasta -m -v -o mouse_reference_non-redundant_2009-08.fasta | processFasta.pl -f mouse_reference_2009-08.fasta -m -v -o mouse_reference_non-redundant_2009-08.fasta | ||
Line 132: | Line 134: | ||
redundant => 56590 | redundant => 56590 | ||
- | # Finally, for the proteotypic peptide stuff, remove DECOY seqs and trim any long (> 9000) sequences. | + | Finally, for the proteotypic peptide stuff, remove DECOY seqs and trim any long (> 9000) sequences. |
processFasta.pl -f mouse_reference_2009-08.fasta -t 8999 -r DECOY_ -e -v -o mouse_reference_trimmed_no-decoy.fasta | processFasta.pl -f mouse_reference_2009-08.fasta -t 8999 -r DECOY_ -e -v -o mouse_reference_trimmed_no-decoy.fasta | ||
- | ### 2: Run the proteotypic scripts. | + | 2: Run the proteotypic scripts. |
kin => 3238189 | kin => 3238189 | ||
no_pdp_prot => 488 | no_pdp_prot => 488 | ||
no_ps_prot => 819640 | no_ps_prot => 819640 | ||
- | orphan => 956 | + | orphan => 956 |
pdp_nan => 3086 | pdp_nan => 3086 | ||
pdp_no => 10123 | pdp_no => 10123 | ||
Line 153: | Line 155: | ||
psieve_ok => 2011325 | psieve_ok => 2011325 | ||
- | 3239145 merged_proteotypic.tsv | + | 3239145 merged_proteotypic.tsv |
wc: wc: No such file or directory | wc: wc: No such file or directory | ||
- | 3239145 total | + | 3239145 total |
mergeProteotypicScores.pl -f mouse_reference_nodecoys_2009-08.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv | mergeProteotypicScores.pl -f mouse_reference_nodecoys_2009-08.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv |
Revision as of 19:15, 24 September 2009
Notes creating proteotypic peptide information - based on Human 2009-04 modified
to include changes needed to show (almost) all peptides. Updated based on 2009-08
mouse build, see specific build README below for additional info and examples.
Root directory for this is at /net/db/projects/PeptideAtlas/species
1) Set up reference database file
Set up add a couple dirs to your PATH, bash syntax export PATH=/regis/sbeams/bin/:/package/genome/tmhmm_sigp_wrapper/:/net/db/projects/PeptideAtlas/species/bin/:$PATH
Make processing dir, cd there, and assemble source data. cd /net/db/projects/PeptideAtlas/species mkdir organism mkdir date
Get database file to work on - see Mouse build instrux below if this needs to be assembled. cd /net/db/projects/PeptideAtlas/species/organim/date cp reference_db.fasta .
Assuming accessions are correct, filter decoys and trim long proteins longer than 8999 AA (which choke Peptide Sieve) processFasta.pl -f reference_db.fasta -r 'DECOY_' -e -o reference_db_no-decoys.fasta
Break files into bite-sized chunks! split_fasta.pl --entries 10000 --filename_root input_split reference_db_no-decoys.fasta
2) Run predictor algorithms.
symlink binaries. ln -s /net/db/src/DetectabilityPredictor/Standalone/PeptideDetectabilityPredictor ln -s /net/db/src/DetectabilityPredictor/Standalone/stand.bin
Run predictor, then merge into one results file These scripts automate the searching, and are located in /bin/net/db/projects/PeptideAtlas/species/bin run_PDP.csh runs predictor on each sub-file
Once the run is complete... mk_PDP.csh merges results files into results.PDP
Peptide Sieve run
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/bin/PeptideSieve .
ln -s /net/db/projects/PeptideAtlas/ExternalData/proteotypic/bin/PepSieve_20080530/peptideSieve_080527/peptideSieve/properties.txt .
Run PeptideDetectabilityPredictor, then merge into one results file These scripts automate the searching, and are located in /bin/net/db/projects/PeptideAtlas/species/bin run_PS.csh runs predictor on each sub-file
Once the run is complete... mk_PS.csh merges results files into results.PS
Merge the results from the two prediction engines.
/regis/sbeams/bin/mergeProteotypicScores.pl -f reference_db_no-decoys.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv
Sort with ENS entries first for mapping, since the same peptides from proteins without mapping can then borrow the mapping info. For yeast use the -y flag, other non-ENS organism flags may be needed. sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv
Finally, calculate genome mappings. If calc script is invoked with no args, it will output a usage stmt that includes current (2009-09) ENS mapping options ( -d species_core_52_37e ) nohup calculateNGenomeMappings.pl -f reference_db_no-decoys.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d species_core_52_37e &
End general notes section
This section outlines the steps taken to process the new mouse reference database 2009-08.
Add /regis/sbeams/bin to PATH export PATH=/regis/sbeams/bin/:$PATH
1: Fetch up-to-date data sources, do some light processing.
IPI - version 3.62 (mouse 3.62 56733) wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.MOUSE.fasta.gz wget ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/README -O readme.ipi gunzip ipi.MOUSE.fasta.gz Fix IPI fasta accession line, forces seqs to one line. processFasta.pl -f ipi.MOUSE.fasta -i -v -o mouse_ipi_fixed-acc.fasta
Ensembl ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/ wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/Mus_musculus.NCBIM37.55.pep.all.fa.gz gunzip Mus_musculus.NCBIM37.55.pep.all.fa.gz wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/pep/README -O readme.ens Streamline the sequence to a single line processFasta.pl -f Mus_musculus.NCBIM37.55.pep.all.fa -v -o mouse_ensembl.fasta
Swiss Prot Fetch and then filter with processFasta, extracting MOUSE entries, fixing accession, main sp wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz processFasta.pl -f uniprot_sprot.fasta -v -s -r '_MOUSE' -o mouse_sprot_main.fasta
isoforms file wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz /regis/sbeams/bin/extractFasta.pl -f uniprot_sprot_varsplic.fasta -r '_MOUSE' -s -o swiss-prot_varsplice_mouse.fasta processFasta.pl -f uniprot_sprot_varsplic.fasta -v -s -r '_MOUSE' -o mouse_sprot_isoforms.fasta
concatenate - use processFasta to eliminate redundancy... processFasta.pl -f mouse_sprot_main.fasta -f mouse_sprot_isoforms.fasta -m -o mouse_sprot_merged.fasta -v
cRAP cp /regis/dbase/users/sbeams/cRAP/crap.fasta . processFasta.pl -f crap.fasta -s -o crap_clean.fasta
Decoys from original search/reference database (not common?) cp /net/db/projects/PeptideAtlas/pipeline/output/Mouse_2008-12_Ens47_P0.9/DATA_FILES/Mus_musculus.fasta ./Original_Mouse_reference_db.fasta processFasta.pl -f Original_Mouse_reference_db.fasta -r 'DECOY_' -v -o mouse_ipi_decoys.fasta
Concatenate all together!
cat mouse_ensembl.fasta mouse_sprot_merged.fasta mouse_ipi_fixed-acc.fasta mouse_ipi_decoys.fasta crap_clean.fasta > mouse_reference_2009-08.fasta
Count unique/redundant seqs by 'merging' file to itself! processFasta.pl -f mouse_reference_2009-08.fasta -m -v -o mouse_reference_non-redundant_2009-08.fasta
total_files => 1 total_seqs => 133420 unique => 76830 redundant => 56590
Finally, for the proteotypic peptide stuff, remove DECOY seqs and trim any long (> 9000) sequences. processFasta.pl -f mouse_reference_2009-08.fasta -t 8999 -r DECOY_ -e -v -o mouse_reference_trimmed_no-decoy.fasta
2: Run the proteotypic scripts. kin => 3238189 no_pdp_prot => 488 no_ps_prot => 819640 orphan => 956 pdp_nan => 3086 pdp_no => 10123 pdp_ok => 3229022 prots => 121195 psieve_cterm_no => 17555 psieve_cterm_ok => 44824 psieve_has_first => 73870 psieve_has_last => 62903 psieve_no => 1091825 psieve_nterm_no => 31861 psieve_nterm_ok => 41755 psieve_ok => 2011325
3239145 merged_proteotypic.tsv
wc: wc: No such file or directory
3239145 total
mergeProteotypicScores.pl -f mouse_reference_nodecoys_2009-08.fasta -p results.PS -d results.PDP -o proteotypic_merged.tsv sortEnsFirst.pl proteotypic_merged.tsv > proteotypic_merged-sorted.tsv
nohup calculateNGenomeMappings.pl -f mouse_reference_nodecoys_2009-08.fasta -p proteotypic_merged-sorted.tsv -o proteotypic_merged-sorted-mapped.tsv -d mus_musculus_core_52_37e &