Building Peptide Atlas
From SPCTools
Revision as of 00:27, 13 March 2009 Tfarrah (Talk | contribs) (→Align peptides with reference proteins and calculate coordinates (start/end points of aligned region)) ← Previous diff |
Current revision Tfarrah (Talk | contribs) |
||
Line 1: | Line 1: | ||
- | Here is how I built a human urine PeptideAtlas in December 2008. | + | [http://www.PeptideAtlas.org PeptideAtlas] is an online database of peptides identified in tandem MS experiments. Raw MS data are gathered from labs around the world and processed uniformly as described below. Data from a single species, or sometimes a subproteome such as plasma or liver, are combined into subsections of PeptideAtlas called ''builds''. PeptideAtlas is implemented in the [[Using SBEAMS| SBEAMS ]] database. Another page outlines the body of [[PeptideAtlas software|software associated with PeptideAtlas]]. |
+ | |||
+ | The atlas build process is wild and woolly and unstable. There are several caveats having to do with special cases, and several steps that must be run manually. The whole process is outlined below. Your other steadfast companion on the atlas build journey will be the ''atlas build recipe'', which describes the same steps below but provides actual unix command line invocations that you can cut and paste. When these documents fail you, as they will, feel free to consult the PeptideAtlas team at ISB, currently comprising Zhi Sun, Dave Campbell, team leader and PeptideAtlas creator Eric Deutsch, and the author of this document, Terry Farrah. | ||
+ | |||
+ | October, 2012: We are re-tooling the pipeline. Notes are [[PeptideAtlas Pipeline Retool 2012/13|here]]! | ||
+ | |||
+ | ===Copy build recipe and follow it=== | ||
+ | The unix commands needed to do each step below are given in mimas.systemsbiology.net:/net/db/projects/PeptideAtlas/pipeline/recipes/master_recipe.notes. Copy this file to <your_build>_recipe.notes, follow along, and edit as needed for your build. | ||
+ | Most stuff takes place via mimas (a.k.a. db) or dione at /net/db/projects/PeptideAtlas. | ||
+ | |||
+ | The order of the steps in the recipe may differ from the order below. In such cases, either order may be followed. | ||
===Start with one or more PeptideProphet output files (pepXML) for each experiment in each project.=== | ===Start with one or more PeptideProphet output files (pepXML) for each experiment in each project.=== | ||
A project is a set of related experiments. For example, a project may study proteins found in normal and diseased liver, and may include 4 experiments: tissue from two normal patients and from two diseased patients. The pepXML files should be created by searching the spectra with a database search engine such as SEQUEST, X!Tandem, or SpectraST, then validating the hits using PeptideProphet. If you are going to combine search results using iProphet (see below), iProphet should '''not''' be run on each set of search results individually; to avoid running iProphet you will need to run xinteract manually because scripts such as runtandemsearch automatically run iProphet after PeptideProphet. | A project is a set of related experiments. For example, a project may study proteins found in normal and diseased liver, and may include 4 experiments: tissue from two normal patients and from two diseased patients. The pepXML files should be created by searching the spectra with a database search engine such as SEQUEST, X!Tandem, or SpectraST, then validating the hits using PeptideProphet. If you are going to combine search results using iProphet (see below), iProphet should '''not''' be run on each set of search results individually; to avoid running iProphet you will need to run xinteract manually because scripts such as runtandemsearch automatically run iProphet after PeptideProphet. | ||
- | Searching databases containing decoys is recommended to provide a reference point for evaluating the FDR (false discovery rate) of the final Atlas. As of fall 2008, spectral libraries containing decoys are available for SpectraST searching. | + | Searching databases containing decoys is recommended to provide a reference point for evaluating the FDR (false discovery rate) of the final Atlas. As of fall 2008, spectral libraries containing decoys are available for SpectraST searching. When searching with SpectraST, set the database to the biosequence set flatfile in spectrast.params. Also, currently, for SpectraST, RefreshParser is run after PeptideProphet by finishspectrastsearch. If this is not necessary -- and I don't think it is -- omit, because successive refreshes to different DBs may result in proliferation of UNMAPPED protIDs (not sure about this). |
It is helpful when referencing files using wildcards if the pepXML files all reside at the same level in the directory tree. If you have to move files to achieve this, adjust the paths within them using | It is helpful when referencing files using wildcards if the pepXML files all reside at the same level in the directory tree. If you have to move files to achieve this, adjust the paths within them using | ||
$ /sbeams/bin/updateAllPaths.pl *.xml *.xls *.shtml. | $ /sbeams/bin/updateAllPaths.pl *.xml *.xls *.shtml. | ||
- | ===Copy build recipe and follow it=== | ||
- | The unix commands needed to do each step below are given in mimas.systemsbiology.net:/net/db/projects/PeptideAtlas/pipeline/recipes/master_recipe.notes. Copy this file to <your_build>_recipe.notes, follow along, and edit as needed for your build. | ||
- | Most stuff takes place via mimas (a.k.a. db) at /net/db/projects/PeptideAtlas. | ||
+ | |||
+ | ===Post-process any glyco search results.=== | ||
+ | See [[Processing glycopeptide data]] | ||
===Register projects and experiments using SBEAMS interface.=== | ===Register projects and experiments using SBEAMS interface.=== | ||
*Go to db.systemsbiology.net. | *Go to db.systemsbiology.net. | ||
Line 19: | Line 29: | ||
*Fill out fields. Owner of project should be the experimenter who created the data. Project tag should match name of subdirectory in /sbeams/archive/<project_owner> that contains the data. | *Fill out fields. Owner of project should be the experimenter who created the data. Project tag should match name of subdirectory in /sbeams/archive/<project_owner> that contains the data. | ||
*To register experiments, go to SBEAMS Home > Proteomics > Manage Data > Experiments | *To register experiments, go to SBEAMS Home > Proteomics > Manage Data > Experiments | ||
+ | **Data Path: full path name up to Experiment_Tag directory -- or, if in /regis/sbeams/archive, may omit that prefix. | ||
- | ===Run iProphet and ProteinProphet=== | + | ===Run iProphet=== |
- | ====If you ran multiple search engines for each experiment, combine ''per experiment'' using iProphet.==== | + | ====If you ran multiple search engines for each experiment, combine ''per experiment'' using iProphet==== |
Create a directory, parallel to the search results directories, named iProphet. Then, for example, | Create a directory, parallel to the search results directories, named iProphet. Then, for example, | ||
$ iProphet ../{XTK,SPC,SEQ}*/interact-prob.pep.xml interact-combined.pep.xml | $ iProphet ../{XTK,SPC,SEQ}*/interact-prob.pep.xml interact-combined.pep.xml | ||
- | Be sure that the input pepXML files were not already processed by iProphet -- you don't want to run iProphet twice. Caution: if you ran an automated post-processing script such as runtandemsearch (which calls finishtandemsearch), iProphet may already have been run automatically. Conversely, if you are using only one search engine and did not run iProphet immediately after running ProteinProphet, run it now. | + | $ ProteinProphet <options> interact-combined.pep.xml interact-combined.prot.xml |
+ | Be sure that the input pepXML files were not already processed by iProphet -- you don't want to run iProphet twice. Caution: if you ran an automated post-processing script such as runtandemsearch (which calls finishtandemsearch), iProphet may already have been run automatically. Conversely, if you are using only one search engine and did not run iProphet immediately after running PeptideProphet, run it now. | ||
The resulting pepXML files will be used to generate final peptide probabilities in the "gather all peptides" step of the Atlas build process below. | The resulting pepXML files will be used to generate final peptide probabilities in the "gather all peptides" step of the Atlas build process below. | ||
- | ====Combine ''all'' pepXML files for project using iProphet, then run ProteinProphet==== | + | ===Build a biosequence set for your species if necessary=== |
- | First create a directory for your project in your data area, for ample disk space. Run on regis9 for ample memory. | + | The biosequence_set typically contains all target and decoy sequences from the search database, plus additional sequences from other databases such as Ensembl and Swiss-Prot. Mapping peptides to this larger set allows us to take advantage of Swiss-Prot annotations and Ensembl genome mappings. The biosequence set may also contain decoys from old target/decoy databases to allow mapping of old search results. If a current bioseq set does not yet exist for your species, create, then come back to this step. As of April 2010, a recipe for creating a bioseq set resides in /regis/dbase/users/sbeams/DecoyDictionaries/bss_build_recipe.notes |
- | $ ssh regis9 | + | |
- | $ cd /regis/data3/tfarrah/search | + | ===Refresh all iProphet output files to biosequence set=== |
- | $ mkdir HsUrine; cd HsUrine; mkdir MultipleExps; cd MultipleExps; mkdir iProphet; cd iProphet | + | |
- | $ iProphet /regis/sbeams/archive/{phaller,youngah}/*Urine*/*/{XTK,SPC,SEQ}*/interact-prob.pep.xml | + | |
- | $ ProteinProphet interact-combined.pep.xml interact-combined.prot.xml UNMAPPED NORMPROTLEN PROTLEN MININDEP0.2 IPROPHET > & ProteinProphet.out | + | To refresh, run RefreshParser. PepXML now has protIDs from bioseq set. |
- | Combining all pepXML files may not be feasible with many and/or large files. In that case, you will need to run iProphet on the experiments in batches, then run ProteinProphet on all the resulting pepXML files combined. Consult David S. for advice. | + | |
+ | ===Run ProteinProphet on the iProphet file for each experiment.=== | ||
+ | ProtXML inherits protIDs from biosequence set. | ||
+ | |||
+ | The purpose of the per-experiment ProteinProphet run is to adjust the probabilities of all the peptides according to NSP (number of sibling peptides). The adjusted probabilities are not used directly, but are used to generate a multiplicative factor for each peptide which is then applied to the iProphet probability for each observation of that peptide. ProteinProphet protein probabilities are not used during the build process because, for large datasets, they are overly optimistic. | ||
- | The purpose of the ProteinProphet run is to adjust the probabilities of all the peptides according to NSP (number of sibling peptides). The adjusted probabilities are not used directly, but are used to generate a multiplicative factor for each peptide which is then applied to the iProphet probability for each observation of that peptide. In particular, it is important to note that, as of January 2009, ''ProteinProphet protein probabilities are not displayed or used in any way in the PeptideAtlas.'' This is because these probabilities are, for large datasets, overly optimistic. | ||
===Obtain search batch IDs for each experiment.=== | ===Obtain search batch IDs for each experiment.=== | ||
- | ===Run PeptideAtlas build "pipeline".=== | + | ===Ensure that the file $PIPELINE/etc/protid_priorities.csv contains protein identifier prioritization for your species=== |
- | Scripts can be found in /net/db/projects/PeptideAtlas/pipeline/run_scripts. Each script ultimately calls /net/db/projects/PeptideAtlas/pipeline/run_scripts/run_Master_current.csh, and this is where the meat of the pipeline resides. | + | Assign higher priorities to identifiers that |
- | ====Gather all peptides and update probabilities==== | + | you prefer to see in the PeptideAtlas. For human and mouse, we prefer Swiss-Prot because of its nice description lines, and also because the human Swiss-Prot, as of 2009, describes a rough draft of the human proteome. |
- | Step01. Calls createPipelineInput.pl, via pipeline/bin/PeptideFilesGenerator.pm. Creates "identlist file" for each pepXML in Experiments.list, and also a combined file. This is a simple text format (one line per record) file containing all the relevant pepXML and protXML info for each peptide identification with P>=threshold (usually 0.9), with peptide probabilities adjusted using the protXML probabilities as a guide. An identlist ''template'' file is also created which contains only the unadjusted pepXML info for peptides with P>=(threshold-0.4); it is cached in the same dir as each pepXML file for use in future builds. | + | |
+ | ===Run step01 of the PeptideAtlas build pipeline=== | ||
+ | $PIPELINE=/net/db/projects/PeptideAtlas/pipeline | ||
+ | |||
+ | A general, customizable script for running the pipeline can be found in $PIPELINE/run_scripts/run_myBuild.csh. Copy this script to $PIPELINE/run_scripts/run_<name-of-build>.csh and edit. This script ultimately calls $PIPELINE/run_scripts/run_Master.csh, which then calls the ''pipeline script'' with various options for each of the various pipeline steps (except the last two steps). The pipeline script is $PIPELINE/bin/$PIPELINE_SCRIPT. $PIPELINE_SCRIPT is defined in $PIPELINE/run_scripts/run_myBuild.csh, and is often peptideAtlasPipeline_Ensembl.pl. | ||
+ | |||
+ | First, run step01 ONLY. You will then have to do some steps manually before proceeding to the other steps. | ||
+ | |||
+ | ====Step 01: Gather peptides, load into DB, and update probabilities==== | ||
+ | Step01. Looks at the results of the TPP, adjusts pepXML probabilities based on protXML info, gathers all PSMs (peptide-spectrum matches) that are wanted for this build based on a probability or FDR threshold, and stores all relevant info for each in a combined PA identlist file. Calls createPipelineInput.pl, via pipeline/bin/PeptideFilesGenerator.pm. Creates identlist template file and identlist file for each pepXML in Experiments.list, then the combined file. The identlist template file contains only the unadjusted pepXML info for PSMs with P>=0.4; it is cached in the same dir as each pepXML file to speed future builds. | ||
At the core of this step is the script createPipelineInput.pl. | At the core of this step is the script createPipelineInput.pl. | ||
Files created in build directory, all but last created by createPipelineInput.pl: | Files created in build directory, all but last created by createPipelineInput.pl: | ||
- | * PeptideAtlasInput_concat.PAidentlist | + | * '''PeptideAtlasInput_concat.PAidentlist''' |
- | * PeptideAtlasInput_sorted.PAidentlist | + | * '''PeptideAtlasInput_sorted.PAidentlist'''<br>These two files are identical except for the sorting of the lines. They are in tsv format and have the following fields: |
- | * APD_Sc_all.tsv | + | # search batch ID |
- | * APD_Sc_all.PAxml | + | # spectrum query |
- | * APD_Sc_all.fasta (created in pipeline script after call to createPipelineInput.pl) | + | # Peptide Atlas accession |
+ | # peptide sequence | ||
+ | # preceding residue | ||
+ | # modified peptide sequence (includes modification & charge info) | ||
+ | # following residue | ||
+ | # charge | ||
+ | # peptide probability | ||
+ | # massdiff | ||
+ | # protein name | ||
+ | # nsp adjusted probability from protXML | ||
+ | # n_adj_obs from protXML | ||
+ | # n_sibling_peps from protXML | ||
+ | * '''APD_<speciesAbbrev>_all.tsv''' | ||
+ | * '''APD_<speciesAbbrev>_all.PAxml'''<br>These two files have the same info with different formatting. There is one record per '''unique''' peptide, with the following fields: | ||
+ | # Peptide Atlas accession | ||
+ | # biosequence name (same as above, maybe? then, not nec. in biosequence set) '''X 3''' | ||
+ | # peptide sequence | ||
+ | # n_peptides | ||
+ | # maximum_probability | ||
+ | # n_experiments | ||
+ | # observed_experiment_list | ||
+ | # biosequence_desc | ||
+ | # searched_experiment_list | ||
- | ====Download latest fasta files from web for reference DB (also called biosequence set)==== | + | * '''APD_<speciesAbbrev>_all.fasta''' -- fasta format file of all peptides in our atlas; PeptideAtlas accessions used. (Created by generateFiles() in pipeline/bin/PeptideFilesGenerator.pm script ''after'' call to createPipelineInput.pl) |
- | Step02. Calls pipeline script with --getEnsembl and executes getEnsembl(). Gets Ensembl fasta file via FTP unless stored locally. Merges in any supplemental protein file specified by calling $PIPELINE/bin/mergeEnsembleAndIPI.pl. Files created: | + | |
- | * <species>.pep.fa -- Ensembl file as retrieved via FTP. | + | |
- | * <species>.fasta | + | |
- | Seems to me that this step can be skipped if we are using a reference DB / biosequence set that already exists in SBEAMS -- often the case. Also seems that this step, like the SpectraST library building step, is independent of all the others. | + | |
+ | Files created in the directory for each experiment: | ||
+ | * '''interact-prob.pep.PAidentlist-template''' -- first 11 fields of PAidentlist files above | ||
+ | * '''interact-prob.pep.PAidentlist''' -- same fields as PAidentlist files above | ||
- | ====BLAST peptides against reference DB==== | + | New in May 2009: cPI.pl now chooses a preferred protID from among the indistinguishables, so protIDs in PAidentlist are now biased toward the preferred DB for each species (Swiss-Prot for human and mouse) |
- | Step03. Calls pipeline script with --BLASTP. Script then calls matchPeptides(), which does a system call of $PIPELINE/bin/PeptidePositionLocator.pl, which seems to search the DB without calling blast and prints "N peptides with/without a match". Files created: | + | ===Do some steps manually=== |
- | * peptide_mapping.tsv | + | |
- | ====Align peptides with reference proteins and calculate coordinates (start/end points of aligned region)==== | + | ====If any experiments were contaminated with proteins from another organism==== |
- | Step04. Calls pipeline script with --BLASTParse. Script then calls BLAST_APD_ENSEMBL(). Files created: | + | If new template files were created, stop after step01, decontaminate any PAidentlist-template files necessary, then run step01 again to create correct PAidentlist and combined PAidentlist. Decontaminate using $PIPELINE/bin/filter_Atlas_files.pl; refer to /regis/sbeams/archive/tconrads/NCIHumanSerum/NCIHumanSerum/SPC_HsNIST2.0/=notes. |
- | * APD_ensembl_hits.tsv -- one line per matched spectrum, listing PA peptide accession, coordinates, protein sequence identifier, and a bit more. | + | |
- | ====Parse the mapping results and calculate chromosomal coordinates==== | + | ====Optional: Extract a preliminary covering protein list from combined PAidentlist==== |
- | Step05. Calls pipeline script with --BLASTParse --getCoordinates. Calls BLAST_APD_ENSEMBL() again, which reads any previous coordinate cache file (.enscache) Files created: | + | This is useful if you want to estimate the protein FDR: |
- | * coordinate_mapping.txt | + | awk '{print $11}' PeptideAtlasInput_sorted.PAidentlist | sort | uniq > protid.list |
- | * $PIPELINE/new_cache/$dbname.enscache | + | Estimate the protein FDR by dividing the number of decoys in the list by the number of non-decoys. For the human plasma atlas this gave an approximately 10% overestimate, due to less-than-parsimonious protein ID mapping, which will be corrected in step02a below. If your protein FDR is not what you desire, adjust the PSM_FDR in the run script and run step01 again. |
- | ====Make a list of unmappable peptides==== | + | ====Create a special filtered pepXML for each experiment==== |
- | Step06. Calls pipeline script with --lostAndFound. Files needed: | + | This pepXML omits all PSMs for peptides that didn't pass threshold. It is to be used to create the master ProteinProphet file (next step). (Note: this pepXML has weird properties that make it unsuitable for most other purposes you might think of. It contains exactly at most one PSM for each peptide -- the highest probability one. And it contains many low-probability PSMs that correspond to peptides that passed threshold in another experiment.) |
- | * APD_<organism>_all.fasta | + | |
- | * APD_ensembl_hits.tsv | + | The following script automates this step, but is not robust (will break if createPipelineInput.pl's informational output changes): |
+ | $PIPELINE/bin/filterPepXMLsForAtlas.sh | ||
+ | ... or do manually, getting minprob for each experiment from step01 output: | ||
+ | /regis/sbeams/bin/filterPepXML.pl -p <minprob> DATA_FILES/PeptideAtlasInput_concat.PAidentlist $exp_dir/interact-ipro.pep.xml > $exp_dir interact-ipro-filtered.pep.xml | ||
+ | |||
+ | ====Create Master ProteinProphet file==== | ||
+ | Using script created by filterPepXMLsForAtlas.sh above, run ProteinProphet on regis9, then copy the resulting interact-combined.prot.xml to the analysis subdir. | ||
+ | |||
+ | ====If any protein IDS are overly long (SpectraST builds only)==== | ||
+ | Truncate overly long protein IDs, most of which begin with DECOY_[01], in interact-combined.prot.xml See recipe for details. | ||
+ | Truncate them from the initial | through to the _UNMAPPED. Names with 181 chars are OK; I've been truncating those longer than 181. | ||
+ | |||
+ | ===Run steps 02 through 08 of the build pipeline=== | ||
+ | Edit the run script to execute steps 02 through 08, then run the script. | ||
+ | |||
+ | ====Step02: Download latest fasta files from web for reference DB (also called biosequence set)==== | ||
+ | Step02. Calls pipeline script with --getEnsembl and executes getEnsembl(). Gets Ensembl fasta file via FTP unless stored locally. Merges in any supplemental protein file specified by calling $PIPELINE/bin/mergeEnsemblAndIPI.pl, which creates the duplicate* and protein2gene.txt files. Files created: | ||
+ | * '''<species>.pep.fa''' -- Ensembl file as retrieved via FTP. | ||
+ | * '''<species>.fasta''' -- Ensembl plus supplemental sequences. This is the file that is stored as a Biosequence Set in PeptideAtlas. | ||
+ | * '''duplicate_groups.txt''' -- each line lists a primary (Ensembl) identifier from the ref DB, then one or more non-primary identifiers. Has header line. | ||
+ | * '''duplicate_mapping.txt''' -- created from duplicate_groups.txt. Each line lists a non-primary, primary identifier pair. Has header line. | ||
+ | * '''duplicate_entries.txt''' -- non-primary identifiers | ||
+ | * '''protein2gene.txt''' -- List of protein/gene identifier pairs, looked up in Ensembl download. | ||
+ | |||
+ | 05/18/10 TMF suggested overhaul of this step: | ||
+ | * don't download latest Ensembl. Instead, use whatever version of Ensembl was used for the biosequence set. We are trying to recreate this every 6 months so it should be pretty current. | ||
+ | * don't create the .pep.fa and .fasta files | ||
+ | * Copy mergeEnsemblAndIPI.pl into a new script, createBSSauxiliaryFiles.pl. This script should be run only once, when creating the biosequence set, and should create the last four files listed above. These should be stored in /regis/dbase/users/sbeams/BSS_XX, where XX is the biosequence set ID. Then, step02 will consist ONLY of symlinking those files to the load directory. | ||
+ | * Caution: I just noticed that the file protein2gene.txt is of drastically different size for two different builds from the same BSS. So maybe it is build dependent and needs to be created anew for each build. | ||
+ | |||
+ | ====Step 02a: Compile protein identifications==== | ||
+ | Runs createPipelineInput.pl --protlist_only. Produces a new file, PeptideAtlas.PAprotlist. Each line of the file represents a set of protein identifiers that all have exactly the same peptide set. | ||
+ | * '''PeptideAtlasInput.PAprotlist''' -- for each <protein> selected from protXML for inclusion in the atlas, includes one line listing | ||
+ | |||
+ | 1. protein_group | ||
+ | 2. all protIDs (primary, then indistinguishables; indistinguishables are other protIDs from bioseq set to which identical peptide set maps and may be sequence-identical; primary is Swiss-Prot whenever possible) | ||
+ | 3. probability | ||
+ | 4. confidence | ||
+ | 5. presence level (canonical, possibly distinguished, or subsumed) | ||
+ | 6. protID for group's canonical representative (canonical for group with highest probability). | ||
+ | 7. subsumed by | ||
+ | 8. n_observations | ||
+ | 9. estimated ng/ml | ||
+ | |||
+ | Also changes field 11 (protein identification) of '''PeptideAtlasInput_{concat,sorted}.PAidentlist'''. This field is updated to match the near-minimal covering set calculated by createPipelineInput during this step. | ||
+ | |||
+ | Mayu is run automatically in step07. However, it may be run at any time after step02a. It makes use of PeptideAtlasInput_concat.PAidentlist with the updated protid fields. | ||
+ | |||
+ | ====Step02sc: Estimate protein concentrations==== | ||
+ | Runs estimateConcentrations.pl. This step added January 2011. This step can be run only if a biosequence_set for the build is already loaded. Sometimes the biosequence_set isn't created until step03 and isn't loaded until after the build pipeline is complete. In that case, step02sc and step02b can be run after load_biosequence_set (but before load_atlas_build --protinfo). | ||
+ | |||
+ | Files used: | ||
+ | * '''PeptideAtlasInput.PAprotlist''', created in step02a | ||
+ | * '''sc_calibration.tsv''', optional file of calibration proteins and concentrations | ||
+ | |||
+ | Creates: | ||
+ | * '''PeptideAtlasInput.PAprotlist''', with fields added for estimated concentrations, uncertainty factors, and normalized PSMs per 100K. If no sc_calibration.tsv file was provided, only the last of the three can be calculated. | ||
+ | * '''PeptideAtlasInput.PAprotlist_prelim''', a copy of PeptideAtlasInput.PAprotlist as it was before step 02sc. | ||
+ | * '''sc_calibration.png''', a calibration graph. Examine this to be sure your calibration proteins are reasonably well correlated with their spectral counts in your build. Otherwise, the resulting concentration estimates will be unreliable. | ||
+ | |||
+ | ====Step02b: Post process protein identification information==== | ||
+ | Files used: | ||
+ | * '''PeptideAtlasInput.PAprotlist''', created in step 02a | ||
+ | * '''duplicate_<something>''', created in step01 | ||
+ | Creates two files: | ||
+ | * '''PeptideAtlasInput.PAprotIdentlist''' -- lists protIDs for Atlas, labeling each with a presence_level of either canonical, possibly_distinguished, or subsumed. Among all protIDs sharing same set of peptides, only one protID listed; others are listed in PetpideAtlas.PAprotRelationships. Biased to include Swiss-Prot identifiers. | ||
+ | * '''PeptideAtlasInput.PAprotRelationships'''-- lists protIDs from bioseq set that are identical to (identical sequence) or indistinguishable from (same peptide set) protIDs in PeptideAtlas.PAprotIdentlist | ||
+ | Together, the above lists include all protIDs from bioseq set that include any peptide in Atlas. | ||
+ | |||
+ | ====Step03: Map peptides to reference DB ==== | ||
+ | Step03. Finds all possible matches for each peptide in reference DB and get start and end coordinates of aligned region. Calls pipeline script with --BLASTP. Script then calls matchPeptides(), which does a system call of $PIPELINE/bin/PeptidePositionLocator.pl, which searches the DB without calling blast and, at end, prints "N peptides with/without a match". | ||
+ | |||
+ | Files used: | ||
+ | * '''APD_<speciesAbbrev>_all.fasta''' -- lists peptides | ||
+ | * '''<species>.fasta''' -- reference DB from step02 | ||
Files created: | Files created: | ||
- | * APD_ensembl_lost_queries.dat | + | * '''peptide_mapping.tsv''' -- for each peptide, one line per reference DB hit, or a single truncated line if no hit |
+ | **PA accession | ||
+ | **pep seq | ||
+ | **protein identifier | ||
+ | **start of aligned region | ||
+ | **end of aligned region | ||
+ | The BLAST program used to be used, but is no longer used. | ||
- | ====Compile statistics on the peptides and proteins in the build==== | + | ====Step04: Count mapped peptides; print in different format==== |
- | Step07. Results in /net/db/projects/PeptideAtlas/pipeline/output/HumanUrine_2008-09_Ens49/analysis/analysis.out. This file contains instructions for doing the following manually: | + | Step04. Calls pipeline script with --BLASTParse. Script then calls BLAST_APD_ENSEMBL(). Reads peptide_mapping.tsv, counts the number of entries for each peptide, and prints out another file with the same number of lines. Counts the number of unique peptide accessions with hits in the ref DB and reports as "hits" and "perfect hits" (identical as of March 2009). File created: |
- | * creating an experiment contribution plot. A plot for the web Atlas is generated automatically, but these instructions tell you how to create a better one. | + | * '''APD_ensembl_hits.tsv''' -- for each peptide, one line per reference DB hit: |
- | * creating an amino acid abundance plot. Untested by Terry. | + | ** PA peptide accession |
+ | ** peptide length | ||
+ | ** protein sequence ID | ||
+ | ** peptide length | ||
+ | ** 100 | ||
+ | ** start position | ||
+ | ** end position | ||
+ | ** 0 | ||
+ | This file omits those peptide accessions that don't have hits in the reference DB, so it will have the same or smaller number of lines as peptide_mapping.tsv. The filename is misleading, as it contains hits to the entire ref DB, not just Ensembl. | ||
- | Calls the following: | + | ====Step 05: Calculate chromosomal coordinates==== |
+ | Step05. Again reads peptide_mapping.tsv, looks up chromosomal coordinates for each entry, and prints into a new file. Calls pipeline script with --BLASTParse --getCoordinates. Calls BLAST_APD_ENSEMBL() again, which reads any previous coordinate cache file (.enscache). Uses packages provided by EnsEMBL and kept in $PIPELINE/lib/ensembl-52/modules. Files created: | ||
+ | * '''coordinate_mapping.txt''': lines have same format as peptide_mapping.tsv, plus: | ||
+ | ** gene ID | ||
+ | ** strand | ||
+ | ** start | ||
+ | ** end | ||
+ | ** transcript stable ID | ||
+ | ** gene stable ID<br> | ||
+ | If not an Ensembl mapping, then these fields are filled with zero or UNKNOWN. | ||
+ | * $PIPELINE/new_cache/'''$dbname.enscache''' -- this file has one token per line. Data is organized into six line chunks: | ||
+ | ** gene ID | ||
+ | ** strand | ||
+ | ** gene ID | ||
+ | ** chromosome | ||
+ | ** start | ||
+ | ** end | ||
+ | As of March 2009 I see such cache files for only 4 species: human, mouse, yeast, and drosophila, and they are all dated 2004-2007. | ||
+ | |||
+ | March 2010: It appears that APD_ensembl_hits.tsv is also opened for writing in this step. | ||
+ | |||
+ | ====Step 06: Make a list of unmappable peptides==== | ||
+ | Step06. Calls pipeline script with --lostAndFound. Files needed: | ||
+ | * '''APD_<speciesAbbrev>_all.fasta''' (peptides for our atlas) | ||
+ | * '''APD_ensembl_hits.tsv (mapping) | ||
+ | ''' | ||
+ | Files created: | ||
+ | * '''APD_ensembl_lost_queries.dat''' -- a list of PA peptide accessions that are not mappable. | ||
+ | |||
+ | ====Step 07: Compile statistics on the peptides and proteins in the build==== | ||
+ | Step07. Calls the following (the first two directly, the rest via analysis.sh): | ||
+ | * $PIPELINE/bin/calc_all_experiment_models (creates prophet_model.sts) | ||
+ | * $PIPELINE/bin/calc_all_experiment_stats_EWD.pl (calls /net/db/projects/PeptideAtlas/pipeline/bin/calcxmlstats.pl and creates search_dir_stats.txt) | ||
* /regis/sbeams/bin/Mayu/Mayu.pl | * /regis/sbeams/bin/Mayu/Mayu.pl | ||
* $PIPELINE/bin/fasta_stat.pl | * $PIPELINE/bin/fasta_stat.pl | ||
* $PIPELINE/bin/peptide_stats_from_step04.pl | * $PIPELINE/bin/peptide_stats_from_step04.pl | ||
* $SBEAMS/lib/scripts/PeptideAtlas/calcProteinStatistics.pl | * $SBEAMS/lib/scripts/PeptideAtlas/calcProteinStatistics.pl | ||
- | * $SBEAMS/lib/scripts/PeptideAtlas/statistics/calcPeptideListStatistics.pl | + | * $SBEAMS/lib/scripts/PeptideAtlas/statistics/calcPeptideListStatistics.pl (takes as input PeptideAtlas_concat.PAidentlist; creates experiment_contribution_summary.out, PPvsDECOY.dat, and out.2tonsequences) |
* $SBEAMS/bin/protein_chance_hits.pl | * $SBEAMS/bin/protein_chance_hits.pl | ||
- | Files created automatically: | + | Files created in analysis directory: |
+ | * '''Mayu_out.csv''' -- table of FDRs at spectrum, peptide, and protein levels | ||
+ | * '''prophet_model.sts''' | ||
+ | * '''search_dir_stats.txt''' -- for each experiment, fraction of hits included in Atlas? | ||
+ | * '''ncumpep_vs_nspec-multobs.gif''' -- experiment contribution summary plot | ||
+ | * '''analysis.out''' -- a summary of the results, plus instructions to create an amino acid abundance plot (untested by Terry) | ||
- | In analysis directory: | + | Files created in DATA_FILES directory: |
- | * prophet_model.sts | + | |
- | * search_dir_stats.txt | + | |
- | * analysis.out | + | |
+ | * '''PPvsDECOY.dat''' -- brief stats on number of decoys found in 3 probability ranges | ||
+ | * '''experiment_contribution_summary.out''' -- table used for PeptideAtlas summary page and creation of experiment contribution summary plot | ||
+ | * '''out.2tonsequences''' -- multiply observed: "2 to N sequences". | ||
+ | * '''protein_chance_hits.out''' -- used for States et al. 95% confidence calculation | ||
+ | * '''simplereducedproteins.txt''' -- a rather minimal list of proteins in sample | ||
+ | * '''msruncounts.txt''' -- number of MS runs per experiment | ||
- | In DATA_FILES directory: | + | ====Step 08:Build a SpectraST library from the build ==== |
- | + | ||
- | * protein2gene.txt | + | |
- | * duplicate_groups.txt | + | |
- | * duplicate_mapping.txt | + | |
- | * duplicate_entries.txt | + | |
- | * out.2tonsequences | + | |
- | * PPvsDECOY.dat | + | |
- | * experiment_contribution_summary.out | + | |
- | * protein_chance_hits.out | + | |
- | * simplereducedproteins.txt -- a rather minimal list of proteins in sample | + | |
- | * msruncounts.txt | + | |
- | + | ||
- | ====Build a SpectraST library from the build ==== | + | |
Step08. Does not depend on any previous steps and can be executed at any point in the pipeline. Files created: | Step08. Does not depend on any previous steps and can be executed at any point in the pipeline. Files created: | ||
* <build-name>_all.splib | * <build-name>_all.splib | ||
* <build-name>_all.sptxt | * <build-name>_all.sptxt | ||
- | ===Load the reference DB (biosequence set) if new one is needed=== | ||
===Define Atlas build via SBEAMS=== | ===Define Atlas build via SBEAMS=== | ||
===Load data into build=== | ===Load data into build=== | ||
Commands below entered on command line on mimas. Full usage including desired options found in recipe. | Commands below entered on command line on mimas. Full usage including desired options found in recipe. | ||
- | ====Load data==== | + | ====Load peptides==== |
$SBEAMS/lib/scripts/PeptideAtlas/load_atlas_build.pl. January 22, 2009: using --purge and --load options together seems to reload previous build. Instead, call first with --purge, then again with --load. | $SBEAMS/lib/scripts/PeptideAtlas/load_atlas_build.pl. January 22, 2009: using --purge and --load options together seems to reload previous build. Instead, call first with --purge, then again with --load. | ||
+ | |||
+ | Loads information from the following files into SBEAMS: | ||
+ | * '''APD_<organism_abbrev>_all.PAxml''' -- info on each unique peptide | ||
+ | * '''coordinate_mapping.txt''' | ||
+ | |||
+ | ====Load protein identifications==== | ||
+ | load_atlas_build --prot_info. | ||
+ | |||
+ | If you need to purge just the prot_info, do load_atlas_build.pl --prot_info --purge. | ||
+ | |||
====Build search key==== | ====Build search key==== | ||
$SBEAMS/lib/scripts/PeptideAtlas/rebuildKeySearch.pl | $SBEAMS/lib/scripts/PeptideAtlas/rebuildKeySearch.pl | ||
Line 136: | Line 303: | ||
====Load spectra and spectrum IDs==== | ====Load spectra and spectrum IDs==== | ||
====Update statistics==== | ====Update statistics==== | ||
+ | |||
+ | ===Some other step that Eric wants us to add=== | ||
+ | |||
+ | ==To Do== | ||
+ | * Make it routine to build biosequence set (reference database) before atlas build process. Remove this task from step03. Get rid of SUPPLEMENTAL_PROTEINS file requirement. | ||
+ | * Remove references to multiply-observed peptides on summary page. Make plot look at all peptides. | ||
+ | |||
+ | |||
+ | ===Compiling protein information for existing atlases=== | ||
+ | Protein information for existing atlases can be compiled by executing the following steps, each of which is detailed above: | ||
+ | * Refresh all pepXML to biosequence set | ||
+ | * filterPepXML.pl on each experiment | ||
+ | * Create master ProteinProphet file | ||
+ | * createPipelineInput.pl --protlist_only | ||
+ | * processPAprotlist.pl | ||
+ | * load_atlas_build --prot_info | ||
+ | |||
+ | ==Deprecated stuff== | ||
+ | |||
+ | ===Deprecated functionality that may be useful again someday=== | ||
+ | Previous to 2009, createPipelineInput did not select preferred protIDs, and thus the following steps used to be necessary after step01: | ||
+ | * step02 to get duplicate_groups.txt, which tells us what the identicals are for the next step | ||
+ | * "refresh" PAidentlist and APD files to Swiss-Prot. These files will now tend to have Swiss-Prot protIDs, but will also have some other protIDs (for human, they will be from IPI, Ensembl, and cRAP) for sequences not identical to anything in Swiss-Prot. | ||
+ | $SBEAMS/lib/scripts/PeptideAtlas/processPAprotlist.pl --PAidentlist PeptideAtlasInput_concat.PAidentlist | ||
+ | $SBEAMS/lib/scripts/PeptideAtlas/processPAprotlist.pl --PAidentlist PeptideAtlasInput_sorted.PAidentlist | ||
+ | $SBEAMS/lib/scripts/PeptideAtlas/createPipelineInput.pl --APD_only --FDR 0.0001 -biosequence_set_id 60 --output_file DATA_FILES/APD_Hs_all | ||
+ | |||
+ | |||
+ | ====Combine ''all'' pepXML files for project using iProphet, then run ProteinProphet==== | ||
+ | |||
+ | Combining all experiments using iProphet was the pipeline in use Nov. '08 through Feb. '09. | ||
+ | We decided in approx. February 2009 that combining experiments using iProphet was too tedious and time-consuming, so this pipeline is no longer in favor, although it probably produces slightly more accurate final peptide probabilities. | ||
+ | First create a directory for your project in your data area, for ample disk space. Run on regis9 for ample memory. | ||
+ | $ ssh regis9 | ||
+ | $ cd /regis/data3/tfarrah/search | ||
+ | $ mkdir HsUrine; cd HsUrine; mkdir MultipleExps; cd MultipleExps; mkdir iProphet; cd iProphet | ||
+ | $ iProphet /regis/sbeams/archive/{phaller,youngah}/*Urine*/*/{XTK,SPC,SEQ}*/interact-prob.pep.xml | ||
+ | $ ProteinProphet interact-combined.pep.xml interact-combined.prot.xml UNMAPPED NORMPROTLEN PROTLEN MININDEP0.2 IPROPHET > & ProteinProphet.out | ||
+ | Combining all pepXML files may not be feasible with many and/or large files. In that case, you will need to run iProphet on the experiments in batches, then run ProteinProphet on all the resulting pepXML files combined. Consult David S. for advice. |
Current revision
PeptideAtlas is an online database of peptides identified in tandem MS experiments. Raw MS data are gathered from labs around the world and processed uniformly as described below. Data from a single species, or sometimes a subproteome such as plasma or liver, are combined into subsections of PeptideAtlas called builds. PeptideAtlas is implemented in the SBEAMS database. Another page outlines the body of software associated with PeptideAtlas.
The atlas build process is wild and woolly and unstable. There are several caveats having to do with special cases, and several steps that must be run manually. The whole process is outlined below. Your other steadfast companion on the atlas build journey will be the atlas build recipe, which describes the same steps below but provides actual unix command line invocations that you can cut and paste. When these documents fail you, as they will, feel free to consult the PeptideAtlas team at ISB, currently comprising Zhi Sun, Dave Campbell, team leader and PeptideAtlas creator Eric Deutsch, and the author of this document, Terry Farrah.
October, 2012: We are re-tooling the pipeline. Notes are here!
Copy build recipe and follow it
The unix commands needed to do each step below are given in mimas.systemsbiology.net:/net/db/projects/PeptideAtlas/pipeline/recipes/master_recipe.notes. Copy this file to <your_build>_recipe.notes, follow along, and edit as needed for your build. Most stuff takes place via mimas (a.k.a. db) or dione at /net/db/projects/PeptideAtlas.
The order of the steps in the recipe may differ from the order below. In such cases, either order may be followed.
Start with one or more PeptideProphet output files (pepXML) for each experiment in each project.
A project is a set of related experiments. For example, a project may study proteins found in normal and diseased liver, and may include 4 experiments: tissue from two normal patients and from two diseased patients. The pepXML files should be created by searching the spectra with a database search engine such as SEQUEST, X!Tandem, or SpectraST, then validating the hits using PeptideProphet. If you are going to combine search results using iProphet (see below), iProphet should not be run on each set of search results individually; to avoid running iProphet you will need to run xinteract manually because scripts such as runtandemsearch automatically run iProphet after PeptideProphet.
Searching databases containing decoys is recommended to provide a reference point for evaluating the FDR (false discovery rate) of the final Atlas. As of fall 2008, spectral libraries containing decoys are available for SpectraST searching. When searching with SpectraST, set the database to the biosequence set flatfile in spectrast.params. Also, currently, for SpectraST, RefreshParser is run after PeptideProphet by finishspectrastsearch. If this is not necessary -- and I don't think it is -- omit, because successive refreshes to different DBs may result in proliferation of UNMAPPED protIDs (not sure about this).
It is helpful when referencing files using wildcards if the pepXML files all reside at the same level in the directory tree. If you have to move files to achieve this, adjust the paths within them using
$ /sbeams/bin/updateAllPaths.pl *.xml *.xls *.shtml.
Post-process any glyco search results.
See Processing glycopeptide data
Register projects and experiments using SBEAMS interface.
- Go to db.systemsbiology.net.
- Login to SBEAMS.
- Click tab "My Projects" or "Accessible Projects" and click "Add new project" at bottom.
- Fill out fields. Owner of project should be the experimenter who created the data. Project tag should match name of subdirectory in /sbeams/archive/<project_owner> that contains the data.
- To register experiments, go to SBEAMS Home > Proteomics > Manage Data > Experiments
- Data Path: full path name up to Experiment_Tag directory -- or, if in /regis/sbeams/archive, may omit that prefix.
Run iProphet
If you ran multiple search engines for each experiment, combine per experiment using iProphet
Create a directory, parallel to the search results directories, named iProphet. Then, for example,
$ iProphet ../{XTK,SPC,SEQ}*/interact-prob.pep.xml interact-combined.pep.xml $ ProteinProphet <options> interact-combined.pep.xml interact-combined.prot.xml
Be sure that the input pepXML files were not already processed by iProphet -- you don't want to run iProphet twice. Caution: if you ran an automated post-processing script such as runtandemsearch (which calls finishtandemsearch), iProphet may already have been run automatically. Conversely, if you are using only one search engine and did not run iProphet immediately after running PeptideProphet, run it now.
The resulting pepXML files will be used to generate final peptide probabilities in the "gather all peptides" step of the Atlas build process below.
Build a biosequence set for your species if necessary
The biosequence_set typically contains all target and decoy sequences from the search database, plus additional sequences from other databases such as Ensembl and Swiss-Prot. Mapping peptides to this larger set allows us to take advantage of Swiss-Prot annotations and Ensembl genome mappings. The biosequence set may also contain decoys from old target/decoy databases to allow mapping of old search results. If a current bioseq set does not yet exist for your species, create, then come back to this step. As of April 2010, a recipe for creating a bioseq set resides in /regis/dbase/users/sbeams/DecoyDictionaries/bss_build_recipe.notes
Refresh all iProphet output files to biosequence set
To refresh, run RefreshParser. PepXML now has protIDs from bioseq set.
Run ProteinProphet on the iProphet file for each experiment.
ProtXML inherits protIDs from biosequence set.
The purpose of the per-experiment ProteinProphet run is to adjust the probabilities of all the peptides according to NSP (number of sibling peptides). The adjusted probabilities are not used directly, but are used to generate a multiplicative factor for each peptide which is then applied to the iProphet probability for each observation of that peptide. ProteinProphet protein probabilities are not used during the build process because, for large datasets, they are overly optimistic.
Obtain search batch IDs for each experiment.
Ensure that the file $PIPELINE/etc/protid_priorities.csv contains protein identifier prioritization for your species
Assign higher priorities to identifiers that you prefer to see in the PeptideAtlas. For human and mouse, we prefer Swiss-Prot because of its nice description lines, and also because the human Swiss-Prot, as of 2009, describes a rough draft of the human proteome.
Run step01 of the PeptideAtlas build pipeline
$PIPELINE=/net/db/projects/PeptideAtlas/pipeline
A general, customizable script for running the pipeline can be found in $PIPELINE/run_scripts/run_myBuild.csh. Copy this script to $PIPELINE/run_scripts/run_<name-of-build>.csh and edit. This script ultimately calls $PIPELINE/run_scripts/run_Master.csh, which then calls the pipeline script with various options for each of the various pipeline steps (except the last two steps). The pipeline script is $PIPELINE/bin/$PIPELINE_SCRIPT. $PIPELINE_SCRIPT is defined in $PIPELINE/run_scripts/run_myBuild.csh, and is often peptideAtlasPipeline_Ensembl.pl.
First, run step01 ONLY. You will then have to do some steps manually before proceeding to the other steps.
Step 01: Gather peptides, load into DB, and update probabilities
Step01. Looks at the results of the TPP, adjusts pepXML probabilities based on protXML info, gathers all PSMs (peptide-spectrum matches) that are wanted for this build based on a probability or FDR threshold, and stores all relevant info for each in a combined PA identlist file. Calls createPipelineInput.pl, via pipeline/bin/PeptideFilesGenerator.pm. Creates identlist template file and identlist file for each pepXML in Experiments.list, then the combined file. The identlist template file contains only the unadjusted pepXML info for PSMs with P>=0.4; it is cached in the same dir as each pepXML file to speed future builds.
At the core of this step is the script createPipelineInput.pl.
Files created in build directory, all but last created by createPipelineInput.pl:
- PeptideAtlasInput_concat.PAidentlist
- PeptideAtlasInput_sorted.PAidentlist
These two files are identical except for the sorting of the lines. They are in tsv format and have the following fields:
- search batch ID
- spectrum query
- Peptide Atlas accession
- peptide sequence
- preceding residue
- modified peptide sequence (includes modification & charge info)
- following residue
- charge
- peptide probability
- massdiff
- protein name
- nsp adjusted probability from protXML
- n_adj_obs from protXML
- n_sibling_peps from protXML
- APD_<speciesAbbrev>_all.tsv
- APD_<speciesAbbrev>_all.PAxml
These two files have the same info with different formatting. There is one record per unique peptide, with the following fields:
- Peptide Atlas accession
- biosequence name (same as above, maybe? then, not nec. in biosequence set) X 3
- peptide sequence
- n_peptides
- maximum_probability
- n_experiments
- observed_experiment_list
- biosequence_desc
- searched_experiment_list
- APD_<speciesAbbrev>_all.fasta -- fasta format file of all peptides in our atlas; PeptideAtlas accessions used. (Created by generateFiles() in pipeline/bin/PeptideFilesGenerator.pm script after call to createPipelineInput.pl)
Files created in the directory for each experiment:
- interact-prob.pep.PAidentlist-template -- first 11 fields of PAidentlist files above
- interact-prob.pep.PAidentlist -- same fields as PAidentlist files above
New in May 2009: cPI.pl now chooses a preferred protID from among the indistinguishables, so protIDs in PAidentlist are now biased toward the preferred DB for each species (Swiss-Prot for human and mouse)
Do some steps manually
If any experiments were contaminated with proteins from another organism
If new template files were created, stop after step01, decontaminate any PAidentlist-template files necessary, then run step01 again to create correct PAidentlist and combined PAidentlist. Decontaminate using $PIPELINE/bin/filter_Atlas_files.pl; refer to /regis/sbeams/archive/tconrads/NCIHumanSerum/NCIHumanSerum/SPC_HsNIST2.0/=notes.
Optional: Extract a preliminary covering protein list from combined PAidentlist
This is useful if you want to estimate the protein FDR:
awk '{print $11}' PeptideAtlasInput_sorted.PAidentlist | sort | uniq > protid.list
Estimate the protein FDR by dividing the number of decoys in the list by the number of non-decoys. For the human plasma atlas this gave an approximately 10% overestimate, due to less-than-parsimonious protein ID mapping, which will be corrected in step02a below. If your protein FDR is not what you desire, adjust the PSM_FDR in the run script and run step01 again.
Create a special filtered pepXML for each experiment
This pepXML omits all PSMs for peptides that didn't pass threshold. It is to be used to create the master ProteinProphet file (next step). (Note: this pepXML has weird properties that make it unsuitable for most other purposes you might think of. It contains exactly at most one PSM for each peptide -- the highest probability one. And it contains many low-probability PSMs that correspond to peptides that passed threshold in another experiment.)
The following script automates this step, but is not robust (will break if createPipelineInput.pl's informational output changes):
$PIPELINE/bin/filterPepXMLsForAtlas.sh
... or do manually, getting minprob for each experiment from step01 output:
/regis/sbeams/bin/filterPepXML.pl -p <minprob> DATA_FILES/PeptideAtlasInput_concat.PAidentlist $exp_dir/interact-ipro.pep.xml > $exp_dir interact-ipro-filtered.pep.xml
Create Master ProteinProphet file
Using script created by filterPepXMLsForAtlas.sh above, run ProteinProphet on regis9, then copy the resulting interact-combined.prot.xml to the analysis subdir.
If any protein IDS are overly long (SpectraST builds only)
Truncate overly long protein IDs, most of which begin with DECOY_[01], in interact-combined.prot.xml See recipe for details. Truncate them from the initial | through to the _UNMAPPED. Names with 181 chars are OK; I've been truncating those longer than 181.
Run steps 02 through 08 of the build pipeline
Edit the run script to execute steps 02 through 08, then run the script.
Step02: Download latest fasta files from web for reference DB (also called biosequence set)
Step02. Calls pipeline script with --getEnsembl and executes getEnsembl(). Gets Ensembl fasta file via FTP unless stored locally. Merges in any supplemental protein file specified by calling $PIPELINE/bin/mergeEnsemblAndIPI.pl, which creates the duplicate* and protein2gene.txt files. Files created:
- <species>.pep.fa -- Ensembl file as retrieved via FTP.
- <species>.fasta -- Ensembl plus supplemental sequences. This is the file that is stored as a Biosequence Set in PeptideAtlas.
- duplicate_groups.txt -- each line lists a primary (Ensembl) identifier from the ref DB, then one or more non-primary identifiers. Has header line.
- duplicate_mapping.txt -- created from duplicate_groups.txt. Each line lists a non-primary, primary identifier pair. Has header line.
- duplicate_entries.txt -- non-primary identifiers
- protein2gene.txt -- List of protein/gene identifier pairs, looked up in Ensembl download.
05/18/10 TMF suggested overhaul of this step:
- don't download latest Ensembl. Instead, use whatever version of Ensembl was used for the biosequence set. We are trying to recreate this every 6 months so it should be pretty current.
- don't create the .pep.fa and .fasta files
- Copy mergeEnsemblAndIPI.pl into a new script, createBSSauxiliaryFiles.pl. This script should be run only once, when creating the biosequence set, and should create the last four files listed above. These should be stored in /regis/dbase/users/sbeams/BSS_XX, where XX is the biosequence set ID. Then, step02 will consist ONLY of symlinking those files to the load directory.
- Caution: I just noticed that the file protein2gene.txt is of drastically different size for two different builds from the same BSS. So maybe it is build dependent and needs to be created anew for each build.
Step 02a: Compile protein identifications
Runs createPipelineInput.pl --protlist_only. Produces a new file, PeptideAtlas.PAprotlist. Each line of the file represents a set of protein identifiers that all have exactly the same peptide set.
- PeptideAtlasInput.PAprotlist -- for each <protein> selected from protXML for inclusion in the atlas, includes one line listing
1. protein_group 2. all protIDs (primary, then indistinguishables; indistinguishables are other protIDs from bioseq set to which identical peptide set maps and may be sequence-identical; primary is Swiss-Prot whenever possible) 3. probability 4. confidence 5. presence level (canonical, possibly distinguished, or subsumed) 6. protID for group's canonical representative (canonical for group with highest probability). 7. subsumed by 8. n_observations 9. estimated ng/ml
Also changes field 11 (protein identification) of PeptideAtlasInput_{concat,sorted}.PAidentlist. This field is updated to match the near-minimal covering set calculated by createPipelineInput during this step.
Mayu is run automatically in step07. However, it may be run at any time after step02a. It makes use of PeptideAtlasInput_concat.PAidentlist with the updated protid fields.
Step02sc: Estimate protein concentrations
Runs estimateConcentrations.pl. This step added January 2011. This step can be run only if a biosequence_set for the build is already loaded. Sometimes the biosequence_set isn't created until step03 and isn't loaded until after the build pipeline is complete. In that case, step02sc and step02b can be run after load_biosequence_set (but before load_atlas_build --protinfo).
Files used:
- PeptideAtlasInput.PAprotlist, created in step02a
- sc_calibration.tsv, optional file of calibration proteins and concentrations
Creates:
- PeptideAtlasInput.PAprotlist, with fields added for estimated concentrations, uncertainty factors, and normalized PSMs per 100K. If no sc_calibration.tsv file was provided, only the last of the three can be calculated.
- PeptideAtlasInput.PAprotlist_prelim, a copy of PeptideAtlasInput.PAprotlist as it was before step 02sc.
- sc_calibration.png, a calibration graph. Examine this to be sure your calibration proteins are reasonably well correlated with their spectral counts in your build. Otherwise, the resulting concentration estimates will be unreliable.
Step02b: Post process protein identification information
Files used:
- PeptideAtlasInput.PAprotlist, created in step 02a
- duplicate_<something>, created in step01
Creates two files:
- PeptideAtlasInput.PAprotIdentlist -- lists protIDs for Atlas, labeling each with a presence_level of either canonical, possibly_distinguished, or subsumed. Among all protIDs sharing same set of peptides, only one protID listed; others are listed in PetpideAtlas.PAprotRelationships. Biased to include Swiss-Prot identifiers.
- PeptideAtlasInput.PAprotRelationships-- lists protIDs from bioseq set that are identical to (identical sequence) or indistinguishable from (same peptide set) protIDs in PeptideAtlas.PAprotIdentlist
Together, the above lists include all protIDs from bioseq set that include any peptide in Atlas.
Step03: Map peptides to reference DB
Step03. Finds all possible matches for each peptide in reference DB and get start and end coordinates of aligned region. Calls pipeline script with --BLASTP. Script then calls matchPeptides(), which does a system call of $PIPELINE/bin/PeptidePositionLocator.pl, which searches the DB without calling blast and, at end, prints "N peptides with/without a match".
Files used:
- APD_<speciesAbbrev>_all.fasta -- lists peptides
- <species>.fasta -- reference DB from step02
Files created:
- peptide_mapping.tsv -- for each peptide, one line per reference DB hit, or a single truncated line if no hit
- PA accession
- pep seq
- protein identifier
- start of aligned region
- end of aligned region
The BLAST program used to be used, but is no longer used.
Step04: Count mapped peptides; print in different format
Step04. Calls pipeline script with --BLASTParse. Script then calls BLAST_APD_ENSEMBL(). Reads peptide_mapping.tsv, counts the number of entries for each peptide, and prints out another file with the same number of lines. Counts the number of unique peptide accessions with hits in the ref DB and reports as "hits" and "perfect hits" (identical as of March 2009). File created:
- APD_ensembl_hits.tsv -- for each peptide, one line per reference DB hit:
- PA peptide accession
- peptide length
- protein sequence ID
- peptide length
- 100
- start position
- end position
- 0
This file omits those peptide accessions that don't have hits in the reference DB, so it will have the same or smaller number of lines as peptide_mapping.tsv. The filename is misleading, as it contains hits to the entire ref DB, not just Ensembl.
Step 05: Calculate chromosomal coordinates
Step05. Again reads peptide_mapping.tsv, looks up chromosomal coordinates for each entry, and prints into a new file. Calls pipeline script with --BLASTParse --getCoordinates. Calls BLAST_APD_ENSEMBL() again, which reads any previous coordinate cache file (.enscache). Uses packages provided by EnsEMBL and kept in $PIPELINE/lib/ensembl-52/modules. Files created:
- coordinate_mapping.txt: lines have same format as peptide_mapping.tsv, plus:
- gene ID
- strand
- start
- end
- transcript stable ID
- gene stable ID
If not an Ensembl mapping, then these fields are filled with zero or UNKNOWN.
- $PIPELINE/new_cache/$dbname.enscache -- this file has one token per line. Data is organized into six line chunks:
- gene ID
- strand
- gene ID
- chromosome
- start
- end
As of March 2009 I see such cache files for only 4 species: human, mouse, yeast, and drosophila, and they are all dated 2004-2007.
March 2010: It appears that APD_ensembl_hits.tsv is also opened for writing in this step.
Step 06: Make a list of unmappable peptides
Step06. Calls pipeline script with --lostAndFound. Files needed:
- APD_<speciesAbbrev>_all.fasta (peptides for our atlas)
- APD_ensembl_hits.tsv (mapping)
Files created:
- APD_ensembl_lost_queries.dat -- a list of PA peptide accessions that are not mappable.
Step 07: Compile statistics on the peptides and proteins in the build
Step07. Calls the following (the first two directly, the rest via analysis.sh):
- $PIPELINE/bin/calc_all_experiment_models (creates prophet_model.sts)
- $PIPELINE/bin/calc_all_experiment_stats_EWD.pl (calls /net/db/projects/PeptideAtlas/pipeline/bin/calcxmlstats.pl and creates search_dir_stats.txt)
- /regis/sbeams/bin/Mayu/Mayu.pl
- $PIPELINE/bin/fasta_stat.pl
- $PIPELINE/bin/peptide_stats_from_step04.pl
- $SBEAMS/lib/scripts/PeptideAtlas/calcProteinStatistics.pl
- $SBEAMS/lib/scripts/PeptideAtlas/statistics/calcPeptideListStatistics.pl (takes as input PeptideAtlas_concat.PAidentlist; creates experiment_contribution_summary.out, PPvsDECOY.dat, and out.2tonsequences)
- $SBEAMS/bin/protein_chance_hits.pl
Files created in analysis directory:
- Mayu_out.csv -- table of FDRs at spectrum, peptide, and protein levels
- prophet_model.sts
- search_dir_stats.txt -- for each experiment, fraction of hits included in Atlas?
- ncumpep_vs_nspec-multobs.gif -- experiment contribution summary plot
- analysis.out -- a summary of the results, plus instructions to create an amino acid abundance plot (untested by Terry)
Files created in DATA_FILES directory:
- PPvsDECOY.dat -- brief stats on number of decoys found in 3 probability ranges
- experiment_contribution_summary.out -- table used for PeptideAtlas summary page and creation of experiment contribution summary plot
- out.2tonsequences -- multiply observed: "2 to N sequences".
- protein_chance_hits.out -- used for States et al. 95% confidence calculation
- simplereducedproteins.txt -- a rather minimal list of proteins in sample
- msruncounts.txt -- number of MS runs per experiment
Step 08:Build a SpectraST library from the build
Step08. Does not depend on any previous steps and can be executed at any point in the pipeline. Files created:
- <build-name>_all.splib
- <build-name>_all.sptxt
Define Atlas build via SBEAMS
Load data into build
Commands below entered on command line on mimas. Full usage including desired options found in recipe.
Load peptides
$SBEAMS/lib/scripts/PeptideAtlas/load_atlas_build.pl. January 22, 2009: using --purge and --load options together seems to reload previous build. Instead, call first with --purge, then again with --load.
Loads information from the following files into SBEAMS:
- APD_<organism_abbrev>_all.PAxml -- info on each unique peptide
- coordinate_mapping.txt
Load protein identifications
load_atlas_build --prot_info.
If you need to purge just the prot_info, do load_atlas_build.pl --prot_info --purge.
Build search key
$SBEAMS/lib/scripts/PeptideAtlas/rebuildKeySearch.pl
Update empirical proteotypic scores
$SBEAMS/lib/scripts/PeptideAtlas/updateProteotypicScores.pl
Load spectra and spectrum IDs
Update statistics
Some other step that Eric wants us to add
To Do
- Make it routine to build biosequence set (reference database) before atlas build process. Remove this task from step03. Get rid of SUPPLEMENTAL_PROTEINS file requirement.
- Remove references to multiply-observed peptides on summary page. Make plot look at all peptides.
Compiling protein information for existing atlases
Protein information for existing atlases can be compiled by executing the following steps, each of which is detailed above:
- Refresh all pepXML to biosequence set
- filterPepXML.pl on each experiment
- Create master ProteinProphet file
- createPipelineInput.pl --protlist_only
- processPAprotlist.pl
- load_atlas_build --prot_info
Deprecated stuff
Deprecated functionality that may be useful again someday
Previous to 2009, createPipelineInput did not select preferred protIDs, and thus the following steps used to be necessary after step01:
- step02 to get duplicate_groups.txt, which tells us what the identicals are for the next step
- "refresh" PAidentlist and APD files to Swiss-Prot. These files will now tend to have Swiss-Prot protIDs, but will also have some other protIDs (for human, they will be from IPI, Ensembl, and cRAP) for sequences not identical to anything in Swiss-Prot.
$SBEAMS/lib/scripts/PeptideAtlas/processPAprotlist.pl --PAidentlist PeptideAtlasInput_concat.PAidentlist $SBEAMS/lib/scripts/PeptideAtlas/processPAprotlist.pl --PAidentlist PeptideAtlasInput_sorted.PAidentlist $SBEAMS/lib/scripts/PeptideAtlas/createPipelineInput.pl --APD_only --FDR 0.0001 -biosequence_set_id 60 --output_file DATA_FILES/APD_Hs_all
Combine all pepXML files for project using iProphet, then run ProteinProphet
Combining all experiments using iProphet was the pipeline in use Nov. '08 through Feb. '09. We decided in approx. February 2009 that combining experiments using iProphet was too tedious and time-consuming, so this pipeline is no longer in favor, although it probably produces slightly more accurate final peptide probabilities. First create a directory for your project in your data area, for ample disk space. Run on regis9 for ample memory.
$ ssh regis9 $ cd /regis/data3/tfarrah/search $ mkdir HsUrine; cd HsUrine; mkdir MultipleExps; cd MultipleExps; mkdir iProphet; cd iProphet $ iProphet /regis/sbeams/archive/{phaller,youngah}/*Urine*/*/{XTK,SPC,SEQ}*/interact-prob.pep.xml $ ProteinProphet interact-combined.pep.xml interact-combined.prot.xml UNMAPPED NORMPROTLEN PROTLEN MININDEP0.2 IPROPHET > & ProteinProphet.out
Combining all pepXML files may not be feasible with many and/or large files. In that case, you will need to run iProphet on the experiments in batches, then run ProteinProphet on all the resulting pepXML files combined. Consult David S. for advice.