Building Peptide Atlas
From SPCTools
Three PeptideAtlas build pipelines are described below:
- Cutting-edge, pre-release pipeline which allows incorporation of more useful protein info into the Atlas
- Standard pipeline, in use when Terry arrived August 2008
- Pipeline which uses iProphet to combine experiments. Slightly better results, but cumbersome, especially for large numbers of experiments.
May 2009 pipeline: generate more informative protein lists
This pipeline is under development, but is currently functional. In this pipeline, multiple searches may be combined using iProphet, but experiments are NOT combined using iProphet. (We decided in approx. February 2009 that combining experiments using iProphet was too tedious and time-consuming.) ProteinProphet is run on each experiment AND on all experiments combined. For details on specific steps, refer to the description of the November 2008 pipeline below.
- Post-process any glyco search results
- PeptideProphet, RefreshParser (for SpectraST searches and any searches against a database different from the primary search DB), and iProphet. All pepXML now have protIDs from primary search DB -- for human, usually IPI. NOTE: if RefreshParser is necessary, maybe best
to just RefreshParse all PeptideProphet pepXML to bioseq set here, from the get-go. Then won't have two layers of RefreshParser, each contributing more UNMAPPED.
- RefreshParser to refresh all interact-ipro.pep.xml to bioseq set. If bioseq set does not yet exist, need to build through step03 first. PepXML now has protIDs from bioseq set.
- ProteinProphet per expt. ProtXML inherits protIDs from bioseq set.
- Mayu & calctppstat
- prophetAll, then copy master protXML to analysis dir. This protXML inherits protIDs from bioseq set.
- step01 to get *.PAidentlist-template, *.PAidentlist, PeptideAtlas.PAprotlist, and APD* files and load any new peptides into the Atlas. These files all inherit protIDs from bioseq set.
- PeptideAtlas.PAprotlist -- for each <protein> selected from protXML for inclusion in the atlas, includes one line listing
- protein_group
- all protIDs (primary plus indistinguishables; indistinguishables are other protIDs from bioseq set to which identical peptide set maps; primary seems to be the first alphabetically)
- probability
- confidence
- presence level (canonical, possibly distinguished, or subsumed)
- protID for group's canonical representative (canonical for group with highest probability).
- if new template files were created, decontaminate any PAidentlist-template files necessary, then run step01 again to create correct PAidentlist and combined PAidentlist
- step02 to get duplicate_groups.txt, which tells us what the identicals are for the next step
- "refresh" PAidentlist and APD files to Swiss-Prot. These files will now tend to have Swiss-Prot protIDs, but will also have some other protIDs (for human, they will be from IPI, Ensembl, and cRAP) for sequences not identical to anything in Swiss-Prot.
$PIPELINE/bin/processPAprotlist.pl --PAidentlist PeptideAtlasInput_concat.PAidentlist $PIPELINE/bin/processPAprotlist.pl --PAidentlist PeptideAtlasInput_sorted.PAidentlist $SBEAMS/lib/scripts/PeptideAtlas/createPipelineInput.pl --APD_only --FDR 0.0001 -biosequence_set_id 60 --output_file DATA_FILES/APD_Hs_all
Now a very nice minimal protein list can be gleaned from combined PAidentlist:
awk '{print $11}' PeptideAtlasInput_sorted.PAidentlist | sort | uniq
This is the "covering" protein list for the Atlas -- smallest list to explain all peptides, with mostly Swiss-Prot IDs.
- Create protein identification lists:
- PeptideAtlas.PAprotIdentlist -- lists protIDs for Atlas, labeling each with a presence_level of either canonical, possibly_distinguished, or subsumed. Only one protID listed per peptide set. Biased to include Swiss-Prot identifiers.
- PeptideAtlas.PAprotRelationships -- lists protIDs from bioseq set that are identical to (identical sequence) or indistinguishable from (same peptide set) protIDs in PeptideAtlas.PAprotIdentlist
$PIPELINE/bin/processPAprotlist.pl (run from DATA_FILES)
Can do this any time before load. Together, the above lists include all protIDs from bioseq set that include any peptide in Atlas. As of 5/19/09, usefulness of the presence_level assignments is suspect -- there are more canonicals than necessary.
- steps 03-08, and load
Standard Pipeline: run iProphet/ProteinProphet on each expt. individually
The current (as of June 2009) standard build pipeline is the same as that described below, except:
- Run iProphet/ProteinProphet on each experiment individually. Do not combine all experiments with iProphet.
- Run ProteinProphet again on all experiments combined, creating a MASTER_PROTEINPROPHET_FILE.
Feel free to contact Terry for more details.
Combining all experiments using iProphet: pipeline in use Nov. '08 through Feb. '09
Start with one or more PeptideProphet output files (pepXML) for each experiment in each project.
A project is a set of related experiments. For example, a project may study proteins found in normal and diseased liver, and may include 4 experiments: tissue from two normal patients and from two diseased patients. The pepXML files should be created by searching the spectra with a database search engine such as SEQUEST, X!Tandem, or SpectraST, then validating the hits using PeptideProphet. If you are going to combine search results using iProphet (see below), iProphet should not be run on each set of search results individually; to avoid running iProphet you will need to run xinteract manually because scripts such as runtandemsearch automatically run iProphet after PeptideProphet.
Searching databases containing decoys is recommended to provide a reference point for evaluating the FDR (false discovery rate) of the final Atlas. As of fall 2008, spectral libraries containing decoys are available for SpectraST searching.
It is helpful when referencing files using wildcards if the pepXML files all reside at the same level in the directory tree. If you have to move files to achieve this, adjust the paths within them using
$ /sbeams/bin/updateAllPaths.pl *.xml *.xls *.shtml.
Copy build recipe and follow it
The unix commands needed to do each step below are given in mimas.systemsbiology.net:/net/db/projects/PeptideAtlas/pipeline/recipes/master_recipe.notes. Copy this file to <your_build>_recipe.notes, follow along, and edit as needed for your build. Most stuff takes place via mimas (a.k.a. db) at /net/db/projects/PeptideAtlas.
Register projects and experiments using SBEAMS interface.
- Go to db.systemsbiology.net.
- Login to SBEAMS.
- Click tab "My Projects" or "Accessible Projects" and click "Add new project" at bottom.
- Fill out fields. Owner of project should be the experimenter who created the data. Project tag should match name of subdirectory in /sbeams/archive/<project_owner> that contains the data.
- To register experiments, go to SBEAMS Home > Proteomics > Manage Data > Experiments
- Data Path: full path name up to Experiment_Tag directory -- or, if in /regis/sbeams/archive, may omit that prefix.
Run iProphet and ProteinProphet
If you ran multiple search engines for each experiment, combine per experiment using iProphet.
Create a directory, parallel to the search results directories, named iProphet. Then, for example,
$ iProphet ../{XTK,SPC,SEQ}*/interact-prob.pep.xml interact-combined.pep.xml
Be sure that the input pepXML files were not already processed by iProphet -- you don't want to run iProphet twice. Caution: if you ran an automated post-processing script such as runtandemsearch (which calls finishtandemsearch), iProphet may already have been run automatically. Conversely, if you are using only one search engine and did not run iProphet immediately after running ProteinProphet, run it now.
The resulting pepXML files will be used to generate final peptide probabilities in the "gather all peptides" step of the Atlas build process below.
Combine all pepXML files for project using iProphet, then run ProteinProphet
First create a directory for your project in your data area, for ample disk space. Run on regis9 for ample memory.
$ ssh regis9 $ cd /regis/data3/tfarrah/search $ mkdir HsUrine; cd HsUrine; mkdir MultipleExps; cd MultipleExps; mkdir iProphet; cd iProphet $ iProphet /regis/sbeams/archive/{phaller,youngah}/*Urine*/*/{XTK,SPC,SEQ}*/interact-prob.pep.xml $ ProteinProphet interact-combined.pep.xml interact-combined.prot.xml UNMAPPED NORMPROTLEN PROTLEN MININDEP0.2 IPROPHET > & ProteinProphet.out
Combining all pepXML files may not be feasible with many and/or large files. In that case, you will need to run iProphet on the experiments in batches, then run ProteinProphet on all the resulting pepXML files combined. Consult David S. for advice.
The purpose of the ProteinProphet run is to adjust the probabilities of all the peptides according to NSP (number of sibling peptides). The adjusted probabilities are not used directly, but are used to generate a multiplicative factor for each peptide which is then applied to the iProphet probability for each observation of that peptide. In particular, it is important to note that, as of January 2009, ProteinProphet protein probabilities are not displayed or used in any way in the PeptideAtlas. This is because these probabilities are, for large datasets, overly optimistic.
Obtain search batch IDs for each experiment.
Run PeptideAtlas build "pipeline".
Scripts can be found in /net/db/projects/PeptideAtlas/pipeline/run_scripts. Each script ultimately calls /net/db/projects/PeptideAtlas/pipeline/run_scripts/run_Master_current.csh, and this is where the meat of the pipeline resides.
Gather peptides and update probabilities
Step01. Looks at the results of the TPP, adjusts pepXML probabilities based on protXML info, gathers all PSMs (peptide-spectrum matches) that are wanted for this build based on a probability or FDR threshold, and stores all relevant info for each in a combined PA identlist file. Calls createPipelineInput.pl, via pipeline/bin/PeptideFilesGenerator.pm. Creates identlist template file and identlist file for each pepXML in Experiments.list, then the combined file. The identlist template file contains only the unadjusted pepXML info for PSMs with P>=0.4; it is cached in the same dir as each pepXML file to speed future builds.
At the core of this step is the script createPipelineInput.pl.
Files created in build directory, all but last created by createPipelineInput.pl:
- PeptideAtlasInput_concat.PAidentlist
- PeptideAtlasInput_sorted.PAidentlist
These two files are identical except for the sorting of the lines. They are in tsv format and have the following fields:
- search batch ID
- spectrum query
- Peptide Atlas accession
- peptide sequence
- preceding residue
- modified peptide sequence (includes modification & charge info)
- following residue
- charge
- peptide probability
- massdiff
- protein name
- nsp adjusted probability from protXML
- n_adj_obs from protXML
- n_sibling_peps from protXML
- APD_Sc_all.tsv
- APD_Sc_all.PAxml
These two files have the same info with different formatting. There is one record per unique peptide, with the following fields:
- Peptide Atlas accession
- biosequence name (same as above, maybe? then, not nec. in biosequence set) X 3
- peptide sequence
- n_peptides
- maximum_probability
- n_experiments
- observed_experiment_list
- biosequence_desc
- searched_experiment_list
- APD_Sc_all.fasta -- fasta format file of all peptides in our atlas; PeptideAtlas accessions used. (Created in pipeline script after call to createPipelineInput.pl)
Files created in the directory for each experiment:
- interact-prob.pep.PAidentlist-template -- first 11 fields of PAidentlist files above
- interact-prob.pep.PAidentlist -- same fields as PAidentlist files above
Download latest fasta files from web for reference DB (also called biosequence set)
Step02. Calls pipeline script with --getEnsembl and executes getEnsembl(). Gets Ensembl fasta file via FTP unless stored locally. Merges in any supplemental protein file specified by calling $PIPELINE/bin/mergeEnsemblAndIPI.pl, which creates the duplicate* and protein2gene.txt files. Files created:
- <species>.pep.fa -- Ensembl file as retrieved via FTP.
- <species>.fasta -- Ensembl plus supplemental sequences. This is the file that is stored as a Biosequence Set in PeptideAtlas.
- duplicate_groups.txt -- each line lists a primary (Ensembl) identifier from the ref DB, then one or more non-primary identifiers. Has header line.
- duplicate_mapping.txt -- created from duplicate_groups.txt. Each line lists a non-primary, primary identifier pair. Has header line.
- duplicate_entries.txt -- non-primary identifiers
- protein2gene.txt -- List of protein/gene identifier pairs, looked up in Ensembl download.
Seems to me that this step can be skipped if we are using a reference DB / biosequence set that already exists in SBEAMS -- often the case. Instead, we can copy the above files from another build that uses the same biosequence set. Also seems that this step, like the SpectraST library building step, is independent of all the others.
Map peptides to reference DB
Step03. Finds all possible matches for each peptide in reference DB and get start and end coordinates of aligned region. Calls pipeline script with --BLASTP. Script then calls matchPeptides(), which does a system call of $PIPELINE/bin/PeptidePositionLocator.pl, which seems to search the DB without calling blast and, at end, prints "N peptides with/without a match". Files created:
- peptide_mapping.tsv -- for each peptide, one line per reference DB hit, or a single truncated line if no hit
- PA accession
- pep seq
- protein identifier
- start of aligned region
- end of aligned region
The BLAST program used to be used, but is no longer used.
Count mapped peptides; print in different format
Step04. Calls pipeline script with --BLASTParse. Script then calls BLAST_APD_ENSEMBL(). Reads peptide_mapping.tsv, counts the number of entries for each peptide, and prints out another file with the same number of lines. Counts the number of unique peptide accessions with hits in the ref DB and reports as "hits" and "perfect hits" (identical as of March 2009). File created:
- APD_ensembl_hits.tsv -- for each peptide, one line per reference DB hit:
- PA peptide accession
- peptide length
- protein sequence ID
- peptide length
- 100
- start position
- end position
- 0
This file omits those peptide accessions that don't have hits in the reference DB, so it will have the same or smaller number of lines as peptide_mapping.tsv. The filename is misleading, as it contains hits to the entire ref DB, not just Ensembl.
Calculate chromosomal coordinates
Step05. Again reads peptide_mapping.tsv, looks up chromosomal coordinates for each entry, and prints into a new file. Calls pipeline script with --BLASTParse --getCoordinates. Calls BLAST_APD_ENSEMBL() again, which reads any previous coordinate cache file (.enscache). Files created:
- coordinate_mapping.txt: lines have same format as peptide_mapping.tsv, plus:
- gene ID
- strand
- start
- end
- transcript stable ID
- gene stable ID
If not an Ensembl mapping, then these fields are filled with zero or UNKNOWN.
- $PIPELINE/new_cache/$dbname.enscache -- this file has one token per line. Data is organized into six line chunks:
- gene ID
- strand
- gene ID
- chromosome
- start
- end
As of March 2009 I see such cache files for only 4 species: human, mouse, yeast, and drosophila, and they are all dated 2004-2007.
Make a list of unmappable peptides
Step06. Calls pipeline script with --lostAndFound. Files needed:
- APD_<organism>_all.fasta (peptides for our atlas)
- APD_ensembl_hits.tsv (mapping)
Files created:
- APD_ensembl_lost_queries.dat -- a list of PA peptide accessions that are not mappable.
Compile statistics on the peptides and proteins in the build
Step07. Calls the following:
- /regis/sbeams/bin/Mayu/Mayu.pl
- $PIPELINE/bin/fasta_stat.pl
- $PIPELINE/bin/peptide_stats_from_step04.pl
- $SBEAMS/lib/scripts/PeptideAtlas/calcProteinStatistics.pl
- $SBEAMS/lib/scripts/PeptideAtlas/statistics/calcPeptideListStatistics.pl
- $SBEAMS/bin/protein_chance_hits.pl
Files created in analysis directory:
- Mayu_out.csv -- table of FDRs at spectrum, peptide, and protein levels
- prophet_model.sts
- search_dir_stats.txt -- for each experiment, fraction of hits included in Atlas?
- analysis.out -- a summary of the results, plus instructions to do the following manually:
- creating an experiment contribution plot. A plot for the web Atlas is generated automatically, but these instructions tell you how to create a better one.
- creating an amino acid abundance plot. Untested by Terry.
Files created in DATA_FILES directory:
- out.2tonsequences -- multiply observed: "2 to N sequences". Who creates this?
- PPvsDECOY.dat -- brief stats on number of decoys found in 3 probability ranges
- experiment_contribution_summary.out -- table used for PeptideAtlas summary page
- protein_chance_hits.out -- used for States et al. 95% confidence calculation
- simplereducedproteins.txt -- a rather minimal list of proteins in sample
- msruncounts.txt -- number of MS runs per experiment
Build a SpectraST library from the build
Step08. Does not depend on any previous steps and can be executed at any point in the pipeline. Files created:
- <build-name>_all.splib
- <build-name>_all.sptxt
Load the reference DB (biosequence set) if new one is needed
Define Atlas build via SBEAMS
Load data into build
Commands below entered on command line on mimas. Full usage including desired options found in recipe.
Load data
$SBEAMS/lib/scripts/PeptideAtlas/load_atlas_build.pl. January 22, 2009: using --purge and --load options together seems to reload previous build. Instead, call first with --purge, then again with --load.
Loads information from the following files into SBEAMS:
- APD_<organism_abbrev>_all.PAxml -- info on each unique peptide
- coordinate_mapping.txt
Build search key
$SBEAMS/lib/scripts/PeptideAtlas/rebuildKeySearch.pl
Update empirical proteotypic scores
$SBEAMS/lib/scripts/PeptideAtlas/updateProteotypicScores.pl