Building Peptide Atlas

From SPCTools

(Difference between revisions)

Revision as of 00:24, 13 March 2009

Here is how I built a human urine PeptideAtlas in December 2008.

1 Start with one or more PeptideProphet output files (pepXML) for each experiment in each project.
2 Copy build recipe and follow it
3 Register projects and experiments using SBEAMS interface.
4 Run iProphet and ProteinProphet
- 4.1 If you ran multiple search engines for each experiment, combine per experiment using iProphet.
- 4.2 Combine all pepXML files for project using iProphet, then run ProteinProphet
5 Obtain search batch IDs for each experiment.
6 Run PeptideAtlas build "pipeline".
7 Load the reference DB (biosequence set) if new one is needed
8 Define Atlas build via SBEAMS
9 Load data into build

Start with one or more PeptideProphet output files (pepXML) for each experiment in each project.

A project is a set of related experiments. For example, a project may study proteins found in normal and diseased liver, and may include 4 experiments: tissue from two normal patients and from two diseased patients. The pepXML files should be created by searching the spectra with a database search engine such as SEQUEST, X!Tandem, or SpectraST, then validating the hits using PeptideProphet. If you are going to combine search results using iProphet (see below), iProphet should not be run on each set of search results individually; to avoid running iProphet you will need to run xinteract manually because scripts such as runtandemsearch automatically run iProphet after PeptideProphet.

Searching databases containing decoys is recommended to provide a reference point for evaluating the FDR (false discovery rate) of the final Atlas. As of fall 2008, spectral libraries containing decoys are available for SpectraST searching.

It is helpful when referencing files using wildcards if the pepXML files all reside at the same level in the directory tree. If you have to move files to achieve this, adjust the paths within them using

$ /sbeams/bin/updateAllPaths.pl *.xml *.xls *.shtml.

Copy build recipe and follow it

The unix commands needed to do each step below are given in mimas.systemsbiology.net:/net/db/projects/PeptideAtlas/pipeline/recipes/master_recipe.notes. Copy this file to <your_build>_recipe.notes, follow along, and edit as needed for your build. Most stuff takes place via mimas (a.k.a. db) at /net/db/projects/PeptideAtlas.

Register projects and experiments using SBEAMS interface.

Go to db.systemsbiology.net.
Login to SBEAMS.
Click tab "My Projects" or "Accessible Projects" and click "Add new project" at bottom.
Fill out fields. Owner of project should be the experimenter who created the data. Project tag should match name of subdirectory in /sbeams/archive/<project_owner> that contains the data.
To register experiments, go to SBEAMS Home > Proteomics > Manage Data > Experiments

Run iProphet and ProteinProphet

If you ran multiple search engines for each experiment, combine per experiment using iProphet.

Create a directory, parallel to the search results directories, named iProphet. Then, for example,

$ iProphet ../{XTK,SPC,SEQ}*/interact-prob.pep.xml interact-combined.pep.xml

Be sure that the input pepXML files were not already processed by iProphet -- you don't want to run iProphet twice. Caution: if you ran an automated post-processing script such as runtandemsearch (which calls finishtandemsearch), iProphet may already have been run automatically. Conversely, if you are using only one search engine and did not run iProphet immediately after running ProteinProphet, run it now.

The resulting pepXML files will be used to generate final peptide probabilities in the "gather all peptides" step of the Atlas build process below.

Combine all pepXML files for project using iProphet, then run ProteinProphet

First create a directory for your project in your data area, for ample disk space. Run on regis9 for ample memory.

$ ssh regis9
$ cd /regis/data3/tfarrah/search
$ mkdir HsUrine; cd HsUrine; mkdir MultipleExps; cd MultipleExps; mkdir iProphet; cd iProphet
$ iProphet /regis/sbeams/archive/{phaller,youngah}/*Urine*/*/{XTK,SPC,SEQ}*/interact-prob.pep.xml
$ ProteinProphet interact-combined.pep.xml interact-combined.prot.xml UNMAPPED NORMPROTLEN PROTLEN MININDEP0.2 IPROPHET > & ProteinProphet.out

Combining all pepXML files may not be feasible with many and/or large files. In that case, you will need to run iProphet on the experiments in batches, then run ProteinProphet on all the resulting pepXML files combined. Consult David S. for advice.

The purpose of the ProteinProphet run is to adjust the probabilities of all the peptides according to NSP (number of sibling peptides). The adjusted probabilities are not used directly, but are used to generate a multiplicative factor for each peptide which is then applied to the iProphet probability for each observation of that peptide. In particular, it is important to note that, as of January 2009, ProteinProphet protein probabilities are not displayed or used in any way in the PeptideAtlas. This is because these probabilities are, for large datasets, overly optimistic.

Obtain search batch IDs for each experiment.

Run PeptideAtlas build "pipeline".

Scripts can be found in /net/db/projects/PeptideAtlas/pipeline/run_scripts. Each script ultimately calls /net/db/projects/PeptideAtlas/pipeline/run_scripts/run_Master_current.csh, and this is where the meat of the pipeline resides.

Gather all peptides and update probabilities

Step01. Calls createPipelineInput.pl, via pipeline/bin/PeptideFilesGenerator.pm. Creates "identlist file" for each pepXML in Experiments.list, and also a combined file. This is a simple text format (one line per record) file containing all the relevant pepXML and protXML info for each peptide identification with P>=threshold (usually 0.9), with peptide probabilities adjusted using the protXML probabilities as a guide. An identlist template file is also created which contains only the unadjusted pepXML info for peptides with P>=(threshold-0.4); it is cached in the same dir as each pepXML file for use in future builds.

At the core of this step is the script createPipelineInput.pl.

Files created in build directory, all but last created by createPipelineInput.pl:

PeptideAtlasInput_concat.PAidentlist
PeptideAtlasInput_sorted.PAidentlist
APD_Sc_all.tsv
APD_Sc_all.PAxml
APD_Sc_all.fasta (created in pipeline script after call to createPipelineInput.pl)

Download latest fasta files from web for reference DB (also called biosequence set)

Step02. Calls pipeline script with --getEnsembl and executes getEnsembl(). Gets Ensembl fasta file via FTP unless stored locally. Merges in any supplemental protein file specified by calling $PIPELINE/bin/mergeEnsembleAndIPI.pl. Files created:

<species>.pep.fa -- Ensembl file as retrieved via FTP.
<species>.fasta

Seems to me that this step can be skipped if we are using a reference DB / biosequence set that already exists in SBEAMS -- often the case. Also seems that this step, like the SpectraST library building step, is independent of all the others.

BLAST peptides against reference DB

Step03. Calls pipeline script with --BLASTP. Script then calls matchPeptides(), which does a system call of $PIPELINE/bin/PeptidePositionLocator.pl, which seems to search the DB without calling blast and prints "N peptides with/without a match". Files created:

peptide_mapping.tsv

?Align peptides with reference proteins and calculate coordinates (start/end points of aligned region?)

Step04. Calls pipeline script with --BLASTParse. Script then calls BLAST_APD_ENSEMBL(). Files created:

APD_ensembl_hits.tsv
APD_ensembl_lost_queries.dat

Parse the mapping results and calculate chromosomal coordinates

Step05. Calls pipeline script with --BLASTParse --getCoordinates. Calls BLAST_APD_ENSEMBL() again, which reads any previous coordinate cache file (.enscache) Files created:

coordinate_mapping.txt
$PIPELINE/new_cache/$dbname.enscache

Make a list of unmappable peptides

Step06. Calls pipeline script with --lostAndFound. Files needed:

APD_<organism>_all.fasta
APD_ensembl_hits.tsv

Files created:

APD_ensembl_lost_queries.dat

Compile statistics on the peptides and proteins in the build

Step07. Results in /net/db/projects/PeptideAtlas/pipeline/output/HumanUrine_2008-09_Ens49/analysis/analysis.out. This file contains instructions for doing the following manually:

creating an experiment contribution plot. A plot for the web Atlas is generated automatically, but these instructions tell you how to create a better one.
creating an amino acid abundance plot. Untested by Terry.

Calls the following:

/regis/sbeams/bin/Mayu/Mayu.pl
$PIPELINE/bin/fasta_stat.pl
$PIPELINE/bin/peptide_stats_from_step04.pl
$SBEAMS/lib/scripts/PeptideAtlas/calcProteinStatistics.pl
$SBEAMS/lib/scripts/PeptideAtlas/statistics/calcPeptideListStatistics.pl
$SBEAMS/bin/protein_chance_hits.pl

Files created automatically:

In analysis directory:

prophet_model.sts
search_dir_stats.txt
analysis.out

In DATA_FILES directory:

protein2gene.txt
duplicate_groups.txt
duplicate_mapping.txt
duplicate_entries.txt
out.2tonsequences
PPvsDECOY.dat
experiment_contribution_summary.out
protein_chance_hits.out
simplereducedproteins.txt -- a rather minimal list of proteins in sample
msruncounts.txt

Build a SpectraST library from the build

Step08. Does not depend on any previous steps and can be executed at any point in the pipeline. Files created:

<build-name>_all.splib
<build-name>_all.sptxt

Load the reference DB (biosequence set) if new one is needed

Define Atlas build via SBEAMS

Load data into build

Commands below entered on command line on mimas. Full usage including desired options found in recipe.

Load data

$SBEAMS/lib/scripts/PeptideAtlas/load_atlas_build.pl. January 22, 2009: using --purge and --load options together seems to reload previous build. Instead, call first with --purge, then again with --load.

Build search key

$SBEAMS/lib/scripts/PeptideAtlas/rebuildKeySearch.pl

Update empirical proteotypic scores

$SBEAMS/lib/scripts/PeptideAtlas/updateProteotypicScores.pl

Load spectra and spectrum IDs

Update statistics

Retrieved from "http://tools.proteomecenter.org/wiki/index.php?title=Building_Peptide_Atlas"

 ====BLAST peptides against reference DB====
-Step03. Calls pipeline script with --BLASTP. Script then calls matchPeptides(), which does a system call of $PIPELINE/bin/peptidePositionLocator.pl. Files created:
+Step03. Calls pipeline script with --BLASTP. Script then calls matchPeptides(), which does a system call of $PIPELINE/bin/PeptidePositionLocator.pl, which seems to search the DB without calling blast and prints "N peptides with/without a match". Files created:
 * peptide_mapping.tsv