PeptideAtlas Pipeline Retool 2012/13

From SPCTools

(Difference between revisions)
Jump to: navigation, search
Revision as of 23:51, 29 October 2012
Tfarrah (Talk | contribs)

← Previous diff
Current revision
Tfarrah (Talk | contribs)

Line 4: Line 4:
trying to do? trying to do?
-1. Start with iPro and ProtPro results, refreshed to ref DB+00. Start with iPro and ProtPro results, refreshed to ref DB
=> We can skip refresh if ref DB same as search DB => We can skip refresh if ref DB same as search DB
-2. For each expt, create PAidentlist template file with all PSMs above P=0.4;+01. For each expt, create PAidentlist template file with all PSMs above P=0.4;
can be used for multiple builds on same data. can be used for multiple builds on same data.
=> Why does this take so long? => Why does this take so long?
-3. Make combined, sorted PAIdentlist file+'''NEW: pre-run step 01, 01a below using a roughly estimated PSM FDR. Then calculate the PSM FDRs needed for Silver, Gold, using David's formula.'''
 + 
 +'''NEW: Perform steps 01a through 06 in parallel for Gold (prot FDR=1%) and Silver (prot FDR=5%) builds.'''
 + 
 +01a. Make combined, sorted PAidentlist file, filtering using a roughly estimated PSM FDR.
 + 
 +'''NEW: create another file, peptide_probs.tsv, saving the highest probability for each stripped peptide. Sort by descending probability.'''
=> Can we speed the sorting? Currently we use unix sort. => Can we speed the sorting? Currently we use unix sort.
=> What do we need APD files for? => What do we need APD files for?
 +
 +01b. Create special filtered pepXML for each expt.,
 +then run ProtPro on all expts combined
 +
 +''Remove step02 of the pipeline--creating biosequence set''
 +
 +02a,b,sc. Compile protein identifications and estimate protein concentrations
 +
 +'''NEW: Using peptide_probs.tsv, gather peptides mapping to identified proteins down to 1% peptide FDR.'''
 +
 +'''NEW: Using the PAidentlist file, gather PSMs matching those peps down to 1% PSM FDR.'''
 +
 +03. Map peptides to reference DB => peptide_mapping.tsv
 +
 +05. Calculate chromosomal coordinates => coordinate_mapping.txt
 +
 +06. Make a list of unmappable peptides
 +
 +07. Statistics on peps, prots in build.
 +
 +08. Build SpectraST library from build

Current revision

10/29/12:

Why is the build pipeline so complex and time consuming? What are we trying to do?

00. Start with iPro and ProtPro results, refreshed to ref DB

=> We can skip refresh if ref DB same as search DB

01. For each expt, create PAidentlist template file with all PSMs above P=0.4; can be used for multiple builds on same data.

=> Why does this take so long?

NEW: pre-run step 01, 01a below using a roughly estimated PSM FDR. Then calculate the PSM FDRs needed for Silver, Gold, using David's formula.

NEW: Perform steps 01a through 06 in parallel for Gold (prot FDR=1%) and Silver (prot FDR=5%) builds.

01a. Make combined, sorted PAidentlist file, filtering using a roughly estimated PSM FDR.

NEW: create another file, peptide_probs.tsv, saving the highest probability for each stripped peptide. Sort by descending probability.

=> Can we speed the sorting? Currently we use unix sort.

=> What do we need APD files for?

01b. Create special filtered pepXML for each expt., then run ProtPro on all expts combined

Remove step02 of the pipeline--creating biosequence set

02a,b,sc. Compile protein identifications and estimate protein concentrations

NEW: Using peptide_probs.tsv, gather peptides mapping to identified proteins down to 1% peptide FDR.

NEW: Using the PAidentlist file, gather PSMs matching those peps down to 1% PSM FDR.

03. Map peptides to reference DB => peptide_mapping.tsv

05. Calculate chromosomal coordinates => coordinate_mapping.txt

06. Make a list of unmappable peptides

07. Statistics on peps, prots in build.

08. Build SpectraST library from build

Personal tools