PeptideAtlas Pipeline Retool 2012/13

10/29/12:

Why is the build pipeline so complex and time consuming? What are we trying to do?

00. Start with iPro and ProtPro results, refreshed to ref DB

=> We can skip refresh if ref DB same as search DB

01. For each expt, create PAidentlist template file with all PSMs above P=0.4; can be used for multiple builds on same data.

=> Why does this take so long?

NEW: pre-run step 01, 01a below using a roughly estimated PSM FDR. Then calculate the PSM FDRs needed for Silver, Gold, using David's formula.

NEW: Perform steps 01a through 06 in parallel for Gold (prot FDR=1%) and Silver (prot FDR=5%) builds.

01a. Make combined, sorted PAidentlist file, filtering using a roughly estimated PSM FDR.

NEW: create another file, peptide_probs.tsv, saving the highest probability for each stripped peptide. Sort by descending probability.

=> Can we speed the sorting? Currently we use unix sort.

=> What do we need APD files for?

01b. Create special filtered pepXML for each expt., then run ProtPro on all expts combined

Remove step02 of the pipeline--creating biosequence set

02a,b,sc. Compile protein identifications and estimate protein concentrations

NEW: Using peptide_probs.tsv, gather peptides mapping to identified proteins down to 1% peptide FDR.

NEW: Using the PAidentlist file, gather PSMs matching those peps down to 1% PSM FDR.

03. Map peptides to reference DB => peptide_mapping.tsv

05. Calculate chromosomal coordinates => coordinate_mapping.txt

06. Make a list of unmappable peptides

07. Statistics on peps, prots in build.

08. Build SpectraST library from build