PeptideAtlas Pipeline Retool 2012/13
From SPCTools
10/29/12:
Why is the build pipeline so complex and time consuming? What are we trying to do?
00. Start with iPro and ProtPro results, refreshed to ref DB
=> We can skip refresh if ref DB same as search DB
01. For each expt, create PAidentlist template file with all PSMs above P=0.4; can be used for multiple builds on same data.
=> Why does this take so long?
NEW: pre-run step 01, 01a below using a roughly estimated PSM FDR. Then calculate the PSM FDRs needed for Silver, Gold, using David's formula.
NEW: Perform steps 01a through 06 in parallel for Gold (prot FDR=1%) and Silver (prot FDR=5%) builds.
01a. Make combined, sorted PAidentlist file, filtering using a roughly estimated PSM FDR.
NEW: create another file, peptide_probs.tsv, saving the highest probability for each stripped peptide. Sort by descending probability.
=> Can we speed the sorting? Currently we use unix sort.
=> What do we need APD files for?
01b. Create special filtered pepXML for each expt., then run ProtPro on all expts combined
Remove step02 of the pipeline--creating biosequence set
02a,b,sc. Compile protein identifications and estimate protein concentrations
NEW: Using peptide_probs.tsv, gather peptides mapping to identified proteins down to 1% peptide FDR.
NEW: Using the PAidentlist file, gather PSMs matching those peps down to 1% PSM FDR.
03. Map peptides to reference DB => peptide_mapping.tsv
05. Calculate chromosomal coordinates => coordinate_mapping.txt
06. Make a list of unmappable peptides
07. Statistics on peps, prots in build.
08. Build SpectraST library from build