PeptideAtlas Pipeline Retool 2012/13
From SPCTools
Revision as of 23:51, 29 October 2012 Tfarrah (Talk | contribs) ← Previous diff |
Current revision Tfarrah (Talk | contribs) |
||
Line 4: | Line 4: | ||
trying to do? | trying to do? | ||
- | 1. Start with iPro and ProtPro results, refreshed to ref DB | + | 00. Start with iPro and ProtPro results, refreshed to ref DB |
=> We can skip refresh if ref DB same as search DB | => We can skip refresh if ref DB same as search DB | ||
- | 2. For each expt, create PAidentlist template file with all PSMs above P=0.4; | + | 01. For each expt, create PAidentlist template file with all PSMs above P=0.4; |
can be used for multiple builds on same data. | can be used for multiple builds on same data. | ||
=> Why does this take so long? | => Why does this take so long? | ||
- | 3. Make combined, sorted PAIdentlist file | + | '''NEW: pre-run step 01, 01a below using a roughly estimated PSM FDR. Then calculate the PSM FDRs needed for Silver, Gold, using David's formula.''' |
+ | |||
+ | '''NEW: Perform steps 01a through 06 in parallel for Gold (prot FDR=1%) and Silver (prot FDR=5%) builds.''' | ||
+ | |||
+ | 01a. Make combined, sorted PAidentlist file, filtering using a roughly estimated PSM FDR. | ||
+ | |||
+ | '''NEW: create another file, peptide_probs.tsv, saving the highest probability for each stripped peptide. Sort by descending probability.''' | ||
=> Can we speed the sorting? Currently we use unix sort. | => Can we speed the sorting? Currently we use unix sort. | ||
=> What do we need APD files for? | => What do we need APD files for? | ||
+ | |||
+ | 01b. Create special filtered pepXML for each expt., | ||
+ | then run ProtPro on all expts combined | ||
+ | |||
+ | ''Remove step02 of the pipeline--creating biosequence set'' | ||
+ | |||
+ | 02a,b,sc. Compile protein identifications and estimate protein concentrations | ||
+ | |||
+ | '''NEW: Using peptide_probs.tsv, gather peptides mapping to identified proteins down to 1% peptide FDR.''' | ||
+ | |||
+ | '''NEW: Using the PAidentlist file, gather PSMs matching those peps down to 1% PSM FDR.''' | ||
+ | |||
+ | 03. Map peptides to reference DB => peptide_mapping.tsv | ||
+ | |||
+ | 05. Calculate chromosomal coordinates => coordinate_mapping.txt | ||
+ | |||
+ | 06. Make a list of unmappable peptides | ||
+ | |||
+ | 07. Statistics on peps, prots in build. | ||
+ | |||
+ | 08. Build SpectraST library from build |
Current revision
10/29/12:
Why is the build pipeline so complex and time consuming? What are we trying to do?
00. Start with iPro and ProtPro results, refreshed to ref DB
=> We can skip refresh if ref DB same as search DB
01. For each expt, create PAidentlist template file with all PSMs above P=0.4; can be used for multiple builds on same data.
=> Why does this take so long?
NEW: pre-run step 01, 01a below using a roughly estimated PSM FDR. Then calculate the PSM FDRs needed for Silver, Gold, using David's formula.
NEW: Perform steps 01a through 06 in parallel for Gold (prot FDR=1%) and Silver (prot FDR=5%) builds.
01a. Make combined, sorted PAidentlist file, filtering using a roughly estimated PSM FDR.
NEW: create another file, peptide_probs.tsv, saving the highest probability for each stripped peptide. Sort by descending probability.
=> Can we speed the sorting? Currently we use unix sort.
=> What do we need APD files for?
01b. Create special filtered pepXML for each expt., then run ProtPro on all expts combined
Remove step02 of the pipeline--creating biosequence set
02a,b,sc. Compile protein identifications and estimate protein concentrations
NEW: Using peptide_probs.tsv, gather peptides mapping to identified proteins down to 1% peptide FDR.
NEW: Using the PAidentlist file, gather PSMs matching those peps down to 1% PSM FDR.
03. Map peptides to reference DB => peptide_mapping.tsv
05. Calculate chromosomal coordinates => coordinate_mapping.txt
06. Make a list of unmappable peptides
07. Statistics on peps, prots in build.
08. Build SpectraST library from build