SlideShare une entreprise Scribd logo
1  sur  23
PRIDE: Quality control in a proteomics
data repository
Attila Csordas
Proteomics Services Team
Biocuration Conference
April 2nd, 2012



1/23
Overview

              who are we?

             what are we dealing with?

              manual curation and submission

              quick detour: ProteomeXchange

              automated curation & submission pipeline

              conclusion


       April 2, 2012
2/23
PRIDE: http://www.ebi.ac.uk/pride
       The PRoteomics IDEntifications database is
       a centralised, primary, archival, public data
          repository for MS/MS proteomics data
        containing peptide ids, protein ids, mass
            spectra, protein expression values,
                         metadata.




3/23
        April 2, 2012
Acknowledgements
                 colleagues at the PRIDE team




                             @pride_ebi

                         pride-ebi@ebi.ac.uk
                         pride-support@ebi.ac.uk


       http://code.google.com/p/pride-toolsuite/
       http://code.google.com/p/pride-converter-2/


4/23
        April 2, 2012
Mass spectrometry
analytical technique measuring the mass-to-charge (m/z) ratio of charged
        particles to determine masses of particles, composition of
        samples/molecules and chemical structures of molecules




             April 2, 2012
5/23
Shotgun/bottom-up proteomics

                                                      P
peptides                             MS/MS analysis
                                                      R
                                                      O
           sequence
           database                                   T
proteins                                              O
                              fragmentation
                                                      C
      MS analysis                                     O
                                                      L



              April 2, 2012
 6/23
What is a PRIDE submission?




7/23
        April 2, 2012
growth of
core data types                   130 million




                                   23 million
                                   4.6 million




  8/23
                  April 2, 2012
Manual curation and submission process
       Search
   Engine + spectra

                                   PRIDE
                                  Converter


                                  pride xml

Mascot (.dat),
X!Tandem (.xml) + mgf




9/23
                  April 2, 2012
PRIDE Inspector

initial assessment
on data quality

visualise/check data

summary charts

support for submitters &
reviewers/editors

more flexible than web
interface




  10/23
                 April 2, 2012
Frequent Data Quality Issues

                           <SearchEngine>PeptideShaker</SearchEngine>
  1. syntactic problems    <PeptideItem>



   2a. core data missing                no protein/peptide identifications




   2b. or metadata missing              no species




   3.inconsistent/incorrect data        protein modifications




11/23
           April 2, 2012
Delta m/z of detected peptide precursors


experimental precursor ion m/z - theoretical precursor ion m/z




   source of delta m/z outliers: incorrect or missing protein
   modifications and charge state misassignments




 12/23
             April 2, 2012
Fixing modifications based on delta m/z outliers




13/23
            April 2, 2012
Fixing modifications based on delta m/z outliers




14/23
            April 2, 2012
but the manual approach does not scale!




15/23
         April 2, 2012
10 times as many & big submissions/ day?




16/23
        April 2, 2012
single point of submission of data to the main repositories to encourage data exchange

                          Published        Raw       Reprocessed


 Individual
submissions
                                                       PeptideAtlas
                                 EBI
                                PRIDE   Raw files                                 Users
                                         archive
Large-scale
submissions

                            UniProt
                                               Other DBs
                                              (GPMDB, …)



17/23
                April 2, 2012
PX submission pipeline




                                                                    Proteome
PX Tool                     Validation   Submission   Publication
                                                                     Central




            Files

    Raw             PRIDE
    Files            XML

        Summary




18/23
                       April 2, 2012
Automated regular submission pipeline
         curation-submission time is ~1/6th of manual time

                            actionable curation summary

  number of files: 3
  Project: Combined personal saliva proteome and microbioproteome
  XML generator software         PRIDE Converter Toolsuite 2.0-
  SNAPSHOT
Filename size         Species      #Proteins   #Peptides #Spectra   #Unid-d   PTMs   % delta
                                                                    spectra          m/z
                                                                                     outlier

22143.    3.3 GB      Homo         4128        60544    184209      123665    3      0.0
xml                   sapiens                           spectra     spectra




 19/23
                   April 2, 2012
Conclusion

                growing amount of data


                growingly complex data


                scalability issues


              overcoming them by automation
              and new, smarter curation strategies




20/23
        April 2, 2012
21/23
        April 2, 2012
Thanks for the attention!




22/23
        April 2, 2012
acsordas@ebi.ac.uk
        Q&A                 @attilacsordas

23/23
        April 2, 2012

Contenu connexe

Similaire à Pride quality controlattilacsordasbiocuration2012

Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomicsgumccomm
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTERN Australia
 
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataRDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataASIS&T
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototypepathsproject
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyJuan Antonio Vizcaino
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleYasset Perez-Riverol
 
Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkJean-Claude Bradley
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseJennifer Smith
 

Similaire à Pride quality controlattilacsordasbiocuration2012 (13)

C044041723
C044041723C044041723
C044041723
 
Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomics
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataRDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scale
 
Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)
 
Information systems a revision
Information systems  a revisionInformation systems  a revision
Information systems a revision
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
 
Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS Talk
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome database
 

Plus de attilacsordas

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingattilacsordas
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological agingattilacsordas
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsattilacsordas
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?attilacsordas
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationattilacsordas
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...attilacsordas
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Positionattilacsordas
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Trainingattilacsordas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Projectattilacsordas
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Oldattilacsordas
 

Plus de attilacsordas (15)

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological aging
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological aging
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitions
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenation
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Position
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Training
 
Merry XOmas
Merry XOmasMerry XOmas
Merry XOmas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Project
 
LindaPowers onSENS3
LindaPowers onSENS3LindaPowers onSENS3
LindaPowers onSENS3
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Old
 
SENS3: Michael Rose
SENS3: Michael RoseSENS3: Michael Rose
SENS3: Michael Rose
 
Microvesiclesslide
MicrovesiclesslideMicrovesiclesslide
Microvesiclesslide
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Pride quality controlattilacsordasbiocuration2012

  • 1. PRIDE: Quality control in a proteomics data repository Attila Csordas Proteomics Services Team Biocuration Conference April 2nd, 2012 1/23
  • 2. Overview who are we? what are we dealing with? manual curation and submission quick detour: ProteomeXchange automated curation & submission pipeline conclusion April 2, 2012 2/23
  • 3. PRIDE: http://www.ebi.ac.uk/pride The PRoteomics IDEntifications database is a centralised, primary, archival, public data repository for MS/MS proteomics data containing peptide ids, protein ids, mass spectra, protein expression values, metadata. 3/23 April 2, 2012
  • 4. Acknowledgements colleagues at the PRIDE team @pride_ebi pride-ebi@ebi.ac.uk pride-support@ebi.ac.uk http://code.google.com/p/pride-toolsuite/ http://code.google.com/p/pride-converter-2/ 4/23 April 2, 2012
  • 5. Mass spectrometry analytical technique measuring the mass-to-charge (m/z) ratio of charged particles to determine masses of particles, composition of samples/molecules and chemical structures of molecules April 2, 2012 5/23
  • 6. Shotgun/bottom-up proteomics P peptides MS/MS analysis R O sequence database T proteins O fragmentation C MS analysis O L April 2, 2012 6/23
  • 7. What is a PRIDE submission? 7/23 April 2, 2012
  • 8. growth of core data types 130 million 23 million 4.6 million 8/23 April 2, 2012
  • 9. Manual curation and submission process Search Engine + spectra PRIDE Converter pride xml Mascot (.dat), X!Tandem (.xml) + mgf 9/23 April 2, 2012
  • 10. PRIDE Inspector initial assessment on data quality visualise/check data summary charts support for submitters & reviewers/editors more flexible than web interface 10/23 April 2, 2012
  • 11. Frequent Data Quality Issues <SearchEngine>PeptideShaker</SearchEngine> 1. syntactic problems <PeptideItem> 2a. core data missing no protein/peptide identifications 2b. or metadata missing no species 3.inconsistent/incorrect data protein modifications 11/23 April 2, 2012
  • 12. Delta m/z of detected peptide precursors experimental precursor ion m/z - theoretical precursor ion m/z source of delta m/z outliers: incorrect or missing protein modifications and charge state misassignments 12/23 April 2, 2012
  • 13. Fixing modifications based on delta m/z outliers 13/23 April 2, 2012
  • 14. Fixing modifications based on delta m/z outliers 14/23 April 2, 2012
  • 15. but the manual approach does not scale! 15/23 April 2, 2012
  • 16. 10 times as many & big submissions/ day? 16/23 April 2, 2012
  • 17. single point of submission of data to the main repositories to encourage data exchange Published Raw Reprocessed Individual submissions PeptideAtlas EBI PRIDE Raw files Users archive Large-scale submissions UniProt Other DBs (GPMDB, …) 17/23 April 2, 2012
  • 18. PX submission pipeline Proteome PX Tool Validation Submission Publication Central Files Raw PRIDE Files XML Summary 18/23 April 2, 2012
  • 19. Automated regular submission pipeline curation-submission time is ~1/6th of manual time actionable curation summary number of files: 3 Project: Combined personal saliva proteome and microbioproteome XML generator software PRIDE Converter Toolsuite 2.0- SNAPSHOT Filename size Species #Proteins #Peptides #Spectra #Unid-d PTMs % delta spectra m/z outlier 22143. 3.3 GB Homo 4128 60544 184209 123665 3 0.0 xml sapiens spectra spectra 19/23 April 2, 2012
  • 20. Conclusion growing amount of data growingly complex data scalability issues overcoming them by automation and new, smarter curation strategies 20/23 April 2, 2012
  • 21. 21/23 April 2, 2012
  • 22. Thanks for the attention! 22/23 April 2, 2012
  • 23. acsordas@ebi.ac.uk Q&A @attilacsordas 23/23 April 2, 2012