SlideShare a Scribd company logo
1 of 46
Download to read offline
Data Quality
Not your Typical Database Problem

Ahmed Elmagarmid

Executive Director
Qatar Computing Research Institute




                                     2011 © Copyright QCRI. Confidential document.
Where are we located?




2                           2011 © Copyright QCRI. Confidential document.
3                                       3
    2011 © Copyright QCRI. Confidential document.
4   2011 © Copyright QCRI. Confidential document.
Qatar Foundation




5                      2011 © Copyright QCRI. Confidential document.
SCIENCE &    COMMUNITY
    EDUCATION           RESEARCH    DEVELOPMENT




2.8 percent of GDP to
be spent on research
annually by 2015




                                                  2011 © Copyright QCRI. Confidential document.
Qatar Foundation Research Division


   Qatar       Qatar Energy &       Qatar
Computing       Environment      Biomedical
 Research         Research        Research
 Institute        Institute       Institute

  QCRI             QEERI           QBRI




                                              2011 © Copyright QCRI. Confidential document.
QCRI Overview




8                   2011 © Copyright QCRI. Confidential document.
QCRI Vision



    To make Qatar a global center for
    computing research by becoming the
    world’s recognized leader in Arabic
    language technologies and in key areas
    vital to the global growth of Qatari
    business and entrepreneurial activity.



9                                      2011 © Copyright QCRI. Confidential document.
QCRI Model
     Grand Challenges


                                                  National Institutions
                                                  (QCRI)

                                                    Grand practical challenges
                        Academia                    National and global impact
                                                    Localized skills & knowledge
                                                    Large teams and long term
                         Individual projects
                                                    Example peers: INRIA, MPI
                         Students move on
                         Theoretical & basic
     Project-based




                         research                Research Parks

                                                   Commercialization
                                                   Entrepreneurship
                                                   Incubation


                          Basic Research         Applied Research
                                                               10
10                                                                       2011 © Copyright QCRI. Confidential document.
QCRI Ecosystem


                                                                                     QU
          Sidra           QBRI                                             MIT

                                                                                    HKU
                  QEERI
                                                QCRI



              WikiMedia                                                                    QSTP
                            Aljazeera
     QP
                                        ALTIS
               Boeing
     Energy                                                    Google
                                    MEEZA          Yahoo
      Co.                  QSA
                                                                                      IBM
                                                       Microsoft
11                                                            2011 © Copyright QCRI. Confidential document.
QCRI Research Centers




        Arabic          Social         Scientific
       Language       Computing       Computing
     Technologies




                     Data Analytics

                    Cloud Computing



12                                                  2011 © Copyright QCRI. Confidential document.
QCRI Scientific Advisory Council


                                           Lord Rupert Redesdale
     Prof. Rich DeMillo                    UK House of Lords
     Georgia Tech, Chair



     Prof. Joichi Ito                      Prof. Ruzena Bajcsy
     MIT Media Lab Director                University of California – Berkeley




     Lew Tucker                            Prof. Alfred V. Aho
     Vice President, Cisco                 Columbia University




     Prof. Dick Lipton                     Yousef Khalidi
     Georgia Tech                          Vice President, Microsoft

13                                                         2011 © Copyright QCRI. Confidential document.
The 60 Doers!
                                                                                                                 Abdellatif
                                                                                                  Ahmed
                                                                                                                                Richard
                                                                                Jill

                                                                                      Management
                                    Ihab
                         Nan
                                               Mourad
                                                                                    and Support Team                                  Richard P.
           Paolo
                                                                     Melissa

                   Data Analytics                         Amr                                                                                                          Kamal
                                                                                                                                                                                             Halima

                                                                                                                              Amal
   John                                                                                                                                            Rashid
                                                                                       Nada             Agathe                                               Scientific
             Michele     Hend      Chu
                                               ElKindi
                                                                                                                                                            Computing                                Kulood


                                                                                                                                             Samreen
                                                                                                                   Mohamed
                                                                                                      Simon P.
                                                                                        Mustafa
                                                                            Tarek
                                              Preslav                                                                                                              Othmane
                         Kareem   Stephan
                                                                                                                              Ahmed A.
                   Wei                                            William

                      Arabic                                                              Cloud
     Ahmed T.

                    Language
                                                       ThuyLinh
                                                                                        Computing                                                                               Sihem
                                                                                                                              Maged                            Gautam
                                                                  Khaled                                                                                                                       Aysha
Ahmed M.           Technologies                                                                                                               Sofiane
                                                                                                                                                       Social
                                                       Ahmed A.

                                                                                                                        Gokop                        Computing
  Ahmed T.                                                          Lolwa
                                              Safdar
                                                                                                            Amira                      Aybuke                                            Shameem
                                  Francisco                                         Simon G.
               Walid      Peng                                                                                                                                               Mikalai
                                                                                                  Khulood                                               Ruth
                                                                                                                                                               2011 © Copyright QCRI. Confidential document.
Strategic Partnerships




15                            2011 © Copyright QCRI. Confidential document.
Agenda Partnerships
       Strategic




16                         2011 © Copyright QCRI. Confidential document.
5-YEAR QCRI MANPOWER PLAN



                                         110
                                 102
                         82

             34                  +20
                         +48              +8
      21     +13


     10-11   11-12       12-13   13-14   14-15




17                                             2011 © Copyright QCRI. Confidential document.
This Talk
     Data Quality




18                  2011 © Copyright QCRI. Confidential document.
Data Quality

     Enhancing the usability of the acquired data and
     increasing the confidence of query results
     "Poor data quality is the norm rather than the exception, but most organizations are in a
     state of denial about this issue. " -Gartner Group




19                                                                                2011 © Copyright QCRI. Confidential document.
Dirty Data is Expensive

Real life data is often dirty: Data   Obama administration offered
error rates in industry: 1% - 30%     $19 billion grants for health IT, i.e.
(Redman, 1998)                        improve EMRs in 2009


                                      The Data Warehousing Institute
Erroneously priced data in retail     estimates that data quality
databases costs US customers          problems cost U.S. businesses
$2.5 billion each year                more than $600 billion a year
                                      (2002)


 20                                                            2011 © Copyright QCRI. Confidential document.
Where to start? Data Quality
                  everywhere!
•    Data Entry
•    Information Extraction
•    Integration from multiple sources
•    Standardization and transformation
•    Business rules compliance




21                                        2011 © Copyright QCRI. Confidential document.
“Academic” Data Cleaning
                               ”
● Pick a well understood data problem under some scoping
  assumptions and solve independently
     Duplicates
     Functional Dependency violations
     Matching dependency violations
     Missing value imputation


● Piece-meal approach to tackle the complexity and sometimes the
  intractability of the problem
     Repairing violations of FD constraints in special cases (no deletion, left hand
     side changes only, allowing variable etc.)


22                                                                     2011 © Copyright QCRI. Confidential document.
“Academic” Data Cleaning
                               ”

• Despite their theoretic and algorithmic beauty, rarely used

     –   Problems never exist in isolation
     –   Fixes to one problem often introduce “other” problems
     –   Data usually not accessible to mess with
     –   Integrity constraints!... What integrity constraints?!!




23                                                        2011 © Copyright QCRI. Confidential document.
“Practitioner” Data Cleaning
                                  ”

• Will share some scary stories

     –   “post-it notes” as an expert messaging system
     –   “written permission” to change value of a record
     –    Default values and best practices
     –   “Call John.. He will know what to do”




24                                                          2011 © Copyright QCRI. Confidential document.
This Talk


● Few data quality challenges and (hopefully) research
  directions

● Summary of recent efforts at QCRI




25                                            2011 © Copyright QCRI. Confidential document.
10 Data Quality Issues




26                            2011 © Copyright QCRI. Confidential document.
Issue 1: The data trio




              DATA




              Quality
27                            2011 © Copyright QCRI. Confidential document.
Extraction remains a key source
                       of data errors
     Acquiring the semantics/schema of the underlying unstructured data
     sources (document, emails, related Web info, click traces, profiles,
     interests, etc.)




28                                                           2011 © Copyright QCRI. Confidential document.
Integration aggravates the
                      problem                                                              m1

 Linked data as an attempt to live with errors .. link as you go




29                                                          2011 © Copyright QCRI. Confidential document.
Slide 29

m1         I'm not sure about this idea of putting "linked data" so prominent in this slide on II
           mourad, 7/23/2011
Issue 2: Data level or application
                     level
• Cleaning data tables by trusting the schema table! Is rarely useful
• Will share a story
   – Bell-core with 1800 inter-linked databases
   – Rule-based logic for sanity checking
   – Post-it messages to communicate between data quality officers
     .. Who work in shifts!
   – Data cleaning action is meaningless if not tied to a business
     logic or to a process. Should never be against FDs




30                                                        2011 © Copyright QCRI. Confidential document.
Issue 3: Protect your gain: DQ
                     Dashboard
● How to protect against going backwards

● How to protect your gains during the cleansing process

● Metrics:
    Minimality Principle: mostly and widely used in academic
    cleaning
    Value of information: to spot the most important problem to fix




31                                                         2011 © Copyright QCRI. Confidential document.
Issue 3: Protect your gain - Ideas

• Root-cause analysis for data cleaning

• Chase problems to the source to reason about “progress”

• Leveraging “Provenance” to design progress meters




32                                                2011 © Copyright QCRI. Confidential document.
Issue 4: Data is not an orphan!

● Data Stewards are not imaginary characters! Important data
  has stewards and custodians

● Need to go through these guardians first
     Some health care requires a signed form per changed cell stating
     reasons for change


● Possible approaches:
     How to avoid stewards?
     How to integrate them in the process or minimize their involvement?


33                                                            2011 © Copyright QCRI. Confidential document.
Issue 5: How clean is clean?

• Quality awareness eats up 10% of the budget [Telecom
  Experience]

• How to avoid over-cleaning

• Example: “Bill Forgiveness”, a real-life experience: roaming
  charges and cross-carrier calls have a very complicated
  business model

• Possible approaches
     – Measure cleaning progress
     – Clean only to satisfy some application needs

34                                                    2011 © Copyright QCRI. Confidential document.
Issue 6: Online cleaning a
                     necessity not a feature
● We live in a complex world → complex applications with 100s
  and 1000s of components and parameters

● Clean as you go .. Clean on demand .. Clean opportunistically ..
  Can be the only hope

● New concepts:
      Iterative cleaning
      Cleaning dynamic and evolving data

● Off-line cleaning can still benefit historical data but is
  becoming less and less important

35                                                         2011 © Copyright QCRI. Confidential document.
Issue 7: Application quality

• Data Quality → Information Quality → Application quality

• Realizes the levels of complexity in current BI apps

• Data usage should influence data cleaning
   – “Usage-based” data cleaning




36                                                       2011 © Copyright QCRI. Confidential document.
Issue 8: SW engineering DQ

• Current focus on discrete values with simple integrity constraints
  (FD, uniqueness…)

• We are good at checking if data complies with rules

• Real business rules are often “assertions” and expressed in
  “turing-complete” languages

• Checking “did we write the assertions right?” becomes a lot harder

• But also.. need to think if we wrote the right assertions!




37                                                             2011 © Copyright QCRI. Confidential document.
Issue 9: DQ Theory?


• ACID in transaction management were not only sensible requirements but
  also had algorithms and methods to enforce them during transactions
  processing

• Does it make sense to do the same for Quality? Plausible properties along
  with actions for maintaining acceptable quality during data manipulation

• Some of these already exist: Timeliness, Currency, Consistency, etc. but
  lack methods of enforcement




38                                                              2011 © Copyright QCRI. Confidential document.
Issue 10: Scale .. Scale

• Terabytes and Petabytes of data requires new ways to
  enforce data quality

• Which ball to drop

• Leveraging application semantics and data usage

• Sampling to learn from the few and apply on the masses

• Active learning to replace human feedback (GDR as a
  solution)
39                                                  2011 © Copyright QCRI. Confidential document.
Sample QCRI Projects




40                          2011 © Copyright QCRI. Confidential document.
GDR – Guided Data Repair

     • Scalable ways to involve experts
     • Repurposing destructive automatic techniques to guide repairs
     • Value of Information measures to generate the most important
       questions
                                                           User Query
     • Judicious use of active learning from user feedback
                                            Learn and
                       Detect Errors                      Clean Database
                                              Repair
                       and Violations                        Instance
                                            Database


                                                            Results
                        Input Database
                           Instance




41                                                               2011 © Copyright QCRI. Confidential document.
GDR Architecture




42                      2011 © Copyright QCRI. Confidential document.
Probabilistic Data Cleaning
                                                User Query




                             Possible
        Uncertain
                              Repair        Clean Database
     Error Detection                          Clean Database
                            Generation          Instance
                                                    Possible
                                                  Instance
                                                 Clean Instance




          Input
                                         Probabilistic Results
        Database
        Instance



43                                                       2011 © Copyright QCRI. Confidential document.
Possible Repairs

A possible repair is a clustering of the input tuples
                  Person                                 Possible Repairs
     ID   Name      ZIP     Income                  X1         X2              X3
     P1   Green    51519     30k                   {P1}      {P1,P2}   {P1,P2,P5}
     P2   Green    51518     32k                   {P2}      {P3,P4}      {P3,P4}
     P3   Peter    30528     40k                  {P3,P4}     {P5}           {P6}
                                     Uncertain                {P6}
     P4   Peter    30528     40k                   {P5}
                                     Clustering
     P5   Gree     51519     55k                   {P6}

     P6   Chuck    51519     30k




44                                                                     2011 © Copyright QCRI. Confidential document.
Thank You
www.qcri.qa




              2011 © Copyright QCRI. Confidential document.

More Related Content

What's hot

Innovation, community, sustainability
Innovation, community, sustainabilityInnovation, community, sustainability
Innovation, community, sustainabilityPaul Walk
 
Semiconductor Hubs for Research & Innovation
Semiconductor Hubs for Research & InnovationSemiconductor Hubs for Research & Innovation
Semiconductor Hubs for Research & InnovationZinnov
 
George Brown College: Leadership in the innovation economy
George Brown College: Leadership in the innovation economyGeorge Brown College: Leadership in the innovation economy
George Brown College: Leadership in the innovation economyCisco Canada
 
Dr K Kamal's slide on TePP
Dr K Kamal's slide on TePPDr K Kamal's slide on TePP
Dr K Kamal's slide on TePPDr_K_Kamal
 
IET Chair All Changed Changed Utterly Wireless To Digital Home
IET Chair   All Changed Changed Utterly   Wireless To Digital HomeIET Chair   All Changed Changed Utterly   Wireless To Digital Home
IET Chair All Changed Changed Utterly Wireless To Digital Homepatkidney
 
“7 Core skills of Innovators” by Aditya Bhalla (Innovation Practice Head, QAI...
“7 Core skills of Innovators” by Aditya Bhalla (Innovation Practice Head, QAI...“7 Core skills of Innovators” by Aditya Bhalla (Innovation Practice Head, QAI...
“7 Core skills of Innovators” by Aditya Bhalla (Innovation Practice Head, QAI...Dubai Quality Group
 
Keeping up with the Pace, Innotribe at LARC
Keeping up with the Pace, Innotribe at LARCKeeping up with the Pace, Innotribe at LARC
Keeping up with the Pace, Innotribe at LARCHeather Vescent
 
Innovation for Country Transformation
Innovation for Country TransformationInnovation for Country Transformation
Innovation for Country TransformationCisco Canada
 
The Secret Sauce for Innovation (longform)
The Secret Sauce for Innovation (longform) The Secret Sauce for Innovation (longform)
The Secret Sauce for Innovation (longform) Laszlo Szalvay
 
Irene ngalmaden2013
Irene ngalmaden2013Irene ngalmaden2013
Irene ngalmaden2013ISSIP
 
Connectovate 2011 Conference Agenda 2011
Connectovate 2011 Conference Agenda  2011Connectovate 2011 Conference Agenda  2011
Connectovate 2011 Conference Agenda 2011kateshore
 
Connecting education tech society laura erickson
Connecting education tech society laura ericksonConnecting education tech society laura erickson
Connecting education tech society laura erickson3helix
 
Irene ngobc2013 final
Irene ngobc2013 finalIrene ngobc2013 final
Irene ngobc2013 finalISSIP
 
Carestream Health's Global Product Level Information Deployment with Aras
Carestream Health's Global Product Level Information Deployment with ArasCarestream Health's Global Product Level Information Deployment with Aras
Carestream Health's Global Product Level Information Deployment with ArasAras
 

What's hot (20)

Innovation, community, sustainability
Innovation, community, sustainabilityInnovation, community, sustainability
Innovation, community, sustainability
 
Tide 123
Tide 123Tide 123
Tide 123
 
Tide 123
Tide 123Tide 123
Tide 123
 
Semiconductor Hubs for Research & Innovation
Semiconductor Hubs for Research & InnovationSemiconductor Hubs for Research & Innovation
Semiconductor Hubs for Research & Innovation
 
George Brown College: Leadership in the innovation economy
George Brown College: Leadership in the innovation economyGeorge Brown College: Leadership in the innovation economy
George Brown College: Leadership in the innovation economy
 
Dr K Kamal's slide on TePP
Dr K Kamal's slide on TePPDr K Kamal's slide on TePP
Dr K Kamal's slide on TePP
 
IET Chair All Changed Changed Utterly Wireless To Digital Home
IET Chair   All Changed Changed Utterly   Wireless To Digital HomeIET Chair   All Changed Changed Utterly   Wireless To Digital Home
IET Chair All Changed Changed Utterly Wireless To Digital Home
 
“7 Core skills of Innovators” by Aditya Bhalla (Innovation Practice Head, QAI...
“7 Core skills of Innovators” by Aditya Bhalla (Innovation Practice Head, QAI...“7 Core skills of Innovators” by Aditya Bhalla (Innovation Practice Head, QAI...
“7 Core skills of Innovators” by Aditya Bhalla (Innovation Practice Head, QAI...
 
Keeping up with the Pace, Innotribe at LARC
Keeping up with the Pace, Innotribe at LARCKeeping up with the Pace, Innotribe at LARC
Keeping up with the Pace, Innotribe at LARC
 
Innovation for Country Transformation
Innovation for Country TransformationInnovation for Country Transformation
Innovation for Country Transformation
 
The Secret Sauce for Innovation (longform)
The Secret Sauce for Innovation (longform) The Secret Sauce for Innovation (longform)
The Secret Sauce for Innovation (longform)
 
MEIC AGM 2012
MEIC AGM 2012MEIC AGM 2012
MEIC AGM 2012
 
Irene ngalmaden2013
Irene ngalmaden2013Irene ngalmaden2013
Irene ngalmaden2013
 
Connectovate 2011 Conference Agenda 2011
Connectovate 2011 Conference Agenda  2011Connectovate 2011 Conference Agenda  2011
Connectovate 2011 Conference Agenda 2011
 
Professional Career Development for IC Engineers
Professional Career Development for IC EngineersProfessional Career Development for IC Engineers
Professional Career Development for IC Engineers
 
Living%20 Labs E Almirall
Living%20 Labs E AlmirallLiving%20 Labs E Almirall
Living%20 Labs E Almirall
 
Connecting education tech society laura erickson
Connecting education tech society laura ericksonConnecting education tech society laura erickson
Connecting education tech society laura erickson
 
Innovation Clusters - Pilot Update & Scale Up Plan
Innovation Clusters - Pilot Update & Scale Up Plan  Innovation Clusters - Pilot Update & Scale Up Plan
Innovation Clusters - Pilot Update & Scale Up Plan
 
Irene ngobc2013 final
Irene ngobc2013 finalIrene ngobc2013 final
Irene ngobc2013 final
 
Carestream Health's Global Product Level Information Deployment with Aras
Carestream Health's Global Product Level Information Deployment with ArasCarestream Health's Global Product Level Information Deployment with Aras
Carestream Health's Global Product Level Information Deployment with Aras
 

Viewers also liked

Viewers also liked (20)

From Programs to Systems – Building a Smarter World
From Programs to Systems – Building a Smarter WorldFrom Programs to Systems – Building a Smarter World
From Programs to Systems – Building a Smarter World
 
Sparse and Low Rank Representations in Music Signal Analysis
 Sparse and Low Rank Representations in Music Signal  Analysis Sparse and Low Rank Representations in Music Signal  Analysis
Sparse and Low Rank Representations in Music Signal Analysis
 
Influence Propagation in Large Graphs - Theorems and Algorithms
Influence Propagation in Large Graphs - Theorems and AlgorithmsInfluence Propagation in Large Graphs - Theorems and Algorithms
Influence Propagation in Large Graphs - Theorems and Algorithms
 
Web Usage Miningand Using Ontology for Capturing Web Usage Semantic
Web Usage Miningand Using Ontology for Capturing Web Usage SemanticWeb Usage Miningand Using Ontology for Capturing Web Usage Semantic
Web Usage Miningand Using Ontology for Capturing Web Usage Semantic
 
Tribute to Nicolas Galatsanos
Tribute to Nicolas GalatsanosTribute to Nicolas Galatsanos
Tribute to Nicolas Galatsanos
 
Sparsity Control for Robustness and Social Data Analysis
Sparsity Control for Robustness and Social Data AnalysisSparsity Control for Robustness and Social Data Analysis
Sparsity Control for Robustness and Social Data Analysis
 
A Classification Framework For Component Models
 A Classification Framework For Component Models A Classification Framework For Component Models
A Classification Framework For Component Models
 
Co-evolution, Games, and Social Behaviors
Co-evolution, Games, and Social BehaviorsCo-evolution, Games, and Social Behaviors
Co-evolution, Games, and Social Behaviors
 
Compressive Spectral Image Sensing, Processing, and Optimization
Compressive Spectral Image Sensing, Processing, and OptimizationCompressive Spectral Image Sensing, Processing, and Optimization
Compressive Spectral Image Sensing, Processing, and Optimization
 
Opening Second Greek Signal Processing Jam
Opening Second Greek Signal Processing JamOpening Second Greek Signal Processing Jam
Opening Second Greek Signal Processing Jam
 
The Tower of Knowledge A Generic System Architecture
The Tower of Knowledge A Generic System ArchitectureThe Tower of Knowledge A Generic System Architecture
The Tower of Knowledge A Generic System Architecture
 
Nonlinear Communications: Achievable Rates, Estimation, and Decoding
Nonlinear Communications: Achievable Rates, Estimation, and DecodingNonlinear Communications: Achievable Rates, Estimation, and Decoding
Nonlinear Communications: Achievable Rates, Estimation, and Decoding
 
State Space Exploration for NASA’s Safety Critical Systems
State Space Exploration for NASA’s Safety Critical SystemsState Space Exploration for NASA’s Safety Critical Systems
State Space Exploration for NASA’s Safety Critical Systems
 
Semantic 3DTV Content Analysis and Description
Semantic 3DTV Content Analysis and DescriptionSemantic 3DTV Content Analysis and Description
Semantic 3DTV Content Analysis and Description
 
Jamming in Wireless Sensor Networks
Jamming in Wireless Sensor NetworksJamming in Wireless Sensor Networks
Jamming in Wireless Sensor Networks
 
Mixture Models for Image Analysis
Mixture Models for Image AnalysisMixture Models for Image Analysis
Mixture Models for Image Analysis
 
Sparse and Redundant Representations: Theory and Applications
Sparse and Redundant Representations: Theory and ApplicationsSparse and Redundant Representations: Theory and Applications
Sparse and Redundant Representations: Theory and Applications
 
Networked 3-D Virtual Collaboration in Science and Education: Towards 'Web 3....
Networked 3-D Virtual Collaboration in Science and Education: Towards 'Web 3....Networked 3-D Virtual Collaboration in Science and Education: Towards 'Web 3....
Networked 3-D Virtual Collaboration in Science and Education: Towards 'Web 3....
 
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
 
Artificial Intelligence and Human Thinking
Artificial Intelligence and Human ThinkingArtificial Intelligence and Human Thinking
Artificial Intelligence and Human Thinking
 

Similar to Data Quality: Not Your Typical Database Problem

ONEIA- OCE Presentation
ONEIA- OCE PresentationONEIA- OCE Presentation
ONEIA- OCE PresentationONEIA
 
nq.cnse.update on the new york state clean energy boom
nq.cnse.update on the new york state clean energy boomnq.cnse.update on the new york state clean energy boom
nq.cnse.update on the new york state clean energy boomnquerques
 
Dechema Conference: Istanbul
Dechema Conference: IstanbulDechema Conference: Istanbul
Dechema Conference: IstanbulIBM Research
 
Alcatel-Lucent : Leclerc intrapreneurship conference-2011
Alcatel-Lucent : Leclerc intrapreneurship conference-2011Alcatel-Lucent : Leclerc intrapreneurship conference-2011
Alcatel-Lucent : Leclerc intrapreneurship conference-2011Jean-Yves Huwart
 
Conference information brochure hyderabad 2011
Conference information brochure    hyderabad 2011Conference information brochure    hyderabad 2011
Conference information brochure hyderabad 2011Imran Ahmed Jafri
 
The Open Data Institute
The Open Data InstituteThe Open Data Institute
The Open Data InstituteHACThousing
 
Cloud project secrets of success
Cloud project secrets of successCloud project secrets of success
Cloud project secrets of successKhazret Sapenov
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real Worldsssw2012
 
Smart Cities Summit Toronto, 2013
Smart Cities Summit Toronto, 2013Smart Cities Summit Toronto, 2013
Smart Cities Summit Toronto, 2013Rick Huijbregts
 
Summerschool+ 2012 Ibm Kees Donker future of learning
Summerschool+ 2012 Ibm Kees Donker future of learningSummerschool+ 2012 Ibm Kees Donker future of learning
Summerschool+ 2012 Ibm Kees Donker future of learningKennisnet
 
Megs kt management meeting 19th april
Megs kt management meeting 19th aprilMegs kt management meeting 19th april
Megs kt management meeting 19th aprilAndrea Wheeler
 
Investor Day Presentation, Sept 2012
Investor Day Presentation, Sept 2012Investor Day Presentation, Sept 2012
Investor Day Presentation, Sept 2012ADVA
 
The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...
The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...
The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...Society of Women Engineers
 
Final indocorp group brochure small size
Final indocorp group brochure small sizeFinal indocorp group brochure small size
Final indocorp group brochure small sizeSrikant Gupta
 
SURFconext: a next generation collaboration infrastructure across institution...
SURFconext: a next generation collaboration infrastructure across institution...SURFconext: a next generation collaboration infrastructure across institution...
SURFconext: a next generation collaboration infrastructure across institution...University of Amsterdam
 

Similar to Data Quality: Not Your Typical Database Problem (20)

Meet IBM Research
Meet IBM ResearchMeet IBM Research
Meet IBM Research
 
ONEIA- OCE Presentation
ONEIA- OCE PresentationONEIA- OCE Presentation
ONEIA- OCE Presentation
 
nq.cnse.update on the new york state clean energy boom
nq.cnse.update on the new york state clean energy boomnq.cnse.update on the new york state clean energy boom
nq.cnse.update on the new york state clean energy boom
 
Shell- Samruk-Kazyna-presentation
Shell- Samruk-Kazyna-presentationShell- Samruk-Kazyna-presentation
Shell- Samruk-Kazyna-presentation
 
Dechema Conference: Istanbul
Dechema Conference: IstanbulDechema Conference: Istanbul
Dechema Conference: Istanbul
 
Alcatel-Lucent : Leclerc intrapreneurship conference-2011
Alcatel-Lucent : Leclerc intrapreneurship conference-2011Alcatel-Lucent : Leclerc intrapreneurship conference-2011
Alcatel-Lucent : Leclerc intrapreneurship conference-2011
 
Conference information brochure hyderabad 2011
Conference information brochure    hyderabad 2011Conference information brochure    hyderabad 2011
Conference information brochure hyderabad 2011
 
The Open Data Institute
The Open Data InstituteThe Open Data Institute
The Open Data Institute
 
Ow2 Ten Minute Prez
Ow2 Ten Minute PrezOw2 Ten Minute Prez
Ow2 Ten Minute Prez
 
Cloud project secrets of success
Cloud project secrets of successCloud project secrets of success
Cloud project secrets of success
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real World
 
Smart Cities Summit Toronto, 2013
Smart Cities Summit Toronto, 2013Smart Cities Summit Toronto, 2013
Smart Cities Summit Toronto, 2013
 
Summerschool+ 2012 Ibm Kees Donker future of learning
Summerschool+ 2012 Ibm Kees Donker future of learningSummerschool+ 2012 Ibm Kees Donker future of learning
Summerschool+ 2012 Ibm Kees Donker future of learning
 
Megs kt management meeting 19th april
Megs kt management meeting 19th aprilMegs kt management meeting 19th april
Megs kt management meeting 19th april
 
Investor Day Presentation, Sept 2012
Investor Day Presentation, Sept 2012Investor Day Presentation, Sept 2012
Investor Day Presentation, Sept 2012
 
101 ab 1415-1445
101 ab 1415-1445101 ab 1415-1445
101 ab 1415-1445
 
101 ab 1415-1445
101 ab 1415-1445101 ab 1415-1445
101 ab 1415-1445
 
The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...
The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...
The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...
 
Final indocorp group brochure small size
Final indocorp group brochure small sizeFinal indocorp group brochure small size
Final indocorp group brochure small size
 
SURFconext: a next generation collaboration infrastructure across institution...
SURFconext: a next generation collaboration infrastructure across institution...SURFconext: a next generation collaboration infrastructure across institution...
SURFconext: a next generation collaboration infrastructure across institution...
 

More from Distinguished Lecturer Series - Leon The Mathematician (6)

Defying Nyquist in Analog to Digital Conversion
Defying Nyquist in Analog to Digital ConversionDefying Nyquist in Analog to Digital Conversion
Defying Nyquist in Analog to Digital Conversion
 
Farewell to Disks: Efficient Processing of Obstinate Data
Farewell to Disks: Efficient Processing of Obstinate DataFarewell to Disks: Efficient Processing of Obstinate Data
Farewell to Disks: Efficient Processing of Obstinate Data
 
Artificial Intelligence and Human Thinking
Artificial Intelligence and Human ThinkingArtificial Intelligence and Human Thinking
Artificial Intelligence and Human Thinking
 
Descriptive Granularity - Building Foundations of Data Mining
Descriptive Granularity - Building Foundations of Data MiningDescriptive Granularity - Building Foundations of Data Mining
Descriptive Granularity - Building Foundations of Data Mining
 
Success Factors in Industry - Academia Collaboration - An Empirical Study
 Success Factors in Industry - Academia Collaboration - An Empirical Study   Success Factors in Industry - Academia Collaboration - An Empirical Study
Success Factors in Industry - Academia Collaboration - An Empirical Study
 
Compressed Sensing In Spectral Imaging
Compressed Sensing In Spectral Imaging  Compressed Sensing In Spectral Imaging
Compressed Sensing In Spectral Imaging
 

Recently uploaded

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Data Quality: Not Your Typical Database Problem

  • 1. Data Quality Not your Typical Database Problem Ahmed Elmagarmid Executive Director Qatar Computing Research Institute 2011 © Copyright QCRI. Confidential document.
  • 2. Where are we located? 2 2011 © Copyright QCRI. Confidential document.
  • 3. 3 3 2011 © Copyright QCRI. Confidential document.
  • 4. 4 2011 © Copyright QCRI. Confidential document.
  • 5. Qatar Foundation 5 2011 © Copyright QCRI. Confidential document.
  • 6. SCIENCE & COMMUNITY EDUCATION RESEARCH DEVELOPMENT 2.8 percent of GDP to be spent on research annually by 2015 2011 © Copyright QCRI. Confidential document.
  • 7. Qatar Foundation Research Division Qatar Qatar Energy & Qatar Computing Environment Biomedical Research Research Research Institute Institute Institute QCRI QEERI QBRI 2011 © Copyright QCRI. Confidential document.
  • 8. QCRI Overview 8 2011 © Copyright QCRI. Confidential document.
  • 9. QCRI Vision To make Qatar a global center for computing research by becoming the world’s recognized leader in Arabic language technologies and in key areas vital to the global growth of Qatari business and entrepreneurial activity. 9 2011 © Copyright QCRI. Confidential document.
  • 10. QCRI Model Grand Challenges National Institutions (QCRI) Grand practical challenges Academia National and global impact Localized skills & knowledge Large teams and long term Individual projects Example peers: INRIA, MPI Students move on Theoretical & basic Project-based research Research Parks Commercialization Entrepreneurship Incubation Basic Research Applied Research 10 10 2011 © Copyright QCRI. Confidential document.
  • 11. QCRI Ecosystem QU Sidra QBRI MIT HKU QEERI QCRI WikiMedia QSTP Aljazeera QP ALTIS Boeing Energy Google MEEZA Yahoo Co. QSA IBM Microsoft 11 2011 © Copyright QCRI. Confidential document.
  • 12. QCRI Research Centers Arabic Social Scientific Language Computing Computing Technologies Data Analytics Cloud Computing 12 2011 © Copyright QCRI. Confidential document.
  • 13. QCRI Scientific Advisory Council Lord Rupert Redesdale Prof. Rich DeMillo UK House of Lords Georgia Tech, Chair Prof. Joichi Ito Prof. Ruzena Bajcsy MIT Media Lab Director University of California – Berkeley Lew Tucker Prof. Alfred V. Aho Vice President, Cisco Columbia University Prof. Dick Lipton Yousef Khalidi Georgia Tech Vice President, Microsoft 13 2011 © Copyright QCRI. Confidential document.
  • 14. The 60 Doers! Abdellatif Ahmed Richard Jill Management Ihab Nan Mourad and Support Team Richard P. Paolo Melissa Data Analytics Amr Kamal Halima Amal John Rashid Nada Agathe Scientific Michele Hend Chu ElKindi Computing Kulood Samreen Mohamed Simon P. Mustafa Tarek Preslav Othmane Kareem Stephan Ahmed A. Wei William Arabic Cloud Ahmed T. Language ThuyLinh Computing Sihem Maged Gautam Khaled Aysha Ahmed M. Technologies Sofiane Social Ahmed A. Gokop Computing Ahmed T. Lolwa Safdar Amira Aybuke Shameem Francisco Simon G. Walid Peng Mikalai Khulood Ruth 2011 © Copyright QCRI. Confidential document.
  • 15. Strategic Partnerships 15 2011 © Copyright QCRI. Confidential document.
  • 16. Agenda Partnerships Strategic 16 2011 © Copyright QCRI. Confidential document.
  • 17. 5-YEAR QCRI MANPOWER PLAN 110 102 82 34 +20 +48 +8 21 +13 10-11 11-12 12-13 13-14 14-15 17 2011 © Copyright QCRI. Confidential document.
  • 18. This Talk Data Quality 18 2011 © Copyright QCRI. Confidential document.
  • 19. Data Quality Enhancing the usability of the acquired data and increasing the confidence of query results "Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group 19 2011 © Copyright QCRI. Confidential document.
  • 20. Dirty Data is Expensive Real life data is often dirty: Data Obama administration offered error rates in industry: 1% - 30% $19 billion grants for health IT, i.e. (Redman, 1998) improve EMRs in 2009 The Data Warehousing Institute Erroneously priced data in retail estimates that data quality databases costs US customers problems cost U.S. businesses $2.5 billion each year more than $600 billion a year (2002) 20 2011 © Copyright QCRI. Confidential document.
  • 21. Where to start? Data Quality everywhere! • Data Entry • Information Extraction • Integration from multiple sources • Standardization and transformation • Business rules compliance 21 2011 © Copyright QCRI. Confidential document.
  • 22. “Academic” Data Cleaning ” ● Pick a well understood data problem under some scoping assumptions and solve independently Duplicates Functional Dependency violations Matching dependency violations Missing value imputation ● Piece-meal approach to tackle the complexity and sometimes the intractability of the problem Repairing violations of FD constraints in special cases (no deletion, left hand side changes only, allowing variable etc.) 22 2011 © Copyright QCRI. Confidential document.
  • 23. “Academic” Data Cleaning ” • Despite their theoretic and algorithmic beauty, rarely used – Problems never exist in isolation – Fixes to one problem often introduce “other” problems – Data usually not accessible to mess with – Integrity constraints!... What integrity constraints?!! 23 2011 © Copyright QCRI. Confidential document.
  • 24. “Practitioner” Data Cleaning ” • Will share some scary stories – “post-it notes” as an expert messaging system – “written permission” to change value of a record – Default values and best practices – “Call John.. He will know what to do” 24 2011 © Copyright QCRI. Confidential document.
  • 25. This Talk ● Few data quality challenges and (hopefully) research directions ● Summary of recent efforts at QCRI 25 2011 © Copyright QCRI. Confidential document.
  • 26. 10 Data Quality Issues 26 2011 © Copyright QCRI. Confidential document.
  • 27. Issue 1: The data trio DATA Quality 27 2011 © Copyright QCRI. Confidential document.
  • 28. Extraction remains a key source of data errors Acquiring the semantics/schema of the underlying unstructured data sources (document, emails, related Web info, click traces, profiles, interests, etc.) 28 2011 © Copyright QCRI. Confidential document.
  • 29. Integration aggravates the problem m1 Linked data as an attempt to live with errors .. link as you go 29 2011 © Copyright QCRI. Confidential document.
  • 30. Slide 29 m1 I'm not sure about this idea of putting "linked data" so prominent in this slide on II mourad, 7/23/2011
  • 31. Issue 2: Data level or application level • Cleaning data tables by trusting the schema table! Is rarely useful • Will share a story – Bell-core with 1800 inter-linked databases – Rule-based logic for sanity checking – Post-it messages to communicate between data quality officers .. Who work in shifts! – Data cleaning action is meaningless if not tied to a business logic or to a process. Should never be against FDs 30 2011 © Copyright QCRI. Confidential document.
  • 32. Issue 3: Protect your gain: DQ Dashboard ● How to protect against going backwards ● How to protect your gains during the cleansing process ● Metrics: Minimality Principle: mostly and widely used in academic cleaning Value of information: to spot the most important problem to fix 31 2011 © Copyright QCRI. Confidential document.
  • 33. Issue 3: Protect your gain - Ideas • Root-cause analysis for data cleaning • Chase problems to the source to reason about “progress” • Leveraging “Provenance” to design progress meters 32 2011 © Copyright QCRI. Confidential document.
  • 34. Issue 4: Data is not an orphan! ● Data Stewards are not imaginary characters! Important data has stewards and custodians ● Need to go through these guardians first Some health care requires a signed form per changed cell stating reasons for change ● Possible approaches: How to avoid stewards? How to integrate them in the process or minimize their involvement? 33 2011 © Copyright QCRI. Confidential document.
  • 35. Issue 5: How clean is clean? • Quality awareness eats up 10% of the budget [Telecom Experience] • How to avoid over-cleaning • Example: “Bill Forgiveness”, a real-life experience: roaming charges and cross-carrier calls have a very complicated business model • Possible approaches – Measure cleaning progress – Clean only to satisfy some application needs 34 2011 © Copyright QCRI. Confidential document.
  • 36. Issue 6: Online cleaning a necessity not a feature ● We live in a complex world → complex applications with 100s and 1000s of components and parameters ● Clean as you go .. Clean on demand .. Clean opportunistically .. Can be the only hope ● New concepts: Iterative cleaning Cleaning dynamic and evolving data ● Off-line cleaning can still benefit historical data but is becoming less and less important 35 2011 © Copyright QCRI. Confidential document.
  • 37. Issue 7: Application quality • Data Quality → Information Quality → Application quality • Realizes the levels of complexity in current BI apps • Data usage should influence data cleaning – “Usage-based” data cleaning 36 2011 © Copyright QCRI. Confidential document.
  • 38. Issue 8: SW engineering DQ • Current focus on discrete values with simple integrity constraints (FD, uniqueness…) • We are good at checking if data complies with rules • Real business rules are often “assertions” and expressed in “turing-complete” languages • Checking “did we write the assertions right?” becomes a lot harder • But also.. need to think if we wrote the right assertions! 37 2011 © Copyright QCRI. Confidential document.
  • 39. Issue 9: DQ Theory? • ACID in transaction management were not only sensible requirements but also had algorithms and methods to enforce them during transactions processing • Does it make sense to do the same for Quality? Plausible properties along with actions for maintaining acceptable quality during data manipulation • Some of these already exist: Timeliness, Currency, Consistency, etc. but lack methods of enforcement 38 2011 © Copyright QCRI. Confidential document.
  • 40. Issue 10: Scale .. Scale • Terabytes and Petabytes of data requires new ways to enforce data quality • Which ball to drop • Leveraging application semantics and data usage • Sampling to learn from the few and apply on the masses • Active learning to replace human feedback (GDR as a solution) 39 2011 © Copyright QCRI. Confidential document.
  • 41. Sample QCRI Projects 40 2011 © Copyright QCRI. Confidential document.
  • 42. GDR – Guided Data Repair • Scalable ways to involve experts • Repurposing destructive automatic techniques to guide repairs • Value of Information measures to generate the most important questions User Query • Judicious use of active learning from user feedback Learn and Detect Errors Clean Database Repair and Violations Instance Database Results Input Database Instance 41 2011 © Copyright QCRI. Confidential document.
  • 43. GDR Architecture 42 2011 © Copyright QCRI. Confidential document.
  • 44. Probabilistic Data Cleaning User Query Possible Uncertain Repair Clean Database Error Detection Clean Database Generation Instance Possible Instance Clean Instance Input Probabilistic Results Database Instance 43 2011 © Copyright QCRI. Confidential document.
  • 45. Possible Repairs A possible repair is a clustering of the input tuples Person Possible Repairs ID Name ZIP Income X1 X2 X3 P1 Green 51519 30k {P1} {P1,P2} {P1,P2,P5} P2 Green 51518 32k {P2} {P3,P4} {P3,P4} P3 Peter 30528 40k {P3,P4} {P5} {P6} Uncertain {P6} P4 Peter 30528 40k {P5} Clustering P5 Gree 51519 55k {P6} P6 Chuck 51519 30k 44 2011 © Copyright QCRI. Confidential document.
  • 46. Thank You www.qcri.qa 2011 © Copyright QCRI. Confidential document.