SlideShare une entreprise Scribd logo
1  sur  48
Digital Enterprise Research Institute                                         www.deri.ie




                                                  Data Curation at the
                                                   New York Times
                      Edward Curry, Andre Freitas, Seán O'Riain




 ed.curry@deri.org
 http://www.deri.org/
 http://www.EdwardCurry.org/
 Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
Speaker Profile
Digital Enterprise Research Institute                                                 www.deri.ie



            Research Scientist at the Digital Enterprise Research
             Institute (DERI)
                   Leading international web science research organization
            Researching how web of data is changing way business
             work and interact with information
                   Projects include studies of enterprise linked data, community-
                    based data curation, semantic data analytics, and semantic
                    search
                   Investigate utilization within the pharmaceutical, oil & gas,
                    financial, advertising, media, manufacturing, health care, ICT,
                    and automotive industries
            Invited speaker at the 2010 MIT Sloan CIO Symposium
             to an audience of more than 600 CIOs
Overview
Digital Enterprise Research Institute                    www.deri.ie



            Curation Background
                   The Business Need for Curated Data
                   What is Data Curation?
                   Data Quality and Curation
                   How to Curate Data


            New York Times Case Study

            Best Practices from Case Study Learning
The Business Need
Digital Enterprise Research Institute                              www.deri.ie



               Knowledge workers need:
                   Access              to the right information
                   Confidence              in that information


               Working incomplete
                inaccurate, or wrong
                information can have
                disastrous consequences
The Problems with Data
Digital Enterprise Research Institute                                           www.deri.ie



          Flawed Data
             Effects   25% of critical data in world‟s top companies
                 (Gartner)

          Data Quality
             Recent               banking crisis (Economist Dec‟09)
             Inaccurate   figures made it difficult to manage operations
                 (investments exposure and risk)
                    –   “asset are defined differently in different programs”
                    –   “numbers did not always add up”
                    –   “departments do not trust each other‟s figures”
                    –   “figures … not worth the pixels they were made of”
What is Data Curation?
Digital Enterprise Research Institute                                    www.deri.ie


        Digital Curation
            Selection,    preservation, maintenance, collection, and
                archiving of digital assets

        Data Curation
            Active             management of data over its life-cycle

        Data Curators
            Ensure    data is trustworthy, discoverable, accessible,
                reusable, and fit for use
                   – Museum cataloguers of the Internet age
What is Data Curation?
Digital Enterprise Research Institute                                www.deri.ie




            Data Governance
                Convergence     of data quality, data management,
                    business process management, and risk
                    management

            Data Curation is a complimentary activity
                Part   of overall data governance strategy for
                    organization

            Data Curator = Data Steward ??
                   Overlapping terms between communities
Data Quality and Curation
Digital Enterprise Research Institute                                                www.deri.ie



            What is Data Quality?
                Desirable              characteristics for information resource
                Described              as a series of quality dimensions
                       – Discoverability, Accessibility, Timeliness, Completeness,
                         Interpretation, Accuracy, Consistency, Provenance &
                         Reputation

            Data curation can be used to improve these
             quality dimensions
Data Quality and Curation
Digital Enterprise Research Institute                                    www.deri.ie



            Discoverability & Accessibility
                Curate    to streamline search by storing and classifying
                    in appropriate and consistent manner

            Accuracy
                Curate     to ensure data correctly represents the “real-
                    world” values it models

            Consistency
                Curate      to ensure data created and maintained using
                    standardized definitions, calculations, terms, and
                    identifiers
Data Quality and Curation
Digital Enterprise Research Institute                                                www.deri.ie




            Provenance & Reputation
                Curate                 to track source of data and determine reputation
                Curate                 to include the objectivity of the source/producer
                       – Is the information unbiased, unprejudiced, and impartial?
                       – Or does it come from a reputable but partisan source?




                       Other dimensions discussed in chapter
How to Curate Data
Digital Enterprise Research Institute                               www.deri.ie




            Data Curation is a large field with sophisticated
             techniques and processes

            Section provides high-level overview on:
                Should                 you curate data?
                Types             of Curation
                Setting                up a curation process


               Additional detail and references available in book
               chapter
Should You Curate Data?
Digital Enterprise Research Institute                                              www.deri.ie




            Curation can have multiple motivations
                Improving                accessibility, quality, consistency,…

            Will the data benefit from curation?
                Identify               business case
                Determine                if potential return support investment

            Not all enterprise data should be curated
                Suits   knowledge-centric data rather than transactional
                    operations data
Types of Data Curation
Digital Enterprise Research Institute                        www.deri.ie



            Multiple approaches to curate data, no single
             correct way
                Who?
                       – Individual Curators
                       – Curation Departments
                       – Community-based Curation
                How?
                       – Manual Curation
                       – (Semi-)Automated
                       – Sheer Curation
Types of Data Curation – Who?
Digital Enterprise Research Institute                                                 www.deri.ie




            Individual Data Curators
                Suitable               for infrequently changing small quantity of
                    data
                       – (<1,000 records)
                       – Minimal curation effort (minutes per record)
Types of Data Curation – Who?
Digital Enterprise Research Institute                                             www.deri.ie


            Curation Departments
                Curation     experts working with subject matter experts
                    to curate data within formal process
                       – Can deal with large curation effort (000‟s of records)

            Limitations
                Scalability: Can struggle with large quantities of
                    dynamic data (>million records)
                Availability:  Post-hoc nature creates delay in curated
                    data availability
Types of Data Curation - Who?
Digital Enterprise Research Institute                                    www.deri.ie



            Community-Based Data Curation
                Decentralized               approach to data curation
                Crowd-sourcing                the curation process
                       – Leverages community of users to curate data
                Wisdom                 of the community (crowd)
                Can           scale to millions of records
Types of Data Curation – How?
Digital Enterprise Research Institute                                        www.deri.ie



            Manual Curation
                Curators               directly manipulate data
                Can           tie users up with low-value add activities

            (Sem-)Automated Curation
                Algorithms      can (semi-)automate curation activities
                    such as data cleansing, record duplication and
                    classification
                Can           be supervised or approved by human curators
Types of Data Curation – How?
Digital Enterprise Research Institute                                          www.deri.ie



            Sheer curation, or Curation at Source
                Curation    activities integrated in normal workflow of
                    those creating and managing data
                Can     be as simple as vetting or “rating” the results of a
                    curation algorithm
                Results                can be available immediately

            Blended Approaches: Best of Both
                Sheer             curation + post hoc curation department
                Allows             immediate access to curated data
                Ensures                quality control with expert curation
Setting up a Curation Process
Digital Enterprise Research Institute                                  www.deri.ie




            5 Steps to setup a curation process:
               1 - Identify what data you need to curate
               2 - Identify who will curate the data
               3 - Define the curation workflow
               4 - Identity appropriate data-in & data-out formats
               5 - Identify the artifacts, tools, and processes needed to
                   support the curation process
The New York Times
Digital Enterprise Research Institute                            www.deri.ie




                             100 Years of Expert Data Curation
The New York Times
Digital Enterprise Research Institute                 www.deri.ie


            Largest metropolitan and third largest
             newspaper in the United States


            nytimes.com
                    Most popular newspaper
                     website in US

            100 year old curated
             repository defining its
             participation in the
             emerging Web of Data
The New York Times
Digital Enterprise Research Institute                                              www.deri.ie


       Data curation dates back to 1913
           Publisher/owner      Adolph S. Ochs decided to provide a
               set of additions to the newspaper
       New York Times Index
           Organized                   catalog of articles titles and summaries
                  – Containing issue, date and column of article
                  – Categorized by subject and names
                  – Introduced on quarterly then annual basis
       Transitory content of newspaper became
        important source of searchable historical data
           Often            used to settle historical debates
The New York Times
Digital Enterprise Research Institute                                            www.deri.ie


              Index Department was created in 1913
                Curation               and cataloguing of NYT resources
                       – Since 1851 NYT had low quality index for internal use

            Developed a comprehensive catalog using a
             controlled vocabulary
                Covering    subjects, personal names, organizations,
                    geographic locations and titles of creative works
                    (books, movies, etc), linked to articles and their
                    summaries

            Current Index Dept. has ~15 people
The New York Times
Digital Enterprise Research Institute                                          www.deri.ie



            Challenges with consistently and accurately
             classifying news articles over time
                Keywords     expressing subjects may show some
                    variance due to cultural or legal constraints
                Identities   of some entities, such as organizations and
                    places, changed over time

            Controlled vocabulary grew to hundreds of
             thousands of categories
                Adding                 complexity to classification process
The New York Times
Digital Enterprise Research Institute                               www.deri.ie




            Increased importance of Web drove need to
             improve categorization of online content

            Curation carried out by Index Department
                Library-time           (days to weeks)
                Print          edition can handle next-day index

            Not suitable for real-time online publishing
                nytimes.com            needed a same-day index
The New York Times
Digital Enterprise Research Institute                                    www.deri.ie


            Introduced two stage curation process
                Editorial  staff performed best-effort semi-automated
                    sheer curation at point of online pub.
                       – Several hundreds journalists
                Index     Department follow up with long-term accurate
                    classification and archiving

            Benefits:
                Non-expert      journalist curators provide instant
                    accessibility to online users
                Index    Department provides long-term high-quality
                    curation in a “trust but verify” approach
NYT Curation Workflow
Digital Enterprise Research Institute                                        www.deri.ie




  Curation                starts with article getting out of the newsroom
NYT Curation Workflow
Digital Enterprise Research Institute                             www.deri.ie




  Member      of editorial staff submits article to web-based rule
      based information extraction system (SAS Teragram)
NYT Curation Workflow
Digital Enterprise Research Institute                         www.deri.ie




 Teragram   uses linguistic extraction rules based on subset of
    Index Dept‟s controlled vocab.
NYT Curation Workflow
Digital Enterprise Research Institute                        www.deri.ie




  Teragram     suggests tags based on the Index vocabulary that
      can potentially describe the content of article
NYT Curation Workflow
Digital Enterprise Research Institute                         www.deri.ie




  Editorial  staff member selects terms that best describe the
      contents and inserts new tags if necessary
NYT Curation Workflow
Digital Enterprise Research Institute                         www.deri.ie




  Reviewed       by the taxonomy managers with feedback to
      editorial staff on classification process
NYT Curation Workflow
Digital Enterprise Research Institute                     www.deri.ie




  Article           is published online at nytimes.com
NYT Curation Workflow
Digital Enterprise Research Institute                           www.deri.ie




  At   later stage article receives second level curation by Index
      Dept. additional Index tags and a summary
NYT Curation Workflow
Digital Enterprise Research Institute            www.deri.ie




  Article           is submitted to NYT Index
The New York Times
Digital Enterprise Research Institute                      www.deri.ie


           Early adopter of Linked Open Data (June „09)
The New York Times
Digital Enterprise Research Institute                                    www.deri.ie


    Linked Open Data @ data.nytimes.com
        Subset               of 10,000 tags from index vocabulary
        Dataset               of people, organizations & locations
               – Complemented by search services to consume data
                 about articles, movies, best sellers, Congress votes,
                 real estate,…
    Benefits
        Improves                  traffic by third party data usage
        Lowers      development cost of new applications for
            different verticals inside the website
               – E.g. movies, travel, sports, books
Overview
Digital Enterprise Research Institute                    www.deri.ie



            Curation Background
                   The Business Need for Curated Data
                   What is Data Curation?
                   Data Quality and Curation
                   How to Curate Data


            Case Study New York Times

            Best Practices from Case Study Learning
Best Practices from Case Study
       Learning
Digital Enterprise Research Institute                           www.deri.ie


            Social Best Practices
                Participation
                Engagement
                Incentives
                Community                Governance Models

            Technical Best Practices
                Data           Representation
                Human-                 and AutomatedCuration
                Track            Provenance
Social Best Practices
Digital Enterprise Research Institute                                              www.deri.ie




            Participation
                Stakeholders  involvement for data producers and
                    consumers must occur early in project
                       – Provides insight into basic questions of what they want
                         to do, for whom, and what it will provide
                White     papers are effective means to present these
                    ideas, and solicit opinion from community
                       – Can be used to establish informal „social contract‟ for
                         community
Social Best Practices
Digital Enterprise Research Institute                                               www.deri.ie




            Engagement
                Outreach                 activities essential for promotion and
                    feedback
                Typical                consumers-to-contributors ratios of less than
                    5%
                Social            communication and networking forums are
                    useful
                       – Majority of community may not communicate using
                         these media
                       – Communication by email still remains important
Social Best Practices
Digital Enterprise Research Institute                                     www.deri.ie




            Incentives
                Sheer      curation needs line of sight from data curating
                    activity, to tangible exploitation benefits
                Lack   of awareness of value proposition will slow
                    emergence of collaborative contributions
                Recognizing   contributing curators through a formal
                    feedback mechanism
                       – Reinforces contribution culture
                       – Directly increases output quality
Social Best Practices
Digital Enterprise Research Institute                                         www.deri.ie




            Community Governance Models
                Effective  governance structure is vital to ensure
                    success of community
                Internal  communities and consortium perform well
                    when they leverage traditional corporate and
                    democratic governance models
                Open      communities need to engage the community
                    within the governance process
                       – Follow less orthodox approaches using meritocratic
                         and autocratic principles
Technical Best Practices
Digital Enterprise Research Institute                                    www.deri.ie

            Data Representation
                Must   be robust and standardized to encourage
                    community usage and tools development
                Support     for legacy data formats and ability to
                    translate data forward to support new technology and
                    standards
            Human & Automated Curation
                Balancing              will improve data quality
                Automated      curation should always defer to, and never
                    override, human curation edits
                       – Automate validating data deposition and entry
                       – Target community at focused curation tasks
Technical Best Practices
Digital Enterprise Research Institute                                         www.deri.ie



            Track Provenance
                All  curation activities should be recorded and
                    maintained as part data provenance effort
                       – Especially where human curators are involved
                Users             can have different perspectives of provenance
                       – A scientist may need to evaluate the fine grained
                         experiment description behind the data
                       – For a business analyst the ‟brand‟ of data provider can
                         be sufficient for determining quality
Conclusions
Digital Enterprise Research Institute                                               www.deri.ie




        Data curation can ensure the quality of data and
         its fitness for use
        Pre-competitive data can be shared without
         conferring a commercial advantage
        Pre-competitive data communities
                Common                 curation tasks carried out once in public
                    domain
                Reduces                cost, increase quantity and quality
Acknowledgements
Digital Enterprise Research Institute                                                      www.deri.ie


        Collaborators Andre Freitas & Seán O'Riain

        Insight from Thought Leaders
               Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product
                Development and Management), and Gregg Fenton (Director Emerging Platforms)
                from the New York Times
               Krista Thomas (Vice President, Marketing & Communications), Tom Tague
                (OpenCalais initiative Lead) from Thomson Reuters
               Antony Williams (VP of Strategic Development ) from ChemSpider
               Helen Berman (Director), John Westbrook (Product Development) from the Protein
                Data Bank
               Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.

        The work presented has been funded by Science
         Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-
         2).
Further Information
Digital Enterprise Research Institute                     www.deri.ie


The Role of Community-Driven
Data Curation for Enterprises
Edward Curry, Andre Freitas, & Seán O'Riain




  In David Wood (ed.),
  Linking Enterprise Data Springer, 2010.
  Available Free at:
  http://3roundstones.com/led_book/led-curry-et-al.html

Contenu connexe

Tendances

Building Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataBuilding Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataEdward Curry
 
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage DataCollaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage DataEdward Curry
 
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013Edward Curry
 
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
Crowdsourcing Approaches to Big Data Curation for Earth SciencesCrowdsourcing Approaches to Big Data Curation for Earth Sciences
Crowdsourcing Approaches to Big Data Curation for Earth SciencesEdward Curry
 
Querying Heterogeneous Datasets on the Linked Data Web
Querying Heterogeneous Datasets on the Linked Data WebQuerying Heterogeneous Datasets on the Linked Data Web
Querying Heterogeneous Datasets on the Linked Data WebEdward Curry
 
System of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceSystem of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceEdward Curry
 
Linked Building (Energy) Data
Linked Building (Energy) DataLinked Building (Energy) Data
Linked Building (Energy) DataEdward Curry
 
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
Enterprise Energy Management using a Linked Dataspace for Energy IntelligenceEnterprise Energy Management using a Linked Dataspace for Energy Intelligence
Enterprise Energy Management using a Linked Dataspace for Energy IntelligenceEdward Curry
 
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and TrendsSustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and TrendsEdward Curry
 
The Big Data Value PPP: A Standardisation Opportunity for Europe
The Big Data Value PPP: A Standardisation Opportunity for EuropeThe Big Data Value PPP: A Standardisation Opportunity for Europe
The Big Data Value PPP: A Standardisation Opportunity for EuropeEdward Curry
 
Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...Edward Curry
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Edward Curry
 
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in CrowdsourcingSLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in CrowdsourcingEdward Curry
 
A Capability Maturity Framework for Sustainable ICT
A Capability Maturity Framework for Sustainable ICTA Capability Maturity Framework for Sustainable ICT
A Capability Maturity Framework for Sustainable ICTEdward Curry
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeEdward Curry
 
Linked Water Data For Water Information Management
Linked Water Data For Water Information ManagementLinked Water Data For Water Information Management
Linked Water Data For Water Information ManagementEdward Curry
 
Interactive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics ApproachInteractive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics ApproachEdward Curry
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
 
Crowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data ManagementCrowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data ManagementEdward Curry
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewIJERA Editor
 

Tendances (20)

Building Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataBuilding Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked Data
 
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage DataCollaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
 
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013
 
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
Crowdsourcing Approaches to Big Data Curation for Earth SciencesCrowdsourcing Approaches to Big Data Curation for Earth Sciences
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
 
Querying Heterogeneous Datasets on the Linked Data Web
Querying Heterogeneous Datasets on the Linked Data WebQuerying Heterogeneous Datasets on the Linked Data Web
Querying Heterogeneous Datasets on the Linked Data Web
 
System of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceSystem of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked Dataspace
 
Linked Building (Energy) Data
Linked Building (Energy) DataLinked Building (Energy) Data
Linked Building (Energy) Data
 
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
Enterprise Energy Management using a Linked Dataspace for Energy IntelligenceEnterprise Energy Management using a Linked Dataspace for Energy Intelligence
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
 
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and TrendsSustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
 
The Big Data Value PPP: A Standardisation Opportunity for Europe
The Big Data Value PPP: A Standardisation Opportunity for EuropeThe Big Data Value PPP: A Standardisation Opportunity for Europe
The Big Data Value PPP: A Standardisation Opportunity for Europe
 
Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
 
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in CrowdsourcingSLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
 
A Capability Maturity Framework for Sustainable ICT
A Capability Maturity Framework for Sustainable ICTA Capability Maturity Framework for Sustainable ICT
A Capability Maturity Framework for Sustainable ICT
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in Europe
 
Linked Water Data For Water Information Management
Linked Water Data For Water Information ManagementLinked Water Data For Water Information Management
Linked Water Data For Water Information Management
 
Interactive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics ApproachInteractive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics Approach
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
 
Crowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data ManagementCrowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data Management
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –Review
 

En vedette

Influenciencia del mundo emocional en el aprendizaje
Influenciencia del mundo emocional en el aprendizajeInfluenciencia del mundo emocional en el aprendizaje
Influenciencia del mundo emocional en el aprendizajeInstituto Familia y Adopción
 
Open Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and TrendsOpen Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and TrendsEdward Curry
 
Towards a BIG Data Public Private Partnership
Towards a BIG Data Public Private PartnershipTowards a BIG Data Public Private Partnership
Towards a BIG Data Public Private PartnershipEdward Curry
 
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Improving Policy Coherence and Accessibility through Semantic Web Technologie...Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Improving Policy Coherence and Accessibility through Semantic Web Technologie...Edward Curry
 
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
Designing Next Generation Smart City Initiatives:Harnessing Findings And Les...Designing Next Generation Smart City Initiatives:Harnessing Findings And Les...
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...Edward Curry
 
Citizen Actuation For Lightweight Energy Management
Citizen Actuation For Lightweight Energy ManagementCitizen Actuation For Lightweight Energy Management
Citizen Actuation For Lightweight Energy ManagementEdward Curry
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupEdward Curry
 
Towards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing SystemsTowards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing SystemsEdward Curry
 

En vedette (8)

Influenciencia del mundo emocional en el aprendizaje
Influenciencia del mundo emocional en el aprendizajeInfluenciencia del mundo emocional en el aprendizaje
Influenciencia del mundo emocional en el aprendizaje
 
Open Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and TrendsOpen Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and Trends
 
Towards a BIG Data Public Private Partnership
Towards a BIG Data Public Private PartnershipTowards a BIG Data Public Private Partnership
Towards a BIG Data Public Private Partnership
 
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Improving Policy Coherence and Accessibility through Semantic Web Technologie...Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
 
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
Designing Next Generation Smart City Initiatives:Harnessing Findings And Les...Designing Next Generation Smart City Initiatives:Harnessing Findings And Les...
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
 
Citizen Actuation For Lightweight Energy Management
Citizen Actuation For Lightweight Energy ManagementCitizen Actuation For Lightweight Energy Management
Citizen Actuation For Lightweight Energy Management
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
 
Towards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing SystemsTowards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing Systems
 

Similaire à Data Curation at the New York Times

Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...Camille Mathieu
 
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...Umair ul Hassan
 
Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...jodischneider
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real Worldsssw2012
 
KMWorld Martin Briefing
KMWorld Martin BriefingKMWorld Martin Briefing
KMWorld Martin Briefingmartingarland
 
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
Data2030 Summit Data Megatrends Turner Sept 2022.pptxData2030 Summit Data Megatrends Turner Sept 2022.pptx
Data2030 Summit Data Megatrends Turner Sept 2022.pptxMatt Turner
 
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and OutcomesWikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomesjodischneider
 
Down to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data ServicesDown to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data ServicesInside Analysis
 
Digital DNA for Organic Enterprises
Digital DNA for Organic EnterprisesDigital DNA for Organic Enterprises
Digital DNA for Organic EnterprisesTeemu Arina
 
Towards Patient Controlled Privacy
Towards Patient Controlled PrivacyTowards Patient Controlled Privacy
Towards Patient Controlled PrivacyOwen Sacco
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneousChris Dwan
 
Introduction to Open Data
Introduction to Open DataIntroduction to Open Data
Introduction to Open DataDerilinx
 
Externalization Trend
Externalization TrendExternalization Trend
Externalization TrendNigel Green
 
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
Keynote Theatre. Keynote Day 2. 16:30   Evelyn de Souza Keynote Theatre. Keynote Day 2. 16:30   Evelyn de Souza
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza CloudExpoAsia
 
Knowledge management on the desktop
Knowledge management on the desktopKnowledge management on the desktop
Knowledge management on the desktopLaura Dragan
 
Big_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_ReddiboinaBig_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_ReddiboinaMadhu Reddiboina
 
Self-service Linked Government Data
Self-service Linked Government DataSelf-service Linked Government Data
Self-service Linked Government DataFadi Maali
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalismBahareh Heravi
 
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic WebMulti-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic WebFabrizio Orlandi
 

Similaire à Data Curation at the New York Times (20)

Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
 
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
 
Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real World
 
KMWorld Martin Briefing
KMWorld Martin BriefingKMWorld Martin Briefing
KMWorld Martin Briefing
 
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
Data2030 Summit Data Megatrends Turner Sept 2022.pptxData2030 Summit Data Megatrends Turner Sept 2022.pptx
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
 
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and OutcomesWikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
 
Down to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data ServicesDown to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data Services
 
Digital DNA for Organic Enterprises
Digital DNA for Organic EnterprisesDigital DNA for Organic Enterprises
Digital DNA for Organic Enterprises
 
Towards Patient Controlled Privacy
Towards Patient Controlled PrivacyTowards Patient Controlled Privacy
Towards Patient Controlled Privacy
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
 
Introduction to Open Data
Introduction to Open DataIntroduction to Open Data
Introduction to Open Data
 
Externalization Trend
Externalization TrendExternalization Trend
Externalization Trend
 
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
Keynote Theatre. Keynote Day 2. 16:30   Evelyn de Souza Keynote Theatre. Keynote Day 2. 16:30   Evelyn de Souza
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
 
Knowledge management on the desktop
Knowledge management on the desktopKnowledge management on the desktop
Knowledge management on the desktop
 
Big_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_ReddiboinaBig_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_Reddiboina
 
Self-service Linked Government Data
Self-service Linked Government DataSelf-service Linked Government Data
Self-service Linked Government Data
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalism
 
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic WebMulti-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
 
Mydex opentech2010
Mydex opentech2010Mydex opentech2010
Mydex opentech2010
 

Dernier

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Dernier (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Data Curation at the New York Times

  • 1. Digital Enterprise Research Institute www.deri.ie Data Curation at the New York Times Edward Curry, Andre Freitas, Seán O'Riain ed.curry@deri.org http://www.deri.org/ http://www.EdwardCurry.org/ Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
  • 2. Speaker Profile Digital Enterprise Research Institute www.deri.ie  Research Scientist at the Digital Enterprise Research Institute (DERI)  Leading international web science research organization  Researching how web of data is changing way business work and interact with information  Projects include studies of enterprise linked data, community- based data curation, semantic data analytics, and semantic search  Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries  Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
  • 3. Overview Digital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  New York Times Case Study  Best Practices from Case Study Learning
  • 4. The Business Need Digital Enterprise Research Institute www.deri.ie  Knowledge workers need:  Access to the right information  Confidence in that information  Working incomplete inaccurate, or wrong information can have disastrous consequences
  • 5. The Problems with Data Digital Enterprise Research Institute www.deri.ie  Flawed Data  Effects 25% of critical data in world‟s top companies (Gartner)  Data Quality  Recent banking crisis (Economist Dec‟09)  Inaccurate figures made it difficult to manage operations (investments exposure and risk) – “asset are defined differently in different programs” – “numbers did not always add up” – “departments do not trust each other‟s figures” – “figures … not worth the pixels they were made of”
  • 6. What is Data Curation? Digital Enterprise Research Institute www.deri.ie  Digital Curation  Selection, preservation, maintenance, collection, and archiving of digital assets  Data Curation  Active management of data over its life-cycle  Data Curators  Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use – Museum cataloguers of the Internet age
  • 7. What is Data Curation? Digital Enterprise Research Institute www.deri.ie  Data Governance  Convergence of data quality, data management, business process management, and risk management  Data Curation is a complimentary activity  Part of overall data governance strategy for organization  Data Curator = Data Steward ??  Overlapping terms between communities
  • 8. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  What is Data Quality?  Desirable characteristics for information resource  Described as a series of quality dimensions – Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation  Data curation can be used to improve these quality dimensions
  • 9. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  Discoverability & Accessibility  Curate to streamline search by storing and classifying in appropriate and consistent manner  Accuracy  Curate to ensure data correctly represents the “real- world” values it models  Consistency  Curate to ensure data created and maintained using standardized definitions, calculations, terms, and identifiers
  • 10. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  Provenance & Reputation  Curate to track source of data and determine reputation  Curate to include the objectivity of the source/producer – Is the information unbiased, unprejudiced, and impartial? – Or does it come from a reputable but partisan source? Other dimensions discussed in chapter
  • 11. How to Curate Data Digital Enterprise Research Institute www.deri.ie  Data Curation is a large field with sophisticated techniques and processes  Section provides high-level overview on:  Should you curate data?  Types of Curation  Setting up a curation process Additional detail and references available in book chapter
  • 12. Should You Curate Data? Digital Enterprise Research Institute www.deri.ie  Curation can have multiple motivations  Improving accessibility, quality, consistency,…  Will the data benefit from curation?  Identify business case  Determine if potential return support investment  Not all enterprise data should be curated  Suits knowledge-centric data rather than transactional operations data
  • 13. Types of Data Curation Digital Enterprise Research Institute www.deri.ie  Multiple approaches to curate data, no single correct way  Who? – Individual Curators – Curation Departments – Community-based Curation  How? – Manual Curation – (Semi-)Automated – Sheer Curation
  • 14. Types of Data Curation – Who? Digital Enterprise Research Institute www.deri.ie  Individual Data Curators  Suitable for infrequently changing small quantity of data – (<1,000 records) – Minimal curation effort (minutes per record)
  • 15. Types of Data Curation – Who? Digital Enterprise Research Institute www.deri.ie  Curation Departments  Curation experts working with subject matter experts to curate data within formal process – Can deal with large curation effort (000‟s of records)  Limitations  Scalability: Can struggle with large quantities of dynamic data (>million records)  Availability: Post-hoc nature creates delay in curated data availability
  • 16. Types of Data Curation - Who? Digital Enterprise Research Institute www.deri.ie  Community-Based Data Curation  Decentralized approach to data curation  Crowd-sourcing the curation process – Leverages community of users to curate data  Wisdom of the community (crowd)  Can scale to millions of records
  • 17. Types of Data Curation – How? Digital Enterprise Research Institute www.deri.ie  Manual Curation  Curators directly manipulate data  Can tie users up with low-value add activities  (Sem-)Automated Curation  Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification  Can be supervised or approved by human curators
  • 18. Types of Data Curation – How? Digital Enterprise Research Institute www.deri.ie  Sheer curation, or Curation at Source  Curation activities integrated in normal workflow of those creating and managing data  Can be as simple as vetting or “rating” the results of a curation algorithm  Results can be available immediately  Blended Approaches: Best of Both  Sheer curation + post hoc curation department  Allows immediate access to curated data  Ensures quality control with expert curation
  • 19. Setting up a Curation Process Digital Enterprise Research Institute www.deri.ie  5 Steps to setup a curation process: 1 - Identify what data you need to curate 2 - Identify who will curate the data 3 - Define the curation workflow 4 - Identity appropriate data-in & data-out formats 5 - Identify the artifacts, tools, and processes needed to support the curation process
  • 20. The New York Times Digital Enterprise Research Institute www.deri.ie 100 Years of Expert Data Curation
  • 21. The New York Times Digital Enterprise Research Institute www.deri.ie  Largest metropolitan and third largest newspaper in the United States  nytimes.com  Most popular newspaper website in US  100 year old curated repository defining its participation in the emerging Web of Data
  • 22. The New York Times Digital Enterprise Research Institute www.deri.ie  Data curation dates back to 1913  Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaper  New York Times Index  Organized catalog of articles titles and summaries – Containing issue, date and column of article – Categorized by subject and names – Introduced on quarterly then annual basis  Transitory content of newspaper became important source of searchable historical data  Often used to settle historical debates
  • 23. The New York Times Digital Enterprise Research Institute www.deri.ie  Index Department was created in 1913  Curation and cataloguing of NYT resources – Since 1851 NYT had low quality index for internal use  Developed a comprehensive catalog using a controlled vocabulary  Covering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries  Current Index Dept. has ~15 people
  • 24. The New York Times Digital Enterprise Research Institute www.deri.ie  Challenges with consistently and accurately classifying news articles over time  Keywords expressing subjects may show some variance due to cultural or legal constraints  Identities of some entities, such as organizations and places, changed over time  Controlled vocabulary grew to hundreds of thousands of categories  Adding complexity to classification process
  • 25. The New York Times Digital Enterprise Research Institute www.deri.ie  Increased importance of Web drove need to improve categorization of online content  Curation carried out by Index Department  Library-time (days to weeks)  Print edition can handle next-day index  Not suitable for real-time online publishing  nytimes.com needed a same-day index
  • 26. The New York Times Digital Enterprise Research Institute www.deri.ie  Introduced two stage curation process  Editorial staff performed best-effort semi-automated sheer curation at point of online pub. – Several hundreds journalists  Index Department follow up with long-term accurate classification and archiving  Benefits:  Non-expert journalist curators provide instant accessibility to online users  Index Department provides long-term high-quality curation in a “trust but verify” approach
  • 27. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Curation starts with article getting out of the newsroom
  • 28. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)
  • 29. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Teragram uses linguistic extraction rules based on subset of Index Dept‟s controlled vocab.
  • 30. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article
  • 31. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Editorial staff member selects terms that best describe the contents and inserts new tags if necessary
  • 32. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Reviewed by the taxonomy managers with feedback to editorial staff on classification process
  • 33. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Article is published online at nytimes.com
  • 34. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  At later stage article receives second level curation by Index Dept. additional Index tags and a summary
  • 35. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Article is submitted to NYT Index
  • 36. The New York Times Digital Enterprise Research Institute www.deri.ie  Early adopter of Linked Open Data (June „09)
  • 37. The New York Times Digital Enterprise Research Institute www.deri.ie  Linked Open Data @ data.nytimes.com  Subset of 10,000 tags from index vocabulary  Dataset of people, organizations & locations – Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,…  Benefits  Improves traffic by third party data usage  Lowers development cost of new applications for different verticals inside the website – E.g. movies, travel, sports, books
  • 38. Overview Digital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  Case Study New York Times  Best Practices from Case Study Learning
  • 39. Best Practices from Case Study Learning Digital Enterprise Research Institute www.deri.ie  Social Best Practices  Participation  Engagement  Incentives  Community Governance Models  Technical Best Practices  Data Representation  Human- and AutomatedCuration  Track Provenance
  • 40. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Participation  Stakeholders involvement for data producers and consumers must occur early in project – Provides insight into basic questions of what they want to do, for whom, and what it will provide  White papers are effective means to present these ideas, and solicit opinion from community – Can be used to establish informal „social contract‟ for community
  • 41. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Engagement  Outreach activities essential for promotion and feedback  Typical consumers-to-contributors ratios of less than 5%  Social communication and networking forums are useful – Majority of community may not communicate using these media – Communication by email still remains important
  • 42. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Incentives  Sheer curation needs line of sight from data curating activity, to tangible exploitation benefits  Lack of awareness of value proposition will slow emergence of collaborative contributions  Recognizing contributing curators through a formal feedback mechanism – Reinforces contribution culture – Directly increases output quality
  • 43. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Community Governance Models  Effective governance structure is vital to ensure success of community  Internal communities and consortium perform well when they leverage traditional corporate and democratic governance models  Open communities need to engage the community within the governance process – Follow less orthodox approaches using meritocratic and autocratic principles
  • 44. Technical Best Practices Digital Enterprise Research Institute www.deri.ie  Data Representation  Must be robust and standardized to encourage community usage and tools development  Support for legacy data formats and ability to translate data forward to support new technology and standards  Human & Automated Curation  Balancing will improve data quality  Automated curation should always defer to, and never override, human curation edits – Automate validating data deposition and entry – Target community at focused curation tasks
  • 45. Technical Best Practices Digital Enterprise Research Institute www.deri.ie  Track Provenance  All curation activities should be recorded and maintained as part data provenance effort – Especially where human curators are involved  Users can have different perspectives of provenance – A scientist may need to evaluate the fine grained experiment description behind the data – For a business analyst the ‟brand‟ of data provider can be sufficient for determining quality
  • 46. Conclusions Digital Enterprise Research Institute www.deri.ie  Data curation can ensure the quality of data and its fitness for use  Pre-competitive data can be shared without conferring a commercial advantage  Pre-competitive data communities  Common curation tasks carried out once in public domain  Reduces cost, increase quantity and quality
  • 47. Acknowledgements Digital Enterprise Research Institute www.deri.ie  Collaborators Andre Freitas & Seán O'Riain  Insight from Thought Leaders  Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times  Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters  Antony Williams (VP of Strategic Development ) from ChemSpider  Helen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank  Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.  The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion- 2).
  • 48. Further Information Digital Enterprise Research Institute www.deri.ie The Role of Community-Driven Data Curation for Enterprises Edward Curry, Andre Freitas, & Seán O'Riain In David Wood (ed.), Linking Enterprise Data Springer, 2010. Available Free at: http://3roundstones.com/led_book/led-curry-et-al.html