The New York Times is the largest metropolitan and the third largest newspaper in the United States. The Times website, nytimes.com, is ranked as the most
popular newspaper website in the United States and is an important source of advertisement revenue for the company. The NYT has a rich history for curation of its articles and its 100 year old curated repository has ultimately defined its participation as one of the first players in the emergingWeb of Data.
Data curation is a process that can ensure the quality of data and its fitness for use. Traditional approaches to curation are struggling with increased data volumes, and near real-time demands for curated data. In response, curation teams have turned to community crowd-sourcing and semi-automatedmetadata tools for assistance.
E. Curry, A. Freitas, and S. O’Riáin, “The Role of Community-Driven Data Curation for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47.
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Data Curation at the New York Times
1. Digital Enterprise Research Institute www.deri.ie
Data Curation at the
New York Times
Edward Curry, Andre Freitas, Seán O'Riain
ed.curry@deri.org
http://www.deri.org/
http://www.EdwardCurry.org/
Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
2. Speaker Profile
Digital Enterprise Research Institute www.deri.ie
Research Scientist at the Digital Enterprise Research
Institute (DERI)
Leading international web science research organization
Researching how web of data is changing way business
work and interact with information
Projects include studies of enterprise linked data, community-
based data curation, semantic data analytics, and semantic
search
Investigate utilization within the pharmaceutical, oil & gas,
financial, advertising, media, manufacturing, health care, ICT,
and automotive industries
Invited speaker at the 2010 MIT Sloan CIO Symposium
to an audience of more than 600 CIOs
3. Overview
Digital Enterprise Research Institute www.deri.ie
Curation Background
The Business Need for Curated Data
What is Data Curation?
Data Quality and Curation
How to Curate Data
New York Times Case Study
Best Practices from Case Study Learning
4. The Business Need
Digital Enterprise Research Institute www.deri.ie
Knowledge workers need:
Access to the right information
Confidence in that information
Working incomplete
inaccurate, or wrong
information can have
disastrous consequences
5. The Problems with Data
Digital Enterprise Research Institute www.deri.ie
Flawed Data
Effects 25% of critical data in world‟s top companies
(Gartner)
Data Quality
Recent banking crisis (Economist Dec‟09)
Inaccurate figures made it difficult to manage operations
(investments exposure and risk)
– “asset are defined differently in different programs”
– “numbers did not always add up”
– “departments do not trust each other‟s figures”
– “figures … not worth the pixels they were made of”
6. What is Data Curation?
Digital Enterprise Research Institute www.deri.ie
Digital Curation
Selection, preservation, maintenance, collection, and
archiving of digital assets
Data Curation
Active management of data over its life-cycle
Data Curators
Ensure data is trustworthy, discoverable, accessible,
reusable, and fit for use
– Museum cataloguers of the Internet age
7. What is Data Curation?
Digital Enterprise Research Institute www.deri.ie
Data Governance
Convergence of data quality, data management,
business process management, and risk
management
Data Curation is a complimentary activity
Part of overall data governance strategy for
organization
Data Curator = Data Steward ??
Overlapping terms between communities
8. Data Quality and Curation
Digital Enterprise Research Institute www.deri.ie
What is Data Quality?
Desirable characteristics for information resource
Described as a series of quality dimensions
– Discoverability, Accessibility, Timeliness, Completeness,
Interpretation, Accuracy, Consistency, Provenance &
Reputation
Data curation can be used to improve these
quality dimensions
9. Data Quality and Curation
Digital Enterprise Research Institute www.deri.ie
Discoverability & Accessibility
Curate to streamline search by storing and classifying
in appropriate and consistent manner
Accuracy
Curate to ensure data correctly represents the “real-
world” values it models
Consistency
Curate to ensure data created and maintained using
standardized definitions, calculations, terms, and
identifiers
10. Data Quality and Curation
Digital Enterprise Research Institute www.deri.ie
Provenance & Reputation
Curate to track source of data and determine reputation
Curate to include the objectivity of the source/producer
– Is the information unbiased, unprejudiced, and impartial?
– Or does it come from a reputable but partisan source?
Other dimensions discussed in chapter
11. How to Curate Data
Digital Enterprise Research Institute www.deri.ie
Data Curation is a large field with sophisticated
techniques and processes
Section provides high-level overview on:
Should you curate data?
Types of Curation
Setting up a curation process
Additional detail and references available in book
chapter
12. Should You Curate Data?
Digital Enterprise Research Institute www.deri.ie
Curation can have multiple motivations
Improving accessibility, quality, consistency,…
Will the data benefit from curation?
Identify business case
Determine if potential return support investment
Not all enterprise data should be curated
Suits knowledge-centric data rather than transactional
operations data
13. Types of Data Curation
Digital Enterprise Research Institute www.deri.ie
Multiple approaches to curate data, no single
correct way
Who?
– Individual Curators
– Curation Departments
– Community-based Curation
How?
– Manual Curation
– (Semi-)Automated
– Sheer Curation
14. Types of Data Curation – Who?
Digital Enterprise Research Institute www.deri.ie
Individual Data Curators
Suitable for infrequently changing small quantity of
data
– (<1,000 records)
– Minimal curation effort (minutes per record)
15. Types of Data Curation – Who?
Digital Enterprise Research Institute www.deri.ie
Curation Departments
Curation experts working with subject matter experts
to curate data within formal process
– Can deal with large curation effort (000‟s of records)
Limitations
Scalability: Can struggle with large quantities of
dynamic data (>million records)
Availability: Post-hoc nature creates delay in curated
data availability
16. Types of Data Curation - Who?
Digital Enterprise Research Institute www.deri.ie
Community-Based Data Curation
Decentralized approach to data curation
Crowd-sourcing the curation process
– Leverages community of users to curate data
Wisdom of the community (crowd)
Can scale to millions of records
17. Types of Data Curation – How?
Digital Enterprise Research Institute www.deri.ie
Manual Curation
Curators directly manipulate data
Can tie users up with low-value add activities
(Sem-)Automated Curation
Algorithms can (semi-)automate curation activities
such as data cleansing, record duplication and
classification
Can be supervised or approved by human curators
18. Types of Data Curation – How?
Digital Enterprise Research Institute www.deri.ie
Sheer curation, or Curation at Source
Curation activities integrated in normal workflow of
those creating and managing data
Can be as simple as vetting or “rating” the results of a
curation algorithm
Results can be available immediately
Blended Approaches: Best of Both
Sheer curation + post hoc curation department
Allows immediate access to curated data
Ensures quality control with expert curation
19. Setting up a Curation Process
Digital Enterprise Research Institute www.deri.ie
5 Steps to setup a curation process:
1 - Identify what data you need to curate
2 - Identify who will curate the data
3 - Define the curation workflow
4 - Identity appropriate data-in & data-out formats
5 - Identify the artifacts, tools, and processes needed to
support the curation process
20. The New York Times
Digital Enterprise Research Institute www.deri.ie
100 Years of Expert Data Curation
21. The New York Times
Digital Enterprise Research Institute www.deri.ie
Largest metropolitan and third largest
newspaper in the United States
nytimes.com
Most popular newspaper
website in US
100 year old curated
repository defining its
participation in the
emerging Web of Data
22. The New York Times
Digital Enterprise Research Institute www.deri.ie
Data curation dates back to 1913
Publisher/owner Adolph S. Ochs decided to provide a
set of additions to the newspaper
New York Times Index
Organized catalog of articles titles and summaries
– Containing issue, date and column of article
– Categorized by subject and names
– Introduced on quarterly then annual basis
Transitory content of newspaper became
important source of searchable historical data
Often used to settle historical debates
23. The New York Times
Digital Enterprise Research Institute www.deri.ie
Index Department was created in 1913
Curation and cataloguing of NYT resources
– Since 1851 NYT had low quality index for internal use
Developed a comprehensive catalog using a
controlled vocabulary
Covering subjects, personal names, organizations,
geographic locations and titles of creative works
(books, movies, etc), linked to articles and their
summaries
Current Index Dept. has ~15 people
24. The New York Times
Digital Enterprise Research Institute www.deri.ie
Challenges with consistently and accurately
classifying news articles over time
Keywords expressing subjects may show some
variance due to cultural or legal constraints
Identities of some entities, such as organizations and
places, changed over time
Controlled vocabulary grew to hundreds of
thousands of categories
Adding complexity to classification process
25. The New York Times
Digital Enterprise Research Institute www.deri.ie
Increased importance of Web drove need to
improve categorization of online content
Curation carried out by Index Department
Library-time (days to weeks)
Print edition can handle next-day index
Not suitable for real-time online publishing
nytimes.com needed a same-day index
26. The New York Times
Digital Enterprise Research Institute www.deri.ie
Introduced two stage curation process
Editorial staff performed best-effort semi-automated
sheer curation at point of online pub.
– Several hundreds journalists
Index Department follow up with long-term accurate
classification and archiving
Benefits:
Non-expert journalist curators provide instant
accessibility to online users
Index Department provides long-term high-quality
curation in a “trust but verify” approach
27. NYT Curation Workflow
Digital Enterprise Research Institute www.deri.ie
Curation starts with article getting out of the newsroom
28. NYT Curation Workflow
Digital Enterprise Research Institute www.deri.ie
Member of editorial staff submits article to web-based rule
based information extraction system (SAS Teragram)
29. NYT Curation Workflow
Digital Enterprise Research Institute www.deri.ie
Teragram uses linguistic extraction rules based on subset of
Index Dept‟s controlled vocab.
30. NYT Curation Workflow
Digital Enterprise Research Institute www.deri.ie
Teragram suggests tags based on the Index vocabulary that
can potentially describe the content of article
31. NYT Curation Workflow
Digital Enterprise Research Institute www.deri.ie
Editorial staff member selects terms that best describe the
contents and inserts new tags if necessary
32. NYT Curation Workflow
Digital Enterprise Research Institute www.deri.ie
Reviewed by the taxonomy managers with feedback to
editorial staff on classification process
33. NYT Curation Workflow
Digital Enterprise Research Institute www.deri.ie
Article is published online at nytimes.com
34. NYT Curation Workflow
Digital Enterprise Research Institute www.deri.ie
At later stage article receives second level curation by Index
Dept. additional Index tags and a summary
36. The New York Times
Digital Enterprise Research Institute www.deri.ie
Early adopter of Linked Open Data (June „09)
37. The New York Times
Digital Enterprise Research Institute www.deri.ie
Linked Open Data @ data.nytimes.com
Subset of 10,000 tags from index vocabulary
Dataset of people, organizations & locations
– Complemented by search services to consume data
about articles, movies, best sellers, Congress votes,
real estate,…
Benefits
Improves traffic by third party data usage
Lowers development cost of new applications for
different verticals inside the website
– E.g. movies, travel, sports, books
38. Overview
Digital Enterprise Research Institute www.deri.ie
Curation Background
The Business Need for Curated Data
What is Data Curation?
Data Quality and Curation
How to Curate Data
Case Study New York Times
Best Practices from Case Study Learning
39. Best Practices from Case Study
Learning
Digital Enterprise Research Institute www.deri.ie
Social Best Practices
Participation
Engagement
Incentives
Community Governance Models
Technical Best Practices
Data Representation
Human- and AutomatedCuration
Track Provenance
40. Social Best Practices
Digital Enterprise Research Institute www.deri.ie
Participation
Stakeholders involvement for data producers and
consumers must occur early in project
– Provides insight into basic questions of what they want
to do, for whom, and what it will provide
White papers are effective means to present these
ideas, and solicit opinion from community
– Can be used to establish informal „social contract‟ for
community
41. Social Best Practices
Digital Enterprise Research Institute www.deri.ie
Engagement
Outreach activities essential for promotion and
feedback
Typical consumers-to-contributors ratios of less than
5%
Social communication and networking forums are
useful
– Majority of community may not communicate using
these media
– Communication by email still remains important
42. Social Best Practices
Digital Enterprise Research Institute www.deri.ie
Incentives
Sheer curation needs line of sight from data curating
activity, to tangible exploitation benefits
Lack of awareness of value proposition will slow
emergence of collaborative contributions
Recognizing contributing curators through a formal
feedback mechanism
– Reinforces contribution culture
– Directly increases output quality
43. Social Best Practices
Digital Enterprise Research Institute www.deri.ie
Community Governance Models
Effective governance structure is vital to ensure
success of community
Internal communities and consortium perform well
when they leverage traditional corporate and
democratic governance models
Open communities need to engage the community
within the governance process
– Follow less orthodox approaches using meritocratic
and autocratic principles
44. Technical Best Practices
Digital Enterprise Research Institute www.deri.ie
Data Representation
Must be robust and standardized to encourage
community usage and tools development
Support for legacy data formats and ability to
translate data forward to support new technology and
standards
Human & Automated Curation
Balancing will improve data quality
Automated curation should always defer to, and never
override, human curation edits
– Automate validating data deposition and entry
– Target community at focused curation tasks
45. Technical Best Practices
Digital Enterprise Research Institute www.deri.ie
Track Provenance
All curation activities should be recorded and
maintained as part data provenance effort
– Especially where human curators are involved
Users can have different perspectives of provenance
– A scientist may need to evaluate the fine grained
experiment description behind the data
– For a business analyst the ‟brand‟ of data provider can
be sufficient for determining quality
46. Conclusions
Digital Enterprise Research Institute www.deri.ie
Data curation can ensure the quality of data and
its fitness for use
Pre-competitive data can be shared without
conferring a commercial advantage
Pre-competitive data communities
Common curation tasks carried out once in public
domain
Reduces cost, increase quantity and quality
47. Acknowledgements
Digital Enterprise Research Institute www.deri.ie
Collaborators Andre Freitas & Seán O'Riain
Insight from Thought Leaders
Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product
Development and Management), and Gregg Fenton (Director Emerging Platforms)
from the New York Times
Krista Thomas (Vice President, Marketing & Communications), Tom Tague
(OpenCalais initiative Lead) from Thomson Reuters
Antony Williams (VP of Strategic Development ) from ChemSpider
Helen Berman (Director), John Westbrook (Product Development) from the Protein
Data Bank
Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.
The work presented has been funded by Science
Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-
2).
48. Further Information
Digital Enterprise Research Institute www.deri.ie
The Role of Community-Driven
Data Curation for Enterprises
Edward Curry, Andre Freitas, & Seán O'Riain
In David Wood (ed.),
Linking Enterprise Data Springer, 2010.
Available Free at:
http://3roundstones.com/led_book/led-curry-et-al.html