SlideShare une entreprise Scribd logo
1  sur  124
Télécharger pour lire hors ligne
EarthBiAs2014	
  
Global	
  NEST	
  
	
  
University	
  of	
  the	
  Aegean	
  
Crowdsourcing	
  Approaches	
  to	
  Big	
  Data	
  CuraDon	
  for	
  
Earth	
  Sciences	
  
	
  
Insight	
  Centre	
  for	
  Data	
  AnalyDcs,	
  	
  
NaDonal	
  University	
  of	
  Ireland	
  Galway	
  
EarthBiAs2014	
   1	
  
Take	
  Home	
  
Algorithms Humans
Better DataData
Talk	
  Overview	
  
•  Part	
  I:	
  Mo=va=on	
  
•  Part	
  II:	
  Data	
  Quality	
  And	
  Data	
  Cura=on	
  
•  Part	
  III:	
  Crowdsourcing	
  
•  Part	
  IV:	
  Case	
  Studies	
  on	
  Crowdsourced	
  Data	
  
Cura=on	
  
•  Part	
  V:	
  SeIng	
  up	
  a	
  Crowdsourced	
  Data	
  Cura=on	
  
Process	
  
•  Part	
  VI:	
  Linked	
  Open	
  Data	
  Example	
  
•  Part	
  IIV:	
  Future	
  Research	
  Challenges	
  
7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   EarthBiAs2014	
  
MOTIVATION	
  
PART	
  I	
  
7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   EarthBiAs2014	
  
BIG
Big Data Public Private Forum
THE BIG PROJECT
Overall objective
Bringing the necessary stakeholders into a self-sustainable
industry-led initiative, which will greatly contribute to
enhance the EU competitiveness taking full advantage of Big
Data technologies.
Work at technical, business and policy levels, shaping the
future through the positioning of IIM and Big Data
specifically in Horizon 2020.
BIGBig Data Public Private Forum
BIG
Big Data Public Private Forum
SITUATING BIG DATA IN INDUSTRY
Health Public Sector
Finance &
Insurance
Telco, Media&
Entertainment
Manufacturing,
Retail, Energy,
Transport
Needs Offerings
Value Chain
Technical Working Groups
Industry Driven Sectorial Forums
Data
Acquisition
Data
Analysis
Data
Curation
Data
Storage
Data
Usage
•  Structured data
•  Unstructured data
•  Event processing
•  Sensor networks
•  Protocols
•  Real-time
•  Data streams
•  Multimodality
•  Stream mining
•  Semantic analysis
•  Machine learning
•  Information
extraction
•  Linked Data
•  Data discovery
•  ‘Whole world’
semantics
•  Ecosystems
•  Community data
analysis
•  Cross-sectorial data
analysis
•  Data Quality
•  Trust / Provenance
•  Annotation
•  Data validation
•  Human-Data
Interaction
•  Top-down/Bottom-up
•  Community / Crowd
•  Human Computation
•  Curation at scale
•  Incentivisation
•  Automation
•  Interoperability
•  In-Memory DBs
•  NoSQL DBs
•  NewSQL DBs
•  Cloud storage
•  Query Interfaces
•  Scalability and
Performance
•  Data Models
•  Consistency,
Availability, Partition-
tolerance
•  Security and Privacy
•  Standardization
•  Decision support
•  Prediction
•  In-use analytics
•  Simulation
•  Exploration
•  Visualisation
•  Modeling
•  Control
•  Domain-specific
usage
BIG
Big Data Public Private Forum
SUBJECT MATTER EXPERT INTERVIEWS
BIG
Big Data Public Private Forum
KEY INSIGHTS
Key Trends
▶  Lower usability barrier for data tools
▶  Blended human and algorithmic data processing for coping with
for data quality
▶  Leveraging large communities (crowds)
▶  Need for semantic standardized data representation
▶  Significant increase in use of new data models (i.e. graph)
(expressivity and flexibility)
▶  Much of (Big Data) technology
is evolving evolutionary
▶  But business processes change
must be revolutionary
▶  Data variety and verifiability
are key opportunities
▶  Long tail of data variety is a
major shift in the data landscape
The Data Landscape
▶  Lack of Business-driven Big
Data strategies
▶  Need for format and data
storage technology standards
▶  Data exchange between
companies, institutions,
individuals, etc.
▶  Regulations & markets for data
access
▶  Human resources: Lack of
skilled data scientists
Biggest Blockers
Technical White Papers available on:
http://www.big-project.eu
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The Internet of Everything:
Connecting the Unconnected
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Earth Science – Systems of Systems
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
CiDzen	
  Sensors	
  
“…humans	
  as	
  ci,zens	
  on	
  the	
  ubiquitous	
  Web,	
  ac,ng	
  as	
  
sensors	
  and	
  sharing	
  their	
  observa,ons	
  and	
  views…”	
  
¨  Sheth,	
  A.	
  (2009).	
  Ci=zen	
  sensing,	
  social	
  signals,	
  and	
  enriching	
  human	
  
experience.	
  Internet	
  Compu,ng,	
  IEEE,	
  13(4),	
  87-­‐92.	
  
Air Pollution
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Citizens as Sensors
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   16 of XYZ
Haklay, M., 2013, Citizen Science and Volunteered Geographic Information – overview and typology of
participation in Sui, D.Z., Elwood, S. and M.F. Goodchild (eds.), 2013. Crowdsourcing Geographic Knowledge:
Volunteered Geographic Information (VGI) in Theory and Practice . Berlin: Springer.
DATA	
  QUALITY	
  AND	
  DATA	
  CURATION	
  
PART	
  II	
  
7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   EarthBiAs2014	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The Problems with Data
Knowledge Workers need:
¨  Access to the right data
¨  Confidence in that data
Flawed data effects 25%
of critical data in world’s
top companies
Data quality role in recent
financial crisis:
¨  “Asset are defined differently
in different programs”
¨  “Numbers did not always add
up”
¨  “Departments do not trust
each other’s figures”
¨  “Figures … not worth the
pixels they were made of”
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
What is Data Quality?
“Desirable characteristics for information
resource”
Described as a series of quality dimensions:
n  Discoverability & Accessibility: storing and classifying in
appropriate and consistent manner
n  Accuracy: Correctly represents the “real-world” values it models
n  Consistency: Created and maintained using standardized
definitions, calculations, terms, and identifiers
n  Provenance & Reputation: Track source & determine reputation
¨  Includes the objectivity of the source/producer
¨  Is the information unbiased, unprejudiced, and impartial?
¨  Or does it come from a reputable but partisan source?
Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers.
Journal of Management Information Systems, 1996. 12(4): p. 5-33.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Data Quality
ID PNAME PCOLOR PRICE
APNR iPod Nano Red 150
APNS iPod Nano Silver 160
<Product	
  name=“iPod	
  Nano”>	
  
	
  	
  	
  <Items>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <Item	
  code=“IPN890”>	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <price>150</price>	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <genera=on>5</genera=on>	
  
	
  	
  	
  	
  	
  	
  	
  	
  </Item>	
  
	
  	
  	
  	
  </Items>	
  
</Product>	
  
Source A
Source B
Schema Difference?
Data Developer
APNR	
  
iPod	
  Nano	
  
Red	
  
150	
  
APNR	
  
iPod	
  Nano	
  
Silver	
  
160	
  
iPod	
  Nano	
   IPN890	
  
150	
  
5	
  
Value Conflicts?
Entity Duplication?
Data Steward
Business Users
?
Technical Domain
(Technical)
Domain
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
What is Data Curation?
n  Digital Curation
¨ Selection, preservation, maintenance, collection,
and archiving of digital assets
n  Data Curation
¨ Active management of data over its life-cycle
n  Data Curators
¨ Ensure data is trustworthy, discoverable, accessible,
reusable, and fit for use
– Museum cataloguers of the Internet age
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Related Activities
n  Data Governance/ Master Data Management
¨ Convergence of data quality, data management,
business process management, and risk
management
¨ Part of overall data governance strategy for
organization
n  Data Curator = Data Steward
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Types of Data Curation
n  Multiple approaches to curate data, no
single correct way
¨ Who?
– Individual Curators
– Curation Departments
– Community-based Curation
¨ How?
– Manual Curation
– (Semi-)Automated
– Sheer Curation
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Types of Data Curation – Who?
n  Individual Data Curators
¨ Suitable for infrequently changing small quantity
of data
–  (<1,000 records)
–  Minimal curation effort (minutes per record)
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Types of Data Curation – Who?
n  Curation Departments
¨ Curation experts working with subject matter
experts to curate data within formal process
–  Can deal with large curation effort (000’s of records)
n  Limitations
¨ Scalability: Can struggle with large quantities of
dynamic data (>million records)
¨ Availability: Post-hoc nature creates delay in
curated data availability
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Types of Data Curation - Who?
n  Community-Based Data Curation
¨ Decentralized approach to data curation
¨ Crowd-sourcing the curation process
– Leverages community of users to curate data
¨ Wisdom of the community (crowd)
¨ Can scale to millions of records
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Types of Data Curation – How?
n  Manual Curation
¨ Curators directly manipulate data
¨ Can tie users up with low-value add activities
n  (Sem-)Automated Curation
¨ Algorithms can (semi-)automate curation
activities such as data cleansing, record
duplication and classification
¨ Can be supervised or approved by human
curators
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Types of Data Curation – How?
n  Sheer curation, or Curation at Source
¨ Curation activities integrated in normal workflow
of those creating and managing data
¨ Can be as simple as vetting or “rating” the
results of a curation algorithm
¨ Results can be available immediately
n  Blended Approaches: Best of Both
¨ Sheer curation + post hoc curation department
¨ Allows immediate access to curated data
¨ Ensures quality control with expert curation
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Data Quailty
Data Curation Example
Profile
Sources
Define
Mappings
Cleans Enrich
De-duplicate
Define
Rules
Curated
Data
Data Developer
Data Curator
Data Governance
Business Users
Applications
Product DataProduct Data
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Data Curation
n  Pros
¨  Can create a single version of truth
¨  Standardized information creation and management
¨  Improves data quality
n  Cons
¨  Significant upfront costs and efforts
¨  Participation limited to few (mostly) technical experts
¨  Difficult to scale for large data sources
–  Extended Enterprise e.g. partner, data vendors
¨  Small % of data under management (i.e. CRM, Product, …)
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The New York Times
100 Years of Expert Data Curation
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The New York Times
n  Largest metropolitan and third largest
newspaper in the United States
n  nytimes.com
q  Most popular newspaper
website in US
n  100 year old curated
repository defining its
participation in the
emerging Web of Data
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The New York Times
n  Data curation dates back to 1913
¨ Publisher/owner Adolph S. Ochs decided to
provide a set of additions to the newspaper
n  New York Times Index
¨ Organized catalog of articles titles and summaries
–  Containing issue, date and column of article
–  Categorized by subject and names
–  Introduced on quarterly then annual basis
n  Transitory content of newspaper became
important source of searchable historical data
¨ Often used to settle historical debates
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The New York Times
n   Index Department was created in 1913
¨ Curation and cataloguing of NYT resources
–  Since 1851 NYT had low quality index for internal use
n  Developed a comprehensive catalog using a
controlled vocabulary
¨ Covering subjects, personal names,
organizations, geographic locations and titles of
creative works (books, movies, etc), linked to
articles and their summaries
n  Current Index Dept. has ~15 people
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The New York Times
n  Challenges with consistently and accurately
classifying news articles over time
¨ Keywords expressing subjects may show some
variance due to cultural or legal constraints
¨ Identities of some entities, such as organizations
and places, changed over time
n  Controlled vocabulary grew to hundreds of
thousands of categories
¨ Adding complexity to classification process
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The New York Times
n  Increased importance of Web drove need to
improve categorization of online content
n  Curation carried out by Index Department
¨ Library-time (days to weeks)
¨ Print edition can handle next-day index
n  Not suitable for real-time online publishing
¨ nytimes.com needed a same-day index
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The New York Times
n  Introduced two stage curation process
¨ Editorial staff performed best-effort semi-
automated sheer curation at point of online pub.
–  Several hundreds journalists
¨ Index Department follow up with long-term
accurate classification and archiving
n  Benefits:
¨ Non-expert journalist curators provide instant
accessibility to online users
¨ Index Department provides long-term high-
quality curation in a “trust but verify” approach
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
NYT Curation Workflow
¨ Curation starts with article getting out of the newsroom
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
NYT Curation Workflow
¨ Member of editorial staff submits article to web-based
rule based information extraction system (SAS Teragram)
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
NYT Curation Workflow
¨ Teragram uses linguistic extraction rules based on subset
of Index Dept’s controlled vocab.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
NYT Curation Workflow
¨ Teragram suggests tags based on the Index vocabulary
that can potentially describe the content of article
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
NYT Curation Workflow
¨ Editorial staff member selects terms that best describe
the contents and inserts new tags if necessary
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
NYT Curation Workflow
¨ Reviewed by the taxonomy managers with feedback to
editorial staff on classification process
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
NYT Curation Workflow
¨ Article is published online at nytimes.com
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
NYT Curation Workflow
¨ At later stage article receives second level curation by
Index Dept. additional Index tags and a summary
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
NYT Curation Workflow
¨ Article is submitted to NYT Index
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The New York Times
n  Early adopter of Linked Open Data (June ‘09)
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
The New York Times
n  Linked Open Data @ data.nytimes.com
¨ Subset of 10,000 tags from index vocabulary
¨ Dataset of people, organizations & locations
– Complemented by search services to consume
data about articles, movies, best sellers,
Congress votes, real estate,…
n  Benefits
¨ Improves traffic by third party data usage
¨ Lowers development cost of new applications
for different verticals inside the website
–  E.g. movies, travel, sports, books
CROWDSOURCING	
  
PART	
  III	
  
7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   EarthBiAs2014	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   50
Crowdsourcing Landscape
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Crowdsourcing Landscape
51
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Introduction to Crowdsourcing
n  Coordinating a crowd (a large group of workers)to
do micro-work (small tasks) that solves problems
(that computers or a single user can’t)
n  A collection of mechanisms and associated
methodologies for scaling and directing
crowd activities to achieve goals
n  Related Areas
¨  Collective Intelligence
¨  Social Computing
¨  Human Computation
¨  Data Mining
A. J. Quinn and B. B. Bederson, “Human computation: a survey and taxonomy of a growing field,” in
Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp.
1403–1412.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
When Computers Were Human
n  Maskelyne 1760
¨ Used human computers
to created almanac of
moon positions
– Used for shipping/
navigation
¨ Quality assurance
– Do calculations twice
– Compare to third verifier
D. A. Grier, When Computers Were Human,
vol. 13. Princeton University Press, 2005.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
When Computers Were Human
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
20th Century
1926: Teleautomation
“…when wireless is perfectly applied the whole earth will be converted
into a huge brain.”
1948: Cybernetics
“…communication and control theory that is concerned especially with
the comparative study of automatic control systems.”
Credits: Thierry Ehrmann (Flickr), Dr. Sabina Jeschke, Wikimedia Foundation
1961: Embedded systems
“A system a dedicated function within a larger mechanical or electrical
system, often with real-time computing constraints.”
1988: Ubiquitous computing
“…advanced computing concept where computing is made to appear
everywhere and anywhere.”
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
21st Century
Credits: Kevin Ashton, Amith Sheth, Helen Gill, Wikimedia Foundation
1999: Internet of Things
“…to uniquely identifiable objects and their virtual representations in
an Internet-like structure.”
2006: Cyber-physical systems
“…communication and control theory that is concerned especially
with the comparative study of automatic control systems.”
2008: Web of Things
“A set of blueprints to make every-day physical objects first class
citizens of the World Wide Web by giving them an API.”
2012: Physical-Cyber-Social computing
“a holistic treatment of data, information, and knowledge from the PCS
worlds to integrate, correlate, interpret, and provide contextually relevant
abstractions to humans.”
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Sensing
Credits: Albany Associates, stuartpilrow, Mike_n (Flickr)
Computation Actuation
Human Powered CPS
Leverages human capabilities in conjunction
with machine capabilities for optimizing
processes in the cyber-physical-social
environments
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Human
ü Visual perception
ü Visuospatial thinking
ü Audiolinguistic ability
ü Sociocultural
awareness
ü Creativity
ü Domain knowledge
Machine
ü Large-scale data
manipulation
ü Collecting and storing
large amounts of data
ü Efficient data movement
ü Bias-free analysis
Human vs Machine Affordances
R. J. Crouser and R. Chang, “An affordance-based framework for human
computation and human-computer collaboration,” IEEE Trans. Vis.
Comput. Graph., vol. 18, pp. 2859–2868, 2012.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
When to Crowdsource a Task?
n  Computers cannot do the task
n  Single person cannot do the task
n  Work can be split into smaller tasks
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Platforms and Marketplaces
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Types of Crowds
n  Internal corporate communities
¨ Taps potential of internal workforce
¨ Curate competitive enterprise data that will
remain internal to the company
– May not always be the case e.g. product technical
support and marketing data
n  External communities
¨  Public crowd-souring market places
¨  Pre-competitive communities
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Generic Architecture
Workers
Platform/Marketplace
(Publish Task, Task Management)
Requestors
1.
2.
4.
3.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Mturk Workflow
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Spatial Crowdsourcing
n  Crowdsoucring that requires a person to travel to a
location to preform a spatial task
¨  Helps non-local requesters through workers in targeted
spatial locality
¨  Used for data collection, package routing, citizen actuation
¨  Usually based on mobile applications
¨  Closely related to social sensing, participatory sensing,
etc.
¨  Early example Ardavark social search en
n  Example systems
CASE	
  STUDIES	
  ON	
  CROWDSOURCED	
  
DATA	
  CURATION	
  
PART	
  IV	
  
7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   EarthBiAs2014	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Crowdsourced Data Curation
DQ Rules &
Algorithms
Entity Linking
Data Fusion
Relation Extraction
Human
Computation
Relevance Judgment
Data Verification
Disambiguation
Clean Data
Internal Community
- Domain Knowledge
- High Quality Responses
- Trustable
Web of Data
Databases
Textual Content
Programmers Managers
External Crowd
- High Availability
- Large Scale
- Expertise Variety
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Examples of CDM Tasks
n  Understanding customer sentiment for
launch of new product around the world.
n  Implemented 24/7 sentiment analysis
system with workers from around the
world.
n  90% accuracy in 95% on content
n  Categorize millions of products on eBay’s
catalog with accurate and complete
attributes
n  Combine the crowd with machine learning to
create an affordable and flexible catalog
quality system
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Examples of CDM Tasks
n  Natural Language Processing
¨  Dialect Identification, Spelling Correction, Machine
Translation, Word Similarity
n  Computer Vision
¨  Image Similarity, Image Annotation/Analysis
n  Classification
¨  Data attributes, Improving taxonomy, search results
n  Verification
¨  Entity consolidation, de-duplicate, cross-check, validate
data
n  Enrichment
¨  Judgments, annotation
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Wikipedia
n  Collaboratively built by large community
¨  More than 19,000,000 articles, 270+ languages,
3,200,000+ articles in English
¨  More than 157,000 active contributors
n  Accuracy and stylistic formality are
equivalent to expert-based resources
¨  i.e. Columbia and Britannica encyclopedias
n  WikiMeida
¨  Software behind Wikipedia
¨  Widely used inside organizations
¨  Intellipedia:16 U.S. Intelligence agencies
¨  Wiki Proteins: curated Protein data for
knowledge discovery
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Wikipedia
n  Decentralized environment supports creation
of high quality information with:
¨ Social organization
¨ Artifacts, tools & processes for cooperative work
coordination
n  Wikipedia collaboration dynamics highlight
good practices
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Wikipedia – Social Organization
n  Any user can edit its contents
¨ Without prior registration
n  Does not lead to a chaotic scenario
¨ In practice highly scalable approach for high
quality content creation on the Web
n  Relies on simple but highly effective way to
coordinate its curation process
n  Curation is activity of Wikipedia admins
¨ Responsibility for information quality standards
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Wikipedia – Social Organization
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Wikipedia – Social Organization
n  Incentives
¨ Improvement of one’s reputation
¨ Sense of efficacy
–  Contributing effectively to a meaningful project
¨ Over time focus of editors typically change
–  From curators of a few articles in specific topics
–  To more global curation perspective
–  Enforcing quality assessment of Wikipedia as a whole
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Wikipedia – Artifacts, Tools &
Processes
n  Wiki Article Editor (Tool)
¨  WYSIWYG or markup text editor
n  Talk Pages (Tool)
¨  Public arena for discussions
around Wikipedia resources
n  Watchlists (Tool)
¨  Helps curators to actively
monitor the integrity and quality
of resources they contribute
n  Permission Mechanisms (Tool)
¨  Users with administrator status
can perform critical actions such
as remove pages and grant
administrative permissions to
new users
n  Automated Edition (Tool)
¨  Bots are automated or semi-automated
tools that perform repetitive tasks over
content
n  Page History and Restore (Tool)
¨  Historical trail of changes to a Wikipedia
Resource
n  Guidelines, Policies & Templates
(Artifact)
¨  Defines curation guidelines for editors to
assess article quality
n  Dispute Resolution (Process)
¨  Dispute mechanism between editors
over the article contents
n  Article Edition, Deletion, Merging,
Redirection, Transwiking, Archival
(Process)
¨  Describe the curation actions over
Wikipedia resources
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
DBPedia Knowledge base
n  DBPedia provides direct access to data
¨ Indirectly uses wiki as data curation platform
¨ Inherits massive volume of curated
Wikipedia data
¨ 3.4 million entities and 1 billion RDF triples
¨ Comprehensive data infrastructure
– Concept URIs
– Definitions
– Basic types
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Wikipedia - DBPedia
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
n  Collaborative knowledge
base maintained by
community of web users
n  Users create entity types
and their meta-data
according to guidelines
n  Requires administrative
approvals for schema
changes by end users
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Audio Tagging - Tag a Tune
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Image Tagging - Peekaboom
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Protein Folding - Fold.it/
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
ReCaptcha
n  OCR
¨  ~ 1% error rate
¨  20%-30% for 18th and
19th century books
n  40 million ReCAPTCHAs
every day” (2008)
¨  Fixing 40,000 books a
day
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
ChemSpider
n  Structure centric chemical community
¨ Over 300 data sources with 25 million records
¨ Provided by chemical vendors, government
databases, private laboratories and individual
n  Pharma realizing benefits of open data
¨ Heavily leveraged by pharmaceutical companies
as pre-competitive resources for experimental
and clinical trial investigation
¨ Glaxo Smith Kline made its proprietary malaria
dataset of 13,500 compounds available
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
n  Dedicated to improving understanding of
biological systems functions with 3-D structure
of macromolecules
¨ Started in 1971 with 3 core members
¨ Originally offered 7 crystal structures
¨ Grown to 63,000 structures
¨ Over 300 million dataset downloads
n  Expanded beyond curated data download
service to include complex molecular
visualized, search, and analysis capabilities
SETTING	
  UP	
  A	
  CROWDSOURCED	
  DATA	
  
CURATION	
  PROCESS	
  
PART	
  V	
  
7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   EarthBiAs2014	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Core Design Questions
Goal
What
Why IncentivesWhoWorkers
How
Process
Malone, T. W., Laubacher, R., & Dellarocas, C. N.
Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Setting up a Curation Process
1 – Who is doing it?
2 – Why are they doing it?
3 – What is being done?
4 – How is it being done?
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
1) Who is doing it? (Workers)
n  Hierarchy (Assignment)
¨ Someone in authority assigns a particular person
or group of people to perform the task
¨ Within the Enterprise (i.e. Individuals, specialised
departments)
¨ Within a structured community (i.e. pre-
competitive community)
n  Crowd (Choice)
¨ Anyone in a large group who choses to do so
¨ Internal or External Crowds
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
2) Why are they doing it? (Incentives)
n  Motivation
¨  Money ($$££)
¨  Glory (reputation/prestige)
¨  Love (altruism, socialize, enjoyment)
¨  Unintended by-product (e.g. re-Captcha, captured in workflow)
¨  Self-serving resources (e.g. Wikipedia, product/customer data)
¨  Part of their job description (e.g. Data curation as part of role)
n  Determine pay and time for each task
¨  Marketplace: Delicate balance
–  Money does not improve quality but can increase participation
¨  Internal Hierarchy: Engineering opportunities for recognition
–  Performance review, prizes for top contributors, badges,
leaderboards, etc.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Effect of Payment on Quality
n  Cost does not affect quality
n  Similar results for bigger tasks [Ariely et al, 2009]
Mason, W. A., & Watts, D. J. (2009). Financial incentives and the ‘‘performance of crowds.’’
Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009.
[Panos Ipeirotis. WWW2011 tutorial]
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
3) What is being done? (Goal)
3.1 Identify the Data
¨ Newly created data and/or legacy data?
¨ How is new data created?
– Do users create the data, or is it imported from an
external source?
¨ How frequently is new data created/updated?
¨ What quantity of data is created?
¨ How much legacy data exists?
¨ Is it stored within a single source, or scattered
across multiple sources?
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
3) What is being done? (Goal)
3.2 Identify the Tasks
¨ Creation Tasks
– Create/Generate
– Find
– Improve/ Edit / Fix
¨ Decision (Vote) Tasks
– Accept / Reject
– Thumbs up / Thumbs Down
– Vote for Best
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
4) How is it being done? (How)
n  Identify the workflow
¨ Tasks integrated in normal workflow of those
creating and managing data
¨ Simple as vetting or “rating” results of algorithm
n  Identify the platform
¨  Internal/Community collaboration platforms
¨  Public crowdsourcing platform
–  Consider the availability of appropriate workers (i.e. experts)
n  Identify the Algorithm
¨  Data quality
¨  Image recognition
¨  etc
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
4) How is it being done? (How)
n  Task Design
¨ Task Interface
¨ Task Assignment/Routing
¨ Task Quality Assurance
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Task Design
98
* Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art
Input Output
Task Router
before computation
Output Aggregation
after computation
Task Interface
during computation
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Pull Routing
n  Workers seek tasks and assign to themselves
¨  Search and Discovery of tasks support by platform
¨  Task Recommendation
¨  Peer Routing
Workers
Tasks Select
Result
Algorithm
Search & Browse Interface
Result
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Push Routing
n  System assigns tasks to workers based on:
¨  Past performance
¨  Expertise
¨  Cost
¨  Latency
100
Workers
Tasks
Assign
Result
Assign
Algorithm
Task Interface
* www.mobileworks.com
Result
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Managing Task Quality Assurance
n  Redundancy: Quorum Votes
¨  Replicate the task (i.e. 3 times)
¨  Use majority voting to determine right value (% agreement)
¨  Weighted majority vote
n  Gold Data / Honey Pots
¨  Inject trap question to test quality
¨  Worker fatigue check (habit of saying no all the time)
n  Estimation of Worker Quality
¨  Redundancy plus gold data
n  Qualification Test
¨  Use test tasks to determine users ability for such tasks
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Social Best Practices
n  Participation
¨  Stakeholders involvement for data producers and
consumers must occur early in project
–  Provides insight into basic questions of what they want
to do, for whom, and what it will provide
n  Incentives
¨  Sheer curation needs line of sight from data curating
activity, to tangible exploitation benefits
¨  Recognizing contributing curators through a formal
feedback mechanism
n  Engagement
¨  Outreach essential for promotion and feedback
¨  Typical consumers-to-contributors ratios < 5%
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Technical Best Practices
n  Human & Automated Curation
¨ Automated curation should always defer to, and
never override, human curation edits
– Automate validating data deposition and entry
– Target community at focused curation tasks
n  Track Provenance
¨ Curation activities should be recorded
–  Especially where human curators are involved
¨ Different perspectives of provenance
– A scientist may need to evaluate the fine grained
experiment description behind the data
– For a business analyst the ’brand’ of data provider
can be sufficient for determining quality
LINKED	
  OPEN	
  DATA	
  EXAMPLE	
  
PART	
  VI	
  
7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   EarthBiAs2014	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Linked Open Data (LOD)
n  Expose and interlink datasets on the Web
n  Using URIs to identify “things” in your data
n  Using a graph representation (RDF) to describe URIs
n  Vision: The Web as a huge graph database
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.n
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Linked Data Example
MulDple	
  
IdenDfiers	
  
IdenDty	
  resoluDon	
  
links	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Identity Resolution in LOD
n  Quality issues with Linked Data
¨  Imprecise or Outdated or Wrong
n  Uncertainty of identity resolution links
¨  Due to multiple identity equivalence interpretations
¨  Due to characteristics of link generation algorithms
(similarity based)
n  User feedback for uncertain links
¨  Verify uncertain identity resolution links from users/
experts
¨  Improve quality of entity consolidation
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Identity Resolution in LOD
<hcp://www.freebase.com/view/en/galway>	
  
<hcp://dbpedia.org/resource/Galway>	
  
<hcp://sws.geonames.org/2964180/>	
  
owl:sameAs	
  
Publisher	
  
owl:sameAs	
  Consumer	
  
MulDple	
  IdenDfiers	
  for	
  ‘Galway’	
  enDty	
  in	
  Linked	
  Open	
  Data	
  Cloud	
  
Different	
  sources	
  of	
  idenDty	
  resoluDon	
  links	
  
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
LOD Application Architecture
Utility	
  
Module
Feedback	
  
Module
Consolidation	
  
Module
Questions
FeedbackRules
Matching
Dependencies
Ranked
Feedback Tasks
Data
Improvement
Candidate Links
Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition), 1-136. Morgan & Claypool.
FUTURE	
  RESEARCH	
  CHALLENGES	
  
PART	
  IIV	
  
7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
   EarthBiAs2014	
  
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Future Research Directions
n  Incentives and social engagement
¨  Better recognition of the data curation role
¨  Understanding of social engagement mechanisms
n  Economic Models
¨  Pre-competitive and public-private partnerships
n  Curation at Scale
¨  Evolution of human computation and crowdsourcing
¨  Instrumenting popular apps for data curation
¨  General-purpose data curation pipelines
¨  Human-data interaction
n  Trust
¨  Capture of data curation decisions & provenance management
¨  Fine-grained permission management models and tools
n  Data Curation Models
¨  Nanopulications
¨  Theoretical principles and domain-specific model
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Future Research Directions
n  Spatial Crowdsourcing
¨  Matching tasks with workers at right time and location
¨  Balancing workload among workers
¨  Tasks at remote locations
¨  Chaining tasks in same vicinity
¨  Preserving worker privacy
n  Interoperability
¨  Finding semantic similarity of tasks across systems
¨  Defining and measuring worker capability across
heterogeneous systems
¨  Enabling routing middleware for multiple systems
¨  Compatibility of reputation systems
¨  Defining standards for task exchange
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Heterogeneous Crowds
n  Multiple requesters, tasks, workers, platform
Collaborative
Data Curation
Tasks Workers
Cyber Physical
Social System
Platforms
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
SLUA Ontology
114
Reward
Action
Capability
User Task
offersearns
includesperforms
requirespossesses
Location Skill Knowledge Ability Availability
Reputation Money Fun Altruism Learning
subClassOf
subClassOf
U. ul Hassan, S. O’Riain, E. Curry, “SLUA: Towards Semantic Linking of Users with Actions
in Crowdsourcing,” in International Workshop on Crowdsourcing the Semantic Web, 2013.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Future Research Directions
n  Task Routing
¨  Optimizing task completion, quality, and latency
¨  Inferring worker preferences, skills, and knowledge
¨  Balancing exploration-exploitation trade-off between
inference and optimization
¨  Cold-start problem for new workers or tasks
¨  Ensuring worker satisfaction via load balancing & rewards
n  Human–Computer Interaction
¨  Reducing search friction through good browsing interfaces
¨  Presenting requisite information nothing more
¨  Choosing the level of task granularity for complex tasks
¨  Ensuring worker engagement
¨  Designing games with a purpose to crowd source with fun
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Summary
Algorithms Humans
Better DataData
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Selected References
n  Big Data & Data Quality
¨  S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the
Path from Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, pp. 21–32, 2011.
¨  A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise Information
Management, vol. 24, no. 3, pp. 288–303, 2011.
¨  R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one master data –
challenges and preconditions,” Industrial Management & Data Systems, vol. 111, no. 1, pp. 146–
162, 2011.
¨  E. Curry, S. Hasan, and S. O’Riain, “Enterprise Energy Management using a Linked Dataspace for
Energy Intelligence,” in Second IFIP Conference on Sustainable Internet and ICT for Sustainability,
2012.
¨  D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
2008.
¨  Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment.Communications of the
ACM, 45(4), 211-2
¨  Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality
assessment and improvement. ACM Computing Surveys (CSUR), 41(3), 16.
¨  B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an Expert Survey,” in
Proceedings of the 2010 ACM Symposium on Applied Computing - SAC ’10, 2010, pp. 106–110.
¨  Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of
Management Information Systems, 1996. 12(4): p. 5-33
¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Leveraging Matching Dependencies for Guided User
Feedback in Linked Data Applications,” In 9th International Workshop on Information Integration
on the Web (IIWeb2012) Scottsdale, Arizona,: ACM.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Selected References
n  Collective Intelligence, Crowdsourcing & Human Computation
¨  Malone, Thomas W., Robert Laubacher, and Chrysanthos Dellarocas. "Harnessing Crowds: Mapping the
Genome of Collective Intelligence." (2009).
¨  A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World-Wide Web,”
Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011.
¨  A. J. Quinn and B. B. Bederson, “Human computation: a survey and taxonomy of a growing field,” in
Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 1403–
1412.
¨  Mason, W. A., & Watts, D. J. (2009). Financial incentives and the ‘‘performance of crowds.’’ Proceedings of
the Human Computation Workshop. Paris: ACM, June 28, 2009.
¨  E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial Intelligence and Machine
Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011.
¨  M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB : Answering Queries with
Crowdsourcing,” in Proceedings of the 2011 international conference on Management of data - SIGMOD
’11, 2011, p. 61.
¨  P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, “Exploring the ‘Crowd’ as
Enabler of Better Information Quality,” in Proceedings of the 16th International Conference on Information
Quality, 2011, pp. 302–312.
¨  Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of crowds". SIGKDD
Explorations (SIGKDD) 11(2):100-108 (2009)
¨  Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial
¨  O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for You, WSDM Hong Kong
2011.
¨  D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005.
–  http://www.youtube.com/watch?v=YwqltwvPnkw
¨  Ul Hassan, U., & Curry, E. (2013, October). A capability requirements approach for predicting worker
performance in crowdsourcing. In Collaborative Computing: Networking, Applications and Worksharing
(Collaboratecom), 2013 9th Internatinal Conference Conference on (pp. 429-437). IEEE.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Selected References
n  Collaborative Data Management
¨  E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation for Enterprises,” in
Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25–47.
¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Towards Expertise Modelling for Routing Data Cleaning
Tasks within a Community of Knowledge Workers,” In 17th International Conference on Information Quality
(ICIQ 2012), Paris, France.
¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2013. “Effects of Expertise Assessment on the Quality of Task
Routing in Human Computation,” In 2nd International Workshop on Social Media for Crowdsourcing and
Human Computation, Paris, France.
¨  Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., ... & Xu, S. (2013). Data
Curation at Scale: The Data Tamer System. In CIDR.
¨  Parameswaran, A. G., Park, H., Garcia-Molina, H., Polyzotis, N., & Widom, J. (2012, October). Deco:
declarative crowdsourcing. In Proceedings of the 21st ACM international conference on Information and
knowledge management (pp. 1203-1212). ACM.
¨  Parameswaran, A., Boyd, S., Garcia-Molina, H., Gupta, A., Polyzotis, N., & Widom, J. (2014). Optimal crowd-
powered rating and filtering algorithms.Proceedings Very Large Data Bases (VLDB).
¨  Marcus, A., Wu, E., Karger, D., Madden, S., & Miller, R. (2011). Human-powered sorts and joins. Proceedings
of the VLDB Endowment, 5(1), 13-24.
¨  Guo, S., Parameswaran, A., & Garcia-Molina, H. (2012, May). So who won?: dynamic max discovery with the
crowd. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp.
385-396). ACM.
¨  Davidson, S. B., Khanna, S., Milo, T., & Roy, S. (2013, March). Using the crowd for top-k and group-by
queries. In Proceedings of the 16th International Conference on Database Theory (pp. 225-236). ACM.
¨  Chai, X., Vuong, B. Q., Doan, A., & Naughton, J. F. (2009, June). Efficiently incorporating user feedback into
information extraction and integration programs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data (pp. 87-100). ACM.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Selected References
n  Spatial Crowdsourcing
¨  Kazemi, L., & Shahabi, C. (2012, November). Geocrowd: enabling query answering with spatial
crowdsourcing. In Proceedings of the 20th International Conference on Advances in Geographic
Information Systems (pp. 189-198). ACM.
¨  Benouaret, K., Valliyur-Ramalingam, R., & Charoy, F. (2013). CrowdSC: Building Smart Cities with
Large Scale Citizen Participation. IEEE Internet Computing, 1.
¨  Musthag, M., & Ganesan, D. (2013, April). Labor dynamics in a mobile micro-task market. In
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 641-650).
ACM.
¨  Deng, Dingxiong, Cyrus Shahabi, and Ugur Demiryurek. "Maximizing the number of worker's self-
selected tasks in spatial crowdsourcing." Proceedings of the 21st ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems. ACM, 2013.
¨  To, H., Ghinita, G., & Shahabi, C. (2014). A Framework for Protecting Worker Location Privacy in
Spatial Crowdsourcing. Proceedings of the VLDB Endowment, 7(10).
¨  Goncalves, J., Ferreira, D., Hosio, S., Liu, Y., Rogstadius, J., Kukka, H., & Kostakos, V. (2013,
September). Crowdsourcing on the spot: altruistic use of public displays, feasibility, performance,
and behaviours. In Proceedings of the 2013 ACM international joint conference on Pervasive and
ubiquitous computing(pp. 753-762). ACM.
¨  Cardone, G., Foschini, L., Bellavista, P., Corradi, A., Borcea, C., Talasila, M., & Curtmola, R. (2013).
Fostering participaction in smart cities: a geo-social crowdsensing platform. Communications
Magazine, IEEE, 51(6).
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Books
n  Surowiecki, J. (2005). The wisdom of crowds. Random House LLC.
n  Batini, C., & Scannapieco, M. (2006). Data quality: concepts, methodologies
and techniques. Springer.
n  Michelucci, P. (2013). Handbook of human computation. Springer.
n  Law, E., & Ahn, L. V. (2011). Human computation. Synthesis Lectures on
Artificial Intelligence and Machine Learning, 5(3), 1-121.
n  Heath, T., & Bizer, C. (2011). Linked data: Evolving the web into a global data
space. Synthesis lectures on the semantic web: theory and technology, 1(1),
1-136.
n  Grier, D. A. (2013). When computers were human. Princeton University Press.
n  Easley, D., & Kleinberg, J. Networks, Crowds, and Markets. Cambridge
University.
n  Sheth, A., & Thirunarayan, K. (2012). Semantics Empowered Web 3.0:
Managing Enterprise, Social, Sensor, and Cloud-based Data and Services for
Advanced Applications. Synthesis Lectures on Data Management, 4(6), 1-175.
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Tutorials
n  Human Computation and Crowdsourcing
¨  http://research.microsoft.com/apps/video/default.aspx?id=169834
¨  http://www.youtube.com/watch?v=tx082gDwGcM
n  Human-Powered Data Management
¨  http://research.microsoft.com/apps/video/default.aspx?id=185336
n  Crowdsourcing Applications and Platforms: A Data Management
Perspective
¨  http://www.vldb.org/pvldb/vol4/p1508-doan-tutorial4.pdf
n  Human Computation: Core Research Questions and State of the Art
¨  http://www.humancomputation.com/Tutorial.html
n  Crowdsourcing & Machine Learning
¨  http://www.cs.rutgers.edu/~hirsh/icml-2011-tutorial/
n  Data quality and data cleaning: an overview
¨  http://dl.acm.org/citation.cfm?id=872875
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Datasets
n  TREC Crowdsourcing Track
¨  https://sites.google.com/site/treccrowd/
n  2010 Crowdsourced Web Relevance Judgments Data
¨  https://docs.google.com/document/d/
1J9H7UIqTGzTO3mArkOYaTaQPibqOTYb_LwpCpu2qFCU/edit
n  Statistical QUality Assurance Robustness Evaluation Data
¨  http://ir.ischool.utexas.edu/square/data.html
n  Crowdsourcing at Scale 2013
¨  http://www.crowdscale.org/
n  USEWOD - Usage Analysis and the Web of Data
¨  http://usewod.org/usewodorg-2.html
n  NAACL 2010 Workshop
¨  https://sites.google.com/site/amtworkshop2010/data-1
n  mturk-tracker.com
n  GalaxyZoo.com
n  CrowdCrafting.com
EarthBiAs2014	
  7-­‐11	
  July	
  2014,	
  Rhodes,	
  Greece	
  
Credits
Special thanks to Umair ul Hassan for his assistance
with the Tutorial
EarthBiAs2014

Contenu connexe

Tendances

Challenges Ahead for Converging Financial Data
Challenges Ahead for Converging Financial DataChallenges Ahead for Converging Financial Data
Challenges Ahead for Converging Financial Data
Edward Curry
 
Approximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous Events
Edward Curry
 

Tendances (20)

Challenges Ahead for Converging Financial Data
Challenges Ahead for Converging Financial DataChallenges Ahead for Converging Financial Data
Challenges Ahead for Converging Financial Data
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
 
Approximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous Events
 
An Environmental Chargeback for Data Center and Cloud Computing Consumers
An Environmental Chargeback for Data Center and Cloud Computing ConsumersAn Environmental Chargeback for Data Center and Cloud Computing Consumers
An Environmental Chargeback for Data Center and Cloud Computing Consumers
 
Linked Water Data For Water Information Management
Linked Water Data For Water Information ManagementLinked Water Data For Water Information Management
Linked Water Data For Water Information Management
 
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013
 
Towards a BIG Data Public Private Partnership
Towards a BIG Data Public Private PartnershipTowards a BIG Data Public Private Partnership
Towards a BIG Data Public Private Partnership
 
Developing an Sustainable IT Capability: Lessons From Intel's Journey
Developing an Sustainable IT Capability: Lessons From Intel's JourneyDeveloping an Sustainable IT Capability: Lessons From Intel's Journey
Developing an Sustainable IT Capability: Lessons From Intel's Journey
 
Linked Building (Energy) Data
Linked Building (Energy) DataLinked Building (Energy) Data
Linked Building (Energy) Data
 
Using Linked Data and the Internet of Things for Energy Management
Using Linked Data and the Internet of Things for Energy ManagementUsing Linked Data and the Internet of Things for Energy Management
Using Linked Data and the Internet of Things for Energy Management
 
Data Curation at the New York Times
Data Curation at the New York TimesData Curation at the New York Times
Data Curation at the New York Times
 
Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...
 
System of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceSystem of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked Dataspace
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in Europe
 
Crowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data ManagementCrowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data Management
 
Interactive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics ApproachInteractive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics Approach
 
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage DataCollaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
 
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and TrendsSustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
 
Open Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and TrendsOpen Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and Trends
 
Towards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing SystemsTowards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing Systems
 

Similaire à Crowdsourcing Approaches to Big Data Curation for Earth Sciences

Similaire à Crowdsourcing Approaches to Big Data Curation for Earth Sciences (20)

Maurice Bouwhuis (SARA/Vancis) - Hoe big data te begrijpen door ze te visuali...
Maurice Bouwhuis (SARA/Vancis) - Hoe big data te begrijpen door ze te visuali...Maurice Bouwhuis (SARA/Vancis) - Hoe big data te begrijpen door ze te visuali...
Maurice Bouwhuis (SARA/Vancis) - Hoe big data te begrijpen door ze te visuali...
 
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
 
Critique and Reflections on Open Data Initiatives
Critique and Reflections on  Open Data  InitiativesCritique and Reflections on  Open Data  Initiatives
Critique and Reflections on Open Data Initiatives
 
Data Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A SurveyData Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A Survey
 
R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014
R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014
R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014
 
RDA Update
RDA UpdateRDA Update
RDA Update
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
David Coleman: Challenging Traditional Models, Roles and Responsibilities in ...
David Coleman: Challenging Traditional Models, Roles and Responsibilities in ...David Coleman: Challenging Traditional Models, Roles and Responsibilities in ...
David Coleman: Challenging Traditional Models, Roles and Responsibilities in ...
 
David Coleman presentation at SDI Summit 2014, Calgary, Canada, 17-19 Sept 2014
David Coleman presentation at SDI Summit 2014, Calgary, Canada, 17-19 Sept 2014David Coleman presentation at SDI Summit 2014, Calgary, Canada, 17-19 Sept 2014
David Coleman presentation at SDI Summit 2014, Calgary, Canada, 17-19 Sept 2014
 
Dart ord the citizen's persepctive-20141107
Dart ord the citizen's persepctive-20141107Dart ord the citizen's persepctive-20141107
Dart ord the citizen's persepctive-20141107
 
SC6 Workshop 1: What can big data do for you?
SC6 Workshop 1: What can big data do for you? SC6 Workshop 1: What can big data do for you?
SC6 Workshop 1: What can big data do for you?
 
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
 
Open Data is not Enough
Open Data is not EnoughOpen Data is not Enough
Open Data is not Enough
 
INSPIRE - ensuring access or continuity of access?
INSPIRE - ensuring access or continuity of access?INSPIRE - ensuring access or continuity of access?
INSPIRE - ensuring access or continuity of access?
 

Dernier

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Crowdsourcing Approaches to Big Data Curation for Earth Sciences

  • 1. EarthBiAs2014   Global  NEST     University  of  the  Aegean   Crowdsourcing  Approaches  to  Big  Data  CuraDon  for   Earth  Sciences     Insight  Centre  for  Data  AnalyDcs,     NaDonal  University  of  Ireland  Galway   EarthBiAs2014   1  
  • 2. Take  Home   Algorithms Humans Better DataData
  • 3. Talk  Overview   •  Part  I:  Mo=va=on   •  Part  II:  Data  Quality  And  Data  Cura=on   •  Part  III:  Crowdsourcing   •  Part  IV:  Case  Studies  on  Crowdsourced  Data   Cura=on   •  Part  V:  SeIng  up  a  Crowdsourced  Data  Cura=on   Process   •  Part  VI:  Linked  Open  Data  Example   •  Part  IIV:  Future  Research  Challenges   7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  
  • 4. MOTIVATION   PART  I   7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  
  • 5. BIG Big Data Public Private Forum THE BIG PROJECT Overall objective Bringing the necessary stakeholders into a self-sustainable industry-led initiative, which will greatly contribute to enhance the EU competitiveness taking full advantage of Big Data technologies. Work at technical, business and policy levels, shaping the future through the positioning of IIM and Big Data specifically in Horizon 2020. BIGBig Data Public Private Forum
  • 6. BIG Big Data Public Private Forum SITUATING BIG DATA IN INDUSTRY Health Public Sector Finance & Insurance Telco, Media& Entertainment Manufacturing, Retail, Energy, Transport Needs Offerings Value Chain Technical Working Groups Industry Driven Sectorial Forums Data Acquisition Data Analysis Data Curation Data Storage Data Usage •  Structured data •  Unstructured data •  Event processing •  Sensor networks •  Protocols •  Real-time •  Data streams •  Multimodality •  Stream mining •  Semantic analysis •  Machine learning •  Information extraction •  Linked Data •  Data discovery •  ‘Whole world’ semantics •  Ecosystems •  Community data analysis •  Cross-sectorial data analysis •  Data Quality •  Trust / Provenance •  Annotation •  Data validation •  Human-Data Interaction •  Top-down/Bottom-up •  Community / Crowd •  Human Computation •  Curation at scale •  Incentivisation •  Automation •  Interoperability •  In-Memory DBs •  NoSQL DBs •  NewSQL DBs •  Cloud storage •  Query Interfaces •  Scalability and Performance •  Data Models •  Consistency, Availability, Partition- tolerance •  Security and Privacy •  Standardization •  Decision support •  Prediction •  In-use analytics •  Simulation •  Exploration •  Visualisation •  Modeling •  Control •  Domain-specific usage
  • 7. BIG Big Data Public Private Forum SUBJECT MATTER EXPERT INTERVIEWS
  • 8. BIG Big Data Public Private Forum KEY INSIGHTS Key Trends ▶  Lower usability barrier for data tools ▶  Blended human and algorithmic data processing for coping with for data quality ▶  Leveraging large communities (crowds) ▶  Need for semantic standardized data representation ▶  Significant increase in use of new data models (i.e. graph) (expressivity and flexibility) ▶  Much of (Big Data) technology is evolving evolutionary ▶  But business processes change must be revolutionary ▶  Data variety and verifiability are key opportunities ▶  Long tail of data variety is a major shift in the data landscape The Data Landscape ▶  Lack of Business-driven Big Data strategies ▶  Need for format and data storage technology standards ▶  Data exchange between companies, institutions, individuals, etc. ▶  Regulations & markets for data access ▶  Human resources: Lack of skilled data scientists Biggest Blockers Technical White Papers available on: http://www.big-project.eu
  • 9. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The Internet of Everything: Connecting the Unconnected
  • 10. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Earth Science – Systems of Systems
  • 11. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  
  • 12. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   CiDzen  Sensors   “…humans  as  ci,zens  on  the  ubiquitous  Web,  ac,ng  as   sensors  and  sharing  their  observa,ons  and  views…”   ¨  Sheth,  A.  (2009).  Ci=zen  sensing,  social  signals,  and  enriching  human   experience.  Internet  Compu,ng,  IEEE,  13(4),  87-­‐92.   Air Pollution
  • 13. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  
  • 14. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  
  • 15. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Citizens as Sensors
  • 16. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   16 of XYZ Haklay, M., 2013, Citizen Science and Volunteered Geographic Information – overview and typology of participation in Sui, D.Z., Elwood, S. and M.F. Goodchild (eds.), 2013. Crowdsourcing Geographic Knowledge: Volunteered Geographic Information (VGI) in Theory and Practice . Berlin: Springer.
  • 17. DATA  QUALITY  AND  DATA  CURATION   PART  II   7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  
  • 18. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The Problems with Data Knowledge Workers need: ¨  Access to the right data ¨  Confidence in that data Flawed data effects 25% of critical data in world’s top companies Data quality role in recent financial crisis: ¨  “Asset are defined differently in different programs” ¨  “Numbers did not always add up” ¨  “Departments do not trust each other’s figures” ¨  “Figures … not worth the pixels they were made of”
  • 19. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   What is Data Quality? “Desirable characteristics for information resource” Described as a series of quality dimensions: n  Discoverability & Accessibility: storing and classifying in appropriate and consistent manner n  Accuracy: Correctly represents the “real-world” values it models n  Consistency: Created and maintained using standardized definitions, calculations, terms, and identifiers n  Provenance & Reputation: Track source & determine reputation ¨  Includes the objectivity of the source/producer ¨  Is the information unbiased, unprejudiced, and impartial? ¨  Or does it come from a reputable but partisan source? Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33.
  • 20. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Data Quality ID PNAME PCOLOR PRICE APNR iPod Nano Red 150 APNS iPod Nano Silver 160 <Product  name=“iPod  Nano”>        <Items>                  <Item  code=“IPN890”>                              <price>150</price>                              <genera=on>5</genera=on>                  </Item>          </Items>   </Product>   Source A Source B Schema Difference? Data Developer APNR   iPod  Nano   Red   150   APNR   iPod  Nano   Silver   160   iPod  Nano   IPN890   150   5   Value Conflicts? Entity Duplication? Data Steward Business Users ? Technical Domain (Technical) Domain
  • 21. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   What is Data Curation? n  Digital Curation ¨ Selection, preservation, maintenance, collection, and archiving of digital assets n  Data Curation ¨ Active management of data over its life-cycle n  Data Curators ¨ Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use – Museum cataloguers of the Internet age
  • 22. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Related Activities n  Data Governance/ Master Data Management ¨ Convergence of data quality, data management, business process management, and risk management ¨ Part of overall data governance strategy for organization n  Data Curator = Data Steward
  • 23. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Types of Data Curation n  Multiple approaches to curate data, no single correct way ¨ Who? – Individual Curators – Curation Departments – Community-based Curation ¨ How? – Manual Curation – (Semi-)Automated – Sheer Curation
  • 24. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Types of Data Curation – Who? n  Individual Data Curators ¨ Suitable for infrequently changing small quantity of data –  (<1,000 records) –  Minimal curation effort (minutes per record)
  • 25. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Types of Data Curation – Who? n  Curation Departments ¨ Curation experts working with subject matter experts to curate data within formal process –  Can deal with large curation effort (000’s of records) n  Limitations ¨ Scalability: Can struggle with large quantities of dynamic data (>million records) ¨ Availability: Post-hoc nature creates delay in curated data availability
  • 26. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Types of Data Curation - Who? n  Community-Based Data Curation ¨ Decentralized approach to data curation ¨ Crowd-sourcing the curation process – Leverages community of users to curate data ¨ Wisdom of the community (crowd) ¨ Can scale to millions of records
  • 27. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Types of Data Curation – How? n  Manual Curation ¨ Curators directly manipulate data ¨ Can tie users up with low-value add activities n  (Sem-)Automated Curation ¨ Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification ¨ Can be supervised or approved by human curators
  • 28. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Types of Data Curation – How? n  Sheer curation, or Curation at Source ¨ Curation activities integrated in normal workflow of those creating and managing data ¨ Can be as simple as vetting or “rating” the results of a curation algorithm ¨ Results can be available immediately n  Blended Approaches: Best of Both ¨ Sheer curation + post hoc curation department ¨ Allows immediate access to curated data ¨ Ensures quality control with expert curation
  • 29. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Data Quailty Data Curation Example Profile Sources Define Mappings Cleans Enrich De-duplicate Define Rules Curated Data Data Developer Data Curator Data Governance Business Users Applications Product DataProduct Data
  • 30. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Data Curation n  Pros ¨  Can create a single version of truth ¨  Standardized information creation and management ¨  Improves data quality n  Cons ¨  Significant upfront costs and efforts ¨  Participation limited to few (mostly) technical experts ¨  Difficult to scale for large data sources –  Extended Enterprise e.g. partner, data vendors ¨  Small % of data under management (i.e. CRM, Product, …)
  • 31. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The New York Times 100 Years of Expert Data Curation
  • 32. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The New York Times n  Largest metropolitan and third largest newspaper in the United States n  nytimes.com q  Most popular newspaper website in US n  100 year old curated repository defining its participation in the emerging Web of Data
  • 33. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The New York Times n  Data curation dates back to 1913 ¨ Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaper n  New York Times Index ¨ Organized catalog of articles titles and summaries –  Containing issue, date and column of article –  Categorized by subject and names –  Introduced on quarterly then annual basis n  Transitory content of newspaper became important source of searchable historical data ¨ Often used to settle historical debates
  • 34. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The New York Times n   Index Department was created in 1913 ¨ Curation and cataloguing of NYT resources –  Since 1851 NYT had low quality index for internal use n  Developed a comprehensive catalog using a controlled vocabulary ¨ Covering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries n  Current Index Dept. has ~15 people
  • 35. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The New York Times n  Challenges with consistently and accurately classifying news articles over time ¨ Keywords expressing subjects may show some variance due to cultural or legal constraints ¨ Identities of some entities, such as organizations and places, changed over time n  Controlled vocabulary grew to hundreds of thousands of categories ¨ Adding complexity to classification process
  • 36. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The New York Times n  Increased importance of Web drove need to improve categorization of online content n  Curation carried out by Index Department ¨ Library-time (days to weeks) ¨ Print edition can handle next-day index n  Not suitable for real-time online publishing ¨ nytimes.com needed a same-day index
  • 37. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The New York Times n  Introduced two stage curation process ¨ Editorial staff performed best-effort semi- automated sheer curation at point of online pub. –  Several hundreds journalists ¨ Index Department follow up with long-term accurate classification and archiving n  Benefits: ¨ Non-expert journalist curators provide instant accessibility to online users ¨ Index Department provides long-term high- quality curation in a “trust but verify” approach
  • 38. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   NYT Curation Workflow ¨ Curation starts with article getting out of the newsroom
  • 39. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   NYT Curation Workflow ¨ Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)
  • 40. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   NYT Curation Workflow ¨ Teragram uses linguistic extraction rules based on subset of Index Dept’s controlled vocab.
  • 41. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   NYT Curation Workflow ¨ Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article
  • 42. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   NYT Curation Workflow ¨ Editorial staff member selects terms that best describe the contents and inserts new tags if necessary
  • 43. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   NYT Curation Workflow ¨ Reviewed by the taxonomy managers with feedback to editorial staff on classification process
  • 44. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   NYT Curation Workflow ¨ Article is published online at nytimes.com
  • 45. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   NYT Curation Workflow ¨ At later stage article receives second level curation by Index Dept. additional Index tags and a summary
  • 46. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   NYT Curation Workflow ¨ Article is submitted to NYT Index
  • 47. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The New York Times n  Early adopter of Linked Open Data (June ‘09)
  • 48. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   The New York Times n  Linked Open Data @ data.nytimes.com ¨ Subset of 10,000 tags from index vocabulary ¨ Dataset of people, organizations & locations – Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,… n  Benefits ¨ Improves traffic by third party data usage ¨ Lowers development cost of new applications for different verticals inside the website –  E.g. movies, travel, sports, books
  • 49. CROWDSOURCING   PART  III   7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  
  • 50. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   50 Crowdsourcing Landscape
  • 51. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Crowdsourcing Landscape 51
  • 52. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Introduction to Crowdsourcing n  Coordinating a crowd (a large group of workers)to do micro-work (small tasks) that solves problems (that computers or a single user can’t) n  A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals n  Related Areas ¨  Collective Intelligence ¨  Social Computing ¨  Human Computation ¨  Data Mining A. J. Quinn and B. B. Bederson, “Human computation: a survey and taxonomy of a growing field,” in Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 1403–1412.
  • 53. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   When Computers Were Human n  Maskelyne 1760 ¨ Used human computers to created almanac of moon positions – Used for shipping/ navigation ¨ Quality assurance – Do calculations twice – Compare to third verifier D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005.
  • 54. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   When Computers Were Human
  • 55. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   20th Century 1926: Teleautomation “…when wireless is perfectly applied the whole earth will be converted into a huge brain.” 1948: Cybernetics “…communication and control theory that is concerned especially with the comparative study of automatic control systems.” Credits: Thierry Ehrmann (Flickr), Dr. Sabina Jeschke, Wikimedia Foundation 1961: Embedded systems “A system a dedicated function within a larger mechanical or electrical system, often with real-time computing constraints.” 1988: Ubiquitous computing “…advanced computing concept where computing is made to appear everywhere and anywhere.”
  • 56. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   21st Century Credits: Kevin Ashton, Amith Sheth, Helen Gill, Wikimedia Foundation 1999: Internet of Things “…to uniquely identifiable objects and their virtual representations in an Internet-like structure.” 2006: Cyber-physical systems “…communication and control theory that is concerned especially with the comparative study of automatic control systems.” 2008: Web of Things “A set of blueprints to make every-day physical objects first class citizens of the World Wide Web by giving them an API.” 2012: Physical-Cyber-Social computing “a holistic treatment of data, information, and knowledge from the PCS worlds to integrate, correlate, interpret, and provide contextually relevant abstractions to humans.”
  • 57. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Sensing Credits: Albany Associates, stuartpilrow, Mike_n (Flickr) Computation Actuation Human Powered CPS Leverages human capabilities in conjunction with machine capabilities for optimizing processes in the cyber-physical-social environments
  • 58. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Human ü Visual perception ü Visuospatial thinking ü Audiolinguistic ability ü Sociocultural awareness ü Creativity ü Domain knowledge Machine ü Large-scale data manipulation ü Collecting and storing large amounts of data ü Efficient data movement ü Bias-free analysis Human vs Machine Affordances R. J. Crouser and R. Chang, “An affordance-based framework for human computation and human-computer collaboration,” IEEE Trans. Vis. Comput. Graph., vol. 18, pp. 2859–2868, 2012.
  • 59. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   When to Crowdsource a Task? n  Computers cannot do the task n  Single person cannot do the task n  Work can be split into smaller tasks
  • 60. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Platforms and Marketplaces
  • 61. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Types of Crowds n  Internal corporate communities ¨ Taps potential of internal workforce ¨ Curate competitive enterprise data that will remain internal to the company – May not always be the case e.g. product technical support and marketing data n  External communities ¨  Public crowd-souring market places ¨  Pre-competitive communities
  • 62. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Generic Architecture Workers Platform/Marketplace (Publish Task, Task Management) Requestors 1. 2. 4. 3.
  • 63. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Mturk Workflow
  • 64. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Spatial Crowdsourcing n  Crowdsoucring that requires a person to travel to a location to preform a spatial task ¨  Helps non-local requesters through workers in targeted spatial locality ¨  Used for data collection, package routing, citizen actuation ¨  Usually based on mobile applications ¨  Closely related to social sensing, participatory sensing, etc. ¨  Early example Ardavark social search en n  Example systems
  • 65. CASE  STUDIES  ON  CROWDSOURCED   DATA  CURATION   PART  IV   7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  
  • 66. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Crowdsourced Data Curation DQ Rules & Algorithms Entity Linking Data Fusion Relation Extraction Human Computation Relevance Judgment Data Verification Disambiguation Clean Data Internal Community - Domain Knowledge - High Quality Responses - Trustable Web of Data Databases Textual Content Programmers Managers External Crowd - High Availability - Large Scale - Expertise Variety
  • 67. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Examples of CDM Tasks n  Understanding customer sentiment for launch of new product around the world. n  Implemented 24/7 sentiment analysis system with workers from around the world. n  90% accuracy in 95% on content n  Categorize millions of products on eBay’s catalog with accurate and complete attributes n  Combine the crowd with machine learning to create an affordable and flexible catalog quality system
  • 68. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Examples of CDM Tasks n  Natural Language Processing ¨  Dialect Identification, Spelling Correction, Machine Translation, Word Similarity n  Computer Vision ¨  Image Similarity, Image Annotation/Analysis n  Classification ¨  Data attributes, Improving taxonomy, search results n  Verification ¨  Entity consolidation, de-duplicate, cross-check, validate data n  Enrichment ¨  Judgments, annotation
  • 69. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Wikipedia n  Collaboratively built by large community ¨  More than 19,000,000 articles, 270+ languages, 3,200,000+ articles in English ¨  More than 157,000 active contributors n  Accuracy and stylistic formality are equivalent to expert-based resources ¨  i.e. Columbia and Britannica encyclopedias n  WikiMeida ¨  Software behind Wikipedia ¨  Widely used inside organizations ¨  Intellipedia:16 U.S. Intelligence agencies ¨  Wiki Proteins: curated Protein data for knowledge discovery
  • 70. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Wikipedia n  Decentralized environment supports creation of high quality information with: ¨ Social organization ¨ Artifacts, tools & processes for cooperative work coordination n  Wikipedia collaboration dynamics highlight good practices
  • 71. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Wikipedia – Social Organization n  Any user can edit its contents ¨ Without prior registration n  Does not lead to a chaotic scenario ¨ In practice highly scalable approach for high quality content creation on the Web n  Relies on simple but highly effective way to coordinate its curation process n  Curation is activity of Wikipedia admins ¨ Responsibility for information quality standards
  • 72. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Wikipedia – Social Organization
  • 73. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Wikipedia – Social Organization n  Incentives ¨ Improvement of one’s reputation ¨ Sense of efficacy –  Contributing effectively to a meaningful project ¨ Over time focus of editors typically change –  From curators of a few articles in specific topics –  To more global curation perspective –  Enforcing quality assessment of Wikipedia as a whole
  • 74. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Wikipedia – Artifacts, Tools & Processes n  Wiki Article Editor (Tool) ¨  WYSIWYG or markup text editor n  Talk Pages (Tool) ¨  Public arena for discussions around Wikipedia resources n  Watchlists (Tool) ¨  Helps curators to actively monitor the integrity and quality of resources they contribute n  Permission Mechanisms (Tool) ¨  Users with administrator status can perform critical actions such as remove pages and grant administrative permissions to new users n  Automated Edition (Tool) ¨  Bots are automated or semi-automated tools that perform repetitive tasks over content n  Page History and Restore (Tool) ¨  Historical trail of changes to a Wikipedia Resource n  Guidelines, Policies & Templates (Artifact) ¨  Defines curation guidelines for editors to assess article quality n  Dispute Resolution (Process) ¨  Dispute mechanism between editors over the article contents n  Article Edition, Deletion, Merging, Redirection, Transwiking, Archival (Process) ¨  Describe the curation actions over Wikipedia resources
  • 75. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   DBPedia Knowledge base n  DBPedia provides direct access to data ¨ Indirectly uses wiki as data curation platform ¨ Inherits massive volume of curated Wikipedia data ¨ 3.4 million entities and 1 billion RDF triples ¨ Comprehensive data infrastructure – Concept URIs – Definitions – Basic types
  • 76. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  
  • 77. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Wikipedia - DBPedia
  • 78. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   n  Collaborative knowledge base maintained by community of web users n  Users create entity types and their meta-data according to guidelines n  Requires administrative approvals for schema changes by end users
  • 79. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  
  • 80. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  
  • 81. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece  
  • 82. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Audio Tagging - Tag a Tune
  • 83. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Image Tagging - Peekaboom
  • 84. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Protein Folding - Fold.it/
  • 85. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   ReCaptcha n  OCR ¨  ~ 1% error rate ¨  20%-30% for 18th and 19th century books n  40 million ReCAPTCHAs every day” (2008) ¨  Fixing 40,000 books a day
  • 86. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   ChemSpider n  Structure centric chemical community ¨ Over 300 data sources with 25 million records ¨ Provided by chemical vendors, government databases, private laboratories and individual n  Pharma realizing benefits of open data ¨ Heavily leveraged by pharmaceutical companies as pre-competitive resources for experimental and clinical trial investigation ¨ Glaxo Smith Kline made its proprietary malaria dataset of 13,500 compounds available
  • 87. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   n  Dedicated to improving understanding of biological systems functions with 3-D structure of macromolecules ¨ Started in 1971 with 3 core members ¨ Originally offered 7 crystal structures ¨ Grown to 63,000 structures ¨ Over 300 million dataset downloads n  Expanded beyond curated data download service to include complex molecular visualized, search, and analysis capabilities
  • 88. SETTING  UP  A  CROWDSOURCED  DATA   CURATION  PROCESS   PART  V   7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  
  • 89. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Core Design Questions Goal What Why IncentivesWhoWorkers How Process Malone, T. W., Laubacher, R., & Dellarocas, C. N. Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).
  • 90. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Setting up a Curation Process 1 – Who is doing it? 2 – Why are they doing it? 3 – What is being done? 4 – How is it being done?
  • 91. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   1) Who is doing it? (Workers) n  Hierarchy (Assignment) ¨ Someone in authority assigns a particular person or group of people to perform the task ¨ Within the Enterprise (i.e. Individuals, specialised departments) ¨ Within a structured community (i.e. pre- competitive community) n  Crowd (Choice) ¨ Anyone in a large group who choses to do so ¨ Internal or External Crowds
  • 92. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   2) Why are they doing it? (Incentives) n  Motivation ¨  Money ($$££) ¨  Glory (reputation/prestige) ¨  Love (altruism, socialize, enjoyment) ¨  Unintended by-product (e.g. re-Captcha, captured in workflow) ¨  Self-serving resources (e.g. Wikipedia, product/customer data) ¨  Part of their job description (e.g. Data curation as part of role) n  Determine pay and time for each task ¨  Marketplace: Delicate balance –  Money does not improve quality but can increase participation ¨  Internal Hierarchy: Engineering opportunities for recognition –  Performance review, prizes for top contributors, badges, leaderboards, etc.
  • 93. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Effect of Payment on Quality n  Cost does not affect quality n  Similar results for bigger tasks [Ariely et al, 2009] Mason, W. A., & Watts, D. J. (2009). Financial incentives and the ‘‘performance of crowds.’’ Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009. [Panos Ipeirotis. WWW2011 tutorial]
  • 94. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   3) What is being done? (Goal) 3.1 Identify the Data ¨ Newly created data and/or legacy data? ¨ How is new data created? – Do users create the data, or is it imported from an external source? ¨ How frequently is new data created/updated? ¨ What quantity of data is created? ¨ How much legacy data exists? ¨ Is it stored within a single source, or scattered across multiple sources?
  • 95. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   3) What is being done? (Goal) 3.2 Identify the Tasks ¨ Creation Tasks – Create/Generate – Find – Improve/ Edit / Fix ¨ Decision (Vote) Tasks – Accept / Reject – Thumbs up / Thumbs Down – Vote for Best
  • 96. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   4) How is it being done? (How) n  Identify the workflow ¨ Tasks integrated in normal workflow of those creating and managing data ¨ Simple as vetting or “rating” results of algorithm n  Identify the platform ¨  Internal/Community collaboration platforms ¨  Public crowdsourcing platform –  Consider the availability of appropriate workers (i.e. experts) n  Identify the Algorithm ¨  Data quality ¨  Image recognition ¨  etc
  • 97. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   4) How is it being done? (How) n  Task Design ¨ Task Interface ¨ Task Assignment/Routing ¨ Task Quality Assurance
  • 98. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Task Design 98 * Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art Input Output Task Router before computation Output Aggregation after computation Task Interface during computation
  • 99. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Pull Routing n  Workers seek tasks and assign to themselves ¨  Search and Discovery of tasks support by platform ¨  Task Recommendation ¨  Peer Routing Workers Tasks Select Result Algorithm Search & Browse Interface Result
  • 100. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Push Routing n  System assigns tasks to workers based on: ¨  Past performance ¨  Expertise ¨  Cost ¨  Latency 100 Workers Tasks Assign Result Assign Algorithm Task Interface * www.mobileworks.com Result
  • 101. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Managing Task Quality Assurance n  Redundancy: Quorum Votes ¨  Replicate the task (i.e. 3 times) ¨  Use majority voting to determine right value (% agreement) ¨  Weighted majority vote n  Gold Data / Honey Pots ¨  Inject trap question to test quality ¨  Worker fatigue check (habit of saying no all the time) n  Estimation of Worker Quality ¨  Redundancy plus gold data n  Qualification Test ¨  Use test tasks to determine users ability for such tasks
  • 102. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Social Best Practices n  Participation ¨  Stakeholders involvement for data producers and consumers must occur early in project –  Provides insight into basic questions of what they want to do, for whom, and what it will provide n  Incentives ¨  Sheer curation needs line of sight from data curating activity, to tangible exploitation benefits ¨  Recognizing contributing curators through a formal feedback mechanism n  Engagement ¨  Outreach essential for promotion and feedback ¨  Typical consumers-to-contributors ratios < 5%
  • 103. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Technical Best Practices n  Human & Automated Curation ¨ Automated curation should always defer to, and never override, human curation edits – Automate validating data deposition and entry – Target community at focused curation tasks n  Track Provenance ¨ Curation activities should be recorded –  Especially where human curators are involved ¨ Different perspectives of provenance – A scientist may need to evaluate the fine grained experiment description behind the data – For a business analyst the ’brand’ of data provider can be sufficient for determining quality
  • 104. LINKED  OPEN  DATA  EXAMPLE   PART  VI   7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  
  • 105. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Linked Open Data (LOD) n  Expose and interlink datasets on the Web n  Using URIs to identify “things” in your data n  Using a graph representation (RDF) to describe URIs n  Vision: The Web as a huge graph database Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.n
  • 106. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Linked Data Example MulDple   IdenDfiers   IdenDty  resoluDon   links  
  • 107. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Identity Resolution in LOD n  Quality issues with Linked Data ¨  Imprecise or Outdated or Wrong n  Uncertainty of identity resolution links ¨  Due to multiple identity equivalence interpretations ¨  Due to characteristics of link generation algorithms (similarity based) n  User feedback for uncertain links ¨  Verify uncertain identity resolution links from users/ experts ¨  Improve quality of entity consolidation
  • 108. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Identity Resolution in LOD <hcp://www.freebase.com/view/en/galway>   <hcp://dbpedia.org/resource/Galway>   <hcp://sws.geonames.org/2964180/>   owl:sameAs   Publisher   owl:sameAs  Consumer   MulDple  IdenDfiers  for  ‘Galway’  enDty  in  Linked  Open  Data  Cloud   Different  sources  of  idenDty  resoluDon  links   Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  • 109. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   LOD Application Architecture Utility   Module Feedback   Module Consolidation   Module Questions FeedbackRules Matching Dependencies Ranked Feedback Tasks Data Improvement Candidate Links Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition), 1-136. Morgan & Claypool.
  • 110. FUTURE  RESEARCH  CHALLENGES   PART  IIV   7-­‐11  July  2014,  Rhodes,  Greece   EarthBiAs2014  
  • 111. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Future Research Directions n  Incentives and social engagement ¨  Better recognition of the data curation role ¨  Understanding of social engagement mechanisms n  Economic Models ¨  Pre-competitive and public-private partnerships n  Curation at Scale ¨  Evolution of human computation and crowdsourcing ¨  Instrumenting popular apps for data curation ¨  General-purpose data curation pipelines ¨  Human-data interaction n  Trust ¨  Capture of data curation decisions & provenance management ¨  Fine-grained permission management models and tools n  Data Curation Models ¨  Nanopulications ¨  Theoretical principles and domain-specific model
  • 112. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Future Research Directions n  Spatial Crowdsourcing ¨  Matching tasks with workers at right time and location ¨  Balancing workload among workers ¨  Tasks at remote locations ¨  Chaining tasks in same vicinity ¨  Preserving worker privacy n  Interoperability ¨  Finding semantic similarity of tasks across systems ¨  Defining and measuring worker capability across heterogeneous systems ¨  Enabling routing middleware for multiple systems ¨  Compatibility of reputation systems ¨  Defining standards for task exchange
  • 113. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Heterogeneous Crowds n  Multiple requesters, tasks, workers, platform Collaborative Data Curation Tasks Workers Cyber Physical Social System Platforms
  • 114. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   SLUA Ontology 114 Reward Action Capability User Task offersearns includesperforms requirespossesses Location Skill Knowledge Ability Availability Reputation Money Fun Altruism Learning subClassOf subClassOf U. ul Hassan, S. O’Riain, E. Curry, “SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing,” in International Workshop on Crowdsourcing the Semantic Web, 2013.
  • 115. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Future Research Directions n  Task Routing ¨  Optimizing task completion, quality, and latency ¨  Inferring worker preferences, skills, and knowledge ¨  Balancing exploration-exploitation trade-off between inference and optimization ¨  Cold-start problem for new workers or tasks ¨  Ensuring worker satisfaction via load balancing & rewards n  Human–Computer Interaction ¨  Reducing search friction through good browsing interfaces ¨  Presenting requisite information nothing more ¨  Choosing the level of task granularity for complex tasks ¨  Ensuring worker engagement ¨  Designing games with a purpose to crowd source with fun
  • 116. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Summary Algorithms Humans Better DataData
  • 117. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Selected References n  Big Data & Data Quality ¨  S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path from Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, pp. 21–32, 2011. ¨  A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise Information Management, vol. 24, no. 3, pp. 288–303, 2011. ¨  R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one master data – challenges and preconditions,” Industrial Management & Data Systems, vol. 111, no. 1, pp. 146– 162, 2011. ¨  E. Curry, S. Hasan, and S. O’Riain, “Enterprise Energy Management using a Linked Dataspace for Energy Intelligence,” in Second IFIP Conference on Sustainable Internet and ICT for Sustainability, 2012. ¨  D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008. ¨  Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment.Communications of the ACM, 45(4), 211-2 ¨  Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), 41(3), 16. ¨  B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an Expert Survey,” in Proceedings of the 2010 ACM Symposium on Applied Computing - SAC ’10, 2010, pp. 106–110. ¨  Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33 ¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications,” In 9th International Workshop on Information Integration on the Web (IIWeb2012) Scottsdale, Arizona,: ACM.
  • 118. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Selected References n  Collective Intelligence, Crowdsourcing & Human Computation ¨  Malone, Thomas W., Robert Laubacher, and Chrysanthos Dellarocas. "Harnessing Crowds: Mapping the Genome of Collective Intelligence." (2009). ¨  A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World-Wide Web,” Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011. ¨  A. J. Quinn and B. B. Bederson, “Human computation: a survey and taxonomy of a growing field,” in Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 1403– 1412. ¨  Mason, W. A., & Watts, D. J. (2009). Financial incentives and the ‘‘performance of crowds.’’ Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009. ¨  E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011. ¨  M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB : Answering Queries with Crowdsourcing,” in Proceedings of the 2011 international conference on Management of data - SIGMOD ’11, 2011, p. 61. ¨  P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, “Exploring the ‘Crowd’ as Enabler of Better Information Quality,” in Proceedings of the 16th International Conference on Information Quality, 2011, pp. 302–312. ¨  Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of crowds". SIGKDD Explorations (SIGKDD) 11(2):100-108 (2009) ¨  Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial ¨  O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for You, WSDM Hong Kong 2011. ¨  D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005. –  http://www.youtube.com/watch?v=YwqltwvPnkw ¨  Ul Hassan, U., & Curry, E. (2013, October). A capability requirements approach for predicting worker performance in crowdsourcing. In Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), 2013 9th Internatinal Conference Conference on (pp. 429-437). IEEE.
  • 119. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Selected References n  Collaborative Data Management ¨  E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25–47. ¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Towards Expertise Modelling for Routing Data Cleaning Tasks within a Community of Knowledge Workers,” In 17th International Conference on Information Quality (ICIQ 2012), Paris, France. ¨  Ul Hassan, U., O’Riain, S., and Curry, E. 2013. “Effects of Expertise Assessment on the Quality of Task Routing in Human Computation,” In 2nd International Workshop on Social Media for Crowdsourcing and Human Computation, Paris, France. ¨  Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., ... & Xu, S. (2013). Data Curation at Scale: The Data Tamer System. In CIDR. ¨  Parameswaran, A. G., Park, H., Garcia-Molina, H., Polyzotis, N., & Widom, J. (2012, October). Deco: declarative crowdsourcing. In Proceedings of the 21st ACM international conference on Information and knowledge management (pp. 1203-1212). ACM. ¨  Parameswaran, A., Boyd, S., Garcia-Molina, H., Gupta, A., Polyzotis, N., & Widom, J. (2014). Optimal crowd- powered rating and filtering algorithms.Proceedings Very Large Data Bases (VLDB). ¨  Marcus, A., Wu, E., Karger, D., Madden, S., & Miller, R. (2011). Human-powered sorts and joins. Proceedings of the VLDB Endowment, 5(1), 13-24. ¨  Guo, S., Parameswaran, A., & Garcia-Molina, H. (2012, May). So who won?: dynamic max discovery with the crowd. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp. 385-396). ACM. ¨  Davidson, S. B., Khanna, S., Milo, T., & Roy, S. (2013, March). Using the crowd for top-k and group-by queries. In Proceedings of the 16th International Conference on Database Theory (pp. 225-236). ACM. ¨  Chai, X., Vuong, B. Q., Doan, A., & Naughton, J. F. (2009, June). Efficiently incorporating user feedback into information extraction and integration programs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (pp. 87-100). ACM.
  • 120. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Selected References n  Spatial Crowdsourcing ¨  Kazemi, L., & Shahabi, C. (2012, November). Geocrowd: enabling query answering with spatial crowdsourcing. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems (pp. 189-198). ACM. ¨  Benouaret, K., Valliyur-Ramalingam, R., & Charoy, F. (2013). CrowdSC: Building Smart Cities with Large Scale Citizen Participation. IEEE Internet Computing, 1. ¨  Musthag, M., & Ganesan, D. (2013, April). Labor dynamics in a mobile micro-task market. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 641-650). ACM. ¨  Deng, Dingxiong, Cyrus Shahabi, and Ugur Demiryurek. "Maximizing the number of worker's self- selected tasks in spatial crowdsourcing." Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2013. ¨  To, H., Ghinita, G., & Shahabi, C. (2014). A Framework for Protecting Worker Location Privacy in Spatial Crowdsourcing. Proceedings of the VLDB Endowment, 7(10). ¨  Goncalves, J., Ferreira, D., Hosio, S., Liu, Y., Rogstadius, J., Kukka, H., & Kostakos, V. (2013, September). Crowdsourcing on the spot: altruistic use of public displays, feasibility, performance, and behaviours. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing(pp. 753-762). ACM. ¨  Cardone, G., Foschini, L., Bellavista, P., Corradi, A., Borcea, C., Talasila, M., & Curtmola, R. (2013). Fostering participaction in smart cities: a geo-social crowdsensing platform. Communications Magazine, IEEE, 51(6).
  • 121. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Books n  Surowiecki, J. (2005). The wisdom of crowds. Random House LLC. n  Batini, C., & Scannapieco, M. (2006). Data quality: concepts, methodologies and techniques. Springer. n  Michelucci, P. (2013). Handbook of human computation. Springer. n  Law, E., & Ahn, L. V. (2011). Human computation. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(3), 1-121. n  Heath, T., & Bizer, C. (2011). Linked data: Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology, 1(1), 1-136. n  Grier, D. A. (2013). When computers were human. Princeton University Press. n  Easley, D., & Kleinberg, J. Networks, Crowds, and Markets. Cambridge University. n  Sheth, A., & Thirunarayan, K. (2012). Semantics Empowered Web 3.0: Managing Enterprise, Social, Sensor, and Cloud-based Data and Services for Advanced Applications. Synthesis Lectures on Data Management, 4(6), 1-175.
  • 122. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Tutorials n  Human Computation and Crowdsourcing ¨  http://research.microsoft.com/apps/video/default.aspx?id=169834 ¨  http://www.youtube.com/watch?v=tx082gDwGcM n  Human-Powered Data Management ¨  http://research.microsoft.com/apps/video/default.aspx?id=185336 n  Crowdsourcing Applications and Platforms: A Data Management Perspective ¨  http://www.vldb.org/pvldb/vol4/p1508-doan-tutorial4.pdf n  Human Computation: Core Research Questions and State of the Art ¨  http://www.humancomputation.com/Tutorial.html n  Crowdsourcing & Machine Learning ¨  http://www.cs.rutgers.edu/~hirsh/icml-2011-tutorial/ n  Data quality and data cleaning: an overview ¨  http://dl.acm.org/citation.cfm?id=872875
  • 123. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Datasets n  TREC Crowdsourcing Track ¨  https://sites.google.com/site/treccrowd/ n  2010 Crowdsourced Web Relevance Judgments Data ¨  https://docs.google.com/document/d/ 1J9H7UIqTGzTO3mArkOYaTaQPibqOTYb_LwpCpu2qFCU/edit n  Statistical QUality Assurance Robustness Evaluation Data ¨  http://ir.ischool.utexas.edu/square/data.html n  Crowdsourcing at Scale 2013 ¨  http://www.crowdscale.org/ n  USEWOD - Usage Analysis and the Web of Data ¨  http://usewod.org/usewodorg-2.html n  NAACL 2010 Workshop ¨  https://sites.google.com/site/amtworkshop2010/data-1 n  mturk-tracker.com n  GalaxyZoo.com n  CrowdCrafting.com
  • 124. EarthBiAs2014  7-­‐11  July  2014,  Rhodes,  Greece   Credits Special thanks to Umair ul Hassan for his assistance with the Tutorial EarthBiAs2014