https://www.insight-centre.org/content/towards-expertise-modelling-routing-data-cleaning-tasks-within-community-knowledge-workers
Presented at the ICIQ 2012
ABSTRACT:
Applications consuming data have to deal with variety of data quality issues such as missing values, duplication, incorrect values, etc. Although automatic approaches can be utilized for data cleaning the results can remain uncertain. Therefore updates suggested by automatic data cleaning algorithms require further human verification. This paper presents an approach for generating tasks for uncertain updates and routing these tasks to appropriate workers based on their expertise. Specifically the paper tackles the problem of modelling the expertise of knowledge workers for the purpose of routing tasks within collaborative data quality management. The proposed expertise model represents the profile of a worker against a set of concepts describing the data. A simple routing algorithm is employed for leveraging the expertise profiles for matching data cleaning tasks with workers. The proposed approach is evaluated on a real world dataset using human workers. The results demonstrate the effectiveness of using concepts for modelling expertise, in terms of likelihood of receiving responses to tasks routed to workers.
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Community of Knowledge Workers
1. Digital Enterprise Research Institute www.deri.ie
TOWARDS EXPERTISE MODELLING FOR ROUTING DATA
CLEANING TASKS WITHIN A COMMUNITY OF
KNOWLEDGE WORKERS
Umair ul Hassan, Sean O’Riain, Edward Curry
Digital Enterprise Research Institute
National University of Ireland, Galway
17th International Conference on Information
Quality (ICIQ 2012), Paris, France
Copyright 2012 Digital Enterprise Research Institute. All rights reserved.
2. Agenda
Digital Enterprise Research Institute www.deri.ie
Paper Overview
Motivation
Enterprise Data Landscape
Collaborative Data Quality
Human Computation
Problem Space
Task Routing
Challenges of Push Routing
CAMEE Prototype
DBPedia.org & SKOS
Expertise Assessment
Task Routing
Experiments
Summary
2
3. Paper Overview
Digital Enterprise Research Institute www.deri.ie
Motivation
Data quality management is limited to few individuals (e.g. MDM)
Involve community of user in data quality tasks
Tasks require expertise and domain knowledge
Problem
How to assess and model human expertise
How to effectively route tasks to appropriate workers
Contribution
Concepts based approach for modelling and assessment of knowledge
worker‟s expertise
Concept matching approach for routing data quality tasks
Prototype implementation using SKOS vocabulary
3
5. Enterprise Data Landscape
Digital Enterprise Research Institute www.deri.ie
Enterprises will have to deal with much more data in future
The
Reality
All data relevant to enterprise and its
operations
Relevant
The External Data
Known
Data directly managed by enterprise and
The its departments
Managed
Enterprise Data
Reference data managed through well define
MDM
Collaboratively policies and governance council
Managed
“The data deluge,” The Economist, Feb-2010
8
6. Collaborative Data Quality
Digital Enterprise Research Institute www.deri.ie
Developers Data Governance
Data
Sources
External Crowd
Data Quality Human
Algorithms Computation
Clean Data Clean Data Internal Community
6
7. Human Computation
Digital Enterprise Research Institute www.deri.ie
Solve computationally hard problems with help of humans
Algorithms control human workers
Computation is carried out by Humans
Algorithm
Workers
Developer
Define Compute
* Barowy et al, “AutoMan: a platform for integrating human-based and digital computation,” OOPSLA ’12
7
8. Human Computation
Digital Enterprise Research Institute www.deri.ie
Task Design
during computation
Input Output
Task Router Output Aggregation
Our Focus before computation after computation
* Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art
8
9. Digital Enterprise Research Institute www.deri.ie
Challenges of Task Routing in Collaborative Data Quality
PROBLEM SPACE
9
10. Task Routing
Digital Enterprise Research Institute www.deri.ie
Pull Routing
System provides an interface to support workers
Workers actively seek tasks and assign to themselves
Search & Browse Interface
Algorithm
Workers
Tasks Select
Result
Result
* www.mtruk.com
10
11. Task Routing
Digital Enterprise Research Institute www.deri.ie
Push Routing
System has complete control over assignment of tasks
– Based on criteria such as expertise, cost, and latency
Workers passively receive tasks
Task Interface
Algorithm
Assign Workers
Tasks
Result
Result
Assign
* www.mobileworks.com
11
12. Challenges of Push Routing
Digital Enterprise Research Institute www.deri.ie
Workers have different domain knowledge and expertise
1. How to represent expertise required for task?
2. How to assess and represent expertise of workers?
3. How to match a task with expertise of workers?
12
13. CAMEE Collaborative Management of Enterprise Entities
Digital Enterprise Research Institute www.deri.ie
Leverages concepts from data to build expertise profiles
Associate data
Task
concepts with tasks
Use concepts from the Profile worker expertise
Data Concepts Expertise
data sources against concepts
Routing
Leverage profiles for
making routing decisions
13
14. dbp-res:X-Men:_First_Class
rdfs:type dbp-owl:Film .
foaf:name "X-Men: First Class"@en . dbp-res:X-Men:_First_Class dbp-prop:released "25-05-2011"
Digital Enterprise Research Institute www.deri.ie
dbp-prop:budget "9600.0" .
dbp-owl:distributor dbp-res:20th_Century_Fox
CAMEE
Task Manager
1) Update &
Concepts
Input Data Quality Task
Dirty Dataset Algorithms Model
Worker1 (Sci-Fi, Action, Adventure)
Worker2 (Drama, Action, Thriller)
Crowd
2) Assessment Expertise 3) Expertise
Routing
Model
Model
Crowd Manager 4) Task
5) Task UI
Feedback Manager Output Clean
6) Response Dataset
dbp-res:X-Men:_First_Class
Was film “X-Men: First Class”
rdfs:type dbp-owl:Film .
True released in 25 May 2011?
foaf:name "X-Men: First Class"@en .
dbp-prop:budget "9600.0" .
dbp-owl:distributor dbp-res:20th_Century_Fox
dbp-prop:released "25-05-2011"
14
16. Challenges of Push Routing
Digital Enterprise Research Institute www.deri.ie
Workers have different domain knowledge and expertise
1. How to represent expertise required for task?
– DBPedia & SKOS Concepts
2. How to assess and represent expertise of workers?
– Expertise based on Self/Task Assessment
3. How to match a task with expertise of workers
– Task Routing based on Matching
16
17. DBpedia & SKOS
Digital Enterprise Research Institute www.deri.ie
Dbpedia.org
Structured Database from Wikipedia Facts
Simple Knowledge Organization System
Common model for knowledge organization
– Facilitate interoperability
– Machine readability
“Concept” is basic element
– Identified by URI and represented with RDF
Defines concept schemes
– Hierarchical and associative relationships
* www.dbpedia.org, www.w3.org/2004/02/skos/
17
19. CAMEE with SKOS
Digital Enterprise Research Institute www.deri.ie
Source Data Data Quality Algorithm Task Model
Entity: A Beautiful Mind Update: Missing Value Task: Confirm Missing Value
Property & Values: dbpedia-owl:writer = Did Akiva Goldsman wrote the
dbpedia-owl:Work/runtime dbpedia:Akiva_Goldsman movie "A Beautiful Mind"?
135.0
dbpedia-owl:director SKOS Concepts: SKOS Concepts:
dbpedia:Ron_Howard American_biographical_films American_biographical_films
dbpedia-owl:producer
Films_set_in_the_1950s Films_set_in_the_1950s
dbpedia:Ron_Howard
dbpedia:Brian_Graze
dbpedia-owl:starring
dbpedia:Ed_Harris
dbpedia:Russell_Crowe Worker Expertise Task Routing
SKOS Concepts:
SKOS Concepts: Match
Films_set_in_the_1950s (Good)
American_biographical_films
American_biographical_films Films_about_psychiatry (Poor)
American_drama_films (Fair)
Films_set_in_the_1950s American_drama_films (Fair)
Workers & Expertise Model Routing Model
19
20. Expertise Assessment
Digital Enterprise Research Institute www.deri.ie
Build profiles of workers
To quantify expertise or knowledge levels of workers against concepts
Two Approaches
Self-Assessment: Workers provide self-assessment of knowledge for each concept
Task Assessment: Workers provide responses to assessment tasks
Expertise Profiles in form of matrix E(C,W)
where C in set of concepts and W is set of workers
Concept Worker 1 Worker 2 Worker 3
1990s_comedy-drama_films 0.6 0.2 0.2
Films_about_psychiatry 0.6 0.2 0.6
American_biographical_films 0.8 0.4 0.4
American_comedy-drama_films 0.8 0.6 0.6
20
24. Digital Enterprise Research Institute www.deri.ie
Leveraging Expertise Profiles for Task routing
EXPERIMENTS
24
25. Expertiment
Digital Enterprise Research Institute www.deri.ie
Hypothesis
Data quality tasks routed using a concept-based expertise profiles
have higher response rates if the expertise model is built using a
task-assessment approach as compared to a self-assessment
based approach.
Two stages of experiment
Assessment Stage (build profiles)
Routing Stage (leverage profiles)
25
26. Dataset
Digital Enterprise Research Institute www.deri.ie
Popular Movies in Dbpedia
Top 100 grossing movies in Hollywood and Bollywood
Characteristic Value
Number of entities (dbp:Film) 724
No. of concepts (film genres) 42
No. of data quality tasks 230
Knowledge Workers
Characteristic Value
No. of Workers 11
Tasks for Assessment Stage 100
Tasks for Routing Stage 130
26
27. Response Rate
Digital Enterprise Research Institute www.deri.ie
Hypothesis
Data quality tasks routed using a concept-based expertise profiles
have higher response rates if the expertise model is built using a
task-assessment approach as compared to a self-assessment
based approach.
Data
Matching Matching
Routing (Assessment) Random
(Self-Assessment) (Task Assessment)
Don't know 71.54% 58.46% 10.00%
Strongly Disagree 5.38% 16.92% 29.23%
Disagree 6.92% 2.31% 13.08%
Neutral 2.31% 2.31% 8.46%
Agree 3.85% 4.62% 12.31%
Strongly Agree 10.00% 15.38% 26.92%
27
28. Assessment Effort
Digital Enterprise Research Institute www.deri.ie
Combine self-assessment with task assessment
Filtering assessment tasks based on self-rated concepts reduces effort
required during assessment
150
140
Effort (average decisions per worker)
130
120
110
100
For examples
90
80
filter tasks with
70 concepts of
60 Good or higher
50 self-rating
40
30
20
10
0
RND SA TA SA&TA SA&TA SA&TA SA&TA SA&TA
(Poor+) (Fair+) (Good+) (Excellent)
Assessment Method for Expertise Profiling
CA: Self-Assessment
TA: Task Assessment
28
29. Task Routing
Digital Enterprise Research Institute www.deri.ie
Likelihood of response and Quality of response remains near
maximum during routing stage
100.00%
Response Rate
90.00%
Effort (average decisions per worker)
Accuracy
80.00%
70.00%
For examples
60.00%
filter tasks with
50.00% concepts of
40.00% Good or higher
self-rating
30.00%
20.00%
10.00%
0.00%
RND SA TA SA&TA SA&TA SA&TA SA&TA SA&TA
(Poor+) (Fair+) (Good+) (Excellent)
Assessment Method for Expertise Profiling
CR: Self-Assessment
TP: Task Assessment
29
30. Summary
Digital Enterprise Research Institute www.deri.ie
Conclusion
Effective task routing is fundamental aspect of collaborative data
quality management
Concepts are effective for expertise assessment and modelling
Task routing leveraging Task Assessment based profiles have better
likelihood of response from workers
Future Directions
Loading balancing under constraints
– Cost, Latency, Motivation, Expertise, Utility
Trade-off between assessment for profiling and exploitation
30
31. Further Reading
Digital Enterprise Research Institute www.deri.ie
17th International Conference on Information Quality (ICIQ 2012)
Paris, 16-17 November 2012
U. Ul Hassan, S. O’Riain, and E. Curry, “Towards Expertise Modelling for Routing Data
Cleaning Tasks within a Community of Knowledge Workers,” in 17th International
Conference on Information Quality - ICIQ’12, 2012.
http://www.deri.ie/about/team/member/umair_ul_hassan/
31
32. Selected References
Digital Enterprise Research Institute www.deri.ie
Big Data & Data Quality
S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the
Path from Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, pp. 21–32, 2011.
A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise Information
Management, vol. 24, no. 3, pp. 288–303, 2011.
R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one master data –
challenges and preconditions,” Industrial Management & Data Systems, vol. 111, no. 1, pp. 146–
162, 2011.
E. Curry, S. Hasan, and S. O‟Riain, “Enterprise Energy Management using a Linked Dataspace for
Energy Intelligence,” in Second IFIP Conference on Sustainable Internet and ICT for
Sustainability, 2012.
D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers
Inc., 2008.
B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an Expert Survey,” in
Proceedings of the 2010 ACM Symposium on Applied Computing - SAC ‟10, 2010, pp. 106–110.
32
33. Selected References
Digital Enterprise Research Institute www.deri.ie
Collective Intelligence, Crowdsourcing & Human Computation
E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation for Enterprises,”
in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25–47.
A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World-Wide Web,”
Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011.
E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial Intelligence and
Machine Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011.
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB Answering Queries with
:
Crowdsourcing,” in Proceedings of the 2011 international conference on Management of data -
SIGMOD ‟11, 2011, p. 61.
P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, “Exploring the „Crowd‟
as Enabler of Better Information Quality,” in Proceedings of the 16th International Conference on
Information Quality, 2011, pp. 302–312.
33
34. Selected References
Digital Enterprise Research Institute www.deri.ie
Expert Finding
K. Balog, L. Azzopardi, and M. de Rijke, “Formal models for expert finding in enterprise corpora,” in
Proceedings of the 29th annual international ACM SIGIR conference on Research and development
in information retrieval - SIGIR ‟06, 2006, p. 43.
K. Balog, T. Bogers, L. Azzopardi, M. de Rijke, and A. van den Bosch, “Broad expertise retrieval in
sparse data environments,” in Proceedings of the 30th annual international ACM SIGIR conference
on Research and development in information retrieval - SIGIR ‟07, 2007, p. 551.
K. Balog and M. De Rijke, “Determining expert profiles (with an application to expert finding),” in
Proceedings of the 20th international joint conference on Artifical intelligence, 2007, pp. 2657–=2662.
34
35. Selected References
Digital Enterprise Research Institute www.deri.ie
Linked Data & User Feedback
S. O‟Riain, E. Curry, and A. Harth, “XBRL and open data for global financial ecosystems: A linked
data approach,” International Journal of Accounting Information Systems, Mar. 2012.
U. Ul Hassan, S. O‟Riain, and E. Curry, “Leveraging Matching Dependencies for Guided User
Feedback in Linked Data Applications,” in 9th International Workshop on Information Integration on
the Web IIWeb2012, 2012.
A. Miles and J. R. Pérez-Agüera, “SKOS: Simple Knowledge Organisation for the Web,” Cataloging &
Classification Quarterly, vol. 43, no. 3–4, pp. 69–83, Apr. 2007.
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann, “DBpedia - A
crystallization point for the Web of Data,” Web Semantics: Science, Services and Agents on the
World Wide Web, vol. 7, no. 3, pp. 154–165, Sep. 2009.
S. R. Jeffery, M. J. Franklin, and A. Y. Halevy, “Pay-as-you-go user feedback for dataspace systems,”
in Proceedings of the 2008 ACM SIGMOD international conference on Management of data -
SIGMOD ‟08, 2008, pp. 847–860.
35
Notes de l'éditeur
Personal background
Show start and the end beforeBreak the builds
The ranking allows us to select best worker based on scoring method