This document discusses managing uncertainty in data and data quality problems. It describes how most data quality problems can be modeled as uncertainty in the data. Probabilistic databases can store, query, and analyze data while accounting for this uncertainty. This allows for scalable and "good enough" initial data integration that can improve over time, avoiding excessive "data fiddling". Measuring expected precision and recall provides a way to quantitatively assess quality and know when cleaning efforts should stop.
Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016
1. MANAGING UNCERTAINTY IN DATA
THE KEY TO EFFECTIVE MANAGEMENT OF DATA QUALITY
PROBLEMS
MAURICE VAN KEULEN
2. Paradigms of scientific method
Empiricism
Mathematical modeling
Simulation
A new paradigm: Data-intensive Scientific Discovery
Combining and analyzing data in novel ways is
capable of tackling research questions that could not
be answered before
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
2
REVOLUTION IN SCIENTIFIC METHOD
Bio-Informatics professor:
“ PhD of 4 years, 3 years
devoted to ‘data fiddling’ ”
3. Research on pregnancy processes based on Electronic
Patient Dossiers (EPDs) of some population of women
Select consult & treatment records from their EPDs
from multiple sources
After first analysis one discovers many records not
related to pregnancy (e.g., dermatologist consult)
Assumption that all records that belong to a pregnant
woman are related to pregnancy is wrong, hence also
the selection criterion!
There is no objective means to ascertain this such as a
field ‘related to pregnancy’
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
3
A FIRST STORY: PREGNANCY RESEARCH
4. A painstaking process follows with specifying filter
rules and manually inspecting samples of results
Imperfect process so noisy records remain!
Wrong diagnoses cause more records to be
erroneously in or out more noisy records
Then, one looks at a sample and notices something
strange in the times of consults: many appear close to
each other and in the evening
Modification time of EPD record (what is recorded)
does not reflect actual moment of activity (semantics)
sequence and duration noise
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
4
A FIRST STORY: PREGNANCY RESEARCH
5. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
5
GEO-SOCIAL RECOMMENDATION: GPS TRAJECTORIES
• Detect visits from trajectories
• GPS traces from mobile phones
• Point-Of-Interest (POI) data
harvested from the internet
• Purpose: construct profiles of
• Customers
• Products
• for recommendation
• Holiday homes
• Greeting cards
6. Substantial amount of money involved in fraud. Ease of
committing fraud incites otherwise decent people to do it
as well. Danger to society
Inspect where there is a high risk of fraud
Example ISZW: labor market, labor circumstances, etc.
But: government data represents paper reality!
Include traces from the internet (social media, web
forums): Customers, employees, and by-standers
leave behind observations and opinions
But natural language: about which company do they
talk?
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
6
DATA-DRIVE FRAUD RISK ANALYSIS
7. Paris Hilton stayed in the Paris Hilton
Lady Gaga - Speechless live @ Helsinki 10/13/2010
http://www.youtube.com/watch?v=yREociHyijk . . .
@ladygaga also talks about her Grampa who died
recently
Laelith Demonia has just defeated liwanu Hird.
Career wins is 575, career losses is 966.
Adding Win7Beta, Win2008, and Vista x64 and x86
images to munin. #wds
history should show that bush jr should be in jail or at
least never should have been president
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
7
NATURAL LANGUAGE PROCESSING: AMBIGUITY ABOUNDS
8. Search (finding the needle in the haystack)
Information extraction from unstructured sources
Natural language processing
Web harvesting
(both produce lower quality structured data)
Data quality management
Responsible analytics is (among other things)
“Knowing how data quality problems in the source
data affect the analytical results”
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
8
TECHNOLOGY WE WORK ON
WE = DATABASES GROUP FROM UNIVERSITY OF TWENTE
Equally
true for
Business
Analytics
9. CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
10
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
Sample of data looks
fine
Result of analysis looks
perfectly reasonable
If you don’t look hard
enough
if you don’t properly pay
attention to it
… you will be unaware
… that you are possibly
looking at significantly
erroneous figures!!!
10. CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
11
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
CustID Sales Name
6789 2 Tom
4567 6000 Jon
5678 NULL Nina
… … …
????
Wrong figures included
Missing figures
Double counting
etc.
Many more problems
at value, record,
schema, source, trust
levels
11. Probabilistic database technology can store, query,
analyze, reason with data taking into account possible
influence on the results
Treats data quality problems as a fact of life
Responsible analytics: know deficiencies of results
Generic and scalable approach and technology
Nice properties for application: postpone-
resolution/cleaning, pay-as-you-go; good-is-good-
enough; human-in-the-loop
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
12
PROBABILISTIC DATABASES TO THE RESCUE
12. Let’s go for an initial
integration that can readily
and meaningfully be used
“Good is good enough” for
meaningful use in many
applications
(can be achieved 10x
earlier)
Let it improve during use
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
PROBABILISTIC DATA INTEGRATION
Use
(analytics)
Measure
quality
Improve
data quality
Partial data
integration
Enumerate cases for
remaining problems
Store data with
uncertainty in UDBMS
InitialintegrationContinuousimprovement
13
Postpon
e
problems
Stop
earlier
Pay as
you go
Human
in the
loop
13. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
14
COMBINING DATA …
Keulen, M. (2012) Managing Uncertainty: The Road
Towards Better Data Interoperability. IT - Information
Technology, 54 (3). pp. 138-146. ISSN 1611-2776
Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10
Car brand Sales
BMW 72
Mercedes-Benz 39
Renault 20
Car brand Sales
Bayerische Motoren Werke 8
Mercedes 35
Renault 15
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
14. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
15
… AND THE PROBLEM OF SEMANTIC DUPLICATES
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Preferred customers …
SELECT SUM(Sales)
FROM CarSales
WHERE Sales>100
0
‘No preferred customers’
15. Database
Real world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8
Bayerische
Motoren Werke
25B.M.W.
SalesCar brand ω
d1
d2
d3
d4
d5
d6
o1
o2
o3
o4
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
16
SEMANTIC DUPLICATES
16. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
17
MOST DATA QUALITY PROBLEMS
CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Mercedes 106
Mercedes-Benz 106
1
2
3
4
5
6
X=0
X=0
X=1 Y=0
X=1 Y=1
X=0 4 and 5 different 0.2
X=1 4 and 5 the same 0.8
Y=0 “Mercedes”
correct name
0.5
Y=1 “Mercedes-Benz”
correct name
0.5
B.M.W. / BMW / Bayerische Motoren Werke analogously
Run some duplicate
detection tool
17. Looks like ordinary database
Several “possible” answers or approximate answers to
queries
Important: Scalability (big data!)
Sales of “preferred customers”
SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
18
IMPORTANT TOOL: PROBABILISTIC DATABASE
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
18. Sales of “preferred customers”
SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
Answer: 106
Risk = Probability * Impact
Analyst only bothered with
problems that matter
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
19
QUERYING AND RELIABILITY ASSESSMENT
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Second most likely
answer at 24% with
impact factor 2 in
sales (211 vs 106)
Risk of substantially
wrong answer
19. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
20
BACK TO GEO-SOCIAL RECOMMENDATION
HOW TO MODEL THE GPS TRAJECTORY PROBLEM?
Smoothing: any jumps and/or sudden sharp angles
are suspicious and probably wrong
Points become
estimated points
Some points are
possibly suspicious
Some are more
suspicious than others
Model the uncertainty
explicitly in the data
20. Fraud risk analysis
about which company do they talk?
Indicators become possible indicators
Fraud risk analysis is statistics / probability theory!
Reasoning with possible indicators is very easy. It’s just more data
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
21
AMBIGUITY IN NATURAL LANGUAGE PROCESSING
AND ITS CONSEQUENCES FOR FRAUD RISK ANALYSIS
Paris Hilton
stayed in the
Paris Hilton
Phrase begin end type ref
Paris 1 1 City sws.geonames.org/
2988507
Paris 1 1 Firstname
Hilton 1 1 Lastname
Paris Hilton 1 2 Person https://en.wikipedia.org/wi
ki/Paris_Hilton
Paris Hilton 1 2 Hotel www.hilton.com/Paris
… … … …
“belong
together”
21. Inspired from information retrieval
(search engine evaluation)
Precision = ratio of answers that are correct
(3/5 = 60%)
Recall = ratio of correct answers given
(3/4 = 75%)
Expected precision and recall
A correct answer is better if the system dares to
claim that it is correct with a higher probability
Analogously, incorrect answers with a high
probability are worse than incorrect answers
with a low probability
Expected precision = (0.8+0.7+0.2) / 2.3 = 74%
Expected recall = (0.8+0.7+0.2) / 4 = 43%
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
22
KNOW WHEN TO STOP CLEANING: MEASURING QUALITY
A
B
C
D
E
F
G
80%
70%
50%
20%
10%
22. Data quality: intangible problem with unknown impact
The key to effective management of DQ problems
Model DQ problems as uncertainty *in* the data
Probabilistic database technology for scalability
Postpone resolution/cleaning: pay-as-you-go
Measure and know when to stop:
good-is-good-enough; human-in-the-loop
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
23
CONCLUSIONS
Bio-Informatics professor:
“ PhD of 4 years, 3 years
devoted to ‘data fiddling’ ”
If we can reduce the data fiddling
with 1 year (33%), we make the
scientist twice as productive!
Notes de l'éditeur
Abstract
Business analytics and data science are significantly impaired by a wide variety of 'data handling' issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or "Uncertain Database". Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data
Explain Bio-Informatics
Of course there are others, e.g., BioInformatics
First two examples showed data quality and semantical problems, if you do NLP you are faced with the same!
Refer back to pregnancy and movie examples: all those issues can be modeled as uncertainty in data. Queries and analytics results will give all possible results, i.e., handle for influence on results
With OSINT data, this problem of semantic duplicates is enormous .,..
Notice that all these are “tables”
TODO: deze slide wat explicieter / concreter maken