Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016

MANAGING UNCERTAINTY IN DATA
THE KEY TO EFFECTIVE MANAGEMENT OF DATA QUALITY
PROBLEMS
MAURICE VAN KEULEN

Paradigms of scientific method
 Empiricism
 Mathematical modeling
 Simulation
A new paradigm: Data-intensive Scientific Discovery
 Combining and analyzing data in novel ways is
capable of tackling research questions that could not
be answered before
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
2
REVOLUTION IN SCIENTIFIC METHOD
Bio-Informatics professor:
“ PhD of 4 years, 3 years
devoted to ‘data fiddling’ ”

Research on pregnancy processes based on Electronic
Patient Dossiers (EPDs) of some population of women
 Select consult & treatment records from their EPDs
from multiple sources
 After first analysis one discovers many records not
related to pregnancy (e.g., dermatologist consult)
 Assumption that all records that belong to a pregnant
woman are related to pregnancy is wrong, hence also
the selection criterion!
 There is no objective means to ascertain this such as a
field ‘related to pregnancy’
problems
3
A FIRST STORY: PREGNANCY RESEARCH

 A painstaking process follows with specifying filter
rules and manually inspecting samples of results
 Imperfect process so noisy records remain!
 Wrong diagnoses cause more records to be
erroneously in or out  more noisy records
 Then, one looks at a sample and notices something
strange in the times of consults: many appear close to
each other and in the evening
 Modification time of EPD record (what is recorded)
does not reflect actual moment of activity (semantics)
 sequence and duration noise
problems
4
A FIRST STORY: PREGNANCY RESEARCH

problems
5
GEO-SOCIAL RECOMMENDATION: GPS TRAJECTORIES
• Detect visits from trajectories
• GPS traces from mobile phones
• Point-Of-Interest (POI) data
harvested from the internet
• Purpose: construct profiles of
• Customers
• Products
• for recommendation
• Holiday homes
• Greeting cards

Substantial amount of money involved in fraud. Ease of
committing fraud incites otherwise decent people to do it
as well. Danger to society
 Inspect where there is a high risk of fraud
 Example ISZW: labor market, labor circumstances, etc.
 But: government data represents paper reality!
Include traces from the internet (social media, web
forums): Customers, employees, and by-standers
leave behind observations and opinions
 But natural language: about which company do they
talk?
problems
6
DATA-DRIVE FRAUD RISK ANALYSIS

 Paris Hilton stayed in the Paris Hilton
 Lady Gaga - Speechless live @ Helsinki 10/13/2010
http://www.youtube.com/watch?v=yREociHyijk . . .
@ladygaga also talks about her Grampa who died
recently
 Laelith Demonia has just defeated liwanu Hird.
Career wins is 575, career losses is 966.
 Adding Win7Beta, Win2008, and Vista x64 and x86
images to munin. #wds
 history should show that bush jr should be in jail or at
least never should have been president
problems
7
NATURAL LANGUAGE PROCESSING: AMBIGUITY ABOUNDS

 Search (finding the needle in the haystack)
 Information extraction from unstructured sources
 Natural language processing
 Web harvesting
(both produce lower quality structured data)
 Data quality management
 Responsible analytics is (among other things)
“Knowing how data quality problems in the source
data affect the analytical results”
problems
8
TECHNOLOGY WE WORK ON
WE = DATABASES GROUP FROM UNIVERSITY OF TWENTE
Equally
true for
Business
Analytics

CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
problems
10
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
 Sample of data looks
fine
 Result of analysis looks
perfectly reasonable
 If you don’t look hard
enough
if you don’t properly pay
attention to it
… you will be unaware
… that you are possibly
looking at significantly
erroneous figures!!!

CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
problems
11
IMPACT OF DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
CustID Sales Name
6789 2 Tom
4567 6000 Jon
5678 NULL Nina
… … …
????
Wrong figures included
Missing figures
Double counting
etc.
Many more problems
at value, record,
schema, source, trust
levels

Probabilistic database technology can store, query,
analyze, reason with data taking into account possible
influence on the results
 Treats data quality problems as a fact of life
 Responsible analytics: know deficiencies of results
 Generic and scalable approach and technology
 Nice properties for application: postpone-
resolution/cleaning, pay-as-you-go; good-is-good-
enough; human-in-the-loop
problems
12
PROBABILISTIC DATABASES TO THE RESCUE

Let’s go for an initial
integration that can readily
and meaningfully be used
“Good is good enough” for
meaningful use in many
applications
(can be achieved 10x
earlier)
Let it improve during use
problems
PROBABILISTIC DATA INTEGRATION
Use
(analytics)
Measure
quality
Improve
data quality
Partial data
integration
Enumerate cases for
remaining problems
Store data with
uncertainty in UDBMS
InitialintegrationContinuousimprovement
13
Postpon
e
problems
Stop
earlier
Pay as
you go
Human
in the
loop

problems
14
COMBINING DATA …
Keulen, M. (2012) Managing Uncertainty: The Road
Towards Better Data Interoperability. IT - Information
Technology, 54 (3). pp. 138-146. ISSN 1611-2776
Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10
Car brand Sales
BMW 72
Mercedes-Benz 39
Renault 20
Car brand Sales
Bayerische Motoren Werke 8
Mercedes 35
Renault 15
Car brand Sales
B.M.W. 25
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45

problems
15
… AND THE PROBLEM OF SEMANTIC DUPLICATES
Car brand Sales
B.M.W. 25
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Preferred customers …
SELECT SUM(Sales)
FROM CarSales
WHERE Sales>100
0
‘No preferred customers’

Database
Real world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8
Bayerische
Motoren Werke
25B.M.W.
SalesCar brand ω
d1
d2
d3
d4
d5
d6
o1
o2
o3
o4
problems
16
SEMANTIC DUPLICATES

problems
17
MOST DATA QUALITY PROBLEMS
CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W. 25
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Mercedes 106
Mercedes-Benz 106
1
2
3
4
5
6
X=0
X=0
X=1 Y=0
X=1 Y=1
X=0 4 and 5 different 0.2
X=1 4 and 5 the same 0.8
Y=0 “Mercedes”
correct name
0.5
Y=1 “Mercedes-Benz”
correct name
0.5
B.M.W. / BMW / Bayerische Motoren Werke analogously
Run some duplicate
detection tool

 Looks like ordinary database
 Several “possible” answers or approximate answers to
queries
 Important: Scalability (big data!)
Sales of “preferred customers”
 SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
problems
18
IMPORTANT TOOL: PROBABILISTIC DATABASE
SUM(sales) P
0 14%
105 6%
106 56%
211 24%

Sales of “preferred customers”
 SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
 Answer: 106
 Risk = Probability * Impact
 Analyst only bothered with
problems that matter
problems
19
QUERYING AND RELIABILITY ASSESSMENT
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Second most likely
answer at 24% with
impact factor 2 in
sales (211 vs 106)
Risk of substantially
wrong answer

problems
20
BACK TO GEO-SOCIAL RECOMMENDATION
HOW TO MODEL THE GPS TRAJECTORY PROBLEM?
 Smoothing: any jumps and/or sudden sharp angles
are suspicious and probably wrong
 Points become
estimated points
 Some points are
possibly suspicious
 Some are more
suspicious than others
Model the uncertainty
explicitly in the data

Fraud risk analysis
 about which company do they talk?
 Indicators become possible indicators
 Fraud risk analysis is statistics / probability theory!
Reasoning with possible indicators is very easy. It’s just more data
problems
21
AMBIGUITY IN NATURAL LANGUAGE PROCESSING
AND ITS CONSEQUENCES FOR FRAUD RISK ANALYSIS
Paris Hilton
stayed in the
Paris Hilton
Phrase begin end type ref
Paris 1 1 City sws.geonames.org/
2988507
Paris 1 1 Firstname
Hilton 1 1 Lastname
Paris Hilton 1 2 Person https://en.wikipedia.org/wi
ki/Paris_Hilton
Paris Hilton 1 2 Hotel www.hilton.com/Paris
… … … …
“belong
together”

 Inspired from information retrieval
(search engine evaluation)
 Precision = ratio of answers that are correct
(3/5 = 60%)
 Recall = ratio of correct answers given
(3/4 = 75%)
 Expected precision and recall
 A correct answer is better if the system dares to
claim that it is correct with a higher probability
 Analogously, incorrect answers with a high
probability are worse than incorrect answers
with a low probability
 Expected precision = (0.8+0.7+0.2) / 2.3 = 74%
 Expected recall = (0.8+0.7+0.2) / 4 = 43%
problems
22
KNOW WHEN TO STOP CLEANING: MEASURING QUALITY
A
B
C
D
E
F
G
80%
70%
50%
20%
10%

Data quality: intangible problem with unknown impact
The key to effective management of DQ problems
 Model DQ problems as uncertainty *in* the data
 Probabilistic database technology for scalability
 Postpone resolution/cleaning: pay-as-you-go
 Measure and know when to stop:
good-is-good-enough; human-in-the-loop
problems
23
CONCLUSIONS
Bio-Informatics professor:
“ PhD of 4 years, 3 years
devoted to ‘data fiddling’ ”
If we can reduce the data fiddling
with 1 year (33%), we make the
scientist twice as productive!

Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016

Recommandé

Recommandé

Contenu connexe

Similaire à Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016

Similaire à Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016 (20)

Dernier

Dernier (20)

Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016

Notes de l'éditeur