Data Cleansing and Understanding Best Practices

Data Cleansing and Data Understanding
Best Practices and Lessons from the Field
Casey Stella
@casey_stella
2017
1

InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
practices-data-curate
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week

Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com

Hi, I’m Casey Stella!
• I’m a Principal Software Engineer at Hortonworks, a Hadoop vendor, writing open
source software
• I work on Apache Metron (Incubating), constructing a platform to do advanced
analytics and data science for cyber security at scale
2

Hi, I’m Casey Stella!
• I’m a Principal Software Engineer at Hortonworks, a Hadoop vendor, writing open
source software
• I work on Apache Metron (Incubating), constructing a platform to do advanced
analytics and data science for cyber security at scale
• Prior to this, I was
• Doing data science consulting on the Hadoop ecosystem for Hortonworks
• Doing data mining on medical data at Explorys using the Hadoop ecosystem
• Doing signal processing on seismic data at Ion Geophysical
• A graduate student in the Math department at Texas A&M in algorithmic
complexity theory
2

Garbage In =⇒ Garbage Out
“80% of the work in any data project is in cleaning the data.”
— D.J. Patel in Data Jujitsu
3

Data Cleansing =⇒ Data Understanding
There are two ways to understand your data
• Syntactic Understanding
• Semantic Understanding
If you hope to get anything out of your data, you have to have a handle on both.
4

Syntactic Understanding: True Types
A true type is a label applied to data points xi such that xi are mutually comparable.
• Schemas type != true data type
• A speciﬁc column can have many diﬀerent types
5

Syntactic Understanding: True Types
A true type is a label applied to data points xi such that xi are mutually comparable.
• Schemas type != true data type
• A speciﬁc column can have many diﬀerent types
“735” has a true type of integer but could have a schema type of string or double
5

Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
6

• For numerical data, distributions and statistical characteristics are informative.
6

• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
6

extremely useful.
• For ALL data, an indication of how “empty” the data is.
6

extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
6

extremely useful.
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
6

extremely useful.
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your
data. 6

Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
7

∆Density
This kind of analysis can show
• Problems in the data pipeline
7

∆Density
• Whether the assumptions of your analysis are violated
7

∆Density
∆Density
∆t =⇒
• Automation
7

∆Density
∆Density
∆t =⇒
• Automation
• Outlier Alerting
7

Story Time: A Summation over Time Saves Face
The stage: An unnamed startup analyzing clinical eﬀectiveness measurement for
hospital.
8

hospital.
• Think of it as a rules engine that takes medical data and outputs how well doctors
and departments are doing
8

hospital.
• Insights aren’t trusted if they’re wrong.
8

hospital.
• Insights aren’t trusted if they’re wrong.
• Correctness depend on good data
8

Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
9

how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
9

how it is stored.
• May come from humans (e.g. domain experience and ontologies)
9

how it is stored.
• May come from machine learning (e.g. analyzing usage patterns to ﬁnd synonyms)
9

how it is stored.
• May come from machine learning (e.g. analyzing usage patterns to ﬁnd synonyms)
Semantic understanding may require data science. At the same time, data science will
require semantic understanding.
9

Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon
The stage: An unnamed insurance company building models to predict disease
10

• Doctors and Nurses are busy people
10

• Humans suﬀer from conﬁrmation bias
10

• Machines can only interpret what they can see
10

• Machines can only interpret what they can see
• Together we can ﬁll in the gaps
10

Implications for Team Structure
To be successful,
15

To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
15

To be successful,
and understanding.
• Your data science teams have to be willing to get their hands dirty
15

To be successful,
and understanding.
• Your data science teams have to be allowed to get their hands dirty
15

To be successful,
and understanding.
• Your data science teams have to be allowed to get their hands dirty
• Your data science teams need software engineering chops.
15

Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation page.1
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
1
http://github.com/cestella/presentations/
16

Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
practices-data-curate

Data Cleansing and Understanding Best Practices

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

More from C4Media

More from C4Media (20)

Recently uploaded

Recently uploaded (20)

Data Cleansing and Understanding Best Practices