Caserta Concepts and Databricks partner up to bring you this insightful webinar on how a business can choose from all of the emerging big data technologies to figure out which one best fits their needs.
2. @joe_Caserta
Joe Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The Data
Warehouse ETL Toolkit (Wiley)
Data Analysis, Data Warehousing and Business
Intelligence since 1996
Began consulting database programing and data
modeling 25+ years hands-on experience building database
solutions
Founded Caserta Concepts in NYC
Web log analytics solution published in Intelligent
Enterprise magazine
Launched Data Science, Data Interaction and Cloud
practices
Laser focus on extending Data Analytics with Big Data
solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance Techniques on Big
Data (Innovation)
Awarded Top 20 Big Data Companies 2016
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing (BDW) Meetup NYC:
2,000+ Members
2016 Awarded Fastest Growing Big Data Companies
2016
Established best practices for big data ecosystem
implementations
3. @joe_Caserta
About Caserta Concepts
– Consulting Data Innovation and Modern Data Engineering
– Award-winning company
– Internationally recognized work force
– Strategy, Architecture, Implementation, Governance
– Innovation Partner
– Strategic Consulting
– Advanced Architecture
– Build & Deploy
• Leader in Enterprise Data Solutions
– Big Data Analytics
– Data Warehousing
– Business Intelligence
Data Science
Cloud Computing
Data Governance
7. @joe_Caserta
Use Case Objectives
• Cross-Channel Behavior Tracking / Identity Resolution
• Access to Atomic Level Transaction Data
• Blend Data Assets from Multiple Marketing Channels
• Fast Data Onboarding and Discovery
• Ensure Data Quality
• Single platform / Landscape simplification
• Understand Path-to-Purchase Customer Journey
• Improve Customer Experience & Increase Sales
8. @joe_Caserta
The Recipe: Build a Dynamic Data Platform
OLD WAY:
• Structure Data Ingest Data Analyze Data
• Fixed Capacity
• Monolith
NEW WAY:
• Ingest Data Analyze Data Structure Data
• Dynamic Capacity
• Ecosystem
RECIPE:
• Cloud
• Data Lake
• Holistic Architecture & Framework
9. @joe_Caserta
Analytics: The Whole Brain Challenge
Front
Back
Analytics Oriented
• Data Science
• Research
Process Oriented
• Data Governance
• Compliance
Operations Oriented
• Shared Services
• Data Engineering
Revenue Oriented
• Revenue Goals
• Monetizing Data
10. @joe_Caserta
Chief Data Organization (Oversight)
Vertical Business Area
[Sales/Finance/Marketing/Operations/Customer Svc]
Product Owner
SCRUM Master
Development Team
Business Subject Matter Expertise
Data Librarian/Data Stewardship
Data Science/ Statistical Skills
Data Engineering / Architecture
Presentation/ BI Report Development Skills
Data Quality Assurance
DevOps
IT Organization
(Oversight)
Enterprise Data Architect
Solution Engineers
Data Integration Practice
User Experience Practice
QA Practice
Operations Practice
Advanced Analytics
Business Analysts
Data Analysts
Data Scientists
Statisticians
Data Engineers
Planning Organization
Project Managers
Data Organization
Data Gov Coordinator
Data Librarians
Data Stewards
It Takes a Village!
12. @joe_Caserta
Global economics
Intensity of competition
Reduce costs
Move to cross-functional teams
New executive leadership
Speed of technical change
Social trends and changes
Period of time in present role
Status & perks of office/dept under threat
No apparent reasons for proposed changes
Lack of understanding of proposed changes
Fear of inability to cope with new technology
Concern over job security
Forces for Change Forces Resisting Change
Status Quo
Moving the Status Quo
http://www.change-management-coach.com/force-field-analysis.html
13. @joe_Caserta
The Data Lake Paradigm
Technology:
• Scalable distributed storage S3
• Pluggable fit-for-purpose processing EMR
• Consistent extensible framework Spark
• Dimensional Data Warehouse Redshift
Functional Capabilities:
• Remove barriers between data ingestion and analysis
• Democratize Data with Just Enough Data Governance
15. @joe_Caserta
Why Spark?
“Big Box” tools vs ROI?
– Prohibitively expensive limited by licensing $$$
– Typically limited to the scalability of a single server
We Spark!
• Development local or distributed is identical
• Beautiful high level API’s
• Full universe of Python modules
• Open source and Free
• Blazing fast!
• Databricks cloud makes it easier
Spark has become our default processing engine for a
myriad of engineering & science problems
16. @joe_Caserta
Why we Databricks
• Interactive UI
• Includes a workspace with notebooks, dashboards, job scheduler,
point-and-click cluster management
• Cluster sharing
• Multiple users can connect to the same cluster, saving cost
• Security features
• Access controls to the whole workspace, clusters
• Collaboration
• Multi-user access to the same notebook, revision control, and IDE
and GitHub integration
• Data management
• Support for connecting different data sources to Spark, caching
service to speed up queries
17. @joe_Caserta
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries
and Reporting
Big
Data
Ware
house
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Usage Pattern Data Governance
Metadata, ILM,
Security
Corporate Data Pyramid (CDP)
18. @joe_Caserta
Data Integration
Identity Resolution
Data Quality
Discovery / Exploration
Machine Learning
Models Development
Reports / Dashboards
Applications
APIs
Structured Data
Unstructured Data
SQL, NoSQL, Object Store
Find Share Collaborate
Data Engineer
Data Scientist
Business Analyst
App Developer
Analyze
Persist
DeployIngest
Data Lake on the Cloud
20. @joe_Caserta
Type
Comments
Single Touch Rules-Based Statistically Driven
Assign the credit
to the first or last
exposure
Assign the credit to
each interaction
based on business
rules
Assign the credit to
interactions based
on data-driven
model
Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click
100% 33% 33% 33% 27% 49% 24%
- Last touch only
- Ignores bulk of
customer journey
- Undervalues
other interactions
and influencers
- Subjective
- Assigns arbitrary
values to each
interaction
- Lacks analytics rigor
to determine weights
ü Looks at full behavior
patterns
ü Consider all touch points
ü Can apply different
models for best results
ü Use data to find
correlations between
touch points (winning
combinations)
Path-to-Purchase Methods
21. @joe_Caserta
Unifying the Customer Across Channels
Customer Data Integration (CDI):
Match and manage customer information from all available sources
Marketing channels: DMP, Salesforce, Adobe, Social, Direct Mail, Call Center, CRM
In other words…
We need to figure out how to
LINK people across systems!
23. @joe_Caserta
Standardization and Matching
Cleanse and Parse:
• Names
• Resolve nicknames
• Create deterministic hash, phonetic
representation
• Addresses
• Emails
• Phone Numbers
Matching:
Join based on combinations of cleansed
and standardized data to create match
results:
Spark map operations:
• Data cleansing, transformation, and
standardization
– Address Parsing: usaddress, postal-
address, etc
– Name Hashing: fuzzy, etc
– Genderization: sexmachine, etc
24. @joe_Caserta
Mastering Unmanageable Source Data
Reveal
• Wait for the customer to “reveal” themselves
• Create link between anonymous self and known profile
Vector
• May need behavioral statistical profiling
• Compare use vectors
Rebuild
• Recluster all prior activities
• Rebuild the Graph
25. @joe_Caserta
The Matching Process
The matching process output gives us the relationships between customers:
Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the
same person (and this is a trivial case)
And we need to cluster and identify our survivors (vertex)
xid yid match_type
1234 4849 phone
4849 5499 email
5499 1235 address
4849 7788 cookie
5499 7788 cookie
4849 1234 phone
26. @joe_Caserta
Graph to the Rescue
1234 4849
5499
7788
We just need to import our edges into a graph
and “dump” out communities
Don’t think table…
think Graph! These matches are
actually communities
1235
27. @joe_Caserta
Connected Components algorithm labels each connected component of the
graph with the ID of its lowest-numbered vertex
This lowest number vertex can serve as our “survivor” (not field survivorship)
Connected Components
xid yid
1234 4849
1234 5499
1234 1235
1234 7788
1234 7788
1234 1234
34. @joe_Caserta
• Data Lake: S3 / NoSQL or SQL as Needed
• Identity Resolution: Spark / Graph Frames
• ETL: Spark/Python
• Path-to-Purchase: S3 / Spark / MLlib
• Data Warehouse: RedShift
• Shared Interface: Notebooks
• Business Intelligence: Tableau
Recap
35. @joe_Caserta
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
@joe_Caserta
• Award-winning company
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Architecture
• Innovation Partner
• Strategic Consulting
• Advanced Technical Design
• Build & Deploy Solutions
• BDW Meetup
• New York City
• 3,000+ members
• Knowledge sharing
Data is not important, it’s what you do with it that’s important!
Thank You