SlideShare une entreprise Scribd logo
1  sur  27
#DAMADay @joe_caserta @joe_caserta#DAMADay
New York Chapter
Integrating Heterogeneous Data
DAMA Day Symposium
May 18, 2017
Presented by
Joe Caserta
#DAMADay @joe_caserta
Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The Data
Warehouse ETL Toolkit (Wiley)
Data Analysis, Data Warehousing and Business
Intelligence since 1996
Began consulting database programing and data
modeling 25+ years hands-on experience building database
solutions
Founded Caserta Concepts in NYC
Web log analytics solution published in Intelligent
Enterprise magazine
Launched Data Science, Data Interaction and Cloud
practices
Laser focus on extending Data Analytics with Big Data
solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance Techniques on Big
Data (Innovation)
Awarded Top 20 Big Data
Companies 2016
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing (BDW) Meetup
NYC:4,000 Members
2016 Awarded Fastest Growing Big Data
Companies 2016
Established best practices for big data ecosystem
implementations
#DAMADay @joe_caserta
About Caserta Concepts
– Consulting Data Innovation and Modern Data Engineering
– Award-winning company
– Internationally recognized work force
– Strategy, Architecture, Implementation, Governance
– Innovation Partner
– Strategic Consulting
– Advanced Architecture
– Build & Deploy
• Leader in Enterprise Data Solutions
– Big Data Analytics
– Data Warehousing
– Business Intelligence
• Data Science
• Cloud Computing
• Data Governance
#DAMADay @joe_caserta
Caserta Client Portfolio
Retail/eCommerce
& Manufacturing
Finance, Healthcare
& Insurance
Digital Media/AdTech
Education & Services
#DAMADay @joe_caserta
Awards & Recognition
Top 10
Fastest Growing
Big Data Companies
2016
#DAMADay @joe_caserta
Our Partners
#DAMADay @joe_caserta
Aligning Heterogeneous Data Sources
Awareness Consideration Purchase Service
Loyalty
Expansion
PR
Radio
TV
Print
Outdoor
Word of Mouth
Direct Mail
Customer Service
Physical Touchpoints
Digital Touchpoints
Search
Paid Content
email
Website/
Landing Pages
Social Media
Community
Chat
Social Media
Call Center
Offers
Mailings
Survey
Loyalty Programs
email
Agents
Partners
Ads
Website
Mobile
3rd Party Sites
Offers
Web self-service
#DAMADay @joe_caserta
Attribution
Type
Comments
Single Touch Rules-Based Statistically Driven
Assign the credit
to the first or last
exposure
Assign the credit to
each interaction
based on business
rules
Assign the credit to
interactions based
on data-driven
model
Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click
100% 33% 33% 33% 27% 49% 24%
- Last touch only
- Ignores bulk of
customer journey
- Undervalues
other interactions
and influencers
- Subjective
- Assigns arbitrary
values to each
interaction
- Lacks analytics rigor
to determine weights
ü Looks at full behavior
patterns
ü Consider all touch points
ü Can apply different
models for best results
ü Use data to find
correlations between
touch points (winning
combinations)
Why do we Care?
#DAMADay @joe_caserta
Onboarding New Data
Business: “I need to analyze some new data”
ü IT collects requirements
ü Creates normalized and/or dimensional data models
ü Profiles and conforms and the data
ü Sophisticated ETL programs and quality standards
ü Loads it into data models
ü Builds a BI semantic layer
ü Creates dashboards and reports
IT: “You’ll have your data in 3-6 months to see if it has value!
– Onboarding new data is difficult!
– Rigid Structures and Data Governance
– Disconnected/removed from business
#DAMADay @joe_caserta
The New Data Paradigm
OLD WAY:
• Structure Data  Ingest Data  Analyze Data
• Fixed Capacity
• Monolith
NEW WAY:
• Ingest Data  Analyze Data  Structure Data
• Dynamic Capacity
• Ecosystem
RECIPE:
• Cloud
• Data Lake
• Holistic Architecture & Framework
#DAMADay @joe_caserta
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries
and Reporting
Big
Data
Ware
house
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Usage Pattern Data Governance
Metadata, ILM,
Security
Corporate Data Pyramid (CDP)
#DAMADay @joe_caserta
• Development local or distributed is identical
• Beautiful high level API’s
• Full universe of Python modules
• Open source and Free
• Blazing fast!
Spark has become our default processing engine for a data engineering & science
Why Use Spark?
#DAMADay @joe_caserta
Cloud Component AWS Google Microsoft
Scalable distributed storage S3 GCS Azure Storage
Pluggable fit-for-purpose processing EMR DataProc HDInsight
Compute Services EC2 GCE VMs
Consistent extensible framework Spark Spark Spark
Dimensional MPP Data Warehouse Redshift BigQuery
Azure SQL Data
Warehouse
Data Streaming Kenesis PubSub Azure Stream
Common Interface Jupyter DataLab Azure Notebook
The Data Lake on the Cloud
• Remove barriers between data ingestion and analysis
• Democratize data with Just Enough Data Governance (JEDG)
#DAMADay @joe_caserta
The Notebook is the ETL Tool
#DAMADay @joe_caserta
Data Quality and Monitoring
• BUILD a robust data quality subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL Toolkit
• Each error instance of each data quality
check is captured
• Implemented as sub-system after
ingestion
• Each fact stores unique identifier of the
defective source row
#DAMADay @joe_caserta
Unifying the Customer Across Channels
Customer Data Integration (CDI):
Match and manage customer information from all available sources
Marketing channels: DMP, Salesforce, Adobe, Social, Direct Mail, Call Center, CRM
In other words…
We need to figure out how to
LINK people across systems!
#DAMADay @joe_caserta
Mastering Master Data is Still MDM
Standardize
Match
Survivorship
Validate
#DAMADay @joe_caserta
Standardization and Matching
Cleanse and Parse:
• Names
• Resolve nicknames
• Create deterministic hash, phonetic
representation
• Addresses
• Emails
• Phone Numbers
Matching:
Join based on combinations of cleansed
and standardized data to create match
results:
Spark map operations:
• Data cleansing, transformation, and
standardization
– Address Parsing: usaddress, postal-address,
etc
– Name Hashing: fuzzy, etc
– Genderization: sexmachine, etc
#DAMADay @joe_caserta
Mastering Unmanageable Source Data
Reveal
• Wait for the customer to “reveal” themselves
• Create link between anonymous self and known profile
Vector
• May need behavioral statistical profiling
• Compare use vectors
Rebuild
• Recluster all prior activities
• Rebuild the Graph
#DAMADay @joe_caserta
The Matching Process
The matching process output gives us the relationships between customers:
Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the
same person (and this is a trivial case)
And we need to cluster and identify our survivors (vertex)
xid yid match_type
1234 4849 phone
4849 5499 email
5499 1235 address
4849 7788 cookie
5499 7788 cookie
4849 1234 phone
#DAMADay @joe_caserta
Graph to the Rescue
1234 4849
5499
7788
We just need to import our edges into a graph
and “dump” out communities
Don’t think table…
think Graph! These matches are
actually communities
1235
#DAMADay @joe_caserta
Connected Components algorithm labels each connected component of the
graph with the ID of its lowest-numbered vertex
This lowest number vertex can serve as our “survivor” (not field survivorship)
Connected Components
xid yid
1234 4849
1234 5499
1234 1235
1234 7788
1234 7788
1234 1234
#DAMADay @joe_caserta
Identity Resolution Process
#DAMADay @joe_caserta
The BDW is still Dimensional
#DAMADay @joe_caserta
Use Graph for Data Lineage
#DAMADay @joe_caserta
Sample Solution Architecture
#DAMADay @joe_caserta
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
@joe_Caserta
• Award-winning company
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Architecture
• Innovation Partner
• Strategic Consulting
• Advanced Technical Design
• Build & Deploy Solutions
• BDW Meetup
• New York City
• 3,000+ members
• Knowledge sharing
Data is not important, it’s what you do with it that’s important!
Thank You

Contenu connexe

Plus de Caserta

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
Caserta
 

Plus de Caserta (20)

The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 

Dernier

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Dernier (20)

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Dama Day May 18 2017 Integrating Heterogeneous Data

  • 1. #DAMADay @joe_caserta @joe_caserta#DAMADay New York Chapter Integrating Heterogeneous Data DAMA Day Symposium May 18, 2017 Presented by Joe Caserta
  • 2. #DAMADay @joe_caserta Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise magazine Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Awarded Top 20 Big Data Companies 2016 Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC:4,000 Members 2016 Awarded Fastest Growing Big Data Companies 2016 Established best practices for big data ecosystem implementations
  • 3. #DAMADay @joe_caserta About Caserta Concepts – Consulting Data Innovation and Modern Data Engineering – Award-winning company – Internationally recognized work force – Strategy, Architecture, Implementation, Governance – Innovation Partner – Strategic Consulting – Advanced Architecture – Build & Deploy • Leader in Enterprise Data Solutions – Big Data Analytics – Data Warehousing – Business Intelligence • Data Science • Cloud Computing • Data Governance
  • 4. #DAMADay @joe_caserta Caserta Client Portfolio Retail/eCommerce & Manufacturing Finance, Healthcare & Insurance Digital Media/AdTech Education & Services
  • 5. #DAMADay @joe_caserta Awards & Recognition Top 10 Fastest Growing Big Data Companies 2016
  • 7. #DAMADay @joe_caserta Aligning Heterogeneous Data Sources Awareness Consideration Purchase Service Loyalty Expansion PR Radio TV Print Outdoor Word of Mouth Direct Mail Customer Service Physical Touchpoints Digital Touchpoints Search Paid Content email Website/ Landing Pages Social Media Community Chat Social Media Call Center Offers Mailings Survey Loyalty Programs email Agents Partners Ads Website Mobile 3rd Party Sites Offers Web self-service
  • 8. #DAMADay @joe_caserta Attribution Type Comments Single Touch Rules-Based Statistically Driven Assign the credit to the first or last exposure Assign the credit to each interaction based on business rules Assign the credit to interactions based on data-driven model Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click 100% 33% 33% 33% 27% 49% 24% - Last touch only - Ignores bulk of customer journey - Undervalues other interactions and influencers - Subjective - Assigns arbitrary values to each interaction - Lacks analytics rigor to determine weights ü Looks at full behavior patterns ü Consider all touch points ü Can apply different models for best results ü Use data to find correlations between touch points (winning combinations) Why do we Care?
  • 9. #DAMADay @joe_caserta Onboarding New Data Business: “I need to analyze some new data” ü IT collects requirements ü Creates normalized and/or dimensional data models ü Profiles and conforms and the data ü Sophisticated ETL programs and quality standards ü Loads it into data models ü Builds a BI semantic layer ü Creates dashboards and reports IT: “You’ll have your data in 3-6 months to see if it has value! – Onboarding new data is difficult! – Rigid Structures and Data Governance – Disconnected/removed from business
  • 10. #DAMADay @joe_caserta The New Data Paradigm OLD WAY: • Structure Data  Ingest Data  Analyze Data • Fixed Capacity • Monolith NEW WAY: • Ingest Data  Analyze Data  Structure Data • Dynamic Capacity • Ecosystem RECIPE: • Cloud • Data Lake • Holistic Architecture & Framework
  • 11. #DAMADay @joe_caserta Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Big Data Ware house Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Usage Pattern Data Governance Metadata, ILM, Security Corporate Data Pyramid (CDP)
  • 12. #DAMADay @joe_caserta • Development local or distributed is identical • Beautiful high level API’s • Full universe of Python modules • Open source and Free • Blazing fast! Spark has become our default processing engine for a data engineering & science Why Use Spark?
  • 13. #DAMADay @joe_caserta Cloud Component AWS Google Microsoft Scalable distributed storage S3 GCS Azure Storage Pluggable fit-for-purpose processing EMR DataProc HDInsight Compute Services EC2 GCE VMs Consistent extensible framework Spark Spark Spark Dimensional MPP Data Warehouse Redshift BigQuery Azure SQL Data Warehouse Data Streaming Kenesis PubSub Azure Stream Common Interface Jupyter DataLab Azure Notebook The Data Lake on the Cloud • Remove barriers between data ingestion and analysis • Democratize data with Just Enough Data Governance (JEDG)
  • 15. #DAMADay @joe_caserta Data Quality and Monitoring • BUILD a robust data quality subsystem: • Metadata and error event facts • Orchestration • Based on Data Warehouse ETL Toolkit • Each error instance of each data quality check is captured • Implemented as sub-system after ingestion • Each fact stores unique identifier of the defective source row
  • 16. #DAMADay @joe_caserta Unifying the Customer Across Channels Customer Data Integration (CDI): Match and manage customer information from all available sources Marketing channels: DMP, Salesforce, Adobe, Social, Direct Mail, Call Center, CRM In other words… We need to figure out how to LINK people across systems!
  • 17. #DAMADay @joe_caserta Mastering Master Data is Still MDM Standardize Match Survivorship Validate
  • 18. #DAMADay @joe_caserta Standardization and Matching Cleanse and Parse: • Names • Resolve nicknames • Create deterministic hash, phonetic representation • Addresses • Emails • Phone Numbers Matching: Join based on combinations of cleansed and standardized data to create match results: Spark map operations: • Data cleansing, transformation, and standardization – Address Parsing: usaddress, postal-address, etc – Name Hashing: fuzzy, etc – Genderization: sexmachine, etc
  • 19. #DAMADay @joe_caserta Mastering Unmanageable Source Data Reveal • Wait for the customer to “reveal” themselves • Create link between anonymous self and known profile Vector • May need behavioral statistical profiling • Compare use vectors Rebuild • Recluster all prior activities • Rebuild the Graph
  • 20. #DAMADay @joe_caserta The Matching Process The matching process output gives us the relationships between customers: Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the same person (and this is a trivial case) And we need to cluster and identify our survivors (vertex) xid yid match_type 1234 4849 phone 4849 5499 email 5499 1235 address 4849 7788 cookie 5499 7788 cookie 4849 1234 phone
  • 21. #DAMADay @joe_caserta Graph to the Rescue 1234 4849 5499 7788 We just need to import our edges into a graph and “dump” out communities Don’t think table… think Graph! These matches are actually communities 1235
  • 22. #DAMADay @joe_caserta Connected Components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex This lowest number vertex can serve as our “survivor” (not field survivorship) Connected Components xid yid 1234 4849 1234 5499 1234 1235 1234 7788 1234 7788 1234 1234
  • 24. #DAMADay @joe_caserta The BDW is still Dimensional
  • 27. #DAMADay @joe_caserta Joe Caserta President, Caserta Concepts joe@casertaconcepts.com @joe_Caserta • Award-winning company • Transformative Data Strategies • Modern Data Engineering • Advanced Architecture • Innovation Partner • Strategic Consulting • Advanced Technical Design • Build & Deploy Solutions • BDW Meetup • New York City • 3,000+ members • Knowledge sharing Data is not important, it’s what you do with it that’s important! Thank You