Incorporating the Data Lake into Your Analytic Architecture

@joe_Caserta
Incorporating the Data Lake
Into your Analytic Architecture
Joe Caserta
President
Caserta Concepts
@joe_Caserta

@joe_Caserta
Launched Data Science
Data Interaction and Cloud practices
Awarded for getting data out of SAP
for enterprise data analytics
Top 20 Most Most Powerful
Big Data Companies
Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The Data
Warehouse ETL Toolkit (Wiley)
Caserta Concepts founded
Web log analytics solution published in Intelligent
Enterprise
Partnered with Big Data vendors Cloudera,
Hortonworks, IBM, Cisco, Datameer, Basho more…
Launched Training practice, teaching and mentoring
data warehousing concepts world-wide
Laser focus on extending Data Warehouses with Big
Data solutions
2001
2010
2004
2012
2009
2014
Launched Big Data Warehousing (BDW)
Meetup - NYC 3,000 Members
2013
2015
Established best practices for big data ecosystem
implementation – Healthcare, Finance, Insurance
Dedicated to Data Governance Techniques
on Big Data (Innovation)
America’s Fastest Growing Private
Companies - Ranked #740
1996 – Dedicated to Dimensional Data Warehousing
1986 – 1996 OLTP Data Modeling and Reporting.

@joe_Caserta
About Caserta Concepts
• Consulting firm focused on Data Innovation, Modern Data Engineering approach
to solve highly complex business data challenges
• Award-winning company
• Internationally recognized work force
• Mentoring, Training, Knowledge Transfer
• Strategy, Architecture, Implementation
• Innovation Partner
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Architecture
• Leader in architecting and implementing enterprise data solutions
• Data Warehousing
• Business Intelligence
• Big Data Analytics
• Data Science
• Data on the Cloud
• Data Interaction & Visualization
• Strategic Consulting
• Technical Design
• Build & Deploy Solutions

@joe_Caserta
Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance

@joe_Caserta
The Future is Today
As a Mindful Cyborg, Chris
Dancy utilizes up to
700 sensors, devices,
applications, and services to
track, analyze, and optimize as
many areas of his existence.
Data quantification enables
him to see the connections of
otherwise invisible data,
resulting in dramatic upgrades
to his health, productivity, and
quality of life.

@joe_Caserta
The Progression of Data Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
Reports  Correlations  Predictions  Recommendations

@joe_Caserta
The Progression of Data Analytics
Source: Gartner
Reports  Correlations  Predictions  Recommendations
Cognitive Computing / Cognitive Data Analytics

@joe_Caserta
Innovation is the only sustainable competitive advantage a company can have
Innovations may fail, but companies that don’t innovate will fail

@joe_Caserta
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Data Lake
Canned Reporting
Big Data Analytics
NoSQL
DatabasesETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
The Evolution of Modern Data Engineering
Data Science

@joe_Caserta
“…any decent sized enterprise will have a variety of different data
technologies for different kinds of data. There will still be large
amounts of it managed in relational stores, but increasingly
we'll be first asking how we want to manipulate the data
and only then figuring out what technology
is the best bet for it.” - Martin Fowler
Think Ecosystem, Not Tech Stack

@joe_Caserta
Proven Methods for Building Analytics Platforms
• Requirements Gathering: Business Interviews
• Design: Top Down / Bottom Up
• Data Profiling: Data quality assessment
• Data Modeling: Create Facts and Dimensions
• Extract Transform Load: From source to a Data Warehouse
• BI Tool: Semantic Layer, Dashboards
• Reporting: Develop Reports and distribution
• Data Governance: Mostly up front
• Analytics: Prepare data for SAS, predictive modeling

@joe_Caserta
The New Conversation
• Do we need a Data Warehouse at all?
• If we do, does it need to be relational?
• Should we leverage Hadoop or NoSQL?
• Can we get to Machine Learning faster?
• Which platform and language are we going to code?
• Which Apache Project should we put in production?

@joe_Caserta
Why Change?
New technologies are great and all… But what drives our adoption of new
technologies and techniques?
• Data has changed – Semi structured, Unstructured, Sparse and evolving
schema
• Volumes have changed  GB to TB to PB workloads
• Cracks in the Armor of Traditional Data Warehousing approach!
Most Importantly:
Companies that innovate to leverage their data win!

@joe_Caserta
Cracks in the Data Warehouse Armor
• Onboarding new data is difficult!
• Data structures are rigid!
• Data Governance is slow!
• Disconnected from business needs:
New Requirement:
“Hey – I need to munge some new data to see if it has value”
Wait! We have to….
Profile, analyze and conform the data
Change data models and load it into dimensional models
Build a semantic layer – that nobody is going to use
Create a dashboard we hope someone will notice
..and then you can have at it 3-6 months later to see if it has value!

@joe_Caserta
Is Traditional Data Warehousing All Wrong?
NO!
The concept of a Data Warehouse is sound:
• Consolidating data from disparate source systems
• Clean and conformed reference data
• Clean and integrated business facts
• Data governance (a more pragmatic version)
We can be more successful by acknowledging the EDW can’t solve all
problems.

@joe_Caserta
So what’s missing?
The Data Lake
A storage and processing layer for all data
• Store anything: source data, semi-structured, unstructured, structured
• Keep it as long as needed
• Support a number of processing workloads
• Scale-out
..and here is where Hadoop
can help us!

@joe_Caserta
Hadoop (Typically) Powers the Data Lake
Hadoop Provides us:
• Distributed storage  HDFS
• Resource Management  YARN
• Many workloads, not just MapReduce

@joe_Caserta
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Data Governance for the Data Lake

@joe_Caserta
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
for the Data Lake

@joe_Caserta
The Big Data Pyramid
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries and
Reporting
Usage Pattern Data Governance
Metadata, ILM,
Security

@joe_Caserta
Landing
Queue
Data Lake
BDW
Data Science
API
Data Providers
Near Real-time
Batch
Data
Science
Clusters
EDW
Graph
RDS
Metastore
Your Likely Future Landscape

@joe_Caserta
Peeling back the layers… The Landing Area
• Source data in it’s full fidelity
• Programmatically Loaded
• Partitioned for data processing
• No governance other than catalog and ILM (Security and Retention)
• Consumers: Data Scientists, ETL Processes, Applications

@joe_Caserta
Data Lake
• Enriched, lightly integrated
• Data has been is accessible in the Hive Metastore
• Either processed into tabular relations
• Or via Hive Serdes directly upon Raw Data
• Partitioned for data access
• Governance additionally includes a guarantee of completeness
• Consumers: Data Scientists, ETL Processes, Applications, Data Analysts

@joe_Caserta
Data Science Workspace
• No barrier for onboarding and analysis of new data
• Blending of new data with entire Data Lake, including the Big Data
Warehouse
• Data Scientists enrich data with insight
• Consumers: Data Scientists

@joe_Caserta
Big Data Warehouse
• Data is Fully Governed
• Data is Structured
• Partitioned/tuned for data access
• Governance includes a guarantee of completeness and accuracy
• Consumers: Data Scientists, ETL Processes, Applications, Data
Analysts, and Business Users (the masses)
Big
Data
Warehouse

@joe_Caserta
Polyglot Warehouse
We promote the concept that the Big Data Warehouse may live in one or
more platforms
• Full Hadoop Solutions
• Hadoop plus MPP or Relational
Supplemental technologies:
• NoSQL: Columnar, Key value, Timeseries, Graph
• Search Technologies

@joe_Caserta
Hadoop is the Data Warehouse?
• Hadoop can be the entire data pyramid platform including
landing, data lake and the Big Data Warehouse
• Especially serves as the Data Lake and “Refinery”
• Query engines such as Hive, and Impala provide SQL support

@joe_Caserta
The Refinery
• The feedback loop between Data Science and Data Warehouse is critical
• Successful work products of science must Graduate into the appropriate
layers of the Data Lake

@joe_Caserta
Data Analytics on the Cloud
AWS and other cloud providers present a very powerful design
pattern:
• S3 serves as the storage layer for the Data Lake
• EMR (Elastic Hadoop) provides the Refinery, most clusters can be
ephemeral
• The Active Set is stored into Redshift MPP or Relational Platforms
Eliminate massive on-premise appliance footprint

@joe_Caserta
Summary
Data Warehousing is not dead for analytics
• The principles of Data Warehousing still make sense
• Recognize gaps in feature/functionality of the Relational
Database and traditional Data Warehousing
• Extend your data ecosystem with a Data Lake
• Accept Tunable Governance
• Think Polyglot and use the right tool for the job

@joe_Caserta
Thank You / Q&A
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

Incorporating the Data Lake into Your Analytic Architecture

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Incorporating the Data Lake into Your Analytic Architecture

Similaire à Incorporating the Data Lake into Your Analytic Architecture (20)

Plus de Caserta

Plus de Caserta (19)

Dernier

Dernier (20)

Incorporating the Data Lake into Your Analytic Architecture

Notes de l'éditeur