Data Never Sleeps
Every minute
Facebook users share 216,302 photos
Dropbox users upload 833,333 new files
Youtube users share 400 hours of new video
Twitter users send 350,000 tweets
A Boeing 737 Aircraft in flight generates 40 TB of data
EDW vs Data Lake
Data Lake is built on the premise that every drop of
data is valuable
Its a place for capturing and exploring huge
volumes of raw data that a business generates
Explorers are diverse: business analysts, data
scientists, …
even business managers (using self-service)
Goals of exploration may be loosely defined
EDW vs Data Lake
EDW stores filtered and processed data
For pre-meditated usage scenarios
Traditionally structured in the form of ‘cubes’
Analogy
Difference between a college library (focused on
curriculum) and the US Library of Congress
EDW vs Data Lake
Schema-on-Read
Schema-on-Write
DATA LAKE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
READ /
EXTRACT
READ /
EXTRACT
READ /
EXTRACT
CRM
ANALYTICS
SCM
ANALYTICS
RECO
ENGINE
ENTERPRISE DATA
WAREHOUSE
XML
JSON
CSV
PDF
TRADING PARTNER
REST API
INVOICING
ORDERS DB
SALES
OPERATIONS
MARKETING
ETL
Why Think of Data Lake
Business Drivers
Diverse sources of data: transactions, interactions, human
and machine-generated
Routine analysis not enough – deeper insights lead to
differentiation
Agile and Adaptive Business Models
Technology Drivers
Fast, cheap and scalable storage (eg. HDFS)
Diverse data-processing engines (eg. NoSQL)
Infinitely elastic processing power (cluster of commodity
servers)
What Features Should It Support
Scalable Storage Layer
3 V’s of Data Inflow
Data Discovery
Data Governance
Pluggable and Extensible Analytics
Elastic Processing Power
Multi-stakeholder and Multi-tenant Access
Building It On Top Of Hadoop
Data Lake doesn’t have to be Hadoop
But Hadoop has proven its prowess on planet-scale
data, in terms of:
Data Volumes
Elastic Data Processing Power
Probably the idea of a Data Lake was inspired by
Hadoop
Naturally most often a Data Lake Architecture is
built around Hadoop
Storage Capacity: Metrics
Normally HDFS scales even with one NameNode
Unless you have hundreds of Petabytes data
But you need to monitor the usage pattern
Are you creating too many small files (what’s the
average number of blocks per file)?
How much RAM would you need for the NameNode? (a
high value could mean larger GC pauses)
Internal Load (heartbeats and block reports) vs
External Get and Create Requests
Storage Capacity: HDFS Federation
Single Name Node NameNode Federation
Name
Node
Data
Node1
Data
Node2
Data
NodeN
MR
Client
Get / Create
Internal
Load
…
NameNode1 NameNode2
Block Pool1 Block Pool2
Data
Node1
Data
Node2
Data
NodeN…
Storage Capacity: Availability
NameNode Federation does not ensure HA
Even if you don’t go for Federation, configuring high
availability is recommended
Essentially set up a Standby NameNode
Active NameNode shares state with the Standby
Using a shared Journal Manager, or
Simply using a NFS-mounted shared File directory
Synchronization frequency is configurable
Compute Capacity
Hadoop 1.0 supported 1 type of Job (Map-Reduce)
MR jobs were scheduled by a ‘JobTracker’ process
Hadoop 2.0 offers a Resource Manager (YARN)
It is intended to replace JobTracker and better the
Hadoop cluster size limit from 3000 to 10000
But more important: YARN supports different types of
Jobs including MR to run on Hadoop
Hence Data Lake should preferably be built on YARN
Compute Capacity: YARN
YARN ARCHITECTURE
RESOURCE
MANAGER
NODE MANAGER
MR APP
MASTER
SPARK
TASK
NODE MANAGER
SPARK APP
MASTER
MR
TASK
N
O
D
E
1
N
O
D
E
2
MR CLIENT
SPARK
CLIENT
Data Inflow
The goal is to build a pipeline into Hadoop-native
data stores
HDFS, mandatorily
Hive and Hbase, preferably
Considering the variety of data formats that a Data
Lake must accommodate:
A general purpose Data Integration Tool must be chosen
For example, Pentaho Data Integration (PDI)
Data Inflow
Pipelines specialized for specific data formats may
also be plugged in
HDFS
FLAT FILE INPUT
CONNECTOR
WEB SERVICE INPUT
CONNECTOR
HDFS OUTPUT
CONNECTOR
.txt .json
SQOOP FLUME
DB log
Data Inflow: Streaming Data
Streaming Data may be processed in two ways
Simply store in the Data Lake for future analysis
Interesting tweets for building a sentiment analysis model
Store and Forward to a Real-time Analytics Engine
Even as real-time processing occurs, the source data in
raw format may be useful in future
To build / update machine learning models, for example
in fraud analytics
HDFS
STORE STORE &
FORWARD
Data Analytics
A Data Lake built on HDFS will most likely use a
Hadoop cluster to analyze data
Sometimes the result of the analysis may be stored
back into HDFS (or possibly Hive / Hbase)
But Data Visualization and Reporting / Dashboards
may work only on structured data cubes
Hence on the Analytics side, a Data Lake may need
outflow paths from HDFS into structured data stores
Plugging In Data Analytics Engine
Jaspersoft Reporting with HDFS
HDFS
ANALYZED DATA
JASPERSOFT ETL
HDFS INPUT
CONNECTOR OLAP
CUBE
JASPERSOFT
REPORTING
ENGINE
Data Governance
Data Lake does not conform to a schema
Data Governance makes it possible to make sense
of the data
To both analysts and administrators
Data Governance is a fairly open-ended subject
Vendors offer different techniques to solve each
governance use case
But common patterns are emerging across the landscape
Data Governance: Analyst Use Cases
To search and retrieve ‘relevant’ data for analysis
Common Techniques
Metadata Management
Data tagging
Text Search
Data Classification
Metadata can include technical as well as business
information (linked to a Business Glossary)
Data tags are often created by users collaboratively
Data Governance: Admin Use Cases
Track data flow from
source to end applications
Retain, replicate and
archive based on usage
Track access and usage
information for compliance
Lineage
Data Life-cycle
Management
Auditing
Automated Metadata Generation
As data is ingested, suitable attributes are extracted
and stored into a metadata repository
Data type (XML, PDF, text, etc)
Data size
Creation and Last Access time, etc
Even data tags can be inserted at the time of ingest
Unconditionally, eg. ‘sales’
Conditionally, eg. ‘holiday_sales’
Apache Atlas For Data Governance
Source: http://atlas.incubator.apache.org/Architecture.html
Data Access And Security
By default HDFS is secured using
Kerberos for authentication, and
Unix-style file permissions for authorization
In a large data repository with diverse stakeholders
you may need more control
If so, a couple of products may be considered for
augmenting Data Security:
Apache Knox
Apache Ranger
Data Access And Security
HDFS
Perimeter Security:
Knox
KERBEROS
Authentication Authorization
(rwx)
RANGER Federated
Access Control
NODE 1 NODE N
Why Use Ranger
Supports Federated Access Control
Can fall-back upon default HDFS file permissions
Manages Access Control over several Hadoop-
based components, like Hive, Storm, etc.
Advanced fine-grained access control, like
Deny policies for user or group
Tag-based access control, where a collection of
resources share a common access tag
For example, a few columns in a Hive table and a
certain files in HDFS could share a tag: ‘internal_audit’
Steps To Build A Data Lake
Set up a scalable data storage layer
Set up a Compute Cluster capable of running a
diverse mix of Jobs
Create data flow pipeline(s) for batch jobs
Create data flow pipeline(s) for streaming data
Steps To Build A Data Lake
Plug in one or more Analytics Engine(s)
Set up mechanisms for efficient data discovery
and data governance
Implement Data Access Controls
Design a Monitoring Infrastructure for Jobs and
Resources (not covered today)
Building A Data Lake: Starting Points
Set up a scalable data storage layer: HDFS
Set up a Compute Cluster capable of running a
diverse mix of Jobs: YARN
Create data flow pipeline(s) for batch jobs:
Pentaho HDFS Connector
Create data flow pipeline(s) for streaming data:
Pentaho Messaging Connector
Steps To Build A Data Lake
Plug in one or more Analytics Engine(s): Pentaho
Reporting and Spark MLib
Set up mechanisms for efficient data discovery
and data governance: Apache Atlas
Implement Data Access Controls: Apache Ranger
Design a Monitoring Infrastructure for Jobs and
Resources: Apache Ambari
Taking The Plunge
Do you need to plan for and build a Data Lake?
Ask yourself: what fraction of your data are you
analyzing today ?
What value might the unused data offer ?
For marketing campaigns
For product lifecycle management
For regulatory compliance, and so on …
Engage your stakeholders from different LoBs
Is decision making being hampered by lack of data ?
Taking The Plunge
Start small: There is a learning curve
Storing data is not enough – maintaining the
stewarding the data is all important
Design for extensibility and plugability
Minimize vendor lock-in
Be open to change as you scale your infrastructure