Grab coffee and enjoy pre-show banter before top of hour briefing

Grab some
coffee and
enjoy the
pre-show
banter before
the top of the
hour!

The Briefing Room
The Great Data Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room
Welcome
Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com
@eric_kavanagh

  Reveal the essential characteristics of enterprise
software, good and bad
  Provide a forum for detailed analysis of today s innovative
technologies
  Give vendors a chance to explain their product to savvy
analysts
  Allow audience members to pose serious questions... and
get answers!
Mission

Topics
April: BIG DATA
May: CLOUD
June: INNOVATORS

Will History Repeat Itself Again?
Ø  Partitioning matters
Ø  File formats matter
Ø  Metadata matters
Ø  Access patterns matter
Hadoop may be
schema-agnostic, but
that doesn’t mean you
shouldn’t carefully plan
your implementation!
“I’ve always found that
plans are useless, but
planning is indispensable.”

Analyst: Robin Bloor
Robin Bloor is
Chief Analyst at
The Bloor Group
robin.bloor@bloorgroup.com
@robinbloor

Think Big, A Teradata Company
  Last year Teradata acquired Think Big Analytics, Inc., a
consulting and solutions company focused on big data
solutions
  Think Big has expertise in implementing a variety of open
source technologies, such as Hadoop, Hbase, Cassandra,
MongoDB and Storm, as well as experience with
Hortonworks, Cloudera and MapR
  Its consultants can assist with the planning, management
and deployment of big data implementations

Guest: Rick Stellwagen
Rick Stellwagen is Data Lake Program Director
at Think Big, A Teradata Company. Rick is
responsible for defining and rolling out a Data
Lake Solution portfolio, identifying and
integrating internal and external best in class
technologies. He is defining the deployment
model, offerings, skills, career path and
integrated capabilities required for data lake
construction and rollout. He also works with
product management, engineering, marketing
and external partner alliances to define
thought leadership positions and shape
product plans both internally and externally.

MAKING BIG DATA COME ALIVEMAKING BIG DATA COME ALIVE
 Data Lake Deployment Best Practices
 Rick Stellwagen, Data Lake Program Director
 April 7, 2015

CONFIDENTIAL | 11
A centralized repository of raw data into which all data-producing streams
flow and from which downstream facilities may draw
What is a Data Lake?
11
Information Sources Data Lake Downstream
Facilities
Data Variety is the driving factor in building a Data Lake

CONFIDENTIAL | 12
Swamp Reservoir
Data Lake: Swamp or Reservoir?
12

CONFIDENTIAL | 13
Ÿ  Corporate Data Sourcing – Repository – System of Record
- Govern who, what and when data is accessed or provisioned
- Track usage, resolve anomalies, visualize, optimize and clarify data lineage
Ÿ  Historical Data Offload
- Offload history of operational and analytical data platforms
- Centralized control of restore capabilities and leverage deep data history
Ÿ  Data Discovery, Organization and Identification
- Gain ultimate flexibility in data use and access Schema on read
- Lightly conditioned, un-modeled, flexible modeling
Ÿ  ETL Offload
- Foundation for Data Integration – push staging to Hadoop
- Data Quality and validation
Ÿ  Business Reporting
- OLAP analysis sourced & processed directly from the data lake
Primary Data Lake Use Cases
13

CONFIDENTIAL | 14
•  A Data Reservoir is a managed Data Lake
that seeks to guarantee quality, access,
provenance, and governance.
•  An important extra guarantee that makes a
Data Reservoir is the presence of metadata
that might enable non subject matter experts to
easily know the location of and entitlements to
the various forms of stored data within.
•  Schema Metadata is always a given, but……
14
Data Lake: Swamp or Reservoir?

CONFIDENTIAL | 15
Business-Ontology
15
How does this data

relate to other data?

How do we classify this data

within the business?

CONFIDENTIAL | 1616
Business-Security
Who can read thedata?

Who owns the data?

Who belongs to what
group?

LDAP
Argus
Unix bitmask
Permissions
Who can see a
column?

CONFIDENTIAL | 1717
Operational
Where did my

data come from?

Any environmental context

about the landing zone, OS,

where my data came from?

What processes
touched my data?

When did my data

get ingested?

... get transformed?

... get exported?

Identity?

CONFIDENTIAL | 1818
Business-Index
What contents are in

a file?

What is the data

serialization?

Where can

we find certain

content in the file?

What terms are

in the contents?

e-Discoverysolr
a lotof NoSQL
File
Magic Number

CONFIDENTIAL | 1919
Business-Schema
How does my data

denormalize?

How should I interpret

my data?

What are my column
names?

Are there any
“important”
dimensions?

Metarepository
HCatalog

CONFIDENTIAL | 20 20
Data Lake
Information Sources
Evaluate
Source Data Ingest
Collect & Manage
Metadata
Profile - Structure
Sequence
Downstream
Facilities
Generate Reports
Discovery Signals
Compress
Automate
Protect
Prepare Data
for Ingest
Prepare Source
Metadata
Assembling the Reservoir
Perimeter-Authentication-Authorization
Data Hub
Generate
Reports

CONFIDENTIAL | 21
Enterprise Data Lake Architecture
21
Ÿ  Each Region has different
“areas”
Ÿ  Three areas for three
types of usage
-  Data Treatment
-  Data Reservoir
-  Data Lab
Regional Data Treatment Facility
Regional Reservoir Regional Lab
Op Meta
Data
Index
Collection Pools
Ingest Zone SOR
Zone
Export
Zone
Orchestration VM
Orchestration
DB
Monitoring
Master
Compute Cluster
Biiz Meta
Data
Index
Orchestration VM
Orchestration
DB
Monitoring
Lake
Master
Data
Export
Zone
<LOB>
Zone
Master
Compute Cluster
Lake
Master
Data
<Insight B><Insight A>
VCC VCC
Processes
op md index
HAR Compactor
Ingestion/SOR
Reconciliation
de-dup
key
generation
Processes
x
correlate
x
co-locate
x
cleanse
de-ident
X Y
Virtual
Compute Cluster
continuous
bulk
metadata
capture
metadata
capture
metadata
capture
de-identiﬁcation
Key: Validate that Ingestion captures Metadata

CONFIDENTIAL | 22
Data Treatment
22
Ÿ  Used by Operations only
Ÿ  Restricted
Ÿ  Non-business process
Ÿ  Lowest-Common-
Denominator Data
Serialization
Ÿ  The entry point for ALL
your data
Master
Compute Cluster
Ingest Zone SOR
Zone
Export
Zone
Op Meta
Data Index
MonitoringOrchestration
DB
Orchestration VM
Regional Data Treatment Facility
Collection Pools
continuous
bulk
metadata
capture
Make sure you capture
Metadata!
Or you risk a swamp
downstream

CONFIDENTIAL | 23
Master
Data
<LOB>
Zone
Export
Zone
Master
Compute Cluster
MonitoringOrchestration
DB
Orchestration VM
Lake
Biz Meta
Data Index
MPPFastAnalytics
Regional ReservoirProcesses
x
correlate
x
co-locate
x
cleanse
de-ident
Data Reservoir
23
Ÿ  Used by Business AND
Operations
Ÿ  Marting !
Ÿ  Business processes
Ÿ  DSS
Ÿ  No Ad Hoc
Ÿ  Business Restricted
Ÿ  First Introduction of SME
Don’t let in
un-vetted data!

CONFIDENTIAL | 24
Data Lab
24
Ÿ  Used by business
primarily
Ÿ  “Un-Safe” Data
Ÿ  Ephemeral (think
virtualization)
Ÿ  Highly experimental
Ÿ  New technologies
Ÿ  Ad Hoc
Regional Lab
Lake
Master
Data
<Insight B><Insight A>
VCC VCC
X Y
Virtual
Compute Cluster

CONFIDENTIAL | 25
•  Know where you are headed – build on Roadmap or Optimizer Planning
•  Quickly put into practice references for company wide Data Lake ingest
•  Establish data lineage and governance tracking with metadata services
•  Establish standards and practices to scale out your data ingest
•  Develop standards for doing profiling and discovery
•  Build out a pipeline framework for data transformations
•  Develop a Security Plan (perimeter, authentication & authorization)
•  Develop an archive and information security approach
•  Plan out next steps and approach for discovery and reporting
Data Lake Best Practices
25

Perceptions & Questions
Analyst:
Robin Bloor

There Has Been a Clear Shift
Analytics & BI were
previously EDW-centric
They are becoming
Data Lake-centric

§  Inexpensive (?)
§  Any data
§  May have metadata
§  Poor performance
§  Weak scheduling
§  Weak data mgmt
§  Security?
§  Data Lake
§  Expensive
§  Prepared data
§  Will have metadata
§  Optimized performance
§  Optimized scheduling
§  Good data mgmt
§  Secure
§  Data workhorse
Hadoop vs Data Mgmt Engine
Hadoop DBMS/EDW

Big Data Architecture - 1
Think Logical, Implement Physical

§  Multiple local instances of Hadoop
§  Weak data placement
§  Metadata chaos
§  Lack of tuning capability
§  Security (expense)
§  User self-service becoming a file
system nightmare
Straws in the Wind
Operational Concerns

The Need for Best Practices
This is clear:
Data Lake is a new idea

u  Is a data lake really just a multiplicity of data
marts growing wild?
u  Aside from performance-critical workloads, what
should Hadoop not be used for?
u  Do you have any specific recommendations for
metadata management in a data lake?
u  Is there a need for enforced provenance &
lineage?

u  Security question: Encryption?
u  Where does streaming fit into the picture?

Upcoming Topics
www.insideanalysis.com
April: BIG DATA
May: CLOUD
June: INNOVATORS

THANK YOU
for your
ATTENTION!
Some images provided courtesy of
Wikimedia Commons

Grab coffee and enjoy pre-show banter before top of hour briefing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Grab coffee and enjoy pre-show banter before top of hour briefing

Similaire à Grab coffee and enjoy pre-show banter before top of hour briefing (20)

Plus de Inside Analysis

Plus de Inside Analysis (20)

Dernier

Dernier (20)

Grab coffee and enjoy pre-show banter before top of hour briefing