- Rick Stellwagen from Think Big, A Teradata Company, discussed best practices for implementing a data lake including establishing standards for data ingestion and metadata capture, developing a security plan, and planning for data discovery and reporting.
- Analyst Robin Bloor asked questions about metadata management, data governance, and security for data lakes. Bloor noted that while data lakes are a new concept, best practices are needed as organizations move analytics and BI capabilities to this model.
- Upcoming Briefing Room topics in 2015 will focus on big data, cloud computing, and innovators in technology.
3. Twitter Tag: #briefr The Briefing Room
Welcome
Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com
@eric_kavanagh
4. Twitter Tag: #briefr The Briefing Room
Reveal the essential characteristics of enterprise
software, good and bad
Provide a forum for detailed analysis of today s innovative
technologies
Give vendors a chance to explain their product to savvy
analysts
Allow audience members to pose serious questions... and
get answers!
Mission
5. Twitter Tag: #briefr The Briefing Room
Topics
April: BIG DATA
May: CLOUD
June: INNOVATORS
6. Twitter Tag: #briefr The Briefing Room
Will History Repeat Itself Again?
Ø Partitioning matters
Ø File formats matter
Ø Metadata matters
Ø Access patterns matter
Hadoop may be
schema-agnostic, but
that doesn’t mean you
shouldn’t carefully plan
your implementation!
“I’ve always found that
plans are useless, but
planning is indispensable.”
7. Twitter Tag: #briefr The Briefing Room
Analyst: Robin Bloor
Robin Bloor is
Chief Analyst at
The Bloor Group
robin.bloor@bloorgroup.com
@robinbloor
8. Twitter Tag: #briefr The Briefing Room
Think Big, A Teradata Company
Last year Teradata acquired Think Big Analytics, Inc., a
consulting and solutions company focused on big data
solutions
Think Big has expertise in implementing a variety of open
source technologies, such as Hadoop, Hbase, Cassandra,
MongoDB and Storm, as well as experience with
Hortonworks, Cloudera and MapR
Its consultants can assist with the planning, management
and deployment of big data implementations
9. Twitter Tag: #briefr The Briefing Room
Guest: Rick Stellwagen
Rick Stellwagen is Data Lake Program Director
at Think Big, A Teradata Company. Rick is
responsible for defining and rolling out a Data
Lake Solution portfolio, identifying and
integrating internal and external best in class
technologies. He is defining the deployment
model, offerings, skills, career path and
integrated capabilities required for data lake
construction and rollout. He also works with
product management, engineering, marketing
and external partner alliances to define
thought leadership positions and shape
product plans both internally and externally.
10. MAKING BIG DATA COME ALIVEMAKING BIG DATA COME ALIVE
Data Lake Deployment Best Practices
Rick Stellwagen, Data Lake Program Director
April 7, 2015
11. CONFIDENTIAL | 11
A centralized repository of raw data into which all data-producing streams
flow and from which downstream facilities may draw
What is a Data Lake?
11
Information Sources Data Lake Downstream
Facilities
Data Variety is the driving factor in building a Data Lake
13. CONFIDENTIAL | 13
Ÿ Corporate Data Sourcing – Repository – System of Record
- Govern who, what and when data is accessed or provisioned
- Track usage, resolve anomalies, visualize, optimize and clarify data lineage
Ÿ Historical Data Offload
- Offload history of operational and analytical data platforms
- Centralized control of restore capabilities and leverage deep data history
Ÿ Data Discovery, Organization and Identification
- Gain ultimate flexibility in data use and access Schema on read
- Lightly conditioned, un-modeled, flexible modeling
Ÿ ETL Offload
- Foundation for Data Integration – push staging to Hadoop
- Data Quality and validation
Ÿ Business Reporting
- OLAP analysis sourced & processed directly from the data lake
Primary Data Lake Use Cases
13
14. CONFIDENTIAL | 14
• A Data Reservoir is a managed Data Lake
that seeks to guarantee quality, access,
provenance, and governance.
• An important extra guarantee that makes a
Data Reservoir is the presence of metadata
that might enable non subject matter experts to
easily know the location of and entitlements to
the various forms of stored data within.
• Schema Metadata is always a given, but……
14
Data Lake: Swamp or Reservoir?
17. CONFIDENTIAL | 1717
Operational
Where did my
data come from?
Any environmental context
about the landing zone, OS,
where my data came from?
What processes
touched my data?
When did my data
get ingested?
... get transformed?
... get exported?
Identity?
18. CONFIDENTIAL | 1818
Business-Index
What contents are in
a file?
What is the data
serialization?
Where can
we find certain
content in the file?
What terms are
in the contents?
e-Discoverysolr
a lotof NoSQL
File
Magic Number
19. CONFIDENTIAL | 1919
Business-Schema
How does my data
denormalize?
How should I interpret
my data?
What are my column
names?
Are there any
“important”
dimensions?
Metarepository
HCatalog
20. CONFIDENTIAL | 20 20
Data Lake
Information Sources
Evaluate
Source Data Ingest
Collect & Manage
Metadata
Profile - Structure
Sequence
Downstream
Facilities
Generate Reports
Discovery Signals
Compress
Automate
Protect
Prepare Data
for Ingest
Prepare Source
Metadata
Assembling the Reservoir
Perimeter-Authentication-Authorization
Data Hub
Generate
Reports
21. CONFIDENTIAL | 21
Enterprise Data Lake Architecture
21
Ÿ Each Region has different
“areas”
Ÿ Three areas for three
types of usage
- Data Treatment
- Data Reservoir
- Data Lab
Regional Data Treatment Facility
Regional Reservoir Regional Lab
Op Meta
Data
Index
Collection Pools
Ingest Zone SOR
Zone
Export
Zone
Orchestration VM
Orchestration
DB
Monitoring
Master
Compute Cluster
Biiz Meta
Data
Index
Orchestration VM
Orchestration
DB
Monitoring
Lake
Master
Data
Export
Zone
<LOB>
Zone
Master
Compute Cluster
Lake
Master
Data
<Insight B><Insight A>
VCC VCC
Processes
op md index
HAR Compactor
Ingestion/SOR
Reconciliation
de-dup
key
generation
Processes
x
correlate
x
co-locate
x
cleanse
de-ident
X Y
Virtual
Compute Cluster
continuous
bulk
metadata
capture
metadata
capture
metadata
capture
de-identification
Key: Validate that Ingestion captures Metadata
22. CONFIDENTIAL | 22
Data Treatment
22
Ÿ Used by Operations only
Ÿ Restricted
Ÿ Non-business process
Ÿ Lowest-Common-
Denominator Data
Serialization
Ÿ The entry point for ALL
your data
Master
Compute Cluster
Ingest Zone SOR
Zone
Export
Zone
Op Meta
Data Index
MonitoringOrchestration
DB
Orchestration VM
Regional Data Treatment Facility
Collection Pools
continuous
bulk
metadata
capture
Make sure you capture
Metadata!
Or you risk a swamp
downstream
23. CONFIDENTIAL | 23
Master
Data
<LOB>
Zone
Export
Zone
Master
Compute Cluster
MonitoringOrchestration
DB
Orchestration VM
Lake
Biz Meta
Data Index
MPPFastAnalytics
Regional ReservoirProcesses
x
correlate
x
co-locate
x
cleanse
de-ident
Data Reservoir
23
Ÿ Used by Business AND
Operations
Ÿ Marting !
Ÿ Business processes
Ÿ DSS
Ÿ No Ad Hoc
Ÿ Business Restricted
Ÿ First Introduction of SME
Don’t let in
un-vetted data!
24. CONFIDENTIAL | 24
Data Lab
24
Ÿ Used by business
primarily
Ÿ “Un-Safe” Data
Ÿ Ephemeral (think
virtualization)
Ÿ Highly experimental
Ÿ New technologies
Ÿ Ad Hoc
Regional Lab
Lake
Master
Data
<Insight B><Insight A>
VCC VCC
X Y
Virtual
Compute Cluster
25. CONFIDENTIAL | 25
• Know where you are headed – build on Roadmap or Optimizer Planning
• Quickly put into practice references for company wide Data Lake ingest
• Establish data lineage and governance tracking with metadata services
• Establish standards and practices to scale out your data ingest
• Develop standards for doing profiling and discovery
• Build out a pipeline framework for data transformations
• Develop a Security Plan (perimeter, authentication & authorization)
• Develop an archive and information security approach
• Plan out next steps and approach for discovery and reporting
Data Lake Best Practices
25
26. Twitter Tag: #briefr The Briefing Room
Perceptions & Questions
Analyst:
Robin Bloor
28. There Has Been a Clear Shift
Analytics & BI were
previously EDW-centric
They are becoming
Data Lake-centric
29. § Inexpensive (?)
§ Any data
§ May have metadata
§ Poor performance
§ Weak scheduling
§ Weak data mgmt
§ Security?
§ Data Lake
§ Expensive
§ Prepared data
§ Will have metadata
§ Optimized performance
§ Optimized scheduling
§ Good data mgmt
§ Secure
§ Data workhorse
Hadoop vs Data Mgmt Engine
Hadoop DBMS/EDW
33. § Multiple local instances of Hadoop
§ Weak data placement
§ Metadata chaos
§ Lack of tuning capability
§ Security (expense)
§ User self-service becoming a file
system nightmare
Straws in the Wind
Operational Concerns
34. The Need for Best Practices
This is clear:
Data Lake is a new idea
35. u Is a data lake really just a multiplicity of data
marts growing wild?
u Aside from performance-critical workloads, what
should Hadoop not be used for?
u Do you have any specific recommendations for
metadata management in a data lake?
u Is there a need for enforced provenance &
lineage?