Watch full webinar here: https://bit.ly/3dmOHyQ
Historically, data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately, the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multi-purpose data lakes are the future of data analysis for a broad range of business users.
Watch on-demand this webinar to learn:
- The restrictions of physical single-purpose data lakes
- How to build a logical multi-purpose data lake for business users
- The newer use cases that make multi-purpose data lakes a necessity
2. Logical Data Lakes: From Single Purpose to
Multipurpose Data Lakes
Chris Day
Director, APAC Sales Engineering, Denodo
Sushant Kumar
Product Marketing Manager, Denodo
3. Agenda
1. Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes
2. Customer story
3. Product Demo
4. Q&A
5. Next Steps
4. 4
• A storage repository that holds a vast amount
of raw data in its native format.
• Hadoop and its ecosystem provided the
foundation that data lakes required: vast
storage and processing muscle
• Advanced analytic tools and mining software
intake raw data from Data Lakes and transform
it into useful insight.
What are Data Lakes and why do we need them?
5. 5
• The early data scientists saw Hadoop as their
personal supercomputer.
• Hadoop-based Data Lakes helped
democratize access to state-of-the-art
supercomputing with off-the-shelf HW (and
later cloud).
• The industry push for BI made Hadoop-based
solutions the standard to bring modern
analytics to any corporation.
Data Lakes – A Data Scientist’s Playground
6. 6
Data Lakes – Not a Perfect World
Physical Nature
• Based on Replication. Data Lakes require data to be copied to its physical storage
• Replication extends development cycles and costs
• Not all data is suitable for replication
• Real time needs: Cloud and SaaS APIs
• Large volumes: existing EDW
• Laws and restrictions
Single Purpose
• Usage of the data lake is often monopolize by data scientists
• New data silo. No clear path to share insights with business users
• Lacks the governance, security and quality that business users are used to (e.g. in
the EDW)
8. 8
Multi-purpose data lakes are data delivery environments developed to
support a broad range of users, from traditional self-service BI users (e.g.
finance, marketing, human resource, transport) to sophisticated data scientists.
Multi-purpose data lakes allow a broader and deeper use of the data lake
investment without minimizing the potential value for data science and without
making it an inflexible environment.
Rick Van der Lans, R20 Consultancy
9. 9
Logical Nature
• Replication is an option, not a necessity
• Broaden data access, shorten development times, better
insights
• Tight integration with big data systems. Fast execution with
large data volumes
Multi-purpose
• Curated access for non-technical users
• Better governance and access control
• Better ROI for the investment of the lake
The Multipurpose Data Lake with Data Virtualization
10. 10
The Multipurpose Data Lake with Data Virtualization
“Amulti-purpose data lake can become an organization’s universal data delivery system”
Architecting the Multi-Purpose Data Lake with Data Virtualization, Rick Van der Lans, April 2018
11. 11
Single access to all data assets, internal
and external:
§ Physical Data Lake (usually based on SQL-on-
Hadoop systems)
§ Other databases (EDW, ODS, applications,
etc.)
§ SaaS APIs (Salesforce, Google, social media,
etc.)
§ Files (local, S3, Azure, etc.)
The Virtual Data Lake – Access to all Data Sources
12. 12
The physical Data Lake can also be used as
Denodo’s cache
This allows to quickly load any data accessible by
Denodo to the Hadoop cluster
Caching becomes an alternative to ingestion ELT
processes that preserves lineage and governance
Load process based on direct load to HDFS:
1. Creation of the target table in Cache
system
2. Generation of Parquet files (in chunks) with
Snappy compression in the local machine
3. Upload in parallel of Parquet files to HDFS
The Virtual Data Lake – Ingesting and Caching
13. 13
Denodo optimizer provides native integration
with MPP systems to provide one extra key
capability: Query Acceleration
Denodo can move, on demand, processing to
the MPP during execution of a query
• Parallel power for calculations in the
virtual layer
• Avoids slow processing in-disk when
processing buffers don’t fit into
Denodo’s memory (swapped data)
The Virtual Data Lake – Using the Lake Processing Engine
14. 14
The Virtual Data Lake – Putting the Pieces Together
2Mrows
(sales by customer)
CurrentSales
68 M rows
1. Partial Aggregation
push down
Maximizes source processing
dramatically Reducesnetwork
traffic 3. On-demand data transfer
Denodo automatically generates
and upload Parquet files
4. Integration with local data
The engine detects when data
is cached or comes from a
local table already in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particularoperations,
the CBO can decide to move all orpart
of the execution tree to theMPP
5. Fast parallel execution
Support for Spark, Presto and Impala
for fast analytical processing in
inexpensive Hadoop-based solutions
Hist.Sales
220 M rows
Customer
2 M rows
(Cached)
join
group by ZIP
System Execution Time Optimization Techniques
Others ~ 10 min Simple federation
No MPP 43 sec Aggregation push-down
With MPP 11 sec
Aggregation push-down + MPP integration
(Impala 8 nodes)
group by
Customer ID
17. 17
16
- Gartner, Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs,
May 2018
When designed properly, DV can speed data integration, lower data
latency, offer flexibility and reuse, and reduce data sprawl across
dispersed data sources.
Due to its many benefits, DV is often the first step for organizations
evolving a traditional, repository-style data warehouse into a Logical
Architecture.
18. 18
§ A logical Data Lake improves decision making and
shortens development cycles
• Surfaces all company data from multiple repositories without
the need to replicate all data into the lake
• Eliminates data silos allows for on-demand combination of data
from multiple sources
§ A Logical Data Lake broadens adoption of the lake and
improves its ROI
• Improves governance and metadata management to avoid
“data swamps”
• Allows controlled access to the lake to non-technical users
§ A Logical Data Lake offer performance for the Big Data World
• Leverages the processing power of the existing cluster
controlled by Denodo’s optimizer
The Logical Data Lake - Conclusions
23. Next session | 20 May | 8.30am IST / 11.00am SGT / 1.00pm AEST
Simplifying Your Cloud Architecture with
a Logical Data Fabric
Katrina Briedis
Sales Engineering, Denodo
Sushant Kumar
Product Marketing Manager, Denodo
REGISTER NOW
bit.ly/APACWB2104