Why Data Virtualization? An Introduction

DATA VIRTUALIZATION PACKED LUNCH
WEBINAR SERIES
Sessions Covering Key Data Integration Challenges
Solved with Data Virtualization

Data Virtualization: An Introduction
Michael Dickson
Sales Engineer, Denodo
Paul Moxon
VP Data Architectures & Chief Evangelist, Denodo

Agenda
1. Data Virtualization: An Introduction
2. Data Virtualization Platforms – Key Capabilities
3. Product Demo
4. Key Takeaways
5. Q&A
6. Next Steps

Data Virtualization: An Introduction
4

Data Integration – “The Way We Were…”
5
Operational
Data Stores
Staging Area Data Warehouse Data Marts Analytics and
Reporting
ETLETLETL

Data Integration – A Modern Data Ecosystem
6

The Data Integration Challenge
7
Manually access different
systems
IT responds with point-to-
point data integration
Takes too long to get
answers to business users
MarketingSales ExecutiveSupport
Database
Apps
Warehouse Cloud
Big Data
Documents AppsNo SQL
“Data bottlenecks create business bottlenecks.”
– Create a Road Map For A Real-time, Agile, Self-Service Data
Platform, Forrester Research, Dec 16, 2015

8
The Data Integration Challenge
It is difficult to integrate numerous
on-premises and cloud data sources.
Traditional tools cannot integrate streaming
data and data-at-rest in real time.
It is difficult to maintain consistent data
access and governance policies across data
siloes.
Traditional data integration is extremely
resource intensive.

The Solution – A Data Abstraction Layer
9
Abstracts access to
disparate data sources
Acts as a single repository
(virtual)
Makes data available in
real-time to consumers
DATA ABSTRACTION LAYER
“Enterprise architects must revise their data
architecture to meet the demand for fast data.”

Data Virtualization
10
“Data virtualization integrates disparate data sources in real time or near-real time
to meet demands for analytics and transactional data.”
– Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015
Publishes
the data to applications
Combines
related data into views
Connects
to disparate data sources
2
3
1

Data Virtualization Reference Architecture
11

Source: “Gartner Market Guide for data virtualization – 2016”
Data virtualization technology can be used to create
virtualized and integrated views of data in memory (rather
than executing data movement and physically storing
integrated views in a target data structure), and provides a
layer of abstraction above the physical implementation of
data.

What Data Virtualization is Not!
• It is not ETL
• If you want to replicate data from ‘A’ to ‘B’…use an ETL tool – it’s what they are designed for
• It is not Data Visualization ( Note the ‘s’)
• It complements visualization and reporting tools (e.g. Tableau)
• It is not a database
• Data Virtualization Platforms don’t store the data…it’s retrieved from the data sources on
demand
• It has many capabilities such as governance, metadata management, security, etc.
• It will work with specialized tools in these areas
• It’s great for service-based architectures
• But be wary of event-driven architectures…use an ESB (or similar) for this
13

− Gartner, Predicts 2017: Data Distribution and Complexity Drive Information Infrastructure
Modernization, Ted Friedman et al.
By 2018, organizations with data virtualization capabilities
will spend 40% less on building and managing data
integration processes for connecting distributed data assets.
14

Data Virtualization Platforms –
Key Capabilities
15

16
Five Essential Capabilities of Data Virtualization
4. Self-service data services
5. Centralized metadata,
security & governance
1. Data abstraction
2. Zero replication, zero relocation
3. Real-time information

17
1. Data abstraction
Abstracts access to disparate data
sources.
Acts as a single virtual repository.
Abstracts data complexities like
location, format, protocols
…hides data complexity for ease of data access by business
Enterprise architects must revise their data
architecture to meet the demand for fast
data.”
– Create a Road Map For A Real-time, Agile, Self-
Service Data Platform, Forrester Research

18
2. Zero replication, zero relocation
…reduces development time and overall TCO
The Denodo Platform enables us to build and
deliver data services, to our internal and external
consumers, within a day instead of the 1 – 2
weeks it would take with ETL.”
– Manager, DrillingInfo
Leaves the data at its source; extracts
only what is needed, on demand.
Diminishes the need for effort-intensive
ETL processes.
Eliminates unnecessary data
redundancy.

19
3. Real-time information
Provisions data in real-time to consumers
Creates real-time logical views of data
across many data sources.
Supports transformations and quality
functions without the latency,
redundancy, and rigidity of legacy
approaches
…enables timely decision-making
Data virtualization integrates disparate data sources in real
time or near-real time to meet demands for analytics and
transactional data.”

20
4. Self-service data services
Facilitates access to all data, both internal and
external
Enables creation of universal semantic models
reflecting business taxonomy
Connects data silos to provide best available
information to drive business decisions
…enables information discovery and self-service
Impressively quick turn around time to "unlock“ data from
additional siloes and from legacy systems - Few vendors (if
any) can compete with Denodo's support of the Restful
/OData standard - both to provide data (northbound) and
to access data from the sources (southbound).”
– Business Analyst, Swiss Re

21
5. Centralized metadata, security & governance
Abstracts data source security models and enables
single-point security and governance.
Extends single-point control across cloud and on-
premises architectures
Provides multiple forms of metadata (technical,
business, operational) to facilitate understanding of
data.
…simplifies data security, privacy, audit
Our Denodo rollout was one of the easiest and most successful
rollouts of critical enterprise software I have seen. It was
successful in handling our initial, security, use case
immediately, and has since shown a strong ability to cover
additional use cases, in particular acting as a Data Abstraction
Layer via it's web service functionality.”
– Enterprise Architect, Asurion

22
Denodo ‘Solution’ Categories
Customer Centricity / MDM
✓ Complete View of Customer
Data Services
✓ Data as a Service
✓ Data Marketplace
✓ Data Services
✓ Application and Data Migration
Cloud Solutions
✓ Cloud Modernization
✓ Cloud Analytics
✓ Hybrid Data Fabric
Data Governance
✓ GRC
✓ GDPR
✓ Data Privacy / Masking
BI and Analytics
✓ Self-Service Analytics
✓ Logical Data Warehouse
✓ Enterprise Data Fabric
Big Data
✓ Logical Data Lake
✓ Data Warehouse Offloading
✓ IoT Analytics

Product Demonstration
Data Virtualization – An Introduction
23
Sales Engineer, Denodo
Michael Dickson

24
Demo Architecture
What’s the impact of a new
marketing campaign for each
country?
▪ Historical sales data offloaded to
Hadoop cluster for cheaper storage
▪ Marketing campaigns managed in an
external cloud app
▪ Country is part of the customer
details table, stored in the DW
Sources
Combine,
Transform
&
Integrate
Consume
Base View
Source
Abstraction
join
group by state
join
Sales Campaign Customer

26
What is the optimizer doing?
SELECT c.state, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.state
Sales Customer
join
group by
Sales Customer
join
group by ID
Group by
state
Sales Customer
Create temp
table
join
group by
Temp_Customer
Partial Aggregation PushdownNaïve Strategy Temporary Data Movement
300 M 2 M
2 M
2 M
2 M
50
SELECT c.id, amount
FROM
(SELECT s.customer_id,
SUM(amount) amount
FROM sales s
GROUP BY s.customer_id) s_agg
JOIN Customer c
ON (c.id = s_agg.customer_id)

27
Why is this so important?
SELECT c.name, AVG(s.amount)
GROUP BY c.state
How Denodo works compared with other federation engines
System Execution Time Data Transferred Optimization Technique
Denodo 9 sec. 4 M Aggregation push-down
Others 125 sec. 302 M None: full scan
300 M 2 M
Sales Customer
join
group by
2 M
2 M
Sales Customer
join
group by ID
Group by
state
To maximize push
down to the EDW
the aggregation is
split in 2 steps:
• 1st by customerID
• 2nd by state
This significantly
reduces network
Traffic and processing
In Denodo

28
Massive Parallel Processing: Example
2M rows
(sales by customer)
Customer
(2M rows)
Sales
(300 million rows)
group by
customer ID
SELECT c.name, AVG(s.amount)
GROUP BY c.name
join
Group by
name
Similar to the previous query, but now
aggregating by customer name.
What changes?
Partial Aggregation
push down
Maximizes source processing
Reduces network traffic
Swapping to Disk
The aggregation by customer name
produces a larger result set (2M)
that exceeds the memory quota.
Denodo will swap to disk to perform
the intermediate calculation
Serial Calculation
Denodo will perform the calculation
of the aggregation in serial, one row
after another.
With a larger volume, this now becomes
the execution bottleneck
Before MPP Integration

29
join
Group by ZIP
join
Group by ZIP
Massive Parallel Processing: Example
2M rows
(sales by customer)
Customer
(2M rows)
System Execution Time Optimization Techniques
Others ~ 10 min Basic
No MPP 43 sec Aggregation push-down
With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes)
Sales
(300 million rows)
join
Group by ZIP
1. Partial Aggregation
push down
Maximizes source processing
Reduces network traffic
3. On-demand data transfer
For SQL-on-Hadoop systems,
Denodo automatically generates
and upload Parquet files
4. Integration with local
and pre-cached data
The engine detects when data
Is cached or a is native table
in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
Of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
For fast analytical processing in
inexpensive Hadoop-based solutions
With MPP Integration
group by
customer ID

Key Takeaways
31
FIRST
Takeaway
Data Virtualization is a key technology when building a modern
data architecture
SECOND
Takeaway
It provides flexibility and agility and reduces the time to deliver
data to the business by up to 10X
THIRD
Takeaway
Data Virtualization hides the complexity of a constantly changing
data infrastructure from the users
FOURTH
Takeaway
In doing so, it allows you to introduce new technologies, formats,
protocols, etc. without causing user disruption
FIFTH
Takeaway
Beware! Not all Data Virtualization platforms are equal…compare
them against the ‘5 criteria’

Next steps
Download Denodo Express:
www.denodoexpress.com
Access Denodo Platform in the Cloud!
30 day FREE trial available!
Denodo for Azure:
www.denodo.com/TrialAzure/PackedLunch
Denodo for AWS: www.denodo.com/TrialAWS/PackedLunch

Next session
From Single Purpose to Multi Purpose
Data Lakes - Broadening End Users
Thursday, August 16, 2018
Paul Moxon
VP Data Architectures & Chief Evangelist, Denodo

Thank you!
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and
microfilm, without prior the written authorization from Denodo Technologies.

Why Data Virtualization? An Introduction

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Why Data Virtualization? An Introduction

Similaire à Why Data Virtualization? An Introduction (20)

Plus de Denodo

Plus de Denodo (20)

Dernier

Dernier (20)

Why Data Virtualization? An Introduction