Watch full webinar here: https://buff.ly/2mHGaLA
What started to evolve as the most agile and real-time enterprise data fabric, data virtualization is proving to go beyond its initial promise and is becoming one of the most important enterprise big data fabrics.
Attend this session to learn:
• What data virtualization really is
• How it differs from other enterprise data integration technologies
• Why data virtualization is finding enterprise-wide deployment inside some of the largest organizations
7. The Data Integration Challenge
7
Manually access different
systems
IT responds with point-to-
point data integration
Takes too long to get
answers to business users
MarketingSales ExecutiveSupport
Database
Apps
Warehouse Cloud
Big Data
Documents AppsNo SQL
“Data bottlenecks create business bottlenecks.”
– Create a Road Map For A Real-time, Agile, Self-Service Data
Platform, Forrester Research, Dec 16, 2015
8. 8
The Data Integration Challenge
It is difficult to integrate numerous
on-premises and cloud data sources.
Traditional tools cannot integrate streaming
data and data-at-rest in real time.
It is difficult to maintain consistent data
access and governance policies across data
siloes.
Traditional data integration is extremely
resource intensive.
9. The Solution – A Data Abstraction Layer
9
Abstracts access to
disparate data sources
Acts as a single repository
(virtual)
Makes data available in
real-time to consumers
DATA ABSTRACTION LAYER
“Enterprise architects must revise their data
architecture to meet the demand for fast data.”
– Create a Road Map For A Real-time, Agile, Self-Service Data
Platform, Forrester Research, Dec 16, 2015
10. Data Virtualization
10
“Data virtualization integrates disparate data sources in real time or near-real time
to meet demands for analytics and transactional data.”
– Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015
Publishes
the data to applications
Combines
related data into views
Connects
to disparate data sources
2
3
1
12. Source: “Gartner Market Guide for data virtualization – 2016”
Data virtualization technology can be used to create
virtualized and integrated views of data in memory (rather
than executing data movement and physically storing
integrated views in a target data structure), and provides a
layer of abstraction above the physical implementation of
data.
13. What Data Virtualization is Not!
• It is not ETL
• If you want to replicate data from ‘A’ to ‘B’…use an ETL tool – it’s what they are designed for
• It is not Data Visualization ( Note the ‘s’)
• It complements visualization and reporting tools (e.g. Tableau)
• It is not a database
• Data Virtualization Platforms don’t store the data…it’s retrieved from the data sources on
demand
• It has many capabilities such as governance, metadata management, security, etc.
• It will work with specialized tools in these areas
• It’s great for service-based architectures
• But be wary of event-driven architectures…use an ESB (or similar) for this
13
14. − Gartner, Predicts 2017: Data Distribution and Complexity Drive Information Infrastructure
Modernization, Ted Friedman et al.
By 2018, organizations with data virtualization capabilities
will spend 40% less on building and managing data
integration processes for connecting distributed data assets.
14
16. 16
Five Essential Capabilities of Data Virtualization
4. Self-service data services
5. Centralized metadata,
security & governance
1. Data abstraction
2. Zero replication, zero relocation
3. Real-time information
17. 17
1. Data abstraction
Abstracts access to disparate data
sources.
Acts as a single virtual repository.
Abstracts data complexities like
location, format, protocols
…hides data complexity for ease of data access by business
Enterprise architects must revise their data
architecture to meet the demand for fast
data.”
– Create a Road Map For A Real-time, Agile, Self-
Service Data Platform, Forrester Research
18. 18
2. Zero replication, zero relocation
…reduces development time and overall TCO
The Denodo Platform enables us to build and
deliver data services, to our internal and external
consumers, within a day instead of the 1 – 2
weeks it would take with ETL.”
– Manager, DrillingInfo
Leaves the data at its source; extracts
only what is needed, on demand.
Diminishes the need for effort-intensive
ETL processes.
Eliminates unnecessary data
redundancy.
19. 19
3. Real-time information
Provisions data in real-time to consumers
Creates real-time logical views of data
across many data sources.
Supports transformations and quality
functions without the latency,
redundancy, and rigidity of legacy
approaches
…enables timely decision-making
Data virtualization integrates disparate data sources in real
time or near-real time to meet demands for analytics and
transactional data.”
– Create a Road Map For A Real-time, Agile, Self-Service Data
Platform, Forrester Research, Dec 16, 2015
20. 20
4. Self-service data services
Facilitates access to all data, both internal and
external
Enables creation of universal semantic models
reflecting business taxonomy
Connects data silos to provide best available
information to drive business decisions
…enables information discovery and self-service
Impressively quick turn around time to "unlock“ data from
additional siloes and from legacy systems - Few vendors (if
any) can compete with Denodo's support of the Restful
/OData standard - both to provide data (northbound) and
to access data from the sources (southbound).”
– Business Analyst, Swiss Re
21. 21
5. Centralized metadata, security & governance
Abstracts data source security models and enables
single-point security and governance.
Extends single-point control across cloud and on-
premises architectures
Provides multiple forms of metadata (technical,
business, operational) to facilitate understanding of
data.
…simplifies data security, privacy, audit
Our Denodo rollout was one of the easiest and most successful
rollouts of critical enterprise software I have seen. It was
successful in handling our initial, security, use case
immediately, and has since shown a strong ability to cover
additional use cases, in particular acting as a Data Abstraction
Layer via it's web service functionality.”
– Enterprise Architect, Asurion
22. 22
Denodo ‘Solution’ Categories
Customer Centricity / MDM
✓ Complete View of Customer
Data Services
✓ Data as a Service
✓ Data Marketplace
✓ Data Services
✓ Application and Data Migration
Cloud Solutions
✓ Cloud Modernization
✓ Cloud Analytics
✓ Hybrid Data Fabric
Data Governance
✓ GRC
✓ GDPR
✓ Data Privacy / Masking
BI and Analytics
✓ Self-Service Analytics
✓ Logical Data Warehouse
✓ Enterprise Data Fabric
Big Data
✓ Logical Data Lake
✓ Data Warehouse Offloading
✓ IoT Analytics
24. 24
Demo Architecture
What’s the impact of a new
marketing campaign for each
country?
▪ Historical sales data offloaded to
Hadoop cluster for cheaper storage
▪ Marketing campaigns managed in an
external cloud app
▪ Country is part of the customer
details table, stored in the DW
Sources
Combine,
Transform
&
Integrate
Consume
Base View
Source
Abstraction
join
group by state
join
Sales Campaign Customer
26. 26
What is the optimizer doing?
SELECT c.state, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.state
Sales Customer
join
group by
Sales Customer
join
group by ID
Group by
state
Sales Customer
Create temp
table
join
group by
Temp_Customer
Partial Aggregation PushdownNaïve Strategy Temporary Data Movement
300 M 2 M
2 M
2 M
2 M
50
SELECT c.id, amount
FROM
(SELECT s.customer_id,
SUM(amount) amount
FROM sales s
GROUP BY s.customer_id) s_agg
JOIN Customer c
ON (c.id = s_agg.customer_id)
27. 27
Why is this so important?
SELECT c.name, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.state
How Denodo works compared with other federation engines
System Execution Time Data Transferred Optimization Technique
Denodo 9 sec. 4 M Aggregation push-down
Others 125 sec. 302 M None: full scan
300 M 2 M
Sales Customer
join
group by
2 M
2 M
Sales Customer
join
group by ID
Group by
state
To maximize push
down to the EDW
the aggregation is
split in 2 steps:
• 1st by customerID
• 2nd by state
This significantly
reduces network
Traffic and processing
In Denodo
28. 28
Massive Parallel Processing: Example
2M rows
(sales by customer)
Customer
(2M rows)
Sales
(300 million rows)
group by
customer ID
SELECT c.name, AVG(s.amount)
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.name
join
Group by
name
Similar to the previous query, but now
aggregating by customer name.
What changes?
Partial Aggregation
push down
Maximizes source processing
Reduces network traffic
Swapping to Disk
The aggregation by customer name
produces a larger result set (2M)
that exceeds the memory quota.
Denodo will swap to disk to perform
the intermediate calculation
Serial Calculation
Denodo will perform the calculation
of the aggregation in serial, one row
after another.
With a larger volume, this now becomes
the execution bottleneck
Before MPP Integration
29. 29
join
Group by ZIP
join
Group by ZIP
Massive Parallel Processing: Example
2M rows
(sales by customer)
Customer
(2M rows)
System Execution Time Optimization Techniques
Others ~ 10 min Basic
No MPP 43 sec Aggregation push-down
With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes)
Sales
(300 million rows)
join
Group by ZIP
1. Partial Aggregation
push down
Maximizes source processing
Reduces network traffic
3. On-demand data transfer
For SQL-on-Hadoop systems,
Denodo automatically generates
and upload Parquet files
4. Integration with local
and pre-cached data
The engine detects when data
Is cached or a is native table
in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
Of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
For fast analytical processing in
inexpensive Hadoop-based solutions
With MPP Integration
group by
customer ID
31. Key Takeaways
31
FIRST
Takeaway
Data Virtualization is a key technology when building a modern
data architecture
SECOND
Takeaway
It provides flexibility and agility and reduces the time to deliver
data to the business by up to 10X
THIRD
Takeaway
Data Virtualization hides the complexity of a constantly changing
data infrastructure from the users
FOURTH
Takeaway
In doing so, it allows you to introduce new technologies, formats,
protocols, etc. without causing user disruption
FIFTH
Takeaway
Beware! Not all Data Virtualization platforms are equal…compare
them against the ‘5 criteria’
33. Next steps
Download Denodo Express:
www.denodoexpress.com
Access Denodo Platform in the Cloud!
30 day FREE trial available!
Denodo for Azure:
www.denodo.com/TrialAzure/PackedLunch
Denodo for AWS: www.denodo.com/TrialAWS/PackedLunch
34. Next session
From Single Purpose to Multi Purpose
Data Lakes - Broadening End Users
Thursday, August 16, 2018
Paul Moxon
VP Data Architectures & Chief Evangelist, Denodo