Advanced Analytics and Machine Learning with Data Virtualization

DATA VIRTUALIZATION PACKED LUNCH
WEBINAR SERIES
Sessions Covering Key Data Integration Challenges
Solved with Data Virtualization

Advanced Analytics and Machine Learning
with Data Virtualization
Paul Moxon
SVP Data Architectures & Chief Evangelist, Denodo

3
The Economist, May 2017
The world’s most valuable resource
is no longer oil, but data.

4
Data – Like Oil – Is Not Easy To Extract and Use

5
AI and Machine Learning Needs Data
Predicting high-risk patients
Data includes patient
demographics, family history,
patient vitals, lab test results,
past medication history, visits
to the hospital, and any claims
data
Predicting equipment failure
Data may include
maintenance data logs
maintained by the technicians,
especially for older machines.
For newer machines, data
coming in from the different
sensors of the machine—
including temperature,
running time, power level
durations, and error messages
Predicting default risks
Data includes company or
individual demographics,
products they purchased/
used, past payment history,
customer support logs, and
any recent adverse events.
Preventing fraudulent claims
Data includes the location
where the claim originated,
time of day, claimant history,
claim amount, and even public
data such as the National
Fraud Database.
Predicting customer churn
Data includes customer
demographics, products
purchased, product usage,
customer calls, time since last
contact, past transaction
history, industry, company
size, and revenue.

6
But the Data is Somewhere in Here…

7
Confirmation of the Constraints on ML/AI…
Source: Machine learning in UK financial services, Bank of England
and Financial Conduct Authority, October 2019

9
Typical Data Science Workflow
A typical workflow for a data scientist is:
1. Gather the requirements for the business problem
2. Identify data useful for the case
• Ingest data
3. Cleanse data into a useful format
4. Analyze data
5. Prepare input for your algorithms
6. Execute data science algorithms (ML, AI, etc.)
• Iterate 2-6 until valuable insights are
produced
7. Visualize and share

10
Typical Data Science Workflow
80% of time – Finding and preparing the data
10% of time – Analysis
10% of time – Visualizing data

11
Where Does Your Time Go?
A large amount of time and effort goes into tasks not intrinsically
related to data science:
• Finding where the right data might be
• Getting access to the data
• Bureaucracy
• Understand access methods and technology (noSQL, REST APIs, etc.)
• Transforming data into a format easy to work with
• Combining data originally available in different sources and formats
• Profile and cleanse data to eliminate incomplete or inconsistent data
points

12
Gartner – The Evolution of Analytical Environments
This is a Second Major Cycle of Analytical Consolidation
Operational ApplicationOperational Application
IoT DataIoT Data
Other NewDataOther NewData
Operational
Application
Operational
Application
Operational
Application
Operational
Application
CubeCube
Operational
Application
Operational
Application
CubeCube
?? Operational ApplicationOperational Application
IoT DataIoT Data
Other NewDataOther NewData
1980s
Pre EDW
1990s
EDW
2010s2000s
Post EDW
Time
LDW
Operational
Application
Operational
Application
Operational
Application
Operational
Application
Operational
Application
Operational
Application
Data
Warehouse
Data
Warehouse
Data
Warehouse
Data
Warehouse
Data
Lake
Data
Lake
??
LDWLDW
Data WarehouseData Warehouse
Data LakeData Lake
MartsMarts
ODSODS
Staging/IngestStaging/Ingest
Unified analysis
› Consolidated data
› "Collect the data"
› Single server, multiple nodes
› More analysis than any
one server can provide
©2018 Gartner, Inc.
Unified analysis
› Logically consolidated view of all data
› "Connect and collect"
› Multiple servers, of multiple nodes
› More analysis than any one system can provide
ID: 342254
Fragmented/
nonexistent analysis
› Multiple sources
› Multiple structured sources
Fragmented analysis
› "Collect the data" (Into
› different repositories)
› New data types,
› processing, requirements
› Uncoordinated views

13
Gartner – Logical Data Warehouse
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry
Cook, Gartner April 2018
DATA VIRTUALIZATION

Gartner, Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical
Needs, May 2018
“When designed properly, Data Virtualization can speed data
integration, lower data latency, offer flexibility and reuse, and reduce
data sprawl across dispersed data sources.
Due to its many benefits, Data Virtualization is often the first step for
organizations evolving a traditional, repository-style data warehouse
into a Logical Architecture”

15
Benefits of a Virtual Data Layer
 A Virtual Layer improves decision making and shortens development cycles
• Surfaces all company data from multiple repositories without the need to replicate all data
into a lake
• Eliminates data silos: allows for on-demand combination of data from multiple sources
 A Virtual Layer broadens usage of data
• Improves governance and metadata management to avoid “data swamps”
• Decouples data source technology. Access normalized via SQL or web services
• Allows controlled access to the data with low grain security controls
 A Virtual Layer offers performant access
• Leverages the processing power of the existing sources controlled by Denodo’s optimizer
• Processing of data for sources with no processing capabilities (e.g. files)
• Caching and ingestion engine to persist data when needed

16
Data Scientist Workflow Steps
Identify useful
data
Modify data into
a useful format
Analyze data Execute data
science algorithms
(ML, AI, etc.)
Share with
business users
Prepare for
ML algorithm

Demonstration
Advanced Analytics and Machine Learning
with Data Virtualization
17

18
https://flic.kr/p/x8HgrF
Can we predict the usage of the NYC
bike system based on data from
previous years?

20
There are external factors to
consider.
Which ones?
https://flic.kr/p/CYT7SS

21
Data Sources – NWS Weather Data

22
Our Citibike Hypothesis
Development
Lifecycle Mgmt
Monitoring & Audit
Governance
Security
Development Tools
and SDK
Scheduled Tasks
Data Caching
Query Optimizer
JDBC/ODBC/ADO.Net SOAP / REST WS
U
Citibike Data
View
Mart
View
J
Application
Layer
Business
Layer
Citibike Trip
View
Unified ViewUnified View
Weather and Date
View
A
J
J
Derived View Derived View
J
JS
Transformation
& Cleansing
Data
Source
Layer
Historical
Trip Data
Base View
Subscriber
Base View
Date Data
Base View
Historical
Weather
Base View
Base
View
Base
View
Base
View
Abstraction
Predicting Citibike Trips
Data includes historical Citibike
trip data, subscriptions and 24-
hour passes purchased, historical
hourly weather data, and date
information (weekends, public
holidays, etc.).
Data
Warehouse
Citibike
REST API
Data
Warehouse
National
Weather
Service
Data Lake
Analytics Notebooks, Python

23
What We’re Going To Do…
1. Connect to data and have a look
2. Format the data (prep it) so that we can look for significant factors
• e.g. bike trips on different days of week, different months of year, etc.
3. Once we’ve decided on the significant attributes, prepare that data
for the ML algorithm
4. Using Python, read the 2018 data and run it through our ML
algorithm for training
5. Read the 2019 data, test the algorithm
6. Save the results and load them into the Denodo Platform
(Find Data)
(Explore Data)
(Prepare the Data)
(Train the Model)
(Test the Model)
(Save the results)

25
Prologis – Operationalizing AI/ML
$1.5TRILLION
is the economic value of goods flowing through
our distribution centers each year, representing:
2.8%
of GDP for the 19 countries where
we do business
%2.0
of the World’s GDP
1983 100 GLOBAL 768 MSF
Founded Most sustainable corporations
$87B
Assets under management on four continents
MILLION
employees under Prologis’ roofs
1.0

26
Prologis – Data Science Workflow
Step 1: Expose Data to Data Scientists

27
Prologis – Data Science Workflow
Step 2: Operationalization of Model Scoring
Web Service
(Python Model Scoring)
AWS Lambda

28
Data Science – Developing for Operationalization

29
Data Science – Operationalization in Production

32
Key Takeaways
 The Denodo Platform makes all kinds of data – from a variety of
data sources – readily available to your data analysts and data
scientists
 Data virtualization shortens the ‘data wrangling’ phases of
analytics/ML projects
 Avoids needing to write ‘data prep’ scripts in Python, R, etc.
 It’s easy to access and analyze the data from analytics tools such as
Zeppelin or Jupyter
 You can use the Denodo Platform to share the results of your
analytics with others
 Finally…People don’t like to ride their bikes in the snow
 The Denodo Platform makes all kinds of data – from a variety of
data sources – readily available to your data analysts and data
scientists
 Data virtualization shortens the ‘data wrangling’ phases of
analytics/ML projects
 Avoids needing to write ‘data prep’ scripts in Python, R, etc.
 It’s easy to access and analyze the data from analytics tools such as
Zeppelin or Jupyter
 You can use the Denodo Platform to share the results of your
analytics with others

34
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
www.denodo.com/TestDrive
G E T S TA R T E D TO DAY

Thank you!
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and
microfilm, without prior the written authorization from Denodo Technologies.

Advanced Analytics and Machine Learning with Data Virtualization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced Analytics and Machine Learning with Data Virtualization

Similar to Advanced Analytics and Machine Learning with Data Virtualization (20)

More from Denodo

More from Denodo (20)

Recently uploaded

Recently uploaded (20)

Advanced Analytics and Machine Learning with Data Virtualization