Watch full webinar here: https://bit.ly/32c6TnG
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
5. 5
AI and Machine Learning Needs Data
Predicting high-risk patients
Data includes patient
demographics, family history,
patient vitals, lab test results,
past medication history, visits
to the hospital, and any claims
data
Predicting equipment failure
Data may include
maintenance data logs
maintained by the technicians,
especially for older machines.
For newer machines, data
coming in from the different
sensors of the machine—
including temperature,
running time, power level
durations, and error messages
Predicting default risks
Data includes company or
individual demographics,
products they purchased/
used, past payment history,
customer support logs, and
any recent adverse events.
Preventing fraudulent claims
Data includes the location
where the claim originated,
time of day, claimant history,
claim amount, and even public
data such as the National
Fraud Database.
Predicting customer churn
Data includes customer
demographics, products
purchased, product usage,
customer calls, time since last
contact, past transaction
history, industry, company
size, and revenue.
7. 7
Confirmation of the Constraints on ML/AI…
Source: Machine learning in UK financial services, Bank of England
and Financial Conduct Authority, October 2019
9. 9
Typical Data Science Workflow
A typical workflow for a data scientist is:
1. Gather the requirements for the business problem
2. Identify data useful for the case
• Ingest data
3. Cleanse data into a useful format
4. Analyze data
5. Prepare input for your algorithms
6. Execute data science algorithms (ML, AI, etc.)
• Iterate 2-6 until valuable insights are
produced
7. Visualize and share
10. 10
Typical Data Science Workflow
80% of time – Finding and preparing the data
10% of time – Analysis
10% of time – Visualizing data
11. 11
Where Does Your Time Go?
A large amount of time and effort goes into tasks not intrinsically
related to data science:
• Finding where the right data might be
• Getting access to the data
• Bureaucracy
• Understand access methods and technology (noSQL, REST APIs, etc.)
• Transforming data into a format easy to work with
• Combining data originally available in different sources and formats
• Profile and cleanse data to eliminate incomplete or inconsistent data
points
13. 13
Gartner – Logical Data Warehouse
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry
Cook, Gartner April 2018
DATA VIRTUALIZATION
14. Gartner, Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical
Needs, May 2018
“When designed properly, Data Virtualization can speed data
integration, lower data latency, offer flexibility and reuse, and reduce
data sprawl across dispersed data sources.
Due to its many benefits, Data Virtualization is often the first step for
organizations evolving a traditional, repository-style data warehouse
into a Logical Architecture”
15. 15
Benefits of a Virtual Data Layer
A Virtual Layer improves decision making and shortens development cycles
• Surfaces all company data from multiple repositories without the need to replicate all data
into a lake
• Eliminates data silos: allows for on-demand combination of data from multiple sources
A Virtual Layer broadens usage of data
• Improves governance and metadata management to avoid “data swamps”
• Decouples data source technology. Access normalized via SQL or web services
• Allows controlled access to the data with low grain security controls
A Virtual Layer offers performant access
• Leverages the processing power of the existing sources controlled by Denodo’s optimizer
• Processing of data for sources with no processing capabilities (e.g. files)
• Caching and ingestion engine to persist data when needed
16. 16
Data Scientist Workflow Steps
Identify useful
data
Modify data into
a useful format
Analyze data Execute data
science algorithms
(ML, AI, etc.)
Share with
business users
Prepare for
ML algorithm
22. 22
Our Citibike Hypothesis
Development
Lifecycle Mgmt
Monitoring & Audit
Governance
Security
Development Tools
and SDK
Scheduled Tasks
Data Caching
Query Optimizer
JDBC/ODBC/ADO.Net SOAP / REST WS
U
Citibike Data
View
Mart
View
J
Application
Layer
Business
Layer
Citibike Trip
View
Unified ViewUnified View
Weather and Date
View
A
J
J
Derived View Derived View
J
JS
Transformation
& Cleansing
Data
Source
Layer
Historical
Trip Data
Base View
Subscriber
Base View
Date Data
Base View
Historical
Weather
Base View
Base
View
Base
View
Base
View
Abstraction
Predicting Citibike Trips
Data includes historical Citibike
trip data, subscriptions and 24-
hour passes purchased, historical
hourly weather data, and date
information (weekends, public
holidays, etc.).
Data
Warehouse
Citibike
REST API
Data
Warehouse
National
Weather
Service
Data Lake
Analytics Notebooks, Python
23. 23
What We’re Going To Do…
1. Connect to data and have a look
2. Format the data (prep it) so that we can look for significant factors
• e.g. bike trips on different days of week, different months of year, etc.
3. Once we’ve decided on the significant attributes, prepare that data
for the ML algorithm
4. Using Python, read the 2018 data and run it through our ML
algorithm for training
5. Read the 2019 data, test the algorithm
6. Save the results and load them into the Denodo Platform
(Find Data)
(Explore Data)
(Prepare the Data)
(Train the Model)
(Test the Model)
(Save the results)
25. 25
Prologis – Operationalizing AI/ML
$1.5TRILLION
is the economic value of goods flowing through
our distribution centers each year, representing:
2.8%
of GDP for the 19 countries where
we do business
%2.0
of the World’s GDP
1983 100 GLOBAL 768 MSF
Founded Most sustainable corporations
$87B
Assets under management on four continents
MILLION
employees under Prologis’ roofs
1.0
26. 26
Prologis – Data Science Workflow
Step 1: Expose Data to Data Scientists
27. 27
Prologis – Data Science Workflow
Step 2: Operationalization of Model Scoring
Web Service
(Python Model Scoring)
AWS Lambda
32. 32
Key Takeaways
The Denodo Platform makes all kinds of data – from a variety of
data sources – readily available to your data analysts and data
scientists
Data virtualization shortens the ‘data wrangling’ phases of
analytics/ML projects
Avoids needing to write ‘data prep’ scripts in Python, R, etc.
It’s easy to access and analyze the data from analytics tools such as
Zeppelin or Jupyter
You can use the Denodo Platform to share the results of your
analytics with others
Finally…People don’t like to ride their bikes in the snow
The Denodo Platform makes all kinds of data – from a variety of
data sources – readily available to your data analysts and data
scientists
Data virtualization shortens the ‘data wrangling’ phases of
analytics/ML projects
Avoids needing to write ‘data prep’ scripts in Python, R, etc.
It’s easy to access and analyze the data from analytics tools such as
Zeppelin or Jupyter
You can use the Denodo Platform to share the results of your
analytics with others
33.
34. 34
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
www.denodo.com/TestDrive
G E T S TA R T E D TO DAY