Publicité

Advanced Analytics and Machine Learning with Data Virtualization

Denodo
21 Feb 2020
Publicité

Contenu connexe

Présentations pour vous(20)

Similaire à Advanced Analytics and Machine Learning with Data Virtualization(20)

Publicité

Plus de Denodo (20)

Dernier(20)

Publicité

Advanced Analytics and Machine Learning with Data Virtualization

  1. DATA VIRTUALIZATION PACKED LUNCH WEBINAR SERIES Sessions Covering Key Data Integration Challenges Solved with Data Virtualization
  2. Advanced Analytics and Machine Learning with Data Virtualization Paul Moxon SVP Data Architectures & Chief Evangelist, Denodo
  3. 3 The Economist, May 2017 The world’s most valuable resource is no longer oil, but data.
  4. 4 Data – Like Oil – Is Not Easy To Extract and Use
  5. 5 AI and Machine Learning Needs Data Predicting high-risk patients Data includes patient demographics, family history, patient vitals, lab test results, past medication history, visits to the hospital, and any claims data Predicting equipment failure Data may include maintenance data logs maintained by the technicians, especially for older machines. For newer machines, data coming in from the different sensors of the machine— including temperature, running time, power level durations, and error messages Predicting default risks Data includes company or individual demographics, products they purchased/ used, past payment history, customer support logs, and any recent adverse events. Preventing fraudulent claims Data includes the location where the claim originated, time of day, claimant history, claim amount, and even public data such as the National Fraud Database. Predicting customer churn Data includes customer demographics, products purchased, product usage, customer calls, time since last contact, past transaction history, industry, company size, and revenue.
  6. 6 But the Data is Somewhere in Here…
  7. 7 Confirmation of the Constraints on ML/AI… Source: Machine learning in UK financial services, Bank of England and Financial Conduct Authority, October 2019
  8. 8 The Scale of the Problem…
  9. 9 Typical Data Science Workflow A typical workflow for a data scientist is: 1. Gather the requirements for the business problem 2. Identify data useful for the case • Ingest data 3. Cleanse data into a useful format 4. Analyze data 5. Prepare input for your algorithms 6. Execute data science algorithms (ML, AI, etc.) • Iterate 2-6 until valuable insights are produced 7. Visualize and share
  10. 10 Typical Data Science Workflow 80% of time – Finding and preparing the data 10% of time – Analysis 10% of time – Visualizing data
  11. 11 Where Does Your Time Go? A large amount of time and effort goes into tasks not intrinsically related to data science: • Finding where the right data might be • Getting access to the data • Bureaucracy • Understand access methods and technology (noSQL, REST APIs, etc.) • Transforming data into a format easy to work with • Combining data originally available in different sources and formats • Profile and cleanse data to eliminate incomplete or inconsistent data points
  12. 12 Gartner – The Evolution of Analytical Environments This is a Second Major Cycle of Analytical Consolidation Operational ApplicationOperational Application Operational ApplicationOperational Application Operational ApplicationOperational Application IoT DataIoT Data Other NewDataOther NewData Operational Application Operational Application Operational Application Operational Application CubeCube Operational Application Operational Application CubeCube ?? Operational ApplicationOperational Application Operational ApplicationOperational Application Operational ApplicationOperational Application IoT DataIoT Data Other NewDataOther NewData 1980s Pre EDW 1990s EDW 2010s2000s Post EDW Time LDW Operational Application Operational Application Operational Application Operational Application Operational Application Operational Application Data Warehouse Data Warehouse Data Warehouse Data Warehouse Data Lake Data Lake ?? LDWLDW Data WarehouseData Warehouse Data LakeData Lake MartsMarts ODSODS Staging/IngestStaging/Ingest Unified analysis › Consolidated data › "Collect the data" › Single server, multiple nodes › More analysis than any one server can provide ©2018 Gartner, Inc. Unified analysis › Logically consolidated view of all data › "Connect and collect" › Multiple servers, of multiple nodes › More analysis than any one system can provide ID: 342254 Fragmented/ nonexistent analysis › Multiple sources › Multiple structured sources Fragmented analysis › "Collect the data" (Into › different repositories) › New data types, › processing, requirements › Uncoordinated views
  13. 13 Gartner – Logical Data Warehouse “Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018 DATA VIRTUALIZATION
  14. Gartner, Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs, May 2018 “When designed properly, Data Virtualization can speed data integration, lower data latency, offer flexibility and reuse, and reduce data sprawl across dispersed data sources. Due to its many benefits, Data Virtualization is often the first step for organizations evolving a traditional, repository-style data warehouse into a Logical Architecture”
  15. 15 Benefits of a Virtual Data Layer  A Virtual Layer improves decision making and shortens development cycles • Surfaces all company data from multiple repositories without the need to replicate all data into a lake • Eliminates data silos: allows for on-demand combination of data from multiple sources  A Virtual Layer broadens usage of data • Improves governance and metadata management to avoid “data swamps” • Decouples data source technology. Access normalized via SQL or web services • Allows controlled access to the data with low grain security controls  A Virtual Layer offers performant access • Leverages the processing power of the existing sources controlled by Denodo’s optimizer • Processing of data for sources with no processing capabilities (e.g. files) • Caching and ingestion engine to persist data when needed
  16. 16 Data Scientist Workflow Steps Identify useful data Modify data into a useful format Analyze data Execute data science algorithms (ML, AI, etc.) Share with business users Prepare for ML algorithm
  17. Demonstration Advanced Analytics and Machine Learning with Data Virtualization 17
  18. 18 https://flic.kr/p/x8HgrF Can we predict the usage of the NYC bike system based on data from previous years?
  19. 19 Data Sources – Citibike
  20. 20 There are external factors to consider. Which ones? https://flic.kr/p/CYT7SS
  21. 21 Data Sources – NWS Weather Data
  22. 22 Our Citibike Hypothesis Development Lifecycle Mgmt Monitoring & Audit Governance Security Development Tools and SDK Scheduled Tasks Data Caching Query Optimizer JDBC/ODBC/ADO.Net SOAP / REST WS U Citibike Data View Mart View J Application Layer Business Layer Citibike Trip View Unified ViewUnified View Weather and Date View A J J Derived View Derived View J JS Transformation & Cleansing Data Source Layer Historical Trip Data Base View Subscriber Base View Date Data Base View Historical Weather Base View Base View Base View Base View Abstraction Predicting Citibike Trips Data includes historical Citibike trip data, subscriptions and 24- hour passes purchased, historical hourly weather data, and date information (weekends, public holidays, etc.). Data Warehouse Citibike REST API Data Warehouse National Weather Service Data Lake Analytics Notebooks, Python
  23. 23 What We’re Going To Do… 1. Connect to data and have a look 2. Format the data (prep it) so that we can look for significant factors • e.g. bike trips on different days of week, different months of year, etc. 3. Once we’ve decided on the significant attributes, prepare that data for the ML algorithm 4. Using Python, read the 2018 data and run it through our ML algorithm for training 5. Read the 2019 data, test the algorithm 6. Save the results and load them into the Denodo Platform (Find Data) (Explore Data) (Prepare the Data) (Train the Model) (Test the Model) (Save the results)
  24. Demo 24
  25. 25 Prologis – Operationalizing AI/ML $1.5TRILLION is the economic value of goods flowing through our distribution centers each year, representing: 2.8% of GDP for the 19 countries where we do business %2.0 of the World’s GDP 1983 100 GLOBAL 768 MSF Founded Most sustainable corporations $87B Assets under management on four continents MILLION employees under Prologis’ roofs 1.0
  26. 26 Prologis – Data Science Workflow Step 1: Expose Data to Data Scientists
  27. 27 Prologis – Data Science Workflow Step 2: Operationalization of Model Scoring Web Service (Python Model Scoring) AWS Lambda
  28. 28 Data Science – Developing for Operationalization
  29. 29 Data Science – Operationalization in Production
  30. 30 Data Science Toolkit
  31. Key Takeaways 31
  32. 32 Key Takeaways  The Denodo Platform makes all kinds of data – from a variety of data sources – readily available to your data analysts and data scientists  Data virtualization shortens the ‘data wrangling’ phases of analytics/ML projects  Avoids needing to write ‘data prep’ scripts in Python, R, etc.  It’s easy to access and analyze the data from analytics tools such as Zeppelin or Jupyter  You can use the Denodo Platform to share the results of your analytics with others  Finally…People don’t like to ride their bikes in the snow  The Denodo Platform makes all kinds of data – from a variety of data sources – readily available to your data analysts and data scientists  Data virtualization shortens the ‘data wrangling’ phases of analytics/ML projects  Avoids needing to write ‘data prep’ scripts in Python, R, etc.  It’s easy to access and analyze the data from analytics tools such as Zeppelin or Jupyter  You can use the Denodo Platform to share the results of your analytics with others
  33. 34 Next Steps Access Denodo Platform in the Cloud! Take a Test Drive today! www.denodo.com/TestDrive G E T S TA R T E D TO DAY
  34. Thank you! © Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.
Publicité