AI meets Big Data

The starter guide for AI on your Hadoop Datalake using Big Data with Spark and Tensorflow.

  1. 1. AI meets Big Data How to cross the chasm
  2. 2. Google TPU
  3. 3. AI history à Perceptron 1958 F. Rosenblatt, “Perceptron” model, neuronal networks 1943 W. McCulloch, W. Pitts, “Neuron” as logical element OR function XOR function 1969 M. Minsky, S. Papert, triggers first AI winter feed forward
  4. 4. AI history à AI winter 1958 F. Rosenblatt, Perzeptron model, neuronal networks 1987-1993 the second AI winter, desktop computer, LISP machines expensive 1943 W. McCulloch, W. Pitts, neuron as logical element 1980 Boom expert systems, Q&A using logical rules, Prolog 1969 M. Minsky, S. Papert, trigger first AI winter 1993-2001 Moore’s law, Deep blue chess- playing, Standford DARPA challenge
  5. 5. 6 AI beats human in games - 2016 Komodo beasts H. Nakamura in 2016AlphaGo beats L. Sedols in 2016 Go 4:1 Chess 2:1
  6. 6. Image Classification- 2016 Human Performance AI Performance https://arxiv.org/pdf/1602.07261.pdf 95% 97% The ability to understand the content of an image by using machine learning
  7. 7. Breast Cancer Diagnoses - 2017 Pathologist Performance AI Performance https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html 73% 92% Doctors often use additional tests to find or diagnose breast cancer The pathologist ended up spending 30 hours on this task on 130 slides A closeup of a lymph node biopsy.
  8. 8. 9 Machine Learning Problem Types
  9. 9. Structured data 80% of world’s data is unstructured
  10. 10. Fishing in the sea versus fishing in the lake Data Warehouse Data Lake Business Intellingence helps find answers to questions you know. Data Science helps you find the question itself. Any kind of data & schema-on-readStructured data & schema-on-write Parallel processing on big dataSQL-ish queries on database tables Extract, Transform, Load Extract, Load, Transform-on-the-fly Low cost on commodity hardwareExpensive for large data
  11. 11. More Data + Bigger Models Accuracy Scale (data size, model size) other approaches neural networks 1990s https://www.scribd.com/document/355752799/Jeff-Dean-s-Lecture-for-YC-AI
  12. 12. More Data + Bigger Models + More Computation Accuracy Scale (data size, model size) other approaches neural networks Now https://www.scribd.com/document/355752799/Jeff-Dean-s-Lecture-for-YC-AI more compute
  13. 13. More Data + Bigger Models + More Computation = Better Results in Machine Learning
  14. 14. Millions of “trip” events each day globally 400+ billion viewing- related events per day Five billion data points for Price Tip feature Movie recommendation Price optimization Routing and price optimization
  15. 15. How to start?
  20. 20. Train and evaluate machine learning models at scale Single machine Data center How to run more experiments faster and in parallel? How to share and reproduce research? How to go from research to real products?
  21. 21. Distributed Machine Learning Data Size Model Size Model parallelism Single machine Data center Data parallelism training very large models exploring several model architectures, hyper- parameter optimization, training several independent models speeds up the training
  22. 22. Compute Workload for Training and Evaluation I/O intensive Compute intensive Single machine Data center
  23. 23. I/O Workload for Simulation and Testing I/O intensive Compute intensive Single machine Data center
  24. 24. Distributed Machine Learning
  27. 27. 11/24/17 28 TensorFlow Standalone TensorFlow On YARN TensorFlow On multi- colored YARN TensorFlow On Spark TensorFrames TensorFlow On Kubernetes TensorFlow On Mesos Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark https://www.slideshare.net/jwiegelmann/distributed -tensorflow-on-hadoop-mesos-kubernetes-spark
  28. 28. Hidden Technical Debt in Machine Learning Systems https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf Google, 2015
  29. 29. Hidden Technical Debt in Machine Learning Systems https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf Google, 2015
  30. 30. http://stevenwhang.com/tfx_paper.pdf TFX: A TensorFlow-Based Production-Scale Machine Learning Platform Google, 2017
  31. 31. https://eng.uber.com/michelangelo/ Michelangelo: Uber’s Machine Learning Platform
  33. 33. Pricing for 890,000 real-time predictions w/o training AWS: Compute Fees + Prediction Fees = $8.40 + $96.44 = $104.84 per month Google: Prediction $0.10 per thousand predictions, plus $0.40 per hour = $377 per month Azure: Packages $0, $100,13, $1.000,06, $9.999,98 = $1.000 per month Q3, 2017
  35. 35. Who are you? Who do you know? What can you afford? Where are you? What have you purchased? What do you like? What content do you prefer? Why have you contacted us?
  36. 36. Marketing Tools Touchpoints Aftersales Capture Data across Digital Channels Each of these customer interactions produces data
  37. 37. Breaking Down Data Silos Connect all your data tools, other sources, and gain a 360 degree view on your data Get actionable insights and serve them personal, relevant content along their journey Real-time processing and decision making One Data Platform Marketing Tools Touchpoints Historical Aftersales Data Analytics Machine Learning Data Apps
  38. 38. Actionable data insights that businesses can use to... ü Better understand and better engage your customers ü Respond to the convergence of customer expectations ü Driver of brand perception 360 degree view of your customers Social Apps CRM Billing Channels Service Call Location Devices Network Ordering Customer 360
  39. 39. Where AI can help… + Predicting lifetime value + Churn estimation + Customer segmentation + Cross/Upselling + Recommendations + Demand forecasting + Market Basket Analysis + Sentiment analysis + Loyalty programs + Reactivation likelihood + Discount targeting + Call to action + Risk analysis + In store traffic patterns
  41. 41. High-level Development Process for Autonomous Vehicles 1 Collect sensors data 3 Autonomous Driving 2 Model Engineering Data Logger Control Unit Big Data Trained Model Data Center Agenda
  42. 42. Sensors Udacity Lincoln MKZ Camera 3x Blackfly GigE Camera, 20 Hz Lidar Velodyne HDL-32E, 9.5 Hz IMU Xsens, 400 Hz GPS 2x fixed, 1 Hz CAN bus, 1,1 kHz Robot Operating System Data 3 GB per minute https://github.com/udacity/self-driving-car
  43. 43. Sensors Spec Sensor blinding, sunlight, darkness rain, fog, snow non-metal objects wind/ high velocity resolution range data Ultrasonic yes yes yes no + + + Lidar yes no yes yes +++ ++ + Radar yes yes no yes ++ +++ + Camera no no yes yes +++ +++ +++
  44. 44. Machine Learning 101 Observations State Estimation Modeling & Prediction Planning Controls f(x) Controls Observations
  45. 45. Machine Learning for Autonomous Driving + Sensor Fusion clustering, segmentation, pattern recognition + Road ego-motion, image processing and pattern recognition + Localization simultaneous localization and mapping + Situation Understanding detection and classification + Trajectory Planning motion planning and control + Control Strategy reinforcement and supervised learning + Driver Model image processing and pattern recognition
  46. 46. Machine Learning Cycle Data collection for training/test Feature engineering I/O workload Model development and architecture Compute workload I/O workload Training and evaluation Re- Simulation and Testing Scaling and monitoring Model deployment versioning 1 2 3 Model tuning
  47. 47. Flux – Open Machine Learning Stack Training & Test data Compute + Network + Storage Deploy model ML Development & Catalog & REST API ML-Specialists Feature Engineering Training Evaluation Re-Simulation Testing CaffeOnSpark Sample Model Prediction Batch Regression Cluster Dataset Correlation Centroid Anomaly Test Scores ü Mainly open source ü No vendor lock in ü Scale-out architecture ü Multi user support ü Resource management ü Job scheduling ü Speed-up training ü Speed-up simulation
  48. 48. Feature Engineering + Hadoop InputFormat and Record Reader for Rosbag + Process Rosbag with Spark, Yarn, MapReduce, Hadoop Streaming API, … + Spark RDD are cached and optimized for analysis Ros bag Processing Engine Computer Network Storage Advanced Analytics RDD Record Reader RDD DataFrame, DataSet SQL, Spark APIs NumPy Ros Msg
  49. 49. Training & Evaluation + Tensorflow ROSRecordDataset + Protocol Buffers to serialize records + Save time because data conversion not needed + Save storage because data duplication not needed Training Engine Machine Learning Ros bag Computer Network Storage ROS Dataset Ros msg
  50. 50. Re-Simulation & Testing + Use Spark for preprocessing, transformation, cleansing, aggregation, time window selection before publish to ROS topics + Use Re-Simulation framework of choice to subscribe to the ROS topics Engine Re-Simulation with framework of choice Computer Network Storage Ros bag Ros topic core subscribe publish
  51. 51. HOW TO START?
  52. 52. + Classification, Regression, Clustering, Collaborative Filtering, Anomaly Detection + Supervised/Unsupervised Reinforcement Learning, Deep Learning, CNN + Model Training, Evaluation, Testing, Simulation, Inference + Big Data Strategy, Consulting, Data Lab, Data Science as a Service + Data Collection, Cleaning, Analyzing, Modeling, Validation, Visualization + Business Case Validation, Prototyping, MVPs, Dashboards Data Science Machine Learning
  53. 53. + Architecture, DevOps, Cloud Building + App. Management Hadoop Ecosystem + Managed Infrastructure Services + Compute, Network, Storage, Firewall, Loadbalancer, DDoS, Protection + Continuous Integration and Deployment + Data Pipelines (Acquisition, Ingestion, Analytics, Visualization) + Distributed Data Architectures + Data Processing Backend + Hadoop Ecosystem + Test Automation and Testing Data Engineering Data Operations
  54. 54. Think Big Business Strategy Data Strategy Technology Strategy Agile Delivery Model Business Case Validation Prototypes, MVPs Data Exploration Data AcquisitionStart Small Value Proposition
