Autodeploy a complete end-to-end machine learning pipeline on Kubernetes using tools like Spark, TensorFlow, HDFS, etc. - it requires a running Kubernetes (K8s) cluster in the cloud or on-premise.
9. Single machine Data center
1 user Many users
Megabyte of data Petabyte of data
Local filesystem Distributed filesystem
Exclusive use Resource sharing, scheduling,
queueing, resource isolation
Scale up Scale out
pip install ... Automating deployment
- Operations, monitoring, ...
10.
11.
12. Development cycle for autonomous vehicles
1 Collect
sensors data
3 Autonomous
Driving
2 Model
Engineering
Data Logger Control Unit
Big Data Trained Model
Data Center
13. Sensors Udacity Lincoln MKZ
Camera 3x Blackfly GigE Camera, 20 Hz
Lidar Velodyne HDL-32E, 9.5 Hz
IMU Xsens, 400 Hz
GPS 2x fixed, 1 Hz
CAN bus, 1,1 kHz
Robot Operating System
Data 3 GB per minute
https://github.com/udacity/self-driving-car
15. ROS bag data structure
https://github.com/valtech/ros_hadoop
16. Robot Operating System
+ Popular open source robotics
framework
+ Reliable distributed architecture
+ Wide use in the robotics
research community
+ Huge selection of “off-the-shelf”
software packages for
hardware/algorithms/etc.
+ Used by Bosch, BMW, KUKA, Google, Siemens, etc.
https://roscon.ros.org/2015/presentations/ROSCon-Automated-Driving.pdf
17. 17
1 Collect
sensors data
3 Autonomous
Driving
2 Model
Engineering
Data Logger Control Unit
Big Data Trained Model
Data Center
Development cycle for autonomous vehicles
21. Train and evaluate machine learning models at scale
Single machine Data center
How to run more experiments faster and in parallel?
How to share and reproduce research?
How to go from research to real products?
22. Distributed Machine Learning
Data Size
Model Size
Model parallelism
Single machine
Data center
Data
parallelism
training very large models exploring several model
architectures, hyper-
parameter optimization,
training several
independent models
speeds up the training
23. Compute Workload for Training and Evaluation
I/O workload
Compute
workload
Single machine
Data center
24. I/O Workload for Simulation and Testing
I/O workload
Compute
workload
Single machine
Data center
25. Flux – Open Machine Learning Stack
Training &
Test data
Compute + Network + Storage Deploy model
ML Development & Catalog & REST API
ML-Heros
Feature
Engineering
Training
Evaluation
Re-Simulation
Testing
CaffeOnSpark
Sample Model Prediction Batch Regression Cluster
Dataset Correlation Centroid Anomaly Test Scores
Mainly open source
No vendor lock in
Scale-out architecture
Multi user support
Resource management
Job scheduling
Speed-up training
Speed-up simulation
https://github.com/flux-project/flux
26. Feature Engineering
+ Hadoop InputFormat and
Record Reader for Rosbag
+ Process Rosbag with Spark,
Yarn, MapReduce, Hadoop
Streaming API, …
+ Spark RDD are cached and
optimized for analysis
Ros
bag
Processing
Engine
Computer
Network
Storage
Advanced
Analytics
RDD
Record
Reader
RDD
DataFrame, DataSet
SQL, Spark APIs
NumPy
Ros
Msg
28. Training & Evaluation
+ Tensorflow ROSRecordDataset
+ Protocol Buffers to serialize
records
+ Save time because data
conversion not needed
+ Save storage because data
duplication not needed
Training
Engine
Machine
Learning
Ros
bag
Computer
Network
Storage
ROS
Dataset
Ros
msg
29. Re-Simulation & Testing
+ Use Spark for preprocessing,
transformation, cleansing,
aggregation, time window
selection before publish to ROS
topics
+ Use Re-Simulation framework
of choice to subscribe to the
ROS topics
Engine
Re-Simulation
with framework
of choice
Computer
Network
Storage
Ros
bag
Ros
topic
core
subscribe
publish