SlideShare une entreprise Scribd logo
1  sur  86
Télécharger pour lire hors ligne
HIGH PERFORMANCE TENSORFLOW +
GPUS @ GPU TECH CONF, MAY 8, 2017
CHRIS FREGLY, RESEARCH ENGINEER
@ PIPELINE.IO
INTRODUCTIONS
INTRODUCTIONS: ME
§ Chris Fregly, Research Engineer @ in San Fran
§ Formerly Netflix and Databricks
§ Advanced Spark and TensorFlow Meetup (Global)
Please Join Our 7,000 Members!!
INTRODUCTIONS: YOU
§ Software Engineer or Data Scientist interested in optimizing
and deploying TensorFlow models to production
§ Assume you have a working knowledge of TensorFlow
NOTES ABOUT THIS MATERIAL
CONTENT BREAKDOWN
§ 50% Training Optimizations (TensorFlow, XLA, Tools)
§ 50% Deployment and Inference Optimizations (Serving)
§ Why Heavy Focus on Inference?
§ Training: boring batch, O(num_researchers)
§ Inference: exciting realtime, O(num_users_of_app)
§ We Use Simple Models to Highlight Optimizations
§
Warning: This is not introductory TensorFlow material!
100% OPEN SOURCE CODE
§ https://github.com/fluxcapacitor/pipeline/
§ Please Star this Repo! J
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
HANDS-ON EXERCISES
§ Combo of Jupyter Notebooks and Command Line
§ Command Line through Jupyter Terminal
§ Some Exercises Based on Experimental Features
In Other Words, Some Code May Be Wonky!
YOU WILL LEARN…
§ TensorFlow Best Practices
§ To Inspect and Debug Models
§ To Distribute Training Across a Cluster
§ To Optimize Training with Queue Feeders
§ To Optimize Training with XLA JIT Compiler
§ To Optimize Inference with AOT and Graph Transform Tool (GTT)
§ Key Components of TensorFlow Serving
§ To Deploy Models with TensorFlow Serving
§ To Optimize Inference by Tuning TensorFlow Serving
AGENDA
§ Setup Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
EVERYBODY GETS A GPU!
SETUP ENVIRONMENT
§ Step 1: Browse to the following:
http://[REDACTED]
§ Step 2: Browse to the following:
http://<ip-address>
Need Help?
Use the Chat!
VERIFY SETUP
http://<ip-address>
Any username,
Any password
HOUSEKEEPING
§ Navigate to the following notebook:
00_Housekeeping
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
GPU HALF-PRECISION SUPPORT
§ New Pascal P100: FP16, INT8
§ Flexible FP32 GPU Cores
§ Process 2x FP16 on 2-element vectors simultaneously
§ Half-precision is OK for Approximate Deep Learning!
GPU CUDA PROGRAMMING
§ Barbaric At Best!
§ CUDA Developers Must Be Very Aware of Hardware Specs
§ As Such, There are Many Great Debuggers/Profilers
Warning: Nvidia Can Change Hardware Specs Anytime
CUDA STREAMS
§ Overlapping Kernel Execution and Data Transfer
LET’S SEE WHAT THIS THING CAN DO!
§ Navigate to the following notebook:
01_Explore_GPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
AGENDA
§ Setup Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
TRAINING TERMINOLOGY
§ Tensors: N-Dimensional Arrays
§ ie. Scalar, Vector, Matrix
§ Operations: MatMul, Add, SummaryLog,…
§ Graph: Graph of Operations (DAG)
§ Session: ContainsGraph(s)
§ Feeds: Feed inputs into Operation
§ Fetches: Fetch output from Operation
§ Variables: What we learn through training
§ aka “weights”, “parameters”
§ Devices: Hardware device on which we train
-TensorFlow-
Trains
Variables
-User-
Fetches
Outputs
-User-
Feeds
Inputs
-TensorFlow-
Performs
Operations
-TensorFlow-
Flows
Tensors
with tf.device(“worker:0/device/gpu:0,worker:1/device/gpu:0”)
TRAINING DEVICES
§ cpu:0
§ By default, all CPUs
§ Requires extra config to target a CPU
§ gpu:0..n
§ Each GPU has a unique id
§ TF usually prefers a single GPU
§ xla_cpu:0, xla_gpu:0..n
§ “JIT Compiler Device”
§ Hints TF to attempt JIT Compile
with tf.device(“/cpu:0”):
with tf.device(“/gpu:0”):
with tf.device(“/gpu:1”):
TRAINING METRICS: TENSORBOARD
§ Summary Ops
§ Event Files
/root/tensorboard/linear/<version>/events…
§ Tags
§ Organize data within Tensorboard UI
loss_summary_op = tf.summary.scalar('loss',
loss)
merge_all_summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(
'/root/tensorboard/linear/<version>',
graph=sess.graph)
TRAINING ON EXISTING INFRASTRUCTURE
§ Data Processing
§ HDFS/Hadoop
§ Spark
§ Containers
§ Docker
§ Schedulers
§ Kubernetes
§ Mesos
<dependency>
<groupId>org.tensorflow</groupId>
<artifactId>tensorflow-hadoop</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
https://github.com/tensorflow/ecosystem
FEED TRAINING DATA WITH QUEUES
§ Don’t Use feed_dict for Production Workloads!!
§ feed_dict Requires C++ <-> Python Serialization
§ Batch Retrieval is Single-threaded,Synchronous, SLOW!
§ Next batch can’t be retrieved until current batch is processed
§ CPUs and GPUs are Not Fully Utilized
§ Use Queues to Read and Pre-Process Data for GPUs!
§ Queues use CPUs for I/O, pre-processing, shuffling, …
§ GPUs stay saturated and focused on cool GPU stuff!!
DATA MOVEMENT WITH QUEUES
§ Queue Pulls Batch from Source (ie HDFS, Kafka)
§ Queue pre-processes and shuffles data
§ Queue should use only CPUs - no GPUs
§ Combine many small files into a few large TFRecord files
§ GPU Pulls Batch from Queue (CUDA Streams)
§ GPU pulls next batch while processing current batch
§ GPU Processes Next Batch Immediately
§ GPUs are fully utilized!
QUEUE CAPACITY PLANNING
§ batch_size
§ # of examples per batch (ie. 64 jpg)
§ Limited by GPU RAM
§ num_processing_threads
§ CPU threads pull and pre-process batches of data
§ Limited by CPU Cores
§ queue_capacity
§ Limited by CPU RAM (ie. 5 * batch_size)
Saturate those GPUs!
GPU Pulls Batches while
Processing Current Batch
AsyncMemory Transfer
with CUDA Streams
-- Thanks,Nvidia!! --
DETECT UNDERUTILIZED CPUS, GPUS
§ Instrument training code to generate “timelines”
§ Analyze with Google Web
Tracing Framework (WTF)
§ Monitor CPU with `top`, GPU with `nvidia-smi`
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))
LET’S FEED A QUEUE FROM HDFS
§ Navigate to the following notebook:
02_Feed_Queue_HDFS
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
TENSORFLOW MODEL
§ GraphDef
§ Architecture of your model (nodes, edges)
§ Metadata
§ Asset: Accompanying assets to your model
§ SignatureDef: Maps external to internal graph nodes for TF Serving
§ MetaGraph
§ Combines GraphDef and Metadata
§ Variables
§ Stored separately (checkpointed) during training
§ Allows training to continue from any checkpoint
§ Values are “frozen” into Constants when deployed for inference
Graph Definition
x
W
mul add
b
MetaGraph
Metadata
Assets
SignatureDef
Tags
Version
TENSORFLOW SESSION
Session
graph: GraphDef
variables:
“W” : 0.328
“b” : -1.407
Variablesare
Periodically
Checkpointed
GraphDef is
Saved as-is
LET’S TRAIN A MODEL (CPU)
§ Navigate to the following notebook:
03_Train_Model_CPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
LET’S TRAIN A MODEL (GPU)
§ Navigate to the following notebook:
04_Train_Model_GPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
TENSORFLOW DEBUGGER
§ Step through Operations
§ Inspect Inputs and Outputs
§ Wrap Session in Debug Session
sess = tf.Session(config=config)
sess =
tf_debug.LocalCLIDebugWrapperSession(sess)
LET’S DEBUG A MODEL
§ Navigate to the following notebook:
05_Debug_Model
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
AGENDA
§ Setup Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
MULTI-GPU TRAINING (SINGLE NODE)
§ Variables stored on CPU (cpu:0)
§ Model graph (aka “replica”, “tower”)
is copied to each GPU(gpu:0, gpu:1, …)
Multi-GPU Training Steps:
1. CPU transfersmodel to each GPU
2. CPU waitson all GPUs to finish batch
3. CPU copiesall gradientsback from all GPUs
4. CPU synchronizesand averagesall gradientsfrom GPUs
5. CPU updatesGPUs with new variables/weights
6. Repeat Step 1 until reaching stop condition (ie. max_epochs)
DISTRIBUTED, MULTI-NODE TRAINING
§ TensorFlow Automatically Inserts Send and Receive Ops into Graph
§ Parameter Server Synchronously AggregatesUpdates to Variables
§ Nodes with Multiple GPUs will Pre-Aggregate Before Sending to PS
Worker0 Worker0
Worker1
Worker0 Worker1 Worker2
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0
gpu1
gpu0
gpu0
SYNCHRONOUS VS. ASYNCHRONOUS
§ Synchronous
§ Worker (“graphreplica”, “tower”)
§ Reads samevariables from Parameter Server in parallel
§ Computes gradients for variables using partition of data
§ Sends gradients to central Parameter Server
§ Parameter Server
§ Aggregates (avg) gradients for each variable based on its portion of data
§ Applies gradients (+, -) to each variables
§ Broadcasts updated variables to each node in parallel
§ ^^ Repeat ^^
§ Asynchronous
§ Each node computes gradients independently
§ Reads stale values,does not synchronized with other nodes
DATA PARALLEL VS MODEL PARALLEL
§ Data Parallel
§ Send exact same model to each worker/device
§ Each worker/device operates on their partition of data, same model
§ Similar to how Spark sends the same function to differentworkers
§ Each Spark worker operates on their partition of the data
§ Model Parallel
§ Send different partition of model to each worker/device
§ Each worker/device operates on all data, their partition of model
Very Difficult, But Necessary for Very Large Models (RAM)
LET’S TRAIN A DISTRIBUTED MODEL
§ Navigate to the following notebook:
06_Train_Distributed_Model
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
AGENDA
§ Setup Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
XLA FRAMEWORK
§ Accelerated Linear Algebra (XLA)
§ Experimental!
§ Goals:
§ Improve execution speed
§ Improve memory usage
§ Reduce reliance on custom operators
§ Reduce mobile footprint
§ Improve portability
XLA HIGH LEVEL OPTIMIZER (HLO)
§ Define Graphs using HLO Language
§ Compiler IntermediateRepresentation (IR)
§ Independentof source and target language
§ XLA Emits Target-Independent,OptimizedHLO
§ Backend Emits Target-Dependent,OptimizedLLVM
§ LLVM Emits Native Code for Target
§ x86-64 and ARM64 CPU Instruction Sets
§ Nvidia GPU NVPTX
JIT COMPILER
§ Just-In-Time Compiler
§ Built on XLA Framework
§ Goals:
§ Reduce memory movement – especiallyuseful on GPUs
§ Reduce overhead of multiple function calls
§ Similar to Spark Operator Fusing in Spark 2.0
§ Unroll Loops, Fuse Operators, Fold Constants, …
§ Enable session-wide or `with jit_scope():`
VISUALIZING JIT COMPILER IN ACTION
Before After
Google Web Tracing Framework:
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))
VISUALIZING FUSING OPERATORS
pip install graphviz
dot -Tpng 
/tmp/hlo_graph_99.w5LcGs.dot 
-o hlo_graph_80.png
GraphViz:
http://www.graphviz.org
hlo_*.dot files generated by XLA
LET’S TRAIN A MODEL (XLA + JIT)
§ Navigate to the following notebook:
07_Train_Model_XLA_JIT
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
IT’S WORTH HIGHLIGHTING…
§ From now on, we will optimize models for inference only
§ In other words,
No More Training Optimizations!
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
AGENDA
§ Setup Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
AOT COMPILER
§ Standalone, Ahead-Of-Time (AOT) Compiler
§ `tfcompile`
§ Built on XLA framework
§ Creates executable with minimal TensorFlow Runtime
§ Creates functions with feeds (input) and fetches (output)
§ Includes only dependencies neededby subgraph computation
§ Packaged as `cc_libary` with header and object file
§ Commonly used on inference graph for mobile devices
§ CPU x86-64 and ARM only – currently, no GPU support
GRAPH TRANSFORM TOOL (GTT)
§ Optimize Trained Models for Inference
§ Examples
§ Remove training-only Ops (checkpoints, drop out)
§ Strip out unreachable nodes
§ Remove debug operations (ie. LogSummary)
§ Fold batch normalization into pre-calculated weights
§ Fuse adjacent operators
§ Quantization
WEIGHT QUANTIZATION FOR INFERENCE
§ FP16 and INT8 have much lower precision than FP32
§ Weights are known at inference time
§ Linear quantization
ACTIVATION QUANTIZATION
§ Activations are not known without forward propagation
§ Depends on input during inference
§ Requires “calibration” from representative dataset
VISUALIZING QUANTIZATION
Create
Conversion
Subgraph
Produces
QuantizedMatMul,
QuantizedRelu
EliminateAdjacent
Dequantize +
Quantize
FUSED BATCH NORMALIZATION
§ What is Batch Normalization?
§ Each batch of data may have wildly different distributions
§ Normalize per batch (and layer)
§ Speeds up training dramatically
§ Weights are learned quicker
§ Final model is more accurate
Always Use Batch Normalization!
§ GTT Fuses Final Mean and Variance MatMul into Graph
z = tf.matmul(a_prev, W)
a = tf.nn.relu(z)
a_mean, a_var = tf.nn.moments(a, [0])
scale = tf.Variable(tf.ones([depth/channels]))
beta = tf.Variable(tf.zeros ([depth/channels]))
bn = tf.nn.batch_normalizaton(a, a_mean, a_var,
beta, scale, 0.001)
LET’S OPTIMIZE MODEL WITH GTT
§ Navigate to the following notebook:
08_Optimize_Model_CPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
LET’S OPTIMIZE MODEL WITH GTT
§ Navigate to the following notebook:
09_Optimize_Model_GPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
AGENDA
§ Setup Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
MODEL SERVING TERMINOLOGY
§ Inference
§ Only Forward Propagation through Network
§ Predict, Classify, Regress, …
§ Bundle
§ GraphDef, Variables, Metadata, …
§ Assets
§ ie. Map of ClassificationID -> String
§ {9283: “penguin”, 9284: “bridge”, …}
§ Version
§ Every Model Has a Version Number (Integers Only?!)
§ Version Policy
§ ie. Serve Only Latest (Highest), Serve both Latest and Previous, …
TENSORFLOW SERVING FEATURES
§ Low-latency or High-throughput Tuning
§ Supports Auto-Scaling
§ DifferentModels/Versions Served in Same Process
§ Custom Loaders beyond File-based
§ Custom Serving Models beyond HashMap and TensorFlow
§ Custom Version Policies for A/B and Bandit Tests
§ Drain Requests for Graceful Model Shutdown or Update
§ Extensible Request Batching Strategies for Diff Use Cases and HW
§ Uses Highly-Efficient GRPC and Protocol Buffers
PREDICTION SERVICE
§ Predict (Original, Generic)
§ Input: List of Tensors
§ Output: List of Tensors
§ Classify
§ Input: List of `tf.Example` (key, value) pairs
§ Output: List of (class_label: String, score: float)
§ Regress
§ Input: List of `tf.Example` (key, value) pairs
§ Output: List of (label: String, score: float)
PREDICTION INPUTS + OUTPUTS
§ SignatureDef
§ Defines inputs and outputs
§ Maps external (logical) to internal (physical) tensor names
§ Allows internal (physical) tensor names to change
tensor_info_x_observed = utils.build_tensor_info(x_observed)
tensor_info_y_pred = utils.build_tensor_info(y_pred)
prediction_signature =
signature_def_utils.build_signature_def(
inputs = {'x_observed': tensor_info_x_observed},
outputs = {'y_pred': tensor_info_y_pred},
method_name = signature_constants.PREDICT_METHOD_NAME
)
MULTI-HEADED INFERENCE
§ Inputs Propagated Forward Only Once
§ Multiple “Heads” of Model
§ Optimizes Bandwidth, CPU, Latency, Memory, Coolness
BUILD YOUR OWN MODEL SERVER (?!)
§ Overcome HTTP <-> GRPC limitation
§ Control Latency Requirements
§ Perform Batch Inference vs. Request/Response
§ Asynchronous Request Handling
§ Custom Request Batching
§ Wrap Circuit Breakers, Fallbacks
§ Mobile Inference Use Cases
§ Reduce Number of Moving Parts
#include
“tensorflow_serving/model_servers/server_core.h”
…
class MyTensorFlowModelServer {
ServerCore::Options options;
// set options (model name, path, etc)
std::unique_ptr<ServerCore> core;
TF_CHECK_OK(
ServerCore::Create(std::move(options), &core)
);
…
}
Compile and Link
libtensorflow.so
LET’S DEPLOY AND SERVE A MODEL
§ Navigate to the following notebook:
10_Deploy_Model_Server
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
AGENDA
§ Setup Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
REQUEST BATCH TUNING
§ max_batch_size
§ Enables throughput/latency tradeoff
§ Bound by RAM
§ batch_timeout_micros
§ Defines batch time window, latency upper-bound
§ Bound by RAM
§ num_batch_threads
§ Defines parallelism
§ Bound by CPU cores
§ max_enqueued_batches
§ Defines queue upper bound, throttling
§ Bound by RAM
Reaching either threshold
will trigger a batch
BATCH SCHEDULER STRATEGIES
§ BasicBatchScheduler
§ Best for homogeneous request types (ie. always classify or always regress)
§ Async callback when `max_batch_size` or `batch_timeout_micros` is reached
§ `BatchTask` encapsulates unit of work to be batched
§ SharedBatchScheduler
§ Best for heterogeneous request types, multi-step inference, ensembles, …
§ Groups BatchTasks into separate queues to form homogenous batches
§ Processes batches fairly through interleaving
§ StreamingBatchScheduler
§ Mixed CPU/GPU/IO-bound workloads
§ Provides fine-grained control for complex, multi-phase inference logic
Must Experiment to Find Best Strategy for You!!
LET’S OPTIMIZE THE MODEL SERVER
§ Navigate to the following notebook:
11_Tune_Model_Server
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
AGENDA
§ Setup Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
YOU JUST LEARNED…
§ TensorFlow Best Practices
§ To Inspect and Debug Models
§ To Distribute Training Across a Cluster
§ To Optimize Training with Queue Feeders
§ To Optimize Training with XLA JIT Compiler
§ To Optimize Inference with AOT and Graph Transform Tool (GTT)
§ Key Components of TensorFlow Serving
§ To Deploy Models with TensorFlow Serving
§ To Optimize Inference by Tuning TensorFlow Serving
Q&A
§ Thank you!!
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Contact Me?
Email: chris@pipeline.io
Twitter: @cfregly

Contenu connexe

Tendances

PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Chris Fregly
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...Chris Fregly
 
High performance network programming on the jvm oscon 2012
High performance network programming on the jvm   oscon 2012 High performance network programming on the jvm   oscon 2012
High performance network programming on the jvm oscon 2012 Erik Onnen
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to PolyaxonYu Ishikawa
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
 
London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)Ontico
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2Alexander Shulgin
 
Solving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUSolving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUUsatyuk Vasiliy
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerNopparat Nopkuat
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
 

Tendances (20)

PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
 
High performance network programming on the jvm oscon 2012
High performance network programming on the jvm   oscon 2012 High performance network programming on the jvm   oscon 2012
High performance network programming on the jvm oscon 2012
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to Polyaxon
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Ndp Slides
Ndp SlidesNdp Slides
Ndp Slides
 
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2
 
Solving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUSolving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPU
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 

Similaire à High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conference - May 08 2017

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformWangda Tan
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsAltoros
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearnJiang Jun
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AIOptimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AIData Con LA
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersJoy Qiao
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsJim Dowling
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 
High Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and KubernetesHigh Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and Kubernetesinside-BigData.com
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityScyllaDB
 
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.Provectus
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraJim Dowling
 

Similaire à High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conference - May 08 2017 (20)

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning Platform
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AIOptimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Clustering tensor flow con kubernetes y raspberry pi
Clustering tensor flow con kubernetes y raspberry piClustering tensor flow con kubernetes y raspberry pi
Clustering tensor flow con kubernetes y raspberry pi
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
High Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and KubernetesHigh Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and Kubernetes
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
 
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 

Plus de Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 

Plus de Chris Fregly (16)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 

Dernier

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 

Dernier (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 

High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conference - May 08 2017

  • 1. HIGH PERFORMANCE TENSORFLOW + GPUS @ GPU TECH CONF, MAY 8, 2017 CHRIS FREGLY, RESEARCH ENGINEER @ PIPELINE.IO
  • 3. INTRODUCTIONS: ME § Chris Fregly, Research Engineer @ in San Fran § Formerly Netflix and Databricks § Advanced Spark and TensorFlow Meetup (Global) Please Join Our 7,000 Members!!
  • 4. INTRODUCTIONS: YOU § Software Engineer or Data Scientist interested in optimizing and deploying TensorFlow models to production § Assume you have a working knowledge of TensorFlow
  • 5. NOTES ABOUT THIS MATERIAL
  • 6. CONTENT BREAKDOWN § 50% Training Optimizations (TensorFlow, XLA, Tools) § 50% Deployment and Inference Optimizations (Serving) § Why Heavy Focus on Inference? § Training: boring batch, O(num_researchers) § Inference: exciting realtime, O(num_users_of_app) § We Use Simple Models to Highlight Optimizations § Warning: This is not introductory TensorFlow material!
  • 7. 100% OPEN SOURCE CODE § https://github.com/fluxcapacitor/pipeline/ § Please Star this Repo! J § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml
  • 8. HANDS-ON EXERCISES § Combo of Jupyter Notebooks and Command Line § Command Line through Jupyter Terminal § Some Exercises Based on Experimental Features In Other Words, Some Code May Be Wonky!
  • 9. YOU WILL LEARN… § TensorFlow Best Practices § To Inspect and Debug Models § To Distribute Training Across a Cluster § To Optimize Training with Queue Feeders § To Optimize Training with XLA JIT Compiler § To Optimize Inference with AOT and Graph Transform Tool (GTT) § Key Components of TensorFlow Serving § To Deploy Models with TensorFlow Serving § To Optimize Inference by Tuning TensorFlow Serving
  • 10. AGENDA § Setup Environment § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 12. SETUP ENVIRONMENT § Step 1: Browse to the following: http://[REDACTED] § Step 2: Browse to the following: http://<ip-address> Need Help? Use the Chat!
  • 14. HOUSEKEEPING § Navigate to the following notebook: 00_Housekeeping § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 16. GPU HALF-PRECISION SUPPORT § New Pascal P100: FP16, INT8 § Flexible FP32 GPU Cores § Process 2x FP16 on 2-element vectors simultaneously § Half-precision is OK for Approximate Deep Learning!
  • 17. GPU CUDA PROGRAMMING § Barbaric At Best! § CUDA Developers Must Be Very Aware of Hardware Specs § As Such, There are Many Great Debuggers/Profilers Warning: Nvidia Can Change Hardware Specs Anytime
  • 18. CUDA STREAMS § Overlapping Kernel Execution and Data Transfer
  • 19. LET’S SEE WHAT THIS THING CAN DO! § Navigate to the following notebook: 01_Explore_GPU § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 21. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 22. AGENDA § Setup Environment § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 23. TRAINING TERMINOLOGY § Tensors: N-Dimensional Arrays § ie. Scalar, Vector, Matrix § Operations: MatMul, Add, SummaryLog,… § Graph: Graph of Operations (DAG) § Session: ContainsGraph(s) § Feeds: Feed inputs into Operation § Fetches: Fetch output from Operation § Variables: What we learn through training § aka “weights”, “parameters” § Devices: Hardware device on which we train -TensorFlow- Trains Variables -User- Fetches Outputs -User- Feeds Inputs -TensorFlow- Performs Operations -TensorFlow- Flows Tensors with tf.device(“worker:0/device/gpu:0,worker:1/device/gpu:0”)
  • 24. TRAINING DEVICES § cpu:0 § By default, all CPUs § Requires extra config to target a CPU § gpu:0..n § Each GPU has a unique id § TF usually prefers a single GPU § xla_cpu:0, xla_gpu:0..n § “JIT Compiler Device” § Hints TF to attempt JIT Compile with tf.device(“/cpu:0”): with tf.device(“/gpu:0”): with tf.device(“/gpu:1”):
  • 25. TRAINING METRICS: TENSORBOARD § Summary Ops § Event Files /root/tensorboard/linear/<version>/events… § Tags § Organize data within Tensorboard UI loss_summary_op = tf.summary.scalar('loss', loss) merge_all_summary_op = tf.summary.merge_all() summary_writer = tf.summary.FileWriter( '/root/tensorboard/linear/<version>', graph=sess.graph)
  • 26. TRAINING ON EXISTING INFRASTRUCTURE § Data Processing § HDFS/Hadoop § Spark § Containers § Docker § Schedulers § Kubernetes § Mesos <dependency> <groupId>org.tensorflow</groupId> <artifactId>tensorflow-hadoop</artifactId> <version>1.0-SNAPSHOT</version> </dependency> https://github.com/tensorflow/ecosystem
  • 27. FEED TRAINING DATA WITH QUEUES § Don’t Use feed_dict for Production Workloads!! § feed_dict Requires C++ <-> Python Serialization § Batch Retrieval is Single-threaded,Synchronous, SLOW! § Next batch can’t be retrieved until current batch is processed § CPUs and GPUs are Not Fully Utilized § Use Queues to Read and Pre-Process Data for GPUs! § Queues use CPUs for I/O, pre-processing, shuffling, … § GPUs stay saturated and focused on cool GPU stuff!!
  • 28. DATA MOVEMENT WITH QUEUES § Queue Pulls Batch from Source (ie HDFS, Kafka) § Queue pre-processes and shuffles data § Queue should use only CPUs - no GPUs § Combine many small files into a few large TFRecord files § GPU Pulls Batch from Queue (CUDA Streams) § GPU pulls next batch while processing current batch § GPU Processes Next Batch Immediately § GPUs are fully utilized!
  • 29. QUEUE CAPACITY PLANNING § batch_size § # of examples per batch (ie. 64 jpg) § Limited by GPU RAM § num_processing_threads § CPU threads pull and pre-process batches of data § Limited by CPU Cores § queue_capacity § Limited by CPU RAM (ie. 5 * batch_size) Saturate those GPUs! GPU Pulls Batches while Processing Current Batch AsyncMemory Transfer with CUDA Streams -- Thanks,Nvidia!! --
  • 30. DETECT UNDERUTILIZED CPUS, GPUS § Instrument training code to generate “timelines” § Analyze with Google Web Tracing Framework (WTF) § Monitor CPU with `top`, GPU with `nvidia-smi` http://google.github.io/tracing-framework/ from tensorflow.python.client import timeline trace = timeline.Timeline(step_stats=run_metadata.step_stats) with open('timeline.json', 'w') as trace_file: trace_file.write( trace.generate_chrome_trace_format(show_memory=True))
  • 31. LET’S FEED A QUEUE FROM HDFS § Navigate to the following notebook: 02_Feed_Queue_HDFS § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 33. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 34. TENSORFLOW MODEL § GraphDef § Architecture of your model (nodes, edges) § Metadata § Asset: Accompanying assets to your model § SignatureDef: Maps external to internal graph nodes for TF Serving § MetaGraph § Combines GraphDef and Metadata § Variables § Stored separately (checkpointed) during training § Allows training to continue from any checkpoint § Values are “frozen” into Constants when deployed for inference Graph Definition x W mul add b MetaGraph Metadata Assets SignatureDef Tags Version
  • 35. TENSORFLOW SESSION Session graph: GraphDef variables: “W” : 0.328 “b” : -1.407 Variablesare Periodically Checkpointed GraphDef is Saved as-is
  • 36. LET’S TRAIN A MODEL (CPU) § Navigate to the following notebook: 03_Train_Model_CPU § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 37. LET’S TRAIN A MODEL (GPU) § Navigate to the following notebook: 04_Train_Model_GPU § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 38. TENSORFLOW DEBUGGER § Step through Operations § Inspect Inputs and Outputs § Wrap Session in Debug Session sess = tf.Session(config=config) sess = tf_debug.LocalCLIDebugWrapperSession(sess)
  • 39. LET’S DEBUG A MODEL § Navigate to the following notebook: 05_Debug_Model § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 41. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 42. AGENDA § Setup Environment § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 43. MULTI-GPU TRAINING (SINGLE NODE) § Variables stored on CPU (cpu:0) § Model graph (aka “replica”, “tower”) is copied to each GPU(gpu:0, gpu:1, …) Multi-GPU Training Steps: 1. CPU transfersmodel to each GPU 2. CPU waitson all GPUs to finish batch 3. CPU copiesall gradientsback from all GPUs 4. CPU synchronizesand averagesall gradientsfrom GPUs 5. CPU updatesGPUs with new variables/weights 6. Repeat Step 1 until reaching stop condition (ie. max_epochs)
  • 44. DISTRIBUTED, MULTI-NODE TRAINING § TensorFlow Automatically Inserts Send and Receive Ops into Graph § Parameter Server Synchronously AggregatesUpdates to Variables § Nodes with Multiple GPUs will Pre-Aggregate Before Sending to PS Worker0 Worker0 Worker1 Worker0 Worker1 Worker2 gpu0 gpu1 gpu2 gpu3 gpu0 gpu1 gpu2 gpu3 gpu0 gpu1 gpu2 gpu3 gpu0 gpu1 gpu0 gpu0
  • 45. SYNCHRONOUS VS. ASYNCHRONOUS § Synchronous § Worker (“graphreplica”, “tower”) § Reads samevariables from Parameter Server in parallel § Computes gradients for variables using partition of data § Sends gradients to central Parameter Server § Parameter Server § Aggregates (avg) gradients for each variable based on its portion of data § Applies gradients (+, -) to each variables § Broadcasts updated variables to each node in parallel § ^^ Repeat ^^ § Asynchronous § Each node computes gradients independently § Reads stale values,does not synchronized with other nodes
  • 46. DATA PARALLEL VS MODEL PARALLEL § Data Parallel § Send exact same model to each worker/device § Each worker/device operates on their partition of data, same model § Similar to how Spark sends the same function to differentworkers § Each Spark worker operates on their partition of the data § Model Parallel § Send different partition of model to each worker/device § Each worker/device operates on all data, their partition of model Very Difficult, But Necessary for Very Large Models (RAM)
  • 47. LET’S TRAIN A DISTRIBUTED MODEL § Navigate to the following notebook: 06_Train_Distributed_Model § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 48. AGENDA § Setup Environment § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 49. XLA FRAMEWORK § Accelerated Linear Algebra (XLA) § Experimental! § Goals: § Improve execution speed § Improve memory usage § Reduce reliance on custom operators § Reduce mobile footprint § Improve portability
  • 50. XLA HIGH LEVEL OPTIMIZER (HLO) § Define Graphs using HLO Language § Compiler IntermediateRepresentation (IR) § Independentof source and target language § XLA Emits Target-Independent,OptimizedHLO § Backend Emits Target-Dependent,OptimizedLLVM § LLVM Emits Native Code for Target § x86-64 and ARM64 CPU Instruction Sets § Nvidia GPU NVPTX
  • 51. JIT COMPILER § Just-In-Time Compiler § Built on XLA Framework § Goals: § Reduce memory movement – especiallyuseful on GPUs § Reduce overhead of multiple function calls § Similar to Spark Operator Fusing in Spark 2.0 § Unroll Loops, Fuse Operators, Fold Constants, … § Enable session-wide or `with jit_scope():`
  • 52. VISUALIZING JIT COMPILER IN ACTION Before After Google Web Tracing Framework: http://google.github.io/tracing-framework/ from tensorflow.python.client import timeline trace = timeline.Timeline(step_stats=run_metadata.step_stats) with open('timeline.json', 'w') as trace_file: trace_file.write( trace.generate_chrome_trace_format(show_memory=True))
  • 53. VISUALIZING FUSING OPERATORS pip install graphviz dot -Tpng /tmp/hlo_graph_99.w5LcGs.dot -o hlo_graph_80.png GraphViz: http://www.graphviz.org hlo_*.dot files generated by XLA
  • 54. LET’S TRAIN A MODEL (XLA + JIT) § Navigate to the following notebook: 07_Train_Model_XLA_JIT § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 55. IT’S WORTH HIGHLIGHTING… § From now on, we will optimize models for inference only § In other words, No More Training Optimizations!
  • 57. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 58. AGENDA § Setup Environment § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 59. AOT COMPILER § Standalone, Ahead-Of-Time (AOT) Compiler § `tfcompile` § Built on XLA framework § Creates executable with minimal TensorFlow Runtime § Creates functions with feeds (input) and fetches (output) § Includes only dependencies neededby subgraph computation § Packaged as `cc_libary` with header and object file § Commonly used on inference graph for mobile devices § CPU x86-64 and ARM only – currently, no GPU support
  • 60. GRAPH TRANSFORM TOOL (GTT) § Optimize Trained Models for Inference § Examples § Remove training-only Ops (checkpoints, drop out) § Strip out unreachable nodes § Remove debug operations (ie. LogSummary) § Fold batch normalization into pre-calculated weights § Fuse adjacent operators § Quantization
  • 61. WEIGHT QUANTIZATION FOR INFERENCE § FP16 and INT8 have much lower precision than FP32 § Weights are known at inference time § Linear quantization
  • 62. ACTIVATION QUANTIZATION § Activations are not known without forward propagation § Depends on input during inference § Requires “calibration” from representative dataset
  • 64. FUSED BATCH NORMALIZATION § What is Batch Normalization? § Each batch of data may have wildly different distributions § Normalize per batch (and layer) § Speeds up training dramatically § Weights are learned quicker § Final model is more accurate Always Use Batch Normalization! § GTT Fuses Final Mean and Variance MatMul into Graph z = tf.matmul(a_prev, W) a = tf.nn.relu(z) a_mean, a_var = tf.nn.moments(a, [0]) scale = tf.Variable(tf.ones([depth/channels])) beta = tf.Variable(tf.zeros ([depth/channels])) bn = tf.nn.batch_normalizaton(a, a_mean, a_var, beta, scale, 0.001)
  • 65. LET’S OPTIMIZE MODEL WITH GTT § Navigate to the following notebook: 08_Optimize_Model_CPU § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 66. LET’S OPTIMIZE MODEL WITH GTT § Navigate to the following notebook: 09_Optimize_Model_GPU § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 68. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 69. AGENDA § Setup Environment § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 70. MODEL SERVING TERMINOLOGY § Inference § Only Forward Propagation through Network § Predict, Classify, Regress, … § Bundle § GraphDef, Variables, Metadata, … § Assets § ie. Map of ClassificationID -> String § {9283: “penguin”, 9284: “bridge”, …} § Version § Every Model Has a Version Number (Integers Only?!) § Version Policy § ie. Serve Only Latest (Highest), Serve both Latest and Previous, …
  • 71. TENSORFLOW SERVING FEATURES § Low-latency or High-throughput Tuning § Supports Auto-Scaling § DifferentModels/Versions Served in Same Process § Custom Loaders beyond File-based § Custom Serving Models beyond HashMap and TensorFlow § Custom Version Policies for A/B and Bandit Tests § Drain Requests for Graceful Model Shutdown or Update § Extensible Request Batching Strategies for Diff Use Cases and HW § Uses Highly-Efficient GRPC and Protocol Buffers
  • 72. PREDICTION SERVICE § Predict (Original, Generic) § Input: List of Tensors § Output: List of Tensors § Classify § Input: List of `tf.Example` (key, value) pairs § Output: List of (class_label: String, score: float) § Regress § Input: List of `tf.Example` (key, value) pairs § Output: List of (label: String, score: float)
  • 73. PREDICTION INPUTS + OUTPUTS § SignatureDef § Defines inputs and outputs § Maps external (logical) to internal (physical) tensor names § Allows internal (physical) tensor names to change tensor_info_x_observed = utils.build_tensor_info(x_observed) tensor_info_y_pred = utils.build_tensor_info(y_pred) prediction_signature = signature_def_utils.build_signature_def( inputs = {'x_observed': tensor_info_x_observed}, outputs = {'y_pred': tensor_info_y_pred}, method_name = signature_constants.PREDICT_METHOD_NAME )
  • 74. MULTI-HEADED INFERENCE § Inputs Propagated Forward Only Once § Multiple “Heads” of Model § Optimizes Bandwidth, CPU, Latency, Memory, Coolness
  • 75. BUILD YOUR OWN MODEL SERVER (?!) § Overcome HTTP <-> GRPC limitation § Control Latency Requirements § Perform Batch Inference vs. Request/Response § Asynchronous Request Handling § Custom Request Batching § Wrap Circuit Breakers, Fallbacks § Mobile Inference Use Cases § Reduce Number of Moving Parts #include “tensorflow_serving/model_servers/server_core.h” … class MyTensorFlowModelServer { ServerCore::Options options; // set options (model name, path, etc) std::unique_ptr<ServerCore> core; TF_CHECK_OK( ServerCore::Create(std::move(options), &core) ); … } Compile and Link libtensorflow.so
  • 76. LET’S DEPLOY AND SERVE A MODEL § Navigate to the following notebook: 10_Deploy_Model_Server § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 78. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 79. AGENDA § Setup Environment § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 80. REQUEST BATCH TUNING § max_batch_size § Enables throughput/latency tradeoff § Bound by RAM § batch_timeout_micros § Defines batch time window, latency upper-bound § Bound by RAM § num_batch_threads § Defines parallelism § Bound by CPU cores § max_enqueued_batches § Defines queue upper bound, throttling § Bound by RAM Reaching either threshold will trigger a batch
  • 81. BATCH SCHEDULER STRATEGIES § BasicBatchScheduler § Best for homogeneous request types (ie. always classify or always regress) § Async callback when `max_batch_size` or `batch_timeout_micros` is reached § `BatchTask` encapsulates unit of work to be batched § SharedBatchScheduler § Best for heterogeneous request types, multi-step inference, ensembles, … § Groups BatchTasks into separate queues to form homogenous batches § Processes batches fairly through interleaving § StreamingBatchScheduler § Mixed CPU/GPU/IO-bound workloads § Provides fine-grained control for complex, multi-phase inference logic Must Experiment to Find Best Strategy for You!!
  • 82. LET’S OPTIMIZE THE MODEL SERVER § Navigate to the following notebook: 11_Tune_Model_Server § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 84. AGENDA § Setup Environment § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 85. YOU JUST LEARNED… § TensorFlow Best Practices § To Inspect and Debug Models § To Distribute Training Across a Cluster § To Optimize Training with Queue Feeders § To Optimize Training with XLA JIT Compiler § To Optimize Inference with AOT and Graph Transform Tool (GTT) § Key Components of TensorFlow Serving § To Deploy Models with TensorFlow Serving § To Optimize Inference by Tuning TensorFlow Serving
  • 86. Q&A § Thank you!! § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Contact Me? Email: chris@pipeline.io Twitter: @cfregly