SlideShare une entreprise Scribd logo
1  sur  102
Télécharger pour lire hors ligne
HIGH PERFORMANCE TENSORFLOW
IN PRODUCTION WITH GPUS!!
CHRIS FREGLY, FOUNDER @ PIPELINE.IO
PIPELINE.AI TENSORFLOW GPU WORKSHOP
NEW YORK - JULY 8, 2017
INTRODUCTIONS: ME
§ Chris Fregly, Research Engineer @
§ Formerly Netflix and Databricks
§ Advanced Spark and TensorFlow Meetup
Please Join Our 18,000+ Members Globally!!
* San Francisco
* Chicago
* Washington DC
* London
Please
Join!!
INTRODUCTIONS: YOU
§ Software Engineer or Data Scientist interested in optimizing
and deploying TensorFlow models to production
§ Assume you have a working knowledge of TensorFlow
CONTENT SUMMARY
§ 50% Training Optimizations (TensorFlow, XLA, Tools)
§ 50% Deployment and Inference Optimizations (Serving)
§ Why Heavy Focus on Inference?
§ Training: boring batch, O(num_data_scientists)
§ Inference: exciting realtime, O(num_users_of_app)
§ We Use Simple Models to Highlight Optimizations
§
Warning: This is not introductory TensorFlow material!
100% OPEN SOURCE CODE
§ https://github.com/fluxcapacitor/pipeline/
§ Please Star this Repo! J
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
HANDS-ON EXERCISES
§ Combo of Jupyter Notebooks and Command Line
§ Command Line through Jupyter Terminal
§ Some Exercises Based on Experimental Features
In Other Words, You Will See Errors. It’s OK!!
YOU WILL LEARN…
§ To Inspect and Debug Models
§ Training and Predicting Best Practices
§ To Distribute Training Across a Cluster
§ To Optimize Training with Queue Feeders
§ To Optimize Training with XLA JIT Compiler
§ To Optimize Inference with AOT and Graph Transform Tool (GTT)
§ Key Components of TensorFlow Serving
§ To Deploy Models with TensorFlow Serving
§ To Optimize Inference by Tuning TensorFlow Serving
AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
EVERYBODY GETS A GPU!
SETUP ENVIRONMENT
§ Step 1: Browse to the following:
http://allocator.community.pipeline.io/allocate
§ Step 2: Browse to the following:
http://<ip-address>
Need Help?
Use the Chat!
VERIFY SETUP
http://<ip-address>
Any username,
Any password!
LET’S EXPLORE OUR ENVIRONMENT
§ Navigate to the following notebook:
01_Explore_Environment
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
SETTING UP TENSORFLOW WITH GPUS
§ Very Painful!
§ Especially inside Docker
§ Use nvidia-docker
§ Especially on Kubernetes!
§ Use Kubernetes 1.7+
§ Check Out Pipeline.IO for Links to GitHub +
DockerHub
GPU HALF-PRECISION SUPPORT
§ FP16, INT8 are “Half Precision”
§ Supported by Pascal P100 (2016) and Volta V100 (2017)
§ Flexible FP32 GPU Cores Can Fit 2 FP16’s for 2x Throughput!
§ Half-Precision is OK for Approximate Deep Learning Use Cases
VOLTA V100 RECENTLY ANNOUNCED
§ 84 Streaming Multiprocessors (SM’s)
§ 5,376 GPU Cores
§ 672 Tensor Cores (ie. Google TPU)
§ Mixed FP16/FP32 Precision
§ More Shared Memory
§ New L0 Instruction Cache
§ Faster L1 Data Cache
§ V100 vs. P100 Performance
§ 12x TFLOPS @ Peak Training
§ 6x Inference Throughput
V100 AND CUDA 9
§ Independent Thread Scheduling - Finally!!
§ Similar to CPU fine-grained thread synchronization semantics
§ Allows GPU to yield execution of any thread
§ Still Optimized for SIMT (Same Instruction Multiple Thread)
§ SIMT units automatically scheduled together
§ Explicit Synchronization
P100 V100
GPU CUDA PROGRAMMING
§ Barbaric, But Fun Barbaric!
§ Must Know Underlying Hardware Very Well
§ Many Great Debuggers/Profilers
§ Hardware Changes are Painful!
§ Newer CUDA versions
automatically JIT-compile old
CUDA code to new NVPTX
§ Not optimal, of course
CUDA STREAMS
§ Asynchronous I/O Transfer
§ Overlap Compute and I/O
§ Keeps GPUs Saturated
§ Fundamental to Queue Framework in TensorFlow
LET’S SEE WHAT THIS THING CAN DO!
§ Navigate to the following notebook:
01a_Explore_GPU
01b_Explore_Numba
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
TRAINING TERMINOLOGY
§ Tensors: N-Dimensional Arrays
§ ie. Scalar, Vector, Matrix
§ Operations: MatMul, Add, SummaryLog,…
§ Graph: Graph of Operations (DAG)
§ Session: Contains Graph(s)
§ Feeds: Feed inputs into Operation
§ Fetches: Fetch output from Operation
§ Variables: What we learn through training
§ aka “weights”, “parameters”
§ Devices: Hardware device on which we train
-TensorFlow-
Trains
Variables
-User-
Fetches
Outputs
-User-
Feeds
Inputs
-TensorFlow-
Performs
Operations
-TensorFlow-
Flows
Tensors
with tf.device(“worker:0/device/gpu:0,worker:1/device/gpu:0”)
TRAINING DEVICES
§ cpu:0
§ By default, all CPUs
§ Requires extra config to target a CPU
§ gpu:0..n
§ Each GPU has a unique id
§ TF usually prefers a single GPU
§ xla_cpu:0, xla_gpu:0..n
§ “JIT Compiler Device”
§ Hints TensorFlow to attempt JIT Compile
with tf.device(“/cpu:0”):
with tf.device(“/gpu:0”):
with tf.device(“/gpu:1”):
TRAINING METRICS: TENSORBOARD
§ Summary Ops
§ Event Files
/root/tensorboard/linear/<version>/events…
§ Tags
§ Organize data within Tensorboard UI
loss_summary_op = tf.summary.scalar('loss',
loss)
merge_all_summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(
'/root/tensorboard/linear/<version>',
graph=sess.graph)
TRAINING ON EXISTING INFRASTRUCTURE
§ Data Processing
§ HDFS/Hadoop
§ Spark
§ Containers
§ Docker
§ Schedulers
§ Kubernetes
§ Mesos
<dependency>
<groupId>org.tensorflow</groupId>
<artifactId>tensorflow-hadoop</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
https://github.com/tensorflow/ecosystem
TRAINING PIPELINES: QUEUE + DATASET
§ Don’t Use feed_dict for Production Workloads!!
§ feed_dict Requires Python <-> C++ Serialization
§ Retrieval is Single-threaded, Synchronous, SLOW!
§ Can’t Retrieve Until Current Batch is Complete
§ CPUs/GPUs Not Fully Utilized!
§ Use Queue or Dataset API
QUEUES
§ More than Just a Traditional Queue
§ Perform I/O, pre-processing, cropping, shuffling
§ Pulls from HDFS, S3, Google Storage, Kafka, ...
§ Combine many small files into large TFRecord files
§ Typically use CPUs to focus GPUs on compute
§ Uses CUDA Streams
DATA MOVEMENT WITH QUEUES
§ GPU Pulls Batch from Queue (CUDA Streams)
§ GPU pulls next batch while processing current batch
GPUs Stay Fully Utilized!
QUEUE CAPACITY PLANNING
§ batch_size
§ # examples / batch (ie. 64 jpg)
§ Limited by GPU RAM
§ num_processing_threads
§ CPU threads pull and pre-process batches of data
§ Limited by CPU Cores
§ queue_capacity
§ Limited by CPU RAM (ie. 5 * batch_size)
DETECT UNDERUTILIZED CPUS, GPUS
§ Instrument training code to generate “timelines”
§ Analyze with Google Web
Tracing Framework (WTF)
§ Monitor CPU with `top`, GPU with `nvidia-smi`
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))
LET’S FEED DATA WITH A QUEUE
§ Navigate to the following notebook:
02_Feed_Queue_HDFS
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
TENSORFLOW MODEL
§ MetaGraph
§ Combines GraphDef and Metadata
§ GraphDef
§ Architecture of your model (nodes, edges)
§ Metadata
§ Asset: Accompanying assets to your model
§ SignatureDef: Maps external : internal tensors
§ Variables
§ Stored separately during training (checkpoint)
§ Allows training to continue from any checkpoint
§ Variables are “frozen” into Constants when deployed for inference
GraphDef
x
W
mul add
b
MetaGraph
Metadata
Assets
SignatureDef
Tags
Version
Variables:
“W” : 0.328
“b” : -1.407
TENSORFLOW SESSION
Session
graph: GraphDef
Variables:
“W” : 0.328
“b” : -1.407
Variables are
Periodically
Checkpointed
GraphDef
is Static
TENSORFLOW DEBUGGER
§ Step through Operations
§ Inspect Inputs and Outputs
§ Wrap Session in Debug Session
sess = tf.Session(config=config)
sess =
tf_debug.LocalCLIDebugWrapperSession(sess)
LET’S DEBUG A MODEL
§ Navigate to the following notebook:
04_Debug_Model
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
BATCH NORMALIZATION
§ Each Mini-Batch May Have Wildly Different Distributions
§ Normalize per batch (and layer)
§ Speeds up Training!!
§ Weights are Learned Quicker
§ Final Model is More Accurate
§ Final mean and variance will be folded into Graph later
-- Always Use Batch Normalization! --
z = tf.matmul(a_prev, W)
a = tf.nn.relu(z)
a_mean, a_var = tf.nn.moments(a, [0])
scale = tf.Variable(tf.ones([depth/channels]))
beta = tf.Variable(tf.zeros ([depth/channels]))
bn = tf.nn.batch_normalizaton(a, a_mean, a_var,
beta, scale, 0.001)
AGENDA
§ GPU Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
MULTI-GPU TRAINING (SINGLE NODE)
§ Variables stored on CPU (cpu:0)
§ Model graph (aka “replica”, “tower”)
is copied to each GPU(gpu:0, gpu:1, …)
Multi-GPU Training Steps:
1. CPU transfers model to each GPU
2. CPU waits on all GPUs to finish batch
3. CPU copies all gradients back from all GPUs
4. CPU synchronizes + AVG all gradients from GPUs
5. CPU updates GPUs with new variables/weights
6. Repeat Step 1 until reaching stop condition (ie. max_epochs)
DISTRIBUTED, MULTI-NODE TRAINING
§ TensorFlow Automatically Inserts Send and Receive Ops into Graph
§ Parameter Server Synchronously Aggregates Updates to Variables
§ Nodes with Multiple GPUs will Pre-Aggregate Before Sending to PS
Worker0 Worker0
Worker1
Worker0 Worker1 Worker2
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0
gpu1
gpu0
gpu0
SYNCHRONOUS VS. ASYNCHRONOUS
§ Synchronous
§ Nodes compute gradients
§ Nodes update Parameter Server (PS)
§ Nodes sync on PS for latest gradients
§ Asynchronous
§ Some nodes delay in computing gradients
§ Nodes don’t update PS
§ Nodes get stale gradients from PS
§ May not converge due to stale reads!
DATA PARALLEL VS MODEL PARALLEL
§ Data Parallel (“Between-Graph Replication”)
§ Send exact same model to each device
§ Each device operates on its partition of data
§ ie. Spark sends same function to many workers
§ Each worker operates on their partition of data
§ Model Parallel (“In-Graph Replication”)
§ Send different partition of model to each device
§ Each device operates on all data
Very Difficult!!
Required for Large Models.
(GPU RAM Limitation)
DISTRIBUTED TENSORFLOW CONCEPTS
§ Client
§ Program that builds a TF Graph, constructs a session, interacts with the cluster
§ Written in Python, C++
§ Cluster
§ Set of distributed nodes executing a graph
§ Nodes can play any role
§ Jobs (“Roles”)
§ Parameter Server (“ps”) stores and updates variables
§ Worker (“worker”) performs compute-intensive tasks (stateless)
§ Assigned 0..* tasks
§ Task (“Server Process”)
“ps” and “worker” are
conventional names
CHIEF WORKER
§ Worker Task 0 is Chosen by Default
§ Task 0 is guaranteed to exist
§ Implements Maintenance Tasks
§ Writes checkpoints
§ Initializes parameters at start of training
§ Writes log summaries
§ Parameter Server health checks
NODE AND PROCESS FAILURES
§ Checkpoint to Persistent Storage (HDFS, S3)
§ Use MonitoredTrainingSession and Hooks
§ Use a Good Cluster Orchestrator (ie. Kubernetes,Mesos)
§ Understand Failure Modes and Recovery States
Stateless, Not Bad: Training Continues Stateful, Bad: Training Must Stop Dios Mio! Long Night Ahead…
VALIDATING DISTRIBUTED MODEL
§ Separate Training and Validation Clusters
§ Validate using Saved Checkpoints from Parameter Servers
§ Avoids Resource Contention
Training
Cluster
Validation
Cluster
Parameter Server
Cluster
LET’S TRAIN WITH DISTRIBUTED CPU
§ Navigate to the following notebook:
05_Train_Model_Distributed_CPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
LET’S TRAIN WITH DISTRIBUTED GPU
§ Navigate to the following notebook:
05a_Train_Model_Distributed_GPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
NEW(‘ISH): EXPERIMENT + ESTIMATOR
§ Higher-Level APIs Simplify Distributed Training
§ Picks Up Configuration from Environment
§ Supports Custom Models (ie. Keras)
§ Used for Training, Validation, and Prediction
§ API is Changing, but Patterns Remain the Same
§ Works Well with Google Cloud ML (Surprised?!)
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
XLA FRAMEWORK
§ Accelerated Linear Algebra (XLA)
§ Goals:
§ Reduce reliance on custom operators
§ Improve execution speed
§ Improve memory usage
§ Reduce mobile footprint
§ Improve portability
§ Helps TF Stay Flexible and Performant
XLA HIGH LEVEL OPTIMIZER (HLO)
§ Compiler Intermediate Representation (IR)
§ Independent of source and target language
§ Define Graphs using HLO Language
§ XLA Step 1 Emits Target-Independent HLO
§ XLA Step 2 Emits Target-Dependent LLVM
§ LLVM Emits Native Code Specific to Target
§ Supports x86-64, ARM64 (CPU), and NVPTX (GPU)
JIT COMPILER
§ Just-In-Time Compiler
§ Built on XLA Framework
§ Goals:
§ Reduce memory movement – especially useful on GPUs
§ Reduce overhead of multiple function calls
§ Similar to Spark Operator Fusing in Spark 2.0
§ Unroll Loops, Fuse Operators, Fold Constants, …
§ Scope to session, device, or `with jit_scope():`
VISUALIZING JIT COMPILER IN ACTION
Before After
Google Web Tracing Framework:
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))
VISUALIZING FUSING OPERATORS
pip install graphviz
dot -Tpng 
/tmp/hlo_graph_1.w5LcGs.dot 
-o hlo_graph_1.png
GraphViz:
http://www.graphviz.org
hlo_*.dot files generated by XLA
LET’S TRAIN WITH XLA CPU
§ Navigate to the following notebook:
06_Train_Model_XLA_CPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
LET’S TRAIN WITH XLA GPU
§ Navigate to the following notebook:
06a_Train_Model_XLA_GPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
IT’S WORTH HIGHLIGHTING…
§ From Now On, We Optimize Trained Models For Inference
§ In Other Words…
We’re Done with Training Optimizations!
Let’s Move to Predicting Optimizations!!
AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
AOT COMPILER
§ Standalone, Ahead-Of-Time (AOT) Compiler
§ Built on XLA framework
§ tfcompile
§ Creates executable with minimal TensorFlow Runtime needed
§ Includes only dependencies needed by subgraph computation
§ Creates functions with feeds (inputs) and fetches (outputs)
§ Packaged as cc_libary header and object files to link into your app
§ Commonly used for mobile device inference graph
§ Currently, only CPU x86-64 and ARM are supported - no GPU
GRAPH TRANSFORM TOOL (GTT)
§ Optimize Trained Models for Inference
§ Remove training-only Ops (checkpoint, drop out, logs)
§ Remove unreachable nodes between given feed -> fetch
§ Fuse adjacent operators to improve memory bandwidth
§ Fold final batch norm mean and variance into variables
§ Round weights/variables improves compression (ie. 70%)
§ Quantize weights and activations simplifies model
§ FP32 down to INT8
BEFORE OPTIMIZATIONS
AFTER STRIPPING UNUSED NODES
§ Optimizations
§ strip_unused_nodes
§ Results
§ Graph much simpler
§ File size much smaller
AFTER REMOVING UNUSED NODES
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ Results
§ Pesky nodes removed
§ File size a bit smaller
AFTER FOLDING CONSTANTS
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ Results
§ Placeholders (feeds) -> Variables*
(*Why Variables and not Constants?)
AFTER FOLDING BATCH NORMS
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ Results
§ Graph remains the same
§ File size approximately the same
WEIGHT QUANTIZATION
§ FP16 and INT8 Are Smaller and Computationally Simpler
§ Weights/Variables are Constants
§ Easy to Linearly Quantize
AFTER QUANTIZING WEIGHTS
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ quantize_weights
§ Results
§ Graph is same, file size is smaller, compute is faster
LET’S OPTIMIZE FOR INFERENCE
§ Navigate to the following notebook:
07_Optimize_Model*
(*Why just CPU version? Why not both CPU and GPU?)
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
BUT WAIT, THERE’S MORE!
ACTIVATION QUANTIZATION
§ Activations Not Known Ahead of Time
§ Depends on input, not easy to quantize
§ Requires Additional Calibration Step
§ Use a “representative” dataset
§ Per Neural Network Layer…
§ Collect histogram of activation values
§ Generate many quantized distributions with different saturation thresholds
§ Choose threshold to minimize…
KL_divergence(ref_distribution, quant_distribution)
§ Not Much Time or Data is Required (Minutes on Commodity Hardware)
ACTIVATION QUANTIZATION GRAPH
Create
Conversion
Subgraph
Produces
QuantizedMatMul,
QuantizedRelu
Eliminate Adjacent
Dequantize +
Quantize
AFTER ACTIVATION QUANTIZATION
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ quantize_weights
§ quantize_nodes (activations)
§ Results
§ Larger graph, needs calibration!
Requires additional
freeze_requantization_ranges
LET’S OPTIMIZE FOR INFERENCE (2/2)
§ Navigate to the following notebook:
08_Optimize_Model_Activations
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
MODEL SERVING TERMINOLOGY
§ Inference
§ Only Forward Propagation through Network
§ Predict, Classify, Regress, …
§ Bundle
§ GraphDef, Variables, Metadata, …
§ Assets
§ ie. Map of ClassificationID -> String
§ {9283: “penguin”, 9284: “bridge”}
§ Version
§ Every Model Has a Version Number (Integer)
§ Version Policy
§ ie. Serve Only Latest (Highest), Serve Both Latest and Previous, …
TENSORFLOW SERVING FEATURES
§ Supports Auto-Scaling
§ Custom Loaders beyond File-based
§ Tune for Low-latency or High-throughput
§ Serve Diff Models/Versions in Same Process
§ Customize Models Types beyond HashMap and TensorFlow
§ Customize Version Policies for A/B and Bandit Tests
§ Support Request Draining for Graceful Model Updates
§ Enable Request Batching for Diff Use Cases and HW
§ Supports Optimized Transport with GRPC and Protocol Buffers
PREDICTION SERVICE
§ Predict (Original, Generic)
§ Input: List of Tensor
§ Output: List of Tensor
§ Classify
§ Input: List of tf.Example (key, value) pairs
§ Output: List of (class_label: String, score: float)
§ Regress
§ Input: List of tf.Example (key, value) pairs
§ Output: List of (label: String, score: float)
PREDICTION INPUTS + OUTPUTS
§ SignatureDef
§ Defines inputs and outputs
§ Maps external (logical) to internal (physical) tensor names
§ Allows internal (physical) tensor names to change
from tensorflow.python.saved_model import utils
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import signature_def_utils
graph = tf.get_default_graph()
x_observed = graph.get_tensor_by_name('x_observed:0')
y_pred = graph.get_tensor_by_name('add:0')
inputs_map = {'inputs': x_observed}
outputs_map = {'outputs': y_pred}
predict_signature =
signature_def_utils.predict_signature_def(inputs=inputs_map,
outputs=outputs_map)
MULTI-HEADED INFERENCE
§ Multiple “Heads” of Model
§ Return class and scores to be fed into another model
§ Inputs Propagated Forward Only Once
§ Optimizes Bandwidth, CPU, Latency, Memory, Coolness
BUILD YOUR OWN MODEL SERVER (?!)
§ Adapt GRPC(Google) <-> HTTP (REST of the World)
§ Perform Batch Inference vs. Request/Response
§ Handle Requests Asynchronously
§ Support Mobile, Embedded Inference
§ Customize Request Batching
§ Add Circuit Breakers, Fallbacks
§ Control Latency Requirements
§ Reduce Number of Moving Parts
#include
“tensorflow_serving/model_servers/server_core.h”
class MyTensorFlowModelServer {
ServerCore::Options options;
// set options (model name, path, etc)
std::unique_ptr<ServerCore> core;
TF_CHECK_OK(
ServerCore::Create(std::move(options), &core)
);
}
Compile and Link
libtensorflow.so
FREEZING MODEL FOR DEPLOYMENT
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ quantize_weights
§ quantize_nodes
§ freeze_graph
§ Results
§ Variables -> Constants
Finally!
We’re Ready to Deploy!!
LET’S DEPLOY OPTIMIZED MODEL
§ Navigate to the following notebook:
09_Deploy_Optimized_Model
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
PULSE CHECK
BREAK
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Need Help?
Use the Chat!
AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
REQUEST BATCH TUNING
§ max_batch_size
§ Enables throughput/latency tradeoff
§ Bounded by RAM
§ batch_timeout_micros
§ Defines batch time window, latency upper-bound
§ Bounded by RAM
§ num_batch_threads
§ Defines parallelism
§ Bounded by CPU cores
§ max_enqueued_batches
§ Defines queue upper bound, throttling
§ Bounded by RAM
Reaching either threshold
will trigger a batch
BATCH SCHEDULER STRATEGIES
§ BasicBatchScheduler
§ Best for homogeneous request types (ie. always classify or always regress)
§ Async callback upon max_batch_size or batch_timeout_micros
§ BatchTask encapsulates unit of work to be batched
§ SharedBatchScheduler
§ Best for heterogeneous request types, multi-step inference, ensembles, …
§ Groups BatchTasks into separate queues to form homogenous batches
§ Processes batches fairly through interleaving
§ StreamingBatchScheduler
§ Mixed CPU/GPU/IO-bound workloads
§ Provides fine-grained control for complex, multi-phase inference logic
You Must Experiment to Find
the Best Strategy for You!!
Co-locate and Isolate Homogenous Workloads
LET’S DEPLOY OPTIMIZED MODEL
§ Navigate to the following notebook:
10_Optimize_Model_Server
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model with XLA JIT Compiler
§ Optimize Model with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
YOU JUST LEARNED…
§ To Inspect and Debug Models
§ Training and Predicting Best Practices
§ To Distribute Training Across a Cluster
§ To Optimize Training with Queue Feeders
§ To Optimize Training with XLA JIT Compiler
§ To Optimize Inference with AOT and Graph Transform Tool (GTT)
§ Key Components of TensorFlow Serving
§ To Deploy Models with TensorFlow Serving
§ To Optimize Inference by Tuning TensorFlow Serving
Q&A
§ Thank you!!
§ https://github.com/fluxcapacitor/pipeline/
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
Contact Me @
Email: chris@pipeline.io
Twitter: @cfregly

Contenu connexe

Tendances

High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...Chris Fregly
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Chris Fregly
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Chris Fregly
 
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...Chris Fregly
 
High performance network programming on the jvm oscon 2012
High performance network programming on the jvm   oscon 2012 High performance network programming on the jvm   oscon 2012
High performance network programming on the jvm oscon 2012 Erik Onnen
 
London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2Alexander Shulgin
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to PolyaxonYu Ishikawa
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerNopparat Nopkuat
 
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)Ontico
 
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Marcus Barczak
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
 

Tendances (20)

High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
 
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
 
High performance network programming on the jvm oscon 2012
High performance network programming on the jvm   oscon 2012 High performance network programming on the jvm   oscon 2012
High performance network programming on the jvm oscon 2012
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Ndp Slides
Ndp SlidesNdp Slides
Ndp Slides
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to Polyaxon
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
 
Shootout at the AWS Corral
Shootout at the AWS CorralShootout at the AWS Corral
Shootout at the AWS Corral
 
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 

En vedette

20150306 파이썬기초 IPython을이용한프로그래밍_이태영
20150306 파이썬기초 IPython을이용한프로그래밍_이태영20150306 파이썬기초 IPython을이용한프로그래밍_이태영
20150306 파이썬기초 IPython을이용한프로그래밍_이태영Tae Young Lee
 
Introduction to Deep Learning with TensorFlow
Introduction to Deep Learning with TensorFlowIntroduction to Deep Learning with TensorFlow
Introduction to Deep Learning with TensorFlowTerry Taewoong Um
 
TensorFlow Tutorial
TensorFlow TutorialTensorFlow Tutorial
TensorFlow TutorialNamHyuk Ahn
 
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsIntro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsKendall
 

En vedette (6)

Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
20150306 파이썬기초 IPython을이용한프로그래밍_이태영
20150306 파이썬기초 IPython을이용한프로그래밍_이태영20150306 파이썬기초 IPython을이용한프로그래밍_이태영
20150306 파이썬기초 IPython을이용한프로그래밍_이태영
 
Introduction to Deep Learning with TensorFlow
Introduction to Deep Learning with TensorFlowIntroduction to Deep Learning with TensorFlow
Introduction to Deep Learning with TensorFlow
 
TensorFlow Tutorial
TensorFlow TutorialTensorFlow Tutorial
TensorFlow Tutorial
 
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsIntro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
 
sungmin slide
sungmin slidesungmin slide
sungmin slide
 

Similaire à High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearnJiang Jun
 
Apache Deep Learning 201 - Philly Open Source
Apache Deep Learning 201 - Philly Open SourceApache Deep Learning 201 - Philly Open Source
Apache Deep Learning 201 - Philly Open SourceTimothy Spann
 
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Timothy Spann
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsAltoros
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformWangda Tan
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Apache Deep Learning 201 - Barcelona DWS March 2019
Apache Deep Learning 201 - Barcelona DWS March 2019Apache Deep Learning 201 - Barcelona DWS March 2019
Apache Deep Learning 201 - Barcelona DWS March 2019Timothy Spann
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AIOptimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AIData Con LA
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesKoan-Sin Tan
 
FireWorks overview
FireWorks overviewFireWorks overview
FireWorks overviewAnubhav Jain
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xrkr10
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlowJeongkyu Shin
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 

Similaire à High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017 (20)

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn
 
Apache Deep Learning 201 - Philly Open Source
Apache Deep Learning 201 - Philly Open SourceApache Deep Learning 201 - Philly Open Source
Apache Deep Learning 201 - Philly Open Source
 
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning Platform
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Apache Deep Learning 201 - Barcelona DWS March 2019
Apache Deep Learning 201 - Barcelona DWS March 2019Apache Deep Learning 201 - Barcelona DWS March 2019
Apache Deep Learning 201 - Barcelona DWS March 2019
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AIOptimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
FireWorks overview
FireWorks overviewFireWorks overview
FireWorks overview
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
Apache Deep Learning 201
Apache Deep Learning 201Apache Deep Learning 201
Apache Deep Learning 201
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlow
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 

Plus de Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 

Plus de Chris Fregly (17)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 

Dernier

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 

Dernier (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 

High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017

  • 1. HIGH PERFORMANCE TENSORFLOW IN PRODUCTION WITH GPUS!! CHRIS FREGLY, FOUNDER @ PIPELINE.IO PIPELINE.AI TENSORFLOW GPU WORKSHOP NEW YORK - JULY 8, 2017
  • 2. INTRODUCTIONS: ME § Chris Fregly, Research Engineer @ § Formerly Netflix and Databricks § Advanced Spark and TensorFlow Meetup Please Join Our 18,000+ Members Globally!! * San Francisco * Chicago * Washington DC * London Please Join!!
  • 3. INTRODUCTIONS: YOU § Software Engineer or Data Scientist interested in optimizing and deploying TensorFlow models to production § Assume you have a working knowledge of TensorFlow
  • 4. CONTENT SUMMARY § 50% Training Optimizations (TensorFlow, XLA, Tools) § 50% Deployment and Inference Optimizations (Serving) § Why Heavy Focus on Inference? § Training: boring batch, O(num_data_scientists) § Inference: exciting realtime, O(num_users_of_app) § We Use Simple Models to Highlight Optimizations § Warning: This is not introductory TensorFlow material!
  • 5. 100% OPEN SOURCE CODE § https://github.com/fluxcapacitor/pipeline/ § Please Star this Repo! J § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml
  • 6. HANDS-ON EXERCISES § Combo of Jupyter Notebooks and Command Line § Command Line through Jupyter Terminal § Some Exercises Based on Experimental Features In Other Words, You Will See Errors. It’s OK!!
  • 7. YOU WILL LEARN… § To Inspect and Debug Models § Training and Predicting Best Practices § To Distribute Training Across a Cluster § To Optimize Training with Queue Feeders § To Optimize Training with XLA JIT Compiler § To Optimize Inference with AOT and Graph Transform Tool (GTT) § Key Components of TensorFlow Serving § To Deploy Models with TensorFlow Serving § To Optimize Inference by Tuning TensorFlow Serving
  • 8. AGENDA § GPUs and TensorFlow § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 10. SETUP ENVIRONMENT § Step 1: Browse to the following: http://allocator.community.pipeline.io/allocate § Step 2: Browse to the following: http://<ip-address> Need Help? Use the Chat!
  • 12. LET’S EXPLORE OUR ENVIRONMENT § Navigate to the following notebook: 01_Explore_Environment § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 14. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 15. SETTING UP TENSORFLOW WITH GPUS § Very Painful! § Especially inside Docker § Use nvidia-docker § Especially on Kubernetes! § Use Kubernetes 1.7+ § Check Out Pipeline.IO for Links to GitHub + DockerHub
  • 16. GPU HALF-PRECISION SUPPORT § FP16, INT8 are “Half Precision” § Supported by Pascal P100 (2016) and Volta V100 (2017) § Flexible FP32 GPU Cores Can Fit 2 FP16’s for 2x Throughput! § Half-Precision is OK for Approximate Deep Learning Use Cases
  • 17. VOLTA V100 RECENTLY ANNOUNCED § 84 Streaming Multiprocessors (SM’s) § 5,376 GPU Cores § 672 Tensor Cores (ie. Google TPU) § Mixed FP16/FP32 Precision § More Shared Memory § New L0 Instruction Cache § Faster L1 Data Cache § V100 vs. P100 Performance § 12x TFLOPS @ Peak Training § 6x Inference Throughput
  • 18. V100 AND CUDA 9 § Independent Thread Scheduling - Finally!! § Similar to CPU fine-grained thread synchronization semantics § Allows GPU to yield execution of any thread § Still Optimized for SIMT (Same Instruction Multiple Thread) § SIMT units automatically scheduled together § Explicit Synchronization P100 V100
  • 19. GPU CUDA PROGRAMMING § Barbaric, But Fun Barbaric! § Must Know Underlying Hardware Very Well § Many Great Debuggers/Profilers § Hardware Changes are Painful! § Newer CUDA versions automatically JIT-compile old CUDA code to new NVPTX § Not optimal, of course
  • 20. CUDA STREAMS § Asynchronous I/O Transfer § Overlap Compute and I/O § Keeps GPUs Saturated § Fundamental to Queue Framework in TensorFlow
  • 21. LET’S SEE WHAT THIS THING CAN DO! § Navigate to the following notebook: 01a_Explore_GPU 01b_Explore_Numba § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 23. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 24. AGENDA § GPUs and TensorFlow § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 25. TRAINING TERMINOLOGY § Tensors: N-Dimensional Arrays § ie. Scalar, Vector, Matrix § Operations: MatMul, Add, SummaryLog,… § Graph: Graph of Operations (DAG) § Session: Contains Graph(s) § Feeds: Feed inputs into Operation § Fetches: Fetch output from Operation § Variables: What we learn through training § aka “weights”, “parameters” § Devices: Hardware device on which we train -TensorFlow- Trains Variables -User- Fetches Outputs -User- Feeds Inputs -TensorFlow- Performs Operations -TensorFlow- Flows Tensors with tf.device(“worker:0/device/gpu:0,worker:1/device/gpu:0”)
  • 26. TRAINING DEVICES § cpu:0 § By default, all CPUs § Requires extra config to target a CPU § gpu:0..n § Each GPU has a unique id § TF usually prefers a single GPU § xla_cpu:0, xla_gpu:0..n § “JIT Compiler Device” § Hints TensorFlow to attempt JIT Compile with tf.device(“/cpu:0”): with tf.device(“/gpu:0”): with tf.device(“/gpu:1”):
  • 27. TRAINING METRICS: TENSORBOARD § Summary Ops § Event Files /root/tensorboard/linear/<version>/events… § Tags § Organize data within Tensorboard UI loss_summary_op = tf.summary.scalar('loss', loss) merge_all_summary_op = tf.summary.merge_all() summary_writer = tf.summary.FileWriter( '/root/tensorboard/linear/<version>', graph=sess.graph)
  • 28. TRAINING ON EXISTING INFRASTRUCTURE § Data Processing § HDFS/Hadoop § Spark § Containers § Docker § Schedulers § Kubernetes § Mesos <dependency> <groupId>org.tensorflow</groupId> <artifactId>tensorflow-hadoop</artifactId> <version>1.0-SNAPSHOT</version> </dependency> https://github.com/tensorflow/ecosystem
  • 29. TRAINING PIPELINES: QUEUE + DATASET § Don’t Use feed_dict for Production Workloads!! § feed_dict Requires Python <-> C++ Serialization § Retrieval is Single-threaded, Synchronous, SLOW! § Can’t Retrieve Until Current Batch is Complete § CPUs/GPUs Not Fully Utilized! § Use Queue or Dataset API
  • 30. QUEUES § More than Just a Traditional Queue § Perform I/O, pre-processing, cropping, shuffling § Pulls from HDFS, S3, Google Storage, Kafka, ... § Combine many small files into large TFRecord files § Typically use CPUs to focus GPUs on compute § Uses CUDA Streams
  • 31. DATA MOVEMENT WITH QUEUES § GPU Pulls Batch from Queue (CUDA Streams) § GPU pulls next batch while processing current batch GPUs Stay Fully Utilized!
  • 32. QUEUE CAPACITY PLANNING § batch_size § # examples / batch (ie. 64 jpg) § Limited by GPU RAM § num_processing_threads § CPU threads pull and pre-process batches of data § Limited by CPU Cores § queue_capacity § Limited by CPU RAM (ie. 5 * batch_size)
  • 33. DETECT UNDERUTILIZED CPUS, GPUS § Instrument training code to generate “timelines” § Analyze with Google Web Tracing Framework (WTF) § Monitor CPU with `top`, GPU with `nvidia-smi` http://google.github.io/tracing-framework/ from tensorflow.python.client import timeline trace = timeline.Timeline(step_stats=run_metadata.step_stats) with open('timeline.json', 'w') as trace_file: trace_file.write( trace.generate_chrome_trace_format(show_memory=True))
  • 34. LET’S FEED DATA WITH A QUEUE § Navigate to the following notebook: 02_Feed_Queue_HDFS § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 36. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 37. TENSORFLOW MODEL § MetaGraph § Combines GraphDef and Metadata § GraphDef § Architecture of your model (nodes, edges) § Metadata § Asset: Accompanying assets to your model § SignatureDef: Maps external : internal tensors § Variables § Stored separately during training (checkpoint) § Allows training to continue from any checkpoint § Variables are “frozen” into Constants when deployed for inference GraphDef x W mul add b MetaGraph Metadata Assets SignatureDef Tags Version Variables: “W” : 0.328 “b” : -1.407
  • 38. TENSORFLOW SESSION Session graph: GraphDef Variables: “W” : 0.328 “b” : -1.407 Variables are Periodically Checkpointed GraphDef is Static
  • 39. TENSORFLOW DEBUGGER § Step through Operations § Inspect Inputs and Outputs § Wrap Session in Debug Session sess = tf.Session(config=config) sess = tf_debug.LocalCLIDebugWrapperSession(sess)
  • 40. LET’S DEBUG A MODEL § Navigate to the following notebook: 04_Debug_Model § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 41. BATCH NORMALIZATION § Each Mini-Batch May Have Wildly Different Distributions § Normalize per batch (and layer) § Speeds up Training!! § Weights are Learned Quicker § Final Model is More Accurate § Final mean and variance will be folded into Graph later -- Always Use Batch Normalization! -- z = tf.matmul(a_prev, W) a = tf.nn.relu(z) a_mean, a_var = tf.nn.moments(a, [0]) scale = tf.Variable(tf.ones([depth/channels])) beta = tf.Variable(tf.zeros ([depth/channels])) bn = tf.nn.batch_normalizaton(a, a_mean, a_var, beta, scale, 0.001)
  • 42. AGENDA § GPU Environment § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 43. MULTI-GPU TRAINING (SINGLE NODE) § Variables stored on CPU (cpu:0) § Model graph (aka “replica”, “tower”) is copied to each GPU(gpu:0, gpu:1, …) Multi-GPU Training Steps: 1. CPU transfers model to each GPU 2. CPU waits on all GPUs to finish batch 3. CPU copies all gradients back from all GPUs 4. CPU synchronizes + AVG all gradients from GPUs 5. CPU updates GPUs with new variables/weights 6. Repeat Step 1 until reaching stop condition (ie. max_epochs)
  • 44. DISTRIBUTED, MULTI-NODE TRAINING § TensorFlow Automatically Inserts Send and Receive Ops into Graph § Parameter Server Synchronously Aggregates Updates to Variables § Nodes with Multiple GPUs will Pre-Aggregate Before Sending to PS Worker0 Worker0 Worker1 Worker0 Worker1 Worker2 gpu0 gpu1 gpu2 gpu3 gpu0 gpu1 gpu2 gpu3 gpu0 gpu1 gpu2 gpu3 gpu0 gpu1 gpu0 gpu0
  • 45. SYNCHRONOUS VS. ASYNCHRONOUS § Synchronous § Nodes compute gradients § Nodes update Parameter Server (PS) § Nodes sync on PS for latest gradients § Asynchronous § Some nodes delay in computing gradients § Nodes don’t update PS § Nodes get stale gradients from PS § May not converge due to stale reads!
  • 46. DATA PARALLEL VS MODEL PARALLEL § Data Parallel (“Between-Graph Replication”) § Send exact same model to each device § Each device operates on its partition of data § ie. Spark sends same function to many workers § Each worker operates on their partition of data § Model Parallel (“In-Graph Replication”) § Send different partition of model to each device § Each device operates on all data Very Difficult!! Required for Large Models. (GPU RAM Limitation)
  • 47. DISTRIBUTED TENSORFLOW CONCEPTS § Client § Program that builds a TF Graph, constructs a session, interacts with the cluster § Written in Python, C++ § Cluster § Set of distributed nodes executing a graph § Nodes can play any role § Jobs (“Roles”) § Parameter Server (“ps”) stores and updates variables § Worker (“worker”) performs compute-intensive tasks (stateless) § Assigned 0..* tasks § Task (“Server Process”) “ps” and “worker” are conventional names
  • 48. CHIEF WORKER § Worker Task 0 is Chosen by Default § Task 0 is guaranteed to exist § Implements Maintenance Tasks § Writes checkpoints § Initializes parameters at start of training § Writes log summaries § Parameter Server health checks
  • 49. NODE AND PROCESS FAILURES § Checkpoint to Persistent Storage (HDFS, S3) § Use MonitoredTrainingSession and Hooks § Use a Good Cluster Orchestrator (ie. Kubernetes,Mesos) § Understand Failure Modes and Recovery States Stateless, Not Bad: Training Continues Stateful, Bad: Training Must Stop Dios Mio! Long Night Ahead…
  • 50. VALIDATING DISTRIBUTED MODEL § Separate Training and Validation Clusters § Validate using Saved Checkpoints from Parameter Servers § Avoids Resource Contention Training Cluster Validation Cluster Parameter Server Cluster
  • 51. LET’S TRAIN WITH DISTRIBUTED CPU § Navigate to the following notebook: 05_Train_Model_Distributed_CPU § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 52. LET’S TRAIN WITH DISTRIBUTED GPU § Navigate to the following notebook: 05a_Train_Model_Distributed_GPU § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 53. NEW(‘ISH): EXPERIMENT + ESTIMATOR § Higher-Level APIs Simplify Distributed Training § Picks Up Configuration from Environment § Supports Custom Models (ie. Keras) § Used for Training, Validation, and Prediction § API is Changing, but Patterns Remain the Same § Works Well with Google Cloud ML (Surprised?!)
  • 55. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 56. AGENDA § GPUs and TensorFlow § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 57. XLA FRAMEWORK § Accelerated Linear Algebra (XLA) § Goals: § Reduce reliance on custom operators § Improve execution speed § Improve memory usage § Reduce mobile footprint § Improve portability § Helps TF Stay Flexible and Performant
  • 58. XLA HIGH LEVEL OPTIMIZER (HLO) § Compiler Intermediate Representation (IR) § Independent of source and target language § Define Graphs using HLO Language § XLA Step 1 Emits Target-Independent HLO § XLA Step 2 Emits Target-Dependent LLVM § LLVM Emits Native Code Specific to Target § Supports x86-64, ARM64 (CPU), and NVPTX (GPU)
  • 59. JIT COMPILER § Just-In-Time Compiler § Built on XLA Framework § Goals: § Reduce memory movement – especially useful on GPUs § Reduce overhead of multiple function calls § Similar to Spark Operator Fusing in Spark 2.0 § Unroll Loops, Fuse Operators, Fold Constants, … § Scope to session, device, or `with jit_scope():`
  • 60. VISUALIZING JIT COMPILER IN ACTION Before After Google Web Tracing Framework: http://google.github.io/tracing-framework/ from tensorflow.python.client import timeline trace = timeline.Timeline(step_stats=run_metadata.step_stats) with open('timeline.json', 'w') as trace_file: trace_file.write( trace.generate_chrome_trace_format(show_memory=True))
  • 61. VISUALIZING FUSING OPERATORS pip install graphviz dot -Tpng /tmp/hlo_graph_1.w5LcGs.dot -o hlo_graph_1.png GraphViz: http://www.graphviz.org hlo_*.dot files generated by XLA
  • 62. LET’S TRAIN WITH XLA CPU § Navigate to the following notebook: 06_Train_Model_XLA_CPU § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 63. LET’S TRAIN WITH XLA GPU § Navigate to the following notebook: 06a_Train_Model_XLA_GPU § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 65. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 66. IT’S WORTH HIGHLIGHTING… § From Now On, We Optimize Trained Models For Inference § In Other Words… We’re Done with Training Optimizations! Let’s Move to Predicting Optimizations!!
  • 67. AGENDA § GPUs and TensorFlow § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 68. AOT COMPILER § Standalone, Ahead-Of-Time (AOT) Compiler § Built on XLA framework § tfcompile § Creates executable with minimal TensorFlow Runtime needed § Includes only dependencies needed by subgraph computation § Creates functions with feeds (inputs) and fetches (outputs) § Packaged as cc_libary header and object files to link into your app § Commonly used for mobile device inference graph § Currently, only CPU x86-64 and ARM are supported - no GPU
  • 69. GRAPH TRANSFORM TOOL (GTT) § Optimize Trained Models for Inference § Remove training-only Ops (checkpoint, drop out, logs) § Remove unreachable nodes between given feed -> fetch § Fuse adjacent operators to improve memory bandwidth § Fold final batch norm mean and variance into variables § Round weights/variables improves compression (ie. 70%) § Quantize weights and activations simplifies model § FP32 down to INT8
  • 71. AFTER STRIPPING UNUSED NODES § Optimizations § strip_unused_nodes § Results § Graph much simpler § File size much smaller
  • 72. AFTER REMOVING UNUSED NODES § Optimizations § strip_unused_nodes § remove_nodes § Results § Pesky nodes removed § File size a bit smaller
  • 73. AFTER FOLDING CONSTANTS § Optimizations § strip_unused_nodes § remove_nodes § fold_constants § Results § Placeholders (feeds) -> Variables* (*Why Variables and not Constants?)
  • 74. AFTER FOLDING BATCH NORMS § Optimizations § strip_unused_nodes § remove_nodes § fold_constants § fold_batch_norms § Results § Graph remains the same § File size approximately the same
  • 75. WEIGHT QUANTIZATION § FP16 and INT8 Are Smaller and Computationally Simpler § Weights/Variables are Constants § Easy to Linearly Quantize
  • 76. AFTER QUANTIZING WEIGHTS § Optimizations § strip_unused_nodes § remove_nodes § fold_constants § fold_batch_norms § quantize_weights § Results § Graph is same, file size is smaller, compute is faster
  • 77. LET’S OPTIMIZE FOR INFERENCE § Navigate to the following notebook: 07_Optimize_Model* (*Why just CPU version? Why not both CPU and GPU?) § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 79. ACTIVATION QUANTIZATION § Activations Not Known Ahead of Time § Depends on input, not easy to quantize § Requires Additional Calibration Step § Use a “representative” dataset § Per Neural Network Layer… § Collect histogram of activation values § Generate many quantized distributions with different saturation thresholds § Choose threshold to minimize… KL_divergence(ref_distribution, quant_distribution) § Not Much Time or Data is Required (Minutes on Commodity Hardware)
  • 81. AFTER ACTIVATION QUANTIZATION § Optimizations § strip_unused_nodes § remove_nodes § fold_constants § fold_batch_norms § quantize_weights § quantize_nodes (activations) § Results § Larger graph, needs calibration! Requires additional freeze_requantization_ranges
  • 82. LET’S OPTIMIZE FOR INFERENCE (2/2) § Navigate to the following notebook: 08_Optimize_Model_Activations § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 84. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 85. AGENDA § GPUs and TensorFlow § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 86. MODEL SERVING TERMINOLOGY § Inference § Only Forward Propagation through Network § Predict, Classify, Regress, … § Bundle § GraphDef, Variables, Metadata, … § Assets § ie. Map of ClassificationID -> String § {9283: “penguin”, 9284: “bridge”} § Version § Every Model Has a Version Number (Integer) § Version Policy § ie. Serve Only Latest (Highest), Serve Both Latest and Previous, …
  • 87. TENSORFLOW SERVING FEATURES § Supports Auto-Scaling § Custom Loaders beyond File-based § Tune for Low-latency or High-throughput § Serve Diff Models/Versions in Same Process § Customize Models Types beyond HashMap and TensorFlow § Customize Version Policies for A/B and Bandit Tests § Support Request Draining for Graceful Model Updates § Enable Request Batching for Diff Use Cases and HW § Supports Optimized Transport with GRPC and Protocol Buffers
  • 88. PREDICTION SERVICE § Predict (Original, Generic) § Input: List of Tensor § Output: List of Tensor § Classify § Input: List of tf.Example (key, value) pairs § Output: List of (class_label: String, score: float) § Regress § Input: List of tf.Example (key, value) pairs § Output: List of (label: String, score: float)
  • 89. PREDICTION INPUTS + OUTPUTS § SignatureDef § Defines inputs and outputs § Maps external (logical) to internal (physical) tensor names § Allows internal (physical) tensor names to change from tensorflow.python.saved_model import utils from tensorflow.python.saved_model import signature_constants from tensorflow.python.saved_model import signature_def_utils graph = tf.get_default_graph() x_observed = graph.get_tensor_by_name('x_observed:0') y_pred = graph.get_tensor_by_name('add:0') inputs_map = {'inputs': x_observed} outputs_map = {'outputs': y_pred} predict_signature = signature_def_utils.predict_signature_def(inputs=inputs_map, outputs=outputs_map)
  • 90. MULTI-HEADED INFERENCE § Multiple “Heads” of Model § Return class and scores to be fed into another model § Inputs Propagated Forward Only Once § Optimizes Bandwidth, CPU, Latency, Memory, Coolness
  • 91. BUILD YOUR OWN MODEL SERVER (?!) § Adapt GRPC(Google) <-> HTTP (REST of the World) § Perform Batch Inference vs. Request/Response § Handle Requests Asynchronously § Support Mobile, Embedded Inference § Customize Request Batching § Add Circuit Breakers, Fallbacks § Control Latency Requirements § Reduce Number of Moving Parts #include “tensorflow_serving/model_servers/server_core.h” class MyTensorFlowModelServer { ServerCore::Options options; // set options (model name, path, etc) std::unique_ptr<ServerCore> core; TF_CHECK_OK( ServerCore::Create(std::move(options), &core) ); } Compile and Link libtensorflow.so
  • 92. FREEZING MODEL FOR DEPLOYMENT § Optimizations § strip_unused_nodes § remove_nodes § fold_constants § fold_batch_norms § quantize_weights § quantize_nodes § freeze_graph § Results § Variables -> Constants Finally! We’re Ready to Deploy!!
  • 93. LET’S DEPLOY OPTIMIZED MODEL § Navigate to the following notebook: 09_Deploy_Optimized_Model § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 95. BREAK § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Need Help? Use the Chat!
  • 96. AGENDA § GPUs and TensorFlow § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 97. REQUEST BATCH TUNING § max_batch_size § Enables throughput/latency tradeoff § Bounded by RAM § batch_timeout_micros § Defines batch time window, latency upper-bound § Bounded by RAM § num_batch_threads § Defines parallelism § Bounded by CPU cores § max_enqueued_batches § Defines queue upper bound, throttling § Bounded by RAM Reaching either threshold will trigger a batch
  • 98. BATCH SCHEDULER STRATEGIES § BasicBatchScheduler § Best for homogeneous request types (ie. always classify or always regress) § Async callback upon max_batch_size or batch_timeout_micros § BatchTask encapsulates unit of work to be batched § SharedBatchScheduler § Best for heterogeneous request types, multi-step inference, ensembles, … § Groups BatchTasks into separate queues to form homogenous batches § Processes batches fairly through interleaving § StreamingBatchScheduler § Mixed CPU/GPU/IO-bound workloads § Provides fine-grained control for complex, multi-phase inference logic You Must Experiment to Find the Best Strategy for You!! Co-locate and Isolate Homogenous Workloads
  • 99. LET’S DEPLOY OPTIMIZED MODEL § Navigate to the following notebook: 10_Optimize_Model_Server § https://github.com/fluxcapacitor/pipeline/gpu.ml/ notebooks/
  • 100. AGENDA § GPUs and TensorFlow § Train and Debug TensorFlow Model § Train with Distributed TensorFlow Cluster § Optimize Model with XLA JIT Compiler § Optimize Model with XLA AOT and Graph Transforms § Deploy Model to TensorFlow Serving Runtime § Optimize TensorFlow Serving Runtime § Wrap-up and Q&A
  • 101. YOU JUST LEARNED… § To Inspect and Debug Models § Training and Predicting Best Practices § To Distribute Training Across a Cluster § To Optimize Training with Queue Feeders § To Optimize Training with XLA JIT Compiler § To Optimize Inference with AOT and Graph Transform Tool (GTT) § Key Components of TensorFlow Serving § To Deploy Models with TensorFlow Serving § To Optimize Inference by Tuning TensorFlow Serving
  • 102. Q&A § Thank you!! § https://github.com/fluxcapacitor/pipeline/ § Slides, code, notebooks, Docker images available here: https://github.com/fluxcapacitor/pipeline/ gpu.ml Contact Me @ Email: chris@pipeline.io Twitter: @cfregly