ML gives machines the ability to learn from data without being explicitly programmed. At Netflix, machine learning is used across many areas including recommendation systems, streaming quality, resource management, regional failover, anomaly detection, and capacity forecasting. Netflix uses various ML algorithms like decision trees, neural networks, and regression models to optimize the customer experience and infrastructure operations.
2. ➢ ML gives machines (computers) ability to learn without being explicitly
programmed
➢ ML is about teaching machines to perform tasks on prior experiences
(knowledge). Experience comes from data
➢ ML Algorithms enable machines to identify patterns in observed data
➢ Predict things without having explicit pre-programmed rules
Training: A learner is trained on dataset and emits a learned model
Inference: A trained or learned model takes real world inputs and make predictions
Machine Learning (ML)
3. ML Algorithms
➢ The main objective of a ML algorithm (algo) is to pick the most
sensible place to put a fence in data.
➢ The goal of all ML algo is to best estimate a target function (f) that
maps input data (X) onto output variables (Y).
➢ There are bunch of ML algo available. Choice depends on the
specific problem.
➢ Tree based Ensemble algo (gradient tree boosting, random forest)
are known to work best on wide variety of datasets
➢ In addition, Hyperparameters optimization of a given algo can
sometime leads to significant improvement in predictive accuracy
for most problems
Popular ML algorithms:
○ Gaussian Naive Bayes (GNB)
○ Bernoulli Naive Bayes (BNB)
○ Multinomial Naive Bayes (MNB)
○ Logistic Regression (LR)
○ Stochastic Gradient Descent (SGD)
○ Passive Aggressive Classifier (PAC)
○ Support Vector Classifier (SVC)
○ K-Nearest Neighbor (KNN)
○ Decision Tree (DT)
○ Random Forest (RF)
○ Extra Trees Classifier (ERF)
○ AdaBoost (AB)
○ Gradient Tree Boosting (GTB)
10-fold CV balanced accuracy of each algorithm on a given dataset, with a lower ranking indicating higher accuracy. The rankings
show the strength of ensemble-based tree algorithms in generating accurate models: The first, second, and fourth-ranked
algorithms belong to this class of algorithms.
Data driven advice to applying machine learning
4. Deep Learning (DL)
➢ Deep Learning (DL) uses deep neural networks (DNNs) that are built via deep
layering of connected artificial neurons, also called perceptrons
○ A neuron can be thought of a function that takes in multiple inputs and yields a
single output. Types of function commonly used are: sigmoid, softmax, ReLu..
➢ DNN functions define relationship between input and output layers, which is
parameterized by weights.
○ Activation functions allows performing various learning tasks by reducing a cost
function and adjusting parameter weights. Errors are minimized by adjusting
weight (w) and bias (b) via gradient descent
Features:
➢ DL Models can extract useful features from raw data, called Feature Learning
➢ DL models are trained to form a non-linear relationships
➢ DL models can be tweaked easily to avoid overfitting
➢ DL does as good a job in nonlinear dimensionality reduction than PCA
○ Autoencoder can recreate the image from low-dimensional codes
➢ NN layers and weights can be tweaked to implement Transfer Learning
DL has proven to work best in the fields of: image (computer vision) and speech recognition,
NLP, sentiment analysis, self driving and recommendation systems
CNN in action
5. Machine Learning - Scalability
➢ Vast majority of ML use cases are data parallel
➢ Parallel processing across multiple GPU/CPU can reduce the model training
time. Parallel computation can be applied to:
○ Model training via ensembles of decision trees (DT)
○ Model Evaluation via resampling procedures like k-fold cross-validation
○ Tuning hyperparameters via grid/random search
➢ ML libraries that support multi-gpu model training:
○ XGBoost
○ LightGBM
○ Horovod - Trained convolutional Networks and LSTMs in hours instead of days or weeks
➢ Data Parallelism in DL
○ LSTM - One layer per GPU
○ Distributed SGD - SGD mini-batches over a pool of parallel workers by using learning rate
adjustment as a function of minibatch size technique
➢ Model Parallelism in DL
○ Stacked LSTM
Accurate,Large Minibatch SGD: Training ImageNet in 1 Hour
6. GPU - General Purpose Computing
➢ Ability to program GPU in high level programming languages like C, C++
○ No knowledge of graphics prog (OpenGL or DirectX)
○ Knowledge of CUDA language, modestly extended version of C
➢ CUDA program utilizes GPUs in conjunction with CPUs to accelerate
compute heavy tasks
○ Application code runs on CPUs but can offload compute intensive task to
GPUs, called CUDA kernel function
➢ Data parallel problems, common in ML/DL, fit well for GPU
computation, where each data element can run in parallel and same
kernel function can be applied to each data element
➢ Neural networks are created from identical neurons that are highly
parallel by nature and rely heavily on matrix math operations, best
supported on GPU. Significant speedup over CPU-only model training
General Purpose GPU programming
7. GPU vs. CPU
GPU
● Thousands of smaller cores. Ideal for compute intensive
parallel tasks or stream processing
● gpu is connected via PCI-e to system bus
● gpu offers much higher instruction throughput and
memory bandwidth than cpu
● gpu has more transistors dedicated for data processing
rather than data caching
● Physical gpu has 20-80 streaming multiprocessors
(SM). Each SM can have hundreds of cores, that adds up
to thousand of cores. Each core runs one thread
● Stream processing can get 10x performance speed up
on gpu due to efficient memory access and higher level
of parallel processing
● A systems can have multiple gpus. gpu-gpu
communication is possible via NVidia NVLINK without
going over PCIe bus.
● Each SM in gpu has on-chip 512 KB register file, 128
KB shared memory and off-chip 1.5 MB shared L2
● gpu cores run in lock step mode, called warp. All
threads in warp starts at the same program address
Getting started with Nvidia CUDA GPUs
CPU
● Fewer cores. Optimize for sequential serial processing
● cpu socket is directly attached to system bus
● Physical cpu socket can have multiple logical cores with
each core has two hyperthread (HT) of execution
● A system can have multiple physical cpus.
Inter-processor communication is via system bus
● Each cpu core has a dedicated L1/L2 and off-core
shared L3 cache
● Each cpu run independently of each other
● Stream processing speed up is limited, ~1.5%
improvement
CPU GPU
8. GPU - Performance Considerations
➢ Improve gpu utilization by reducing cpu (host) and gpu (device) memory transfers
○ For example: All stages of the Decision Tree construction can be efficiently performed on GPU
➢ Gradient Boosting works best on GPU. ML libraries, like XGBoost, are optimized to run all phases of training
○ Data compression, gradient calculation, feature quantization, prediction, decision tree construction and
evaluation
➢ Scale computation across multiple GPUs on a system. Nvidia GPUs supports NVLink for inter-gpu communication,
that offers 10x times higher throughput than communicating over PCIe bus
➢ Train model with mixed precision. Nvidia Tensor cores (Volta/Turing GPU) support mixed precision training
○ Lower precision than 32-bit floating point requires less memory and computation bandwidth
○ Math operations run faster in reduced precision
➢ GPU primitives can be used to compose more complicated algorithms while retaining high performance, readability
and reliability. Simple algo can be used to build massively parallel algo. Some examples of parallel primitives:
○ Radix sort, Reduction Harris, Parallel prefix sum (scan), Segmented scan and reduce, Interleaved sequences
(multi-reduce), interleaved sequences (multi-scan)
➢ GPUs are optimized for 32-bit floating point operations, but not for 64-bit double precision
○ 32-bit parallel and sequential summation show dramatically superior numerical stability
○ Errors of parallel summation has O(logn) complexity, as compared to O(n) for sequential summation
Scan Primitives for GPU Computing
9. Mason - Netflix ML Workflow and Orchestration
➢ Models should learn and adapt to new data as it arrives. ML
workflow involves:
Labeling -> Feature Generation -> Training -> Metrics
➢ At Netflix, Meson performs workflow orchestration and job
scheduling, and Mesos is used for cluster management
○ Several ML pipelines are built to train and test
recommendation algo.
➢ Meson supports:
○ Convenient authoring of workflow via Scala based DSL
○ Support ML specific constructs like: parallel parameters
sweeping, cross validation, bootstrapping etc.
○ Support custom extensions to perform various tasks:
Submit jobs to Spark cluster, query Hive tables, access to
Netflix microservices and plugin visualizations
10. Netflix - Metaflow
➢ Netflix python library for creating & executing DAGs (directed acyclic graph) as
workflow. Each node in DAG is a processing step
○ Metaflow handles data flow and state transfers at each layer
➢ Gives user a freedom to design and implement their own code inside the DAG
➢ Makes it easy for ML workloads to interact with AWS cloud infrastructure like:
storage, compute, notebooks or other UI..
➢ Takes snapshot of the code, data and dependencies automatically. Ease of
collaboration due to built-in versioning and logging
➢ Support for resuming workflows, reproducing past results, and inspecting
workflow in a notebook
➢ Graphs can be large (fan-outs) with thousands of tasks in a single workflow
➢ Job scheduler layer (Meson, AWS Step functions) is responsible for
orchestrating the workflow and assigning DAG to compute layer
○ Schedule steps in topological order
○ Making sure each step in graph is finished before executing next
○ Support trigger based (cron, external condition..) execution of workflows
At Netflix scale, scheduler handles hundreds of thousands of active workflows
11. Netflix - Notebook (Polynote)
➢ Notebook is a web tool popular among ML community for:
○ Sharing live code, visualization..
○ Data cleaning, transformation, simulation, modeling..
➢ Polynote is a new notebook system built at Netflix, that offers
○ IDE like features: autocomplete, parameter hints, in-line error highlighting..
○ Parameterized notebook for building reusable templates
○ Polyglot language: Python, SQL, Scala
○ Apache Spark integration
➢ ML engineers are required to work with multiple languages:
○ Scala and Spark to generate training data (cleaning, subsampling,..)
○ Training model with Python ML libraries like tensorflow, scikit-learn..
➢ Polynote improves notebook’s reproducibility and visibility features:
○ Keep notebook hidden state intact when cells are executed in any order
○ Dependency and configuration setup (Spark) are saved within notebook
○ Data Visualization with matplotlib and Vega
Source: Polynote - an IDE-inspired polyglot notebook
12. Netflix - Notebook Infrastructure
Netflix users construct entire workflows in a notebook. To support varying use cases and automation,
Netflix built a notebook infrastructure with open source and home grown projects:
➢ nteract : next gen react-based UI for Jupyter notebooks
➢ Meson: Netflix workflow orchestration platform
➢ Papermill : Library for parameterizing, executing and analyzing jupyter notebooks.
➢ Commuter: Service for viewing and sharing notebooks, stored on S3
➢ Titus: Netflix container management platform
➢ Storage: S3, EFS
➢ Compute: All jobs are scheduled on container
Beyond Interactive, Notebook innovation at Netflix
14. Recommendation
➢ Netflix recommendation engine is responsible for:
○ Personalizing member ‘s home page
○ Recommending what shows to watch
○ Displaying artworks
➢ Various ML models are tested offline on historical viewing
data to see if it would have improved recommendations. If it
would, deploy a live A/B testing to see if it performs well in
production
➢ Goal is to predict better what you want to watch before you
watch it.
➢ All sorts of models are tested during exploration:
○ Logistic regression (2014)
○ Ensemble Model of Decision Trees (DT)
○ Trees and Very Large GBDT (xgboost)
○ FeedForward Neural Network (NN)
○ Recurrent NN
○ Convolutional NN
○ LTSM / Stacked LTSM
15. Streaming Quality
➢ Netflix optimizes content delivery by a combination of intelligent caching and encoding recipes that
incorporate: device capabilities, title complexity, geographical location and network bandwidth
➢ Viewing experience and streaming quality are enhanced by applying predictive models:
○ Device caching takes into account user immediate (20 seconds) viewing history to predict what next
unwatched episode in series will be watched next
○ Network quality characterization and prediction to adopt video quality during playback
○ Actively monitoring constraints around resource usage like: device memory, available network bandwidth to
reduce video start time
➢ By best predicting regional demands, video assets can be cached closer to subscriber location..and
that reduces rebuffer events even with a higher quality streaming
➢ Remove redundancy in video encoded via Spatial and temporal prediction and correlation, resulting
in less bandwidth requirements for delivering same quality video
➢ Content allocation algorithm to improve Netflix CDN hardware utilization
Source: How Data Science Helps Power Worldwide Delivery of Netflix Content
16. ➢ Netflix load balances subscriber load across multiple AWS regions
➢ Drop in SPS (Stream Per Second) metric triggers regional failover
➢ Linear regression model is used to predict the traffic that will be
routed to savior regions
➢ Model is trained on historical scaling behavior of the microservice
to predict level of scale up or system resources required to handle
the load for that time of day.
➢ Regional failover takes into account geographical location of
subscribers and capacity requirements of microservice to achieve
graceful failover
➢ Failover efficiency ( 7 mins to failover the regions) is achieved by
keeping enough dark capacity online in each region that meets
service scaling requirements for that day
➢ Dark capacity is whitelisted to take production traffic at failover
Regional Failover
17. Resource Management
(Predictive container placement)
➢ Optimum container placement using combinatorial optimization and ML instead of solely relying
on Linux CFS scheduler to make placement decisions
➢ Allocate containers closer to compute resources by detecting optimum collocation opportunities
➢ Gradient boosting ML model is trained via LightGBM library on container cpu usage data. Model
predicts 95 percentile cpu usage of each container for next 10 minutes via condition quantile
regression
○ Container metadata (image, app name, memory, net..) along with time series cpu usage for
last hour are used for model training.
○ Model prediction is fed into MIP (Mix Integer Programming) that spits out the optimized
placement. Container isolation is applied by cgroup cpusets changes.
➢ Type of constraints applied to placement decisions:
○ Assign all tasks within a container to same socket to avoid numa latencies
○ Container is assigned a minimum of one core to avoid core and L1/L2 cache sharing
○ Spread different containers across sockets, if possible to reduce shared L3 cache
contention
○ Not to modify placement of running container when adding/removing containers
Predictive CPU isolation of containers at Netflix
Container tasks
runtime distribution
with and without
improved isolation.
Less outliers with
container isolation
applied
18. Anomaly Detection
➢ Anomaly detection systems are optimized for higher precision ( reduce false
detection) while maintaining recall (true anomaly)
➢ Outliers are points in data that exhibit significantly different properties than the
majority of the points, commonly used for detecting anomaly in areas:
○ suspicious financial or credit card transactions,
○ Traffic violation and management
○ Network intrusions or hacking. Surveillance
○ Health monitoring
○ Event detection in time series and sensor data
➢ Netflix device reliability team apply statistical and predictive modeling to prioritize
device reliability issues by controlling various covariates
➢ Models are trained with past incidents that are labeled False and True (known to be
real issue and actionable).
○ Incident data is high dimensional with a rich structure to reliably determine the root cause
○ Trained model predicts the likelihood that if a given set of measured conditions constitutes
a real problem.
➢ Netflix data team uses Robust Anomaly Detection (RAD) algo to detect anomalies
in high cardinality Big Data.
○ RAD algo is being used at Netflix to detect anomaly: failures in receiving bank payments
and to identify subscriber sign up problems across devices and browsers
19. Capacity Forecasting
➢ Regression model that predicts Netflix microservice RPS (Request per Second) by
identifying its relationship with system resources (cpu, mem, net, io)
○ Model is trained with system resource usage (features) and RPS (label) metrics of
popular Netflix services. Additional dimensions like: AWS region, time of day can be
added to make more precise prediction
○ Service metrics (last 2 weeks) are fetched from Netflix telemetry system (Atlas) to
retrain the model
➢ Helps with capacity planning by forecasting system resource needed that can
scale with a service RPS growth.
➢ Trained model is deployed as a WebApp or microservice to help Netflix service
team to estimate cloud cost increase in relation to service RPS changes
Feature (cpu, mem, io..) correlation with RPS
21. References➢ Netflix Machine Learning - Techblog at Medium
➢ General Purpose GPU Computing. Getting Started with Nvidia CUDA
➢ Using Machine Learning to improve Streaming Quality at Netflix
➢ How Data Science Helps Power Worldwide Delivery of Netflix Content
➢ Telltale: Netflix Application Monitoring Simplified
➢ Predictive CPU isolation of Containers at Netflix
➢ Mason - ML Workflow Orchestration at Netflix
➢ MetaFlow, a Human-Centric Framework for Data Science, Metaflow and AWS Step Functions, Metaflow Docs
➢ Polynote - IDE inspired Polyglot Notebook
➢ Scheduling Notebooks at Netflix
➢ Justin Basilico Presentations on Netflix Personalization
➢ Introduction to Causality in Machine Learning. How Netflix Applies Computation Causal Inference
➢ Comparing Popular ML Algorithms on different Datasets. When to use a particular ML Algorithms
➢ Model Selection for Machine Learning
➢ Strength and Weaknesses of ML Algorithms
➢ Dimensionality Reduction, Feature Selection and Extraction
➢ How to use Gradient Boosting Libraries, XGBoost, LightGBM using Scikit-Learn ML framework
➢ Model Learning Rate tuning when training Deep Learning Neural Networks
➢ 4 Automatic Outlier Detection Algorithms in Python
➢ 17 Statistical Hypothesis Tests needed in ML
➢ Understand Intuitively Different ML Classification Algorithm Principles
➢ Understand Intuitively How Neural Networks Work
➢ Model Performance Validation and Metrics
➢ Convolutional Neural Network (CNN) in action
➢ Distributed Training via Large Minibatch SGD
➢ Scan Primitives for GPU Computing. GPU Primitives for implementing popular algorithms like sorting, prefix sum..
➢ GPU parallel programming using Cuda. Free online class at Udacity