SlideShare une entreprise Scribd logo
1  sur  28
Optimizing training on
Apache MXNet
Julien Simon, AI Evangelist, EMEA
@julsimon
What to expect from this session
• Techniques and tips to optimize training on Apache MXNet
• Infrastructure performance: storage and I/O, GPU throughput, distributed
training, CPU-based training, cost
• Model performance: data augmentation, initializers, optimizers, etc.
• Level 666: you should be familiar with Deep Learning and MXNet
Optimizing Infrastructure Performance
Deploying data sets to instances
• Deep Learning training sets are often very large, with a huge number of files
• How can we deploy them quickly, easily and reliably to instances?
• We strongly recommend packing the training set in a RecordIo file
• https://mxnet.incubator.apache.org/architecture/note_data_loading.html
• https://mxnet.incubator.apache.org/how_to/recordio.html
• Only one file to move around!
• Worth the effort: pack once, train many times
• In any case, you need to copy your data set to a central location
• Let’s look at Amazon EBS, Amazon S3 and Amazon EFS
Storing data sets in Amazon EBS
1. Prepare your data set on a dedicated EBS volume
2. Take a snapshot
3. Deploying to a new instance only takes a few seconds
a. Create a volume from the snapshot
b. Attach the volume to the instance
c. Mount the volume
• Easy to automate, including at boot time (UserData or cfn-init)
• Easy to scale to many instances, even in different accounts
• Large choice of EBS volume types (cost vs. performance)
• Caveat: no sharing for distributed training, copying is required
Storing data sets in Amazon S3
• MXNet has an S3 connector  build option USE_S3=1
https://mxnet.incubator.apache.org/how_to/s3_integration.html
• Best durability (11 9’s)
• Distributed training possible
• Caveats
• Lower performance than EBS-optimized instances
• Beware of hot spots if a lot of instances are running
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
train_dataiter = mx.io.MNISTIter(
image="s3://bucket-name/training-data/train-images-idx3-ubyte",
label="s3://bucket-name/training-data/train-labels-idx1-ubyte", ...
Storing data sets in Amazon EFS
1. Copy your data set on an EFS volume
2. Mount the volume on instances
• Simple way to set up distributed training (no copying required)
• Caveats
• You probably want the “Max I/O” performance mode, but I’d test both
to see if latency is an issue or not
• EFS is more expensive than S3 and EBS: use it for training only, not
for long-term storage
Maximizing GPU usage
• GPUs need a high-throughput, stable flow of training data to run at top speed
• Large datasets cannot fit in RAM
• Adding more GPUs requires more throughput
• How can we check that training is running at full speed?
• Keep track of performance indicators from previous trainings (images / sec, etc.)
• Look at performance indicators and benchmarks reported by others
• Use nvidia-smi
• Look at power consumption, GPU utilization and GPU RAM
• All these values should be maxed out and stable
Maximizing GPU usage: batch size
• Picking a batch size is a tradeoff between training speed and accuracy
• Larger batch size is more computationally efficient
• Smaller batch size helps find a better minimum
• Smaller data sets, few classes (MNIST, CIFAR)
• Start with 32*GPU_COUNT
• 1024 is probably the largest reasonable batch size
• Large data sets, lot of classes (ImageNet)
• Use the largest possible batch size
• Start at 32*GPU_COUNT and increase it until MXNet OOMs
Maximizing GPU usage: compute & I/O
• Check power consumption and GPU usage after each modification
• If they’re not maxed out, GPUs are probably stalling
• Can the Python process keep up? Loading images, pre-processing, etc.
• Use top to check load and count threads
• Use RecordIO and add more decoding threads
• Can the I/O layer keep up?
• Use iostat to look at volume stats
• Use faster storage: SSD or even a ramdisk!
Using distributed training
• MXNet scales almost linearly up to 256 GPUs
http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-deep-learning-aws.html
• Easy to set up
https://mxnet.incubator.apache.org/how_to/multi_devices.html
• Blog post + AWS CloudFormation template
https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/
• Master node must have SSH access to slave nodes
• Data set must be accessible on all nodes
• Shared storage: great!
• No shared storage  automatic copy with rsync
What about CPU training?
• Several libraries help speed up Deep Learning on CPUs
• Fast implementation of math primitives
• Dedicated instruction sets, e.g. Intel AVX or ARM NEON
• Fast memory allocation
• Intel Math Kernel Library https://software.intel.com/en-us/mkl  USE_MKL = 1
• NNPACK https://github.com/Maratyszcza/NNPACK  USE_NNPACK = 1
• Libjpeg-turbo https://www.libjpeg-turbo.org/  USE_TURBO_JPEG = 1
• Jemalloc http://jemalloc.net/  USE_JEMALLOC = 1
• Google Perf Tools https://github.com/gperftools  USE_GPERFTOOLS = 1
Distribution Details
 Open Source
 Apache 2.0 License
 Common DNN APIs across all Intel hardware.
 Rapid release cycles, iterated with the DL community, to
best support industry framework integration.
 Highly vectorized & threaded for maximal performance,
based on the popular Intel® MKL library.
For developers of deep learning frameworks featuring optimized performance on Intel hardware
http://github.com/01org/mkl-dnn
Direct 2D
Convolution
Rectified linear unit
neuron activation
(ReLU)
Maximum
pooling
Inner product
Local response
normalization
(LRN)
Intel® MKL-dnn
Math Kernel Library for Deep Neural Networks
Examples:
Optimizing cost
• Use Spot instances
https://aws.amazon.com/blogs/aws/natural-
language-processing-at-clemson-university-1-1-
million-vcpus-ec2-spot-instances/
• Sharing is caring: it’s easy to share an
instance for multiple jobs
mod = mx.mod.Module(lenet, context=(mx.gpu(7), mx.gpu(8), mx.gpu(9)))
p2.16xlarge 89%
discount
Demo: C5 + Intel MKL = ♥ ♥ ♥
Optimizing Model Performance
Using data augmentation
• Data augmentation lets you add more samples to smaller data sets
• Even a large data set may benefit from it and generalize better
• The ImageRecordIter object lets you do that easily from a RecordIO image file
• Images: crop, rotate, change colors, etc.
• https://mxnet.incubator.apache.org/api/python/io.html#mxnet.io.ImageRecordIter
• Careful: this processing is performed by the Python process: add more threads!
data_iter = mx.io.ImageRecordIter(path_imgrec="./data/caltech_train.rec",
data_shape=(3, 227, 227),
batch_size=4,
resize=256
…
# you can add more augumentation options here.
# use help(mx.io.ImageRecordIter) to see all possible choices )
Picking an initializer
• MXNet supports many different initializers
https://mxnet.incubator.apache.org/api/python/optimization.html
• Initial weights should neither be ”too large” or “too small”
• There seems to be some sort of consensus on:
https://www.quora.com/What-are-good-initial-weights-in-a-neural-network
• Xavier for Convolutional Neural Networks
• Random values between 0 and 1 for everything else
• I wouldn’t use anything else unless I really knew better 
Managing the learning rate
• The learning rate is probably the most discussed parameter in Deep Learning
• Too small: your model may never converge
• Too large: your model may never reach a minimum
• Try keeping a large learning rate for a long time, then reduce it
• Here are common techniques you could use with MXNet:
1. Use a fixed learning rate
2. Use steps: scale the learning rate
• once a number of batches have been completed,
• after each epoch,
• once specific epochs have been completed
3. Use an optimizer which automatically adapts the learning rate
Scaling the learning rate with steps
• Number of steps = number of samples / batch size / number of distributed workers
• FactorScheduler object: update the learning rate after ‘n’ steps
• MultiFactorScheduler object: update the learning rate after specific step counts
• MXNet scripts let you use command-line parameters (--step-epochs)
https://github.com/apache/incubator-mxnet/tree/master/example/image-classification
lr_sch = mx.lr_scheduler.FactorScheduler(step=100, factor=0.9)
mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1),
('lr_scheduler', lr_sch)))
steps = [0, 100, 200, 250, 300, 325, 350]
lr_sch = mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=0.9)
mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1),
('lr_scheduler', lr_sch)))
Picking an optimizer
• MXNet supports many different optimizers
https://mxnet.incubator.apache.org/api/python/optimization.html
http://ruder.io/optimizing-gradient-descent/
• It’s unlikely that a single one will work best every time. Experiment!
• Several SGD variants adapt the learning rate during training
• Some of them even use a specific learning rate for each parameter
Example: learning MNIST with the LeNet CNN (20 epochs)
Algorithm SGD NAG Adam NAdam AdaGrad AdaMax
Time / epoch 2.5s 2.55s 18.5s 15.1s 5.7s 7.5s
Validation accuracy 98.5% 98.5% 98.3% 98.4% 99.2% 98.55%
Reducing model size
• Complex neural networks are too large for resource-constrained environments
• MXNet supports Mixed Precision Training
• Use float16 instead of float32
• Almost 2x reduction in memory consumption, no loss of accuracy
• https://devblogs.nvidia.com/parallelforall/mixed-precision-training-deep-neural-networks/
• http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#mxnet
• BMXNet: Binary Neural Network Implementation
• Use binary values for weights and activations
• 20x to 30x reduction in model size, with limited loss
• https://github.com/hpi-xnor/BMXNet
Monitoring the training process
• You can run callbacks at the end of each batch and at the end of each epoch.
• This allows you to display training speed…
• … and save parameters after each epoch
module.fit(iterator, num_epoch=n_epoch, ...
batch_end_callback=mx.callback.Speedometer(64, 10))
Epoch[0] Batch [10] Speed: 1910.41 samples/sec Train-accuracy=0.200000
Epoch[0] Batch [20] Speed: 1764.83 samples/sec Train-accuracy=0.400000
module.fit(iterator, num_epoch=n_epoch, ...
epoch_end_callback = mx.callback.do_checkpoint("mymodel", 1))
Start training with [cpu(0)]
Epoch[0] Resetting Data Iterator
Epoch[0] Time cost=0.100 Saved checkpoint to "mymodel-0001.params"
Epoch[1] Resetting Data Iterator
Epoch[1] Time cost=0.060 Saved checkpoint to "mymodel-0002.params"
Early stopping
Training accuracy
Loss function
Accuracy
100%
Epochs
Validation accuracy
Loss
Best checkpoint
OVERFITTING
Conclusion
• There is a lot of literature on selecting and tweaking hyper-parameters
• You should definitely read it but please experiment with your own data
• Train 1,000 models and pick the best one
• Optimizing infrastructure is all the more important, then!
• Make sure all parts are firing on all cylinders
• Spot instances!
• I hope this was useful. Please don’t forget to send your feedback
• Go build cool stuff and let me know! Happy to share and retweet 
Resources
https://aws.amazon.com/ai
https://aws.amazon.com/blogs/ai
https://mxnet.io
https://github.com/apache/incubator-mxnet
https://github.com/gluon-api
https://aws.amazon.com/blogs/machine-learning/speeding-up-apache-mxnet-using-the-nnpack-library/
https://medium.com/@julsimon/speeding-up-apache-mxnet-part-3-lets-smash-it-with-c5-and-intel-mkl-
90ab153b8cc1
https://medium.com/@julsimon/imagenet-part-1-going-on-an-adventure-c0a62976dc72
https://medium.com/@julsimon/imagenet-part-2-the-road-goes-ever-on-and-on-578f09a749f9
Thank you!
Julien Simon, AI Evangelist, EMEA
@julsimon
THANK YOU!
J u l i e n S i m o n , P r i n c i p a l A I / M L E v a n g e l i s t , E M E A
@ j u l s i m o n

Contenu connexe

Tendances

Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
Tom Laszewski
 

Tendances (20)

Amazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAmazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and Hosting
 
Build, train, and deploy Machine Learning models at scale (May 2018)
Build, train, and deploy Machine Learning models at scale (May 2018)Build, train, and deploy Machine Learning models at scale (May 2018)
Build, train, and deploy Machine Learning models at scale (May 2018)
 
AWS Webcast - Backup & Restore for ElastiCache/Redis: Getting Started & Best ...
AWS Webcast - Backup & Restore for ElastiCache/Redis: Getting Started & Best ...AWS Webcast - Backup & Restore for ElastiCache/Redis: Getting Started & Best ...
AWS Webcast - Backup & Restore for ElastiCache/Redis: Getting Started & Best ...
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)
 
Building your own slack bot on the AWS stack
Building your own slack bot on the AWS stackBuilding your own slack bot on the AWS stack
Building your own slack bot on the AWS stack
 
Deep Learning for Developers (December 2017)
Deep Learning for Developers (December 2017)Deep Learning for Developers (December 2017)
Deep Learning for Developers (December 2017)
 
AutoScaling and Drupal
AutoScaling and DrupalAutoScaling and Drupal
AutoScaling and Drupal
 
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
 
Hosting Drupal on Amazon EC2
Hosting Drupal on Amazon EC2Hosting Drupal on Amazon EC2
Hosting Drupal on Amazon EC2
 
MCL 322 Optimizing Training on Apache MXNet
MCL 322 Optimizing Training on Apache MXNet MCL 322 Optimizing Training on Apache MXNet
MCL 322 Optimizing Training on Apache MXNet
 
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
 
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
 
Scaling Drupal & Deployment in AWS
Scaling Drupal & Deployment in AWSScaling Drupal & Deployment in AWS
Scaling Drupal & Deployment in AWS
 
ECS for Amazon Deep Learning and Amazon Machine Learning
ECS for Amazon Deep Learning and Amazon Machine LearningECS for Amazon Deep Learning and Amazon Machine Learning
ECS for Amazon Deep Learning and Amazon Machine Learning
 
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech TalksDeep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
 
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
 
GAM303_Migrating Battleborn and the Spark Platform to Amazon GameLift for Mul...
GAM303_Migrating Battleborn and the Spark Platform to Amazon GameLift for Mul...GAM303_Migrating Battleborn and the Spark Platform to Amazon GameLift for Mul...
GAM303_Migrating Battleborn and the Spark Platform to Amazon GameLift for Mul...
 
AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)
 

Similaire à Optimizing training on Apache MXNet

Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
sqlserver.co.il
 

Similaire à Optimizing training on Apache MXNet (20)

Optimize Your Machine Learning Workloads
Optimize Your Machine Learning WorkloadsOptimize Your Machine Learning Workloads
Optimize Your Machine Learning Workloads
 
ACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemakerACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemaker
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning service
 
MCL309_Deep Learning on a Raspberry Pi
MCL309_Deep Learning on a Raspberry PiMCL309_Deep Learning on a Raspberry Pi
MCL309_Deep Learning on a Raspberry Pi
 
Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCache Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCache
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCache
 
Deep Learning with Apache MXNet
Deep Learning with Apache MXNetDeep Learning with Apache MXNet
Deep Learning with Apache MXNet
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engine
 
Running ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionRunning ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in Production
 
[REPEAT] Deep Learning Applications Using TensorFlow (AIM401-R) - AWS re:Inve...
[REPEAT] Deep Learning Applications Using TensorFlow (AIM401-R) - AWS re:Inve...[REPEAT] Deep Learning Applications Using TensorFlow (AIM401-R) - AWS re:Inve...
[REPEAT] Deep Learning Applications Using TensorFlow (AIM401-R) - AWS re:Inve...
 
AWS re:Invent 2018 - AIM401 - Deep Learning using Tensorflow
AWS re:Invent 2018 - AIM401 - Deep Learning using TensorflowAWS re:Invent 2018 - AIM401 - Deep Learning using Tensorflow
AWS re:Invent 2018 - AIM401 - Deep Learning using Tensorflow
 
AWS 機器學習 II ─ 深度學習 Deep Learning & MXNet
AWS 機器學習 II ─ 深度學習 Deep Learning & MXNetAWS 機器學習 II ─ 深度學習 Deep Learning & MXNet
AWS 機器學習 II ─ 深度學習 Deep Learning & MXNet
 
OroCRM Partner Technical Training: September 2015
OroCRM Partner Technical Training: September 2015OroCRM Partner Technical Training: September 2015
OroCRM Partner Technical Training: September 2015
 
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning PerformanceAnirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
 
Build, train, and deploy Machine Learning models at scale (May 2018)
Build, train, and deploy Machine Learning models at scale (May 2018)Build, train, and deploy Machine Learning models at scale (May 2018)
Build, train, and deploy Machine Learning models at scale (May 2018)
 
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
 

Plus de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Optimizing training on Apache MXNet

  • 1. Optimizing training on Apache MXNet Julien Simon, AI Evangelist, EMEA @julsimon
  • 2. What to expect from this session • Techniques and tips to optimize training on Apache MXNet • Infrastructure performance: storage and I/O, GPU throughput, distributed training, CPU-based training, cost • Model performance: data augmentation, initializers, optimizers, etc. • Level 666: you should be familiar with Deep Learning and MXNet
  • 4. Deploying data sets to instances • Deep Learning training sets are often very large, with a huge number of files • How can we deploy them quickly, easily and reliably to instances? • We strongly recommend packing the training set in a RecordIo file • https://mxnet.incubator.apache.org/architecture/note_data_loading.html • https://mxnet.incubator.apache.org/how_to/recordio.html • Only one file to move around! • Worth the effort: pack once, train many times • In any case, you need to copy your data set to a central location • Let’s look at Amazon EBS, Amazon S3 and Amazon EFS
  • 5. Storing data sets in Amazon EBS 1. Prepare your data set on a dedicated EBS volume 2. Take a snapshot 3. Deploying to a new instance only takes a few seconds a. Create a volume from the snapshot b. Attach the volume to the instance c. Mount the volume • Easy to automate, including at boot time (UserData or cfn-init) • Easy to scale to many instances, even in different accounts • Large choice of EBS volume types (cost vs. performance) • Caveat: no sharing for distributed training, copying is required
  • 6. Storing data sets in Amazon S3 • MXNet has an S3 connector  build option USE_S3=1 https://mxnet.incubator.apache.org/how_to/s3_integration.html • Best durability (11 9’s) • Distributed training possible • Caveats • Lower performance than EBS-optimized instances • Beware of hot spots if a lot of instances are running https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html train_dataiter = mx.io.MNISTIter( image="s3://bucket-name/training-data/train-images-idx3-ubyte", label="s3://bucket-name/training-data/train-labels-idx1-ubyte", ...
  • 7. Storing data sets in Amazon EFS 1. Copy your data set on an EFS volume 2. Mount the volume on instances • Simple way to set up distributed training (no copying required) • Caveats • You probably want the “Max I/O” performance mode, but I’d test both to see if latency is an issue or not • EFS is more expensive than S3 and EBS: use it for training only, not for long-term storage
  • 8. Maximizing GPU usage • GPUs need a high-throughput, stable flow of training data to run at top speed • Large datasets cannot fit in RAM • Adding more GPUs requires more throughput • How can we check that training is running at full speed? • Keep track of performance indicators from previous trainings (images / sec, etc.) • Look at performance indicators and benchmarks reported by others • Use nvidia-smi • Look at power consumption, GPU utilization and GPU RAM • All these values should be maxed out and stable
  • 9. Maximizing GPU usage: batch size • Picking a batch size is a tradeoff between training speed and accuracy • Larger batch size is more computationally efficient • Smaller batch size helps find a better minimum • Smaller data sets, few classes (MNIST, CIFAR) • Start with 32*GPU_COUNT • 1024 is probably the largest reasonable batch size • Large data sets, lot of classes (ImageNet) • Use the largest possible batch size • Start at 32*GPU_COUNT and increase it until MXNet OOMs
  • 10. Maximizing GPU usage: compute & I/O • Check power consumption and GPU usage after each modification • If they’re not maxed out, GPUs are probably stalling • Can the Python process keep up? Loading images, pre-processing, etc. • Use top to check load and count threads • Use RecordIO and add more decoding threads • Can the I/O layer keep up? • Use iostat to look at volume stats • Use faster storage: SSD or even a ramdisk!
  • 11. Using distributed training • MXNet scales almost linearly up to 256 GPUs http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-deep-learning-aws.html • Easy to set up https://mxnet.incubator.apache.org/how_to/multi_devices.html • Blog post + AWS CloudFormation template https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/ • Master node must have SSH access to slave nodes • Data set must be accessible on all nodes • Shared storage: great! • No shared storage  automatic copy with rsync
  • 12. What about CPU training? • Several libraries help speed up Deep Learning on CPUs • Fast implementation of math primitives • Dedicated instruction sets, e.g. Intel AVX or ARM NEON • Fast memory allocation • Intel Math Kernel Library https://software.intel.com/en-us/mkl  USE_MKL = 1 • NNPACK https://github.com/Maratyszcza/NNPACK  USE_NNPACK = 1 • Libjpeg-turbo https://www.libjpeg-turbo.org/  USE_TURBO_JPEG = 1 • Jemalloc http://jemalloc.net/  USE_JEMALLOC = 1 • Google Perf Tools https://github.com/gperftools  USE_GPERFTOOLS = 1
  • 13. Distribution Details  Open Source  Apache 2.0 License  Common DNN APIs across all Intel hardware.  Rapid release cycles, iterated with the DL community, to best support industry framework integration.  Highly vectorized & threaded for maximal performance, based on the popular Intel® MKL library. For developers of deep learning frameworks featuring optimized performance on Intel hardware http://github.com/01org/mkl-dnn Direct 2D Convolution Rectified linear unit neuron activation (ReLU) Maximum pooling Inner product Local response normalization (LRN) Intel® MKL-dnn Math Kernel Library for Deep Neural Networks Examples:
  • 14. Optimizing cost • Use Spot instances https://aws.amazon.com/blogs/aws/natural- language-processing-at-clemson-university-1-1- million-vcpus-ec2-spot-instances/ • Sharing is caring: it’s easy to share an instance for multiple jobs mod = mx.mod.Module(lenet, context=(mx.gpu(7), mx.gpu(8), mx.gpu(9))) p2.16xlarge 89% discount
  • 15. Demo: C5 + Intel MKL = ♥ ♥ ♥
  • 17. Using data augmentation • Data augmentation lets you add more samples to smaller data sets • Even a large data set may benefit from it and generalize better • The ImageRecordIter object lets you do that easily from a RecordIO image file • Images: crop, rotate, change colors, etc. • https://mxnet.incubator.apache.org/api/python/io.html#mxnet.io.ImageRecordIter • Careful: this processing is performed by the Python process: add more threads! data_iter = mx.io.ImageRecordIter(path_imgrec="./data/caltech_train.rec", data_shape=(3, 227, 227), batch_size=4, resize=256 … # you can add more augumentation options here. # use help(mx.io.ImageRecordIter) to see all possible choices )
  • 18. Picking an initializer • MXNet supports many different initializers https://mxnet.incubator.apache.org/api/python/optimization.html • Initial weights should neither be ”too large” or “too small” • There seems to be some sort of consensus on: https://www.quora.com/What-are-good-initial-weights-in-a-neural-network • Xavier for Convolutional Neural Networks • Random values between 0 and 1 for everything else • I wouldn’t use anything else unless I really knew better 
  • 19. Managing the learning rate • The learning rate is probably the most discussed parameter in Deep Learning • Too small: your model may never converge • Too large: your model may never reach a minimum • Try keeping a large learning rate for a long time, then reduce it • Here are common techniques you could use with MXNet: 1. Use a fixed learning rate 2. Use steps: scale the learning rate • once a number of batches have been completed, • after each epoch, • once specific epochs have been completed 3. Use an optimizer which automatically adapts the learning rate
  • 20. Scaling the learning rate with steps • Number of steps = number of samples / batch size / number of distributed workers • FactorScheduler object: update the learning rate after ‘n’ steps • MultiFactorScheduler object: update the learning rate after specific step counts • MXNet scripts let you use command-line parameters (--step-epochs) https://github.com/apache/incubator-mxnet/tree/master/example/image-classification lr_sch = mx.lr_scheduler.FactorScheduler(step=100, factor=0.9) mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1), ('lr_scheduler', lr_sch))) steps = [0, 100, 200, 250, 300, 325, 350] lr_sch = mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=0.9) mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1), ('lr_scheduler', lr_sch)))
  • 21. Picking an optimizer • MXNet supports many different optimizers https://mxnet.incubator.apache.org/api/python/optimization.html http://ruder.io/optimizing-gradient-descent/ • It’s unlikely that a single one will work best every time. Experiment! • Several SGD variants adapt the learning rate during training • Some of them even use a specific learning rate for each parameter Example: learning MNIST with the LeNet CNN (20 epochs) Algorithm SGD NAG Adam NAdam AdaGrad AdaMax Time / epoch 2.5s 2.55s 18.5s 15.1s 5.7s 7.5s Validation accuracy 98.5% 98.5% 98.3% 98.4% 99.2% 98.55%
  • 22. Reducing model size • Complex neural networks are too large for resource-constrained environments • MXNet supports Mixed Precision Training • Use float16 instead of float32 • Almost 2x reduction in memory consumption, no loss of accuracy • https://devblogs.nvidia.com/parallelforall/mixed-precision-training-deep-neural-networks/ • http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#mxnet • BMXNet: Binary Neural Network Implementation • Use binary values for weights and activations • 20x to 30x reduction in model size, with limited loss • https://github.com/hpi-xnor/BMXNet
  • 23. Monitoring the training process • You can run callbacks at the end of each batch and at the end of each epoch. • This allows you to display training speed… • … and save parameters after each epoch module.fit(iterator, num_epoch=n_epoch, ... batch_end_callback=mx.callback.Speedometer(64, 10)) Epoch[0] Batch [10] Speed: 1910.41 samples/sec Train-accuracy=0.200000 Epoch[0] Batch [20] Speed: 1764.83 samples/sec Train-accuracy=0.400000 module.fit(iterator, num_epoch=n_epoch, ... epoch_end_callback = mx.callback.do_checkpoint("mymodel", 1)) Start training with [cpu(0)] Epoch[0] Resetting Data Iterator Epoch[0] Time cost=0.100 Saved checkpoint to "mymodel-0001.params" Epoch[1] Resetting Data Iterator Epoch[1] Time cost=0.060 Saved checkpoint to "mymodel-0002.params"
  • 24. Early stopping Training accuracy Loss function Accuracy 100% Epochs Validation accuracy Loss Best checkpoint OVERFITTING
  • 25. Conclusion • There is a lot of literature on selecting and tweaking hyper-parameters • You should definitely read it but please experiment with your own data • Train 1,000 models and pick the best one • Optimizing infrastructure is all the more important, then! • Make sure all parts are firing on all cylinders • Spot instances! • I hope this was useful. Please don’t forget to send your feedback • Go build cool stuff and let me know! Happy to share and retweet 
  • 27. Thank you! Julien Simon, AI Evangelist, EMEA @julsimon
  • 28. THANK YOU! J u l i e n S i m o n , P r i n c i p a l A I / M L E v a n g e l i s t , E M E A @ j u l s i m o n