This document provides tips and techniques for optimizing training on Apache MXNet. It discusses optimizing infrastructure performance through efficient data storage and distribution, maximizing GPU usage through techniques like batch size tuning, and optimizing model performance through techniques like data augmentation, learning rate scheduling, and optimizer selection. The document demonstrates these concepts through examples and recommends experimenting to find the best approach for each problem.
2. What to expect from this session
• Techniques and tips to optimize training on Apache MXNet
• Infrastructure performance: storage and I/O, GPU throughput, distributed
training, CPU-based training, cost
• Model performance: data augmentation, initializers, optimizers, etc.
• Level 666: you should be familiar with Deep Learning and MXNet
4. Deploying data sets to instances
• Deep Learning training sets are often very large, with a huge number of files
• How can we deploy them quickly, easily and reliably to instances?
• We strongly recommend packing the training set in a RecordIo file
• https://mxnet.incubator.apache.org/architecture/note_data_loading.html
• https://mxnet.incubator.apache.org/how_to/recordio.html
• Only one file to move around!
• Worth the effort: pack once, train many times
• In any case, you need to copy your data set to a central location
• Let’s look at Amazon EBS, Amazon S3 and Amazon EFS
5. Storing data sets in Amazon EBS
1. Prepare your data set on a dedicated EBS volume
2. Take a snapshot
3. Deploying to a new instance only takes a few seconds
a. Create a volume from the snapshot
b. Attach the volume to the instance
c. Mount the volume
• Easy to automate, including at boot time (UserData or cfn-init)
• Easy to scale to many instances, even in different accounts
• Large choice of EBS volume types (cost vs. performance)
• Caveat: no sharing for distributed training, copying is required
6. Storing data sets in Amazon S3
• MXNet has an S3 connector build option USE_S3=1
https://mxnet.incubator.apache.org/how_to/s3_integration.html
• Best durability (11 9’s)
• Distributed training possible
• Caveats
• Lower performance than EBS-optimized instances
• Beware of hot spots if a lot of instances are running
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
train_dataiter = mx.io.MNISTIter(
image="s3://bucket-name/training-data/train-images-idx3-ubyte",
label="s3://bucket-name/training-data/train-labels-idx1-ubyte", ...
7. Storing data sets in Amazon EFS
1. Copy your data set on an EFS volume
2. Mount the volume on instances
• Simple way to set up distributed training (no copying required)
• Caveats
• You probably want the “Max I/O” performance mode, but I’d test both
to see if latency is an issue or not
• EFS is more expensive than S3 and EBS: use it for training only, not
for long-term storage
8. Maximizing GPU usage
• GPUs need a high-throughput, stable flow of training data to run at top speed
• Large datasets cannot fit in RAM
• Adding more GPUs requires more throughput
• How can we check that training is running at full speed?
• Keep track of performance indicators from previous trainings (images / sec, etc.)
• Look at performance indicators and benchmarks reported by others
• Use nvidia-smi
• Look at power consumption, GPU utilization and GPU RAM
• All these values should be maxed out and stable
9. Maximizing GPU usage: batch size
• Picking a batch size is a tradeoff between training speed and accuracy
• Larger batch size is more computationally efficient
• Smaller batch size helps find a better minimum
• Smaller data sets, few classes (MNIST, CIFAR)
• Start with 32*GPU_COUNT
• 1024 is probably the largest reasonable batch size
• Large data sets, lot of classes (ImageNet)
• Use the largest possible batch size
• Start at 32*GPU_COUNT and increase it until MXNet OOMs
10. Maximizing GPU usage: compute & I/O
• Check power consumption and GPU usage after each modification
• If they’re not maxed out, GPUs are probably stalling
• Can the Python process keep up? Loading images, pre-processing, etc.
• Use top to check load and count threads
• Use RecordIO and add more decoding threads
• Can the I/O layer keep up?
• Use iostat to look at volume stats
• Use faster storage: SSD or even a ramdisk!
11. Using distributed training
• MXNet scales almost linearly up to 256 GPUs
http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-deep-learning-aws.html
• Easy to set up
https://mxnet.incubator.apache.org/how_to/multi_devices.html
• Blog post + AWS CloudFormation template
https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/
• Master node must have SSH access to slave nodes
• Data set must be accessible on all nodes
• Shared storage: great!
• No shared storage automatic copy with rsync
12. What about CPU training?
• Several libraries help speed up Deep Learning on CPUs
• Fast implementation of math primitives
• Dedicated instruction sets, e.g. Intel AVX or ARM NEON
• Fast memory allocation
• Intel Math Kernel Library https://software.intel.com/en-us/mkl USE_MKL = 1
• NNPACK https://github.com/Maratyszcza/NNPACK USE_NNPACK = 1
• Libjpeg-turbo https://www.libjpeg-turbo.org/ USE_TURBO_JPEG = 1
• Jemalloc http://jemalloc.net/ USE_JEMALLOC = 1
• Google Perf Tools https://github.com/gperftools USE_GPERFTOOLS = 1
13. Distribution Details
Open Source
Apache 2.0 License
Common DNN APIs across all Intel hardware.
Rapid release cycles, iterated with the DL community, to
best support industry framework integration.
Highly vectorized & threaded for maximal performance,
based on the popular Intel® MKL library.
For developers of deep learning frameworks featuring optimized performance on Intel hardware
http://github.com/01org/mkl-dnn
Direct 2D
Convolution
Rectified linear unit
neuron activation
(ReLU)
Maximum
pooling
Inner product
Local response
normalization
(LRN)
Intel® MKL-dnn
Math Kernel Library for Deep Neural Networks
Examples:
14. Optimizing cost
• Use Spot instances
https://aws.amazon.com/blogs/aws/natural-language-
processing-at-clemson-university-1-1-million-vcpus-
ec2-spot-instances/
• Sharing is caring: it’s easy to share an
instance for multiple jobs
mod = mx.mod.Module(lenet, context=(mx.gpu(7), mx.gpu(8), mx.gpu(9)))
p2.16xlarge 89%
discount
17. Using data augmentation
• Data augmentation lets you add more samples to smaller data sets
• Even a large data set may benefit from it and generalize better
• The ImageRecordIter object lets you do that easily from a RecordIO image file
• Images: crop, rotate, change colors, etc.
• https://mxnet.incubator.apache.org/api/python/io.html#mxnet.io.ImageRecordIter
• Careful: this processing is performed by the Python process: add more threads!
data_iter = mx.io.ImageRecordIter(path_imgrec="./data/caltech_train.rec",
data_shape=(3, 227, 227),
batch_size=4,
resize=256
…
# you can add more augumentation options here.
# use help(mx.io.ImageRecordIter) to see all possible choices )
18. Picking an initializer
• MXNet supports many different initializers
https://mxnet.incubator.apache.org/api/python/optimization.html
• Initial weights should neither be ”too large” or “too small”
• There seems to be some sort of consensus on:
https://www.quora.com/What-are-good-initial-weights-in-a-neural-network
• Xavier for Convolutional Neural Networks
• Random values between 0 and 1 for everything else
• I wouldn’t use anything else unless I really knew better
19. Managing the learning rate
• The learning rate is probably the most discussed parameter in Deep Learning
• Too small: your model may never converge
• Too large: your model may never reach a minimum
• Try keeping a large learning rate for a long time, then reduce it
• Here are common techniques you could use with MXNet:
1. Use a fixed learning rate
2. Use steps: scale the learning rate
• once a number of batches have been completed,
• after each epoch,
• once specific epochs have been completed
3. Use an optimizer which automatically adapts the learning rate
20. Scaling the learning rate with steps
• Number of steps = number of samples / batch size / number of distributed workers
• FactorScheduler object: update the learning rate after ‘n’ steps
• MultiFactorScheduler object: update the learning rate after specific step counts
• MXNet scripts let you use command-line parameters (--step-epochs)
https://github.com/apache/incubator-mxnet/tree/master/example/image-classification
lr_sch = mx.lr_scheduler.FactorScheduler(step=100, factor=0.9)
mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1),
('lr_scheduler', lr_sch)))
steps = [0, 100, 200, 250, 300, 325, 350]
lr_sch = mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=0.9)
mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1),
('lr_scheduler', lr_sch)))
21. Picking an optimizer
• MXNet supports many different optimizers
https://mxnet.incubator.apache.org/api/python/optimization.html
http://ruder.io/optimizing-gradient-descent/
• It’s unlikely that a single one will work best every time. Experiment!
• Several SGD variants adapt the learning rate during training
• Some of them even use a specific learning rate for each parameter
Example: learning MNIST with the LeNet CNN (20 epochs)
Algorithm SGD NAG Adam NAdam AdaGrad AdaMax
Time / epoch 2.5s 2.55s 18.5s 15.1s 5.7s 7.5s
Validation accuracy 98.5% 98.5% 98.3% 98.4% 99.2% 98.55%
22. Reducing model size
• Complex neural networks are too large for resource-constrained environments
• MXNet supports Mixed Precision Training
• Use float16 instead of float32
• Almost 2x reduction in memory consumption, no loss of accuracy
• https://devblogs.nvidia.com/parallelforall/mixed-precision-training-deep-neural-networks/
• http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#mxnet
• BMXNet: Binary Neural Network Implementation
• Use binary values for weights and activations
• 20x to 30x reduction in model size, with limited loss
• https://github.com/hpi-xnor/BMXNet
23. Monitoring the training process
• You can run callbacks at the end of each batch and at the end of each epoch.
• This allows you to display training speed…
• … and save parameters after each epoch
module.fit(iterator, num_epoch=n_epoch, ...
batch_end_callback=mx.callback.Speedometer(64, 10))
Epoch[0] Batch [10] Speed: 1910.41 samples/sec Train-accuracy=0.200000
Epoch[0] Batch [20] Speed: 1764.83 samples/sec Train-accuracy=0.400000
module.fit(iterator, num_epoch=n_epoch, ...
epoch_end_callback = mx.callback.do_checkpoint("mymodel", 1))
Start training with [cpu(0)]
Epoch[0] Resetting Data Iterator
Epoch[0] Time cost=0.100 Saved checkpoint to "mymodel-0001.params"
Epoch[1] Resetting Data Iterator
Epoch[1] Time cost=0.060 Saved checkpoint to "mymodel-0002.params"
25. Conclusion
• There is a lot of literature on selecting and tweaking hyper-parameters
• You should definitely read it but please experiment with your own data
• Train 1,000 models and pick the best one
• Optimizing infrastructure is all the more important, then!
• Make sure all parts are firing on all cylinders
• Spot instances!
• I hope this was useful. Please don’t forget to send your feedback
• Go build cool stuff and let me know! Happy to share and retweet
28. THANK YOU!
J u l i e n S i m o n , P r i n c i p a l A I / M L E v a n g e l i s t , E M E A
@ j u l s i m o n
Notes de l'éditeur
ImageNet: 1.2 million files, 152 GB
ImageNet: 1.2 million files, 152 GB
ImageNet: 1.2 million files, 152 GB
ImageNet: 1.2 million files, 152 GB
Intel® MKL-DNN (Math Kernel Library for Deep Neural Networks) is highly optimized using industry leading techniques and low level assembly code where appropriate. The API has been developed with feedback and interaction with the major framework owners, and as an open source project will track new and emerging trends in these frameworks. Intel is using this internally for our work in optimizing industry frameworks, as well as supporting the industry in their optimizations.