Find out more about:
• Techniques and tips to optimize trainingon Apache MXNet
• Infrastructure performance:storage and I/O, GPU throughput, distributed training, CPU-based training, cost
• Model performance:data augmentation, initializers, optimizers, etc.
• Level 666: you should be familiar with Deep Learning and MXNet
2. What to expect from this session
• Techniques and tips to optimize training on Apache MXNet
• Infrastructure performance: storage and I/O, GPU throughput, distributed
training, CPU-based training, cost
• Model performance: data augmentation, initializers, optimizers, etc.
• Level 666: you should be familiar with Deep Learning and MXNet
4. Deploying data sets to instances
• Deep Learning training sets are often very large, with a huge number of files
• How can we deploy them quickly, easily and reliably to instances?
• We strongly recommend packing the training set in a RecordIo file
• https://mxnet.incubator.apache.org/architecture/note_data_loading.html
• https://mxnet.incubator.apache.org/how_to/recordio.html
• Only one file to move around!
• Worth the effort: pack once, train many times
• In any case, you need to copy your data set to a central location
• Let’s look at Amazon EBS, Amazon S3 and Amazon EFS
5. Storing data sets in Amazon EBS
1. Prepare your data set on a dedicated EBS volume
2. Take a snapshot
3. Deploying to a new instance only takes a few seconds
a. Create a volume from the snapshot
b. Attach the volume to the instance
c. Mount the volume
• Easy to automate, including at boot time (UserData or cfn-init)
• Easy to scale to many instances, even in different accounts
• Large choice of EBS volume types (cost vs. performance)
• Caveat: no sharing for distributed training, copying is required
6. Storing data sets in Amazon S3
• MXNet has an S3 connector build option USE_S3=1
https://mxnet.incubator.apache.org/how_to/s3_integration.html
• Best durability (11 9’s)
• Distributed training possible
• Caveats
• Lower performance than EBS-optimized instances
• Beware of hot spots if a lot of instances are running
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
train_dataiter = mx.io.MNISTIter(
image="s3://bucket-name/training-data/train-images-idx3-ubyte",
label="s3://bucket-name/training-data/train-labels-idx1-ubyte", ...
7. Storing data sets in Amazon EFS
1. Copy your data set on an EFS volume
2. Mount the volume on instances
• Simple way to set up distributed training (no copying required)
• Caveats
• You probably want the “Max I/O” performance mode, but I’d test both
to see if latency is an issue or not
• EFS is more expensive than S3 and EBS: use it for training only, not
for long-term storage
8. Maximizing GPU usage
• GPUs need a high-throughput, stable flow of training data to run at top speed
• Large datasets cannot fit in RAM
• Adding more GPUs requires more throughput
• How can we check that training is running at full speed?
• Keep track of performance indicators from previous trainings (images / sec, etc.)
• Look at performance indicators and benchmarks reported by others
• Use nvidia-smi
• Look at power consumption, GPU utilization and GPU RAM
• All these values should be maxed out and stable
9. Maximizing GPU usage: batch size
• Picking a batch size is a tradeoff between training speed and accuracy
• Larger batch size is more computationally efficient
• Smaller batch size helps find a better minimum
• Smaller data sets, few classes (MNIST, CIFAR)
• Start with 32*GPU_COUNT
• 1024 is probably the largest reasonable batch size
• Large data sets, lot of classes (ImageNet)
• Use the largest possible batch size
• Start at 32*GPU_COUNT and increase it until MXNet OOMs
10. Maximizing GPU usage: compute & I/O
• Check power consumption and GPU usage after each modification
• If they’re not maxed out, GPUs are probably stalling
• Can the Python process keep up? Loading images, pre-processing, etc.
• Use top to check load and count threads
• Use RecordIO and add more decoding threads
• Can the I/O layer keep up?
• Use iostat to look at volume stats
• Use faster storage: SSD or even a ramdisk!
11. Using distributed training
• MXNet scales almost linearly up to 256 GPUs
http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-deep-learning-aws.html
• Easy to set up
https://mxnet.incubator.apache.org/how_to/multi_devices.html
• Blog post + AWS CloudFormation template
https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/
• Master node must have SSH access to slave nodes
• Data set must be accessible on all nodes
• Shared storage: great!
• No shared storage automatic copy with rsync
12. What about CPU training?
• Several libraries help speed up Deep Learning on CPUs
• Fast implementation of math primitives
• Dedicated instruction sets, e.g. Intel AVX or ARM NEON
• Fast memory allocation
• Intel Math Kernel Library https://software.intel.com/en-us/mkl USE_MKL = 1
• NNPACK https://github.com/Maratyszcza/NNPACK USE_NNPACK = 1
• Libjpeg-turbo https://www.libjpeg-turbo.org/ USE_TURBO_JPEG = 1
• Jemalloc http://jemalloc.net/ USE_JEMALLOC = 1
• Google Perf Tools https://github.com/gperftools USE_GPERFTOOLS = 1
13. Distribution Details
Open Source
Apache 2.0 License
Common DNN APIs across all Intel hardware.
Rapid release cycles, iterated with the DL community, to
best support industry framework integration.
Highly vectorized & threaded for maximal performance,
based on the popular Intel® MKL library.
For developers of deep learning frameworks featuring optimized performance on Intel hardware
http://github.com/01org/mkl-dnn
Direct 2D
Convolution
Rectified linear unit
neuron activation
(ReLU)
Maximum
pooling
Inner product
Local response
normalization
(LRN)
Intel® MKL-dnn
Math Kernel Library for Deep Neural Networks
Examples:
14. Optimizing cost
• Use Spot instances
https://aws.amazon.com/blogs/aws/natural-
language-processing-at-clemson-university-1-1-
million-vcpus-ec2-spot-instances/
• Sharing is caring: it’s easy to share an
instance for multiple jobs
mod = mx.mod.Module(lenet, context=(mx.gpu(7), mx.gpu(8), mx.gpu(9)))
p2.16xlarge 89%
discount
17. Using data augmentation
• Data augmentation lets you add more samples to smaller data sets
• Even a large data set may benefit from it and generalize better
• The ImageRecordIter object lets you do that easily from a RecordIO image file
• Images: crop, rotate, change colors, etc.
• https://mxnet.incubator.apache.org/api/python/io.html#mxnet.io.ImageRecordIter
• Careful: this processing is performed by the Python process: add more threads!
data_iter = mx.io.ImageRecordIter(path_imgrec="./data/caltech_train.rec",
data_shape=(3, 227, 227),
batch_size=4,
resize=256
…
# you can add more augumentation options here.
# use help(mx.io.ImageRecordIter) to see all possible choices )
18. Picking an initializer
• MXNet supports many different initializers
https://mxnet.incubator.apache.org/api/python/optimization.html
• Initial weights should neither be ”too large” or “too small”
• There seems to be some sort of consensus on:
https://www.quora.com/What-are-good-initial-weights-in-a-neural-network
• Xavier for Convolutional Neural Networks
• Random values between 0 and 1 for everything else
• I wouldn’t use anything else unless I really knew better
19. Managing the learning rate
• The learning rate is probably the most discussed parameter in Deep Learning
• Too small: your model may never converge
• Too large: your model may never reach a minimum
• Try keeping a large learning rate for a long time, then reduce it
• Here are common techniques you could use with MXNet:
1. Use a fixed learning rate
2. Use steps: scale the learning rate
• once a number of batches have been completed,
• after each epoch,
• once specific epochs have been completed
3. Use an optimizer which automatically adapts the learning rate
20. Scaling the learning rate with steps
• Number of steps = number of samples / batch size / number of distributed workers
• FactorScheduler object: update the learning rate after ‘n’ steps
• MultiFactorScheduler object: update the learning rate after specific step counts
• MXNet scripts let you use command-line parameters (--step-epochs)
https://github.com/apache/incubator-mxnet/tree/master/example/image-classification
lr_sch = mx.lr_scheduler.FactorScheduler(step=100, factor=0.9)
mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1),
('lr_scheduler', lr_sch)))
steps = [0, 100, 200, 250, 300, 325, 350]
lr_sch = mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=0.9)
mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1),
('lr_scheduler', lr_sch)))
21. Picking an optimizer
• MXNet supports many different optimizers
https://mxnet.incubator.apache.org/api/python/optimization.html
http://ruder.io/optimizing-gradient-descent/
• It’s unlikely that a single one will work best every time. Experiment!
• Several SGD variants adapt the learning rate during training
• Some of them even use a specific learning rate for each parameter
Example: learning MNIST with the LeNet CNN (20 epochs)
Algorithm SGD NAG Adam NAdam AdaGrad AdaMax
Time / epoch 2.5s 2.55s 18.5s 15.1s 5.7s 7.5s
Validation accuracy 98.5% 98.5% 98.3% 98.4% 99.2% 98.55%
22. Reducing model size
• Complex neural networks are too large for resource-constrained environments
• MXNet supports Mixed Precision Training
• Use float16 instead of float32
• Almost 2x reduction in memory consumption, no loss of accuracy
• https://devblogs.nvidia.com/parallelforall/mixed-precision-training-deep-neural-networks/
• http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#mxnet
• BMXNet: Binary Neural Network Implementation
• Use binary values for weights and activations
• 20x to 30x reduction in model size, with limited loss
• https://github.com/hpi-xnor/BMXNet
23. Monitoring the training process
• You can run callbacks at the end of each batch and at the end of each epoch.
• This allows you to display training speed…
• … and save parameters after each epoch
module.fit(iterator, num_epoch=n_epoch, ...
batch_end_callback=mx.callback.Speedometer(64, 10))
Epoch[0] Batch [10] Speed: 1910.41 samples/sec Train-accuracy=0.200000
Epoch[0] Batch [20] Speed: 1764.83 samples/sec Train-accuracy=0.400000
module.fit(iterator, num_epoch=n_epoch, ...
epoch_end_callback = mx.callback.do_checkpoint("mymodel", 1))
Start training with [cpu(0)]
Epoch[0] Resetting Data Iterator
Epoch[0] Time cost=0.100 Saved checkpoint to "mymodel-0001.params"
Epoch[1] Resetting Data Iterator
Epoch[1] Time cost=0.060 Saved checkpoint to "mymodel-0002.params"
25. Conclusion
• There is a lot of literature on selecting and tweaking hyper-parameters
• You should definitely read it but please experiment with your own data
• Train 1,000 models and pick the best one
• Optimizing infrastructure is all the more important, then!
• Make sure all parts are firing on all cylinders
• Spot instances!
• I hope this was useful. Please don’t forget to send your feedback
• Go build cool stuff and let me know! Happy to share and retweet