Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

[AWS Tech Talk] Using containers for deep learning workflows

Slides supporting the following webinar:
https://www.youtube.com/watch?v=wbDVGAbd_dM

  • Soyez le premier à commenter

[AWS Tech Talk] Using containers for deep learning workflows

  1. 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Shashank Prasanna, Sr. Technical Evangelist, AI/ML 30th September 2019 Using Containers for Deep Learning Workflows
  2. 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Common deep learning setups and challenges Using containers for deep learning workflows • Demo 1: Containers for deep learning training workflows Scaling deep learning training • Demo 2: Submitting training jobs using containers to Amazon Elastic Kubernetes Services (Amazon EKS) • Demo 3: Running large-scale experiments using containers on Amazon SageMaker Summary and Q&A
  3. 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Common machine learning setups 1. Code & frameworks 2. Compute (CPUs, GPUs) 3. Storage CLI EC2 instance DL AMI Amazon S3 CLI On-premises
  4. 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deep learning workflow Data acquisition curation and labeling Data preparation for training Large-scale experimentation Distributed training Model optimization and validation Deployment Need for scale
  5. 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deep learning is computationally expensive, but can be scaled-out CLI EC2 instance this… CLI Cluster …to this
  6. 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Scaling-out deep learning training Parallel experiments Distributed training Distributing training of a single model to train faster Different models running parallel to find the best model
  7. 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. …but there are challenges to scaling CLI Cluster Code and dependencies Infrastructure management Cluster management 1 2 3
  8. 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine learning stack is complex • “My code requires building several dependencies from source” • “My code isn’t taking advantage the GPU/GPUs” • “is cudnn, nccl installed, is it the right version?” • “My code is running slow on CPUs” • “oh wait, is it taking advantage of AVX instruction set ?!?” • “I updated my drivers and training is now slower/errors out” • “My cluster runs a different version of framework/linux distro” Makes portability, collaboration, scaling training really really hard! Code and dependencies 1
  9. 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NVIDIA drivers 436.15 Ubuntu 16.04 TensorFlow 1.13 Keras horovod numpy scipy others… Mkl 2019 v3CPU: cudnn 7.1 cublas 10 nccl 2 CUDA toolkit 10 GPU: scikit-learn pandas openmpi Python My code Development system NVIDIA drivers 410.68 Centos 7 Training cluster TensorFlow 1.14 Keras horovod numpy scipy others… Mkl 2019 v2CPU: cudnn 7.5 cublas 10 nccl 2.4 CUDA toolkit 10 GPU: scikit-learn pandas openmpi Python My code Multiple points of failureDevelopment system Training cluster Code and dependencies 1
  10. 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Containers for Machine Learning Container runtime Infrastructure NVIDIA drivers Host OS Packages:TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python + Your training scripts ML environments that are: Code and dependencies 1
  11. 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. TensorFlow mkl cudnn cublas Nccl CUDA toolkit NVIDIA drivers Host OS CPU: GPU: Container runtime TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python Development system NVIDIA drivers Host OS Container runtime Training cluster Container registry push TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Pythonpull + Your training scripts + Your training scripts
  12. 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Deep Learning Containers https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-images.html Code and dependencies 1
  13. 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 1: Containers for deep learning workflows AWS Cloud Amazon ECR Deep learning container images AWS DL containers EC2 instance GPUs CLI Amazon EBS Datasets and checkpoints
  14. 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges with scaling deep learning CLI Cluster Code and dependencies Infrastructure management Cluster management
  15. 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML infrastructure and cluster management Image registry Container image repository Amazon Elastic Container Registry (Amazon ECR) Compute Where the containers run Amazon EC2 Jupyter notebook instances high performance algorithms Large-scale training Optimization One-click deployment Fully managed with auto-scaling ML services Fully-managed service that covers the entire machine learning workflow Amazon SageMaker Management Deployment, scheduling, scaling, and management of containerized applications Amazon Elastic Kubernetes Service (Amazon EKS) Amazon Elastic Container Service (Amazon ECS)
  16. 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 2: Submitting training jobs to Amazon Elastic Kubernetes Services (Amazon EKS) Approach: 1. Provision a Kubernetes cluster Custom container Code files Container registry Amazon EKS cluster
  17. 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Create a Kubernetes cluster Create cluster Submit a training jobs CLI eksctl create cluster --name eks-gpu --version 1.13 --region us-west-2 --nodegroup-name gpu-nodes --node-type p3.8xlarge --nodes 4 --timeout=40m --ssh-access --ssh-public-key=<public-key> --auto-kubeconfig
  18. 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Learn more: Amazon EKS, Kubeflow and Katib Amazon Elastic Kubernetes Service (Amazon EKS) Machine learning workflows on Kubernetes Hyperparameter Tuning and Neural Architecture Search kubeflow.org/docs/aws/ aws.amazon.com/eks/
  19. 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 3: Hyperparameter search experiment using Amazon SageMaker SageMaker SDK Fully-managed SageMaker cluster Amazon S3 Container registry Custom container Code files Docker build Approach: Webinar: Machine Learning with Containers and Amazon SageMaker
  20. 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Takeaways • Containers let you build l • Leverage services such as Amazon SageMaker and Kubernetes + Kubeflow to manage large-scale ML workloads. • Choose fully-managed or self-managed based on needs Code and dependencies Infrastructure management Cluster management
  21. 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Resources docs.aws.amazon.com/sagemaker/ latest/dg/whatis.html Documentation github.com/awslabs/ amazon-sagemaker-examples Examples on GitHub aws.amazon.com/blogs/machine- learning/category/artificial-intelligence/ AWS ML Blog docs.aws.amazon.com/dlami/latest/devgui de/deep-learning-containers-images.html Webinar: Machine Learning with Containers and Amazon SageMaker
  22. 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! Shashank Prasanna, Sr. Technical Evangelist, AI/ML Questions? Happy to help: Twitter: @shshnkp LinkedIn: linkedin.com/in/shashankprasanna Demo code and configuration scripts: https://github.com/shashankprasanna/using -containers-for-dl

×