Contenu connexe Similaire à CON309_Containerized Machine Learning on AWS (20) Plus de Amazon Web Services (20) CON309_Containerized Machine Learning on AWS1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Containerized Machine Learning
on AWS
Asi f K h a n , Tech n i ca l Bu si n ess Devel opmen t Ma n a ger, AWS
H ok u to H osh i , H ea d of I n fra stru ctu re , C ook pa d I n c.
Yu i ch i ro Someya , Ma ch i n e Lea rn i n g En gi n eer, C ook pa d I n c.
November 28, 2017
CON309
2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we going to talk about
• What is deep learning
• The lifecycle of a deep learning
application
• Deploying deep learning functions on
Amazon EC2 Container Service (Amazon
ECS)
• Cookpad Inc. use case
3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Significantly improve many applications on multiple domains
“deep learning” trend in the past 10 years
image understanding speech recognition natural language
processing
…
Machine learning
autonomy
4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning at Amazon
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The circle of machine learning (ML)
Front-end team
Data engineering team
Analysts/DS team
DevOps team
Business problem
Data
ML model
ML application
ML is great
How do we scale it?
How do we apply it?
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Solution for building AI apps with CICD
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS AI services
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon ECS
EC2 INSTANCES
LOAD
BALANCER
Internet
ECS
AGENT
TASK
Container
TASK
Container
ECS
AGENT
TASK
Container
TASK
Container
AGENT COMMUNICATION
SERVICE
Amazon
ECS
API
CLUSTER MANAGEMENT
ENGINE
KEY/VALUE STORE
ECS
AGENT
TASK
Container
TASK
Container
LOAD
BALANCER
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS developer tools
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s do this…
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
Amazon
S3
1. Upload
model
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
AWS
CodeCommit
2. Commit code
Application developer
Amazon
S3
1. Upload
model
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
AWS
CodeCommit
2. Commit code
Application developer
Amazon
S3
1. Upload
model
Mobile client
14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
AWS
CodeCommit
2. Commit code
Application developer
Amazon
S3
1. Upload
model
Amazon ECR
Instance
Spot Instance
Application Load
Balancer
Amazon Route 53
15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
AWS
CodeCommit
2. Commit code
Application developer
Amazon
S3
1. Upload
model
Amazon Route 53
AWS
CodePipeline
Ops
AWS
CodeBuild
Amazon ECR
Instance
Spot Instance
Application Load
Balancer
Amazon Route 53
Ops
16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
AWS
CodeCommit
2. Commit code
Application developer
Amazon
S3
1. Upload
model
Amazon ECR
Instance
Spot Instance
Application Load
Balancer
Amazon Route 53
AWS
CodePipeline
Ops
AWS
CodeBuild
3.UploadMXNet+applicationimage
Ops
17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
AWS
CodeCommit
2. Commit code
Application developer
Amazon
S3
1. Upload
model
Amazon Route 53
AWS
CodePipeline
Ops
AWS
CodeBuild
3.UploadMXNet+applicationimage
4. Predict
Amazon ECR
Instance
Spot Instance
Application Load
Balancer
Amazon Route 53
Ops
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
AWS
CodeCommit
2. Commit code
Application developer
Amazon
S3
1. Upload
model
Amazon ECR
Instance
Spot Instance
Application Load
Balancer
Amazon Route 53
AWS
CodePipeline
Ops
AWS
CodeBuild
4. Predict
[ "probability=0.115212, class=n04366367 suspension bridge” ]
3.UploadMXNet+applicationimage
Ops
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
AWS
CodeCommit
2. Commit code
Application developer
Amazon
S3
1. Upload
model Amazon ECR
Instance
Spot Instance
Application Load
Balancer
Amazon Route 53
AWS
CodePipeline
Ops
AWS
CodeBuild
3.UploadMXNet+applicationimage
4. Predict
[ "probability=0.115212, class=n04366367 suspension bridge” ]
5.Uploaddata
Ops
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data scientist
Jupyter
Notebook
AWS
CodeCommit
2. Commit code
Application developer
Amazon
S3
1. Upload
model
Amazon ECR
Instance
Spot Instance
Application Load
Balancer
Amazon Route 53
AWS
CodePipeline
Ops
AWS
CodeBuild
3.UploadMXNet+applicationimage
4. Predict
[ "probability=0.115212, class=n04366367 suspension bridge” ]
5.Uploaddata
6.Updateservice
Ops
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Give it a spin
http://amzn.to/2zWpQij
22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
‣ Hokuto Hoshi (@kani_b)
‣ Head of Infrastructure, Cookpad Inc.
‣ hokuto@cookpad.com
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Cookpad?
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
About Cookpad
• “Make everyday cooking fun!” - Since 1998
• https://cookpad.com/
• Largest online recipe sharing and search service in
Japan
Over 2.7M user-authored recipes
About 60M Monthly users in Japan
27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• https://cookpad.com/#{your_country_code}
• us, uk, id, es, fr, br, ae, etc.…
Cookpad is global
67 countries
21 languages
Offices in Japan, UK, Spain, Indonesia etc.
28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Our infrastructure
All-in on AWS since 2011
~1,400 EC2 instances
200+ ECS services
15,000+ requests/sec
150+ developers
9 SREs
9 Machine learning engineers
Over 2 regions
29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cooking Log (料理きろく)
Our very first Deep Learning powered feature
30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cooking log
(料理きろく)
Collect "Food Photos" from
Camera Roll automatically
Powered by
Convolutional Neural Network
31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DEMO
32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
12,000,000+
"food" photos
140,000+
Users
34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• There were several new challenges
• Semi real-time image classification in production
• Different workloads from the rest of web applications
• Especially we needed:
• Scalable infrastructure for new workloads
• Environment isolation for new challenge
Our first deep learning feature
35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• What we needed: Scalable infrastructure for massive photo uploading and semi real-time
classification
• Clients send tiny thumbnails after taking photos (difficult to predict traffic)
• Traffic spikes are coming sometimes (e.g. The TV show introduces our app)
• What we chose: Asynchronous architecture
• Uploading and classification take time (~ several hundred ms)
• Synchronous processing with API servers
gives users a bad experience
• Upload directly from clients to Amazon S3 using presigned URL
• Enable Amazon S3 notification and Amazon SQS for queue of classification
Scalable and asynchronous classification
36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Environment isolation
• What we needed: Isolated environment in production
• Different languages (Python for Machine Learning, Ruby for web application)
• Different workloads (It’s our first product that uses deep learning)
• Different hardware (GPU)
• What we chose: Container environment
• The container ensures runtimes are isolated
• Language environment, GPU drivers, many configurations
• Amazon ECS provides managed and scalable Docker environment
• And we had already used containers on Amazon ECS!
• We run all classification in Amazon ECS cluster on g2.xlarge
37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API server
1. Request API server to
issue Amazon S3 presigned URL
38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API server
1. Request API server to
issue Amazon S3 presigned URL
2. Upload thumbnail
to Amazon S3 bucket
39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API server
1. Request API server to
issue Amazon S3 presigned URL
2. Upload thumbnail
to Amazon S3 bucket
{ bucket: thumbnails
key: 123.jpg }
3. Queue upload event
From Amazon S3 to Amazon SQS
(s3://thumbnails/123.jpg)
40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API server
1. Request API server to
issue Amazon S3 presigned URL
2. Upload thumbnail
to Amazon S3 bucket
{ bucket: thumbnails
key: 123.jpg }
3. Queue upload event
From Amazon S3 to Amazon SQS
ECS cluster
4. Dequeue message
5. Download thumbnail
From Amazon S3
6. Classify thumbnail and
send result to API server
(s3://thumbnails/123.jpg)
41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API server
1. Request API server to
issue S3 pre-signed url
{ bucket: thumbnails
key: 123.jpg }
ECS cluster
4. Dequeue message
5. Download thumbnail
From S3
6. Classify thumbnail and
send result to API server
Scalable without any code
2. Upload thumbnail
to Amazon S3 bucket
3. Queue upload event
From Amazon S3 to Amazon SQS
(s3://thumbnails/123.jpg)
42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
(s3://thumbnails/123.jpg)
API server
1. Request API server to
issue S3 pre-signed url
2. Upload thumbnail
to S3 bucket
{ bucket: thumbnails
key: 123.jpg }
3. Queue upload event
From S3 to SQS
5. Download thumbnail
From S3
6. Classify thumbnail and
send result to API server ECS cluster
4. Dequeue message
Scaling ECS tasks
by SQS queue length
43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
(s3://thumbnails/123.jpg)
1. Request API server to
issue Amazon S3 presigned URL
2. Upload thumbnail
to Amazon S3 bucket
{ bucket: thumbnails
key: 123.jpg }
3. Queue upload event
From Amazon S3 to Amazon SQS
ECS cluster
4. Dequeue message
5. Download thumbnail
From Amazon S3
6. Classify thumbnail and
send result to API serverAPI server
Only need to wait for results from ECS
44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API server
1. Request API server to
issue Amazon S3 presigned URL
2. Upload thumbnail
to Amazon S3 bucket
{ bucket: thumbnails
key: 123.jpg }
3. Queue upload event
from Amazon S3 to Amazon SQS
4. Dequeue message
5. Download thumbnail
From S3
6. Classify thumbnail and
send result to API server ECS cluster
What is happening in containers?
(s3://thumbnails/123.jpg)
45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning and infrastructure
Infrastructure to accelerate our machine learning projects.
46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
‣ Yuichiro Someya (ayemos)
‣ Machine Learning Engineer @ Cookpad Inc.
# 2016(new grads) ~ Current
47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API server
1. Request API server to
issue Amazon S3 presigned URL
2. Upload thumbnail
to S3 bucket
{ bucket: thumbnails
key: 123.jpg }
3. Queue upload event
From Amazon S3 to Amazon SQS
4. Dequeue message
5. Download thumbnail
From S3
6. Classify thumbnail and
send result to API server ECS cluster
What is happening in containers?
(s3://thumbnails/123.jpg)
48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ECS cluster
The task in detail
(Report results to externals)
Image classification!
Classifier
49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
(Serialized)
classifier
• Fetch the classifier
• Dequeue from Amazon SQS
• Load the image from
Amazon S3
• Report the results back
ECS cluster(Serialized)
classifier
Deploy the classifier on Amazon ECS
Task definition?
50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Democratize task definitions
• hako: https://github.com/eagletmt/hako/
• Container deploy tool (Amazon ECS compatible)
• Use yaml as definition file format
• Each developer writes app.yml and sends Pull request
• 200+ applications are at work
scheduler:
type: ecs
region: ap-northeast-1
cluster: hako-production-g2
desired_count: 1
app:
image: food-photo-classifier
cpu: 128
memory: 3072
memory_reservation: 2048
env:
AWS_REGION: ap-northeast-1
COOKPADNET_ENV: production
...
51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why we’re using hako
• `hako` behaves as abstract operation layer over Amazon ECS (or other docker manager)
• Higher-level operations: Deploy/Rollback/Stop/Remove
• Manages *secret* environment variables
• Pluggable pre/post development operations as `scripts`
• Operations like DNS settings, Consul registrations, and so on
• We want each developer can deploy tasks on Amazon ECS individually.
• `hako` handles Infra/SRE work around Amazon ECS.
52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ECS cluster(Serialized)
classifier
Deploy the model on Amazon ECS
$ hako deploy classifier.yml
• Fetch the classifier
• Dequeue from Amazon
SQS
• Load the image from
Amazon S3
• Report the results back
53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ECS cluster (g2)
Task(Container)
GPU
NVIDIA Driver
(kernel modules)
NVIDIA Driver
(user libraries)
App, Worker
ECS and GPU
• Install drivers to clusters
(ref: https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver )
• CUDA to the container
(CUDA Driver version compatibility is relatively loose)
• GPU device files have to be visible and writable
• Privileged flag
(migrating to linux_parameters.devices option)
54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ECS cluster(Serialized)
classifier
Deploy the model on Amazon ECS
$ hako deploy classifier.yml
• Fetch the classifier
• Dequeue from Amazon
SQS
• Load the image from
Amazon S3
• Report the results back
55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Food/Nonfood
image classifier
Food photos
from Cookpad
Random photos
from other datasets
(Nonfood)
Labeled image dataset
train
56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Whole set
Food
(100,000 photos~)
Nonfood
(100,000 photos~)
?
It’s hard to obtain “Complement set”
{food} ≠ {non-food}
57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Whole set
Food
Nonfood
{Food: 50%, Nonfood: 50%}
?
58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Whole set
Food
Nonfood
Plushies
Rebuild dataset
making use of posterior insights
59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Food/Nonhood
image classifier
Food Photos
from Cookpad
Random Photos
from other datasets
(Nonfood)
Labeled image dataset
97.9% Accurate
(Precision: 97.6%, Recall: 95.8%)
60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CUDA 8.0/cuDNN7
CUDA 9.0/cuDNN7
`ssh workbench-001`
New!
62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cooking log
• Scalable food/non-food image classification
• Asynchronous and isolated architecture
• Containerized GPU workloads
• `hako` makes it easy to define and deploy applications
Machine learning infrastructure
• Great environment makes our research fast and creative!
• Managing multiple AMIs using Packer
• Dedicated Instances
• Operate instances via chat bot
63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://info.cookpad.com/us
https://github.com/cookpad
We’re hiring!
64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!