SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
TensorFlow 深度学习平台
AI
Machine Learning
Deep Learning
Hadoop? Spark?
MapReduce
Hadoop & Spark
Data
Executor Executor Executor Executor Executor
Result
map
reduce
MapReduce
Hadoop & Spark
“In parallel computing, an embarrassingly parallel workload or problem
(also called perfectly parallel or pleasingly parallel) is one where little or
no effort is needed to separate the problem into a number of parallel
tasks.”
—— Wikipedia
https://en.wikipedia.org/wiki/Embarrassingly_parallel
Spark MLlib
What about
TensorFlow?
TensorFlow Graph
Training Data Validation Data Test Data
Model Serving request
train
(cpu/gpu)
✔
✘
Graph
(DAG)
How does TensorFlow work
Python
Golang
Java
…
Graph
(DAG)
C++ Core
compile run
Preprocess
Data
Executor Executor Executor Executor Executor
Result
map
reduce
Hive, Spark, Storm, …
Distributed Storage
NFS, HDFS, S3, …
Training
Training Data Validation Data Test Data
Model Serving request
train
(cpu/gpu)
✔
✘
TensorFlow
(Distributed Training)
• TaaS (TensorFlow as a Service)
• 开始于 2016 年年 8 ⽉月底
• 受到 Google Cloud 的 CloudML 产品启发
• 让算法⼯工程师可以专注于算法,其它的事情交给 elearn
分布式存储
CPU 弹性需求 service 的 IP、Port 管理理
⼤大量量 container 的⽣生命周期 API
GPU
Distributed
Storage
Datastore
• Abstraction of data
• Compatible with many types of distributed storage
• Decoupling computing and storage
• Bring your own storage
Datastore
Client 缺点 应⽤用场景
NFS nfs-utils 性能问题、权限管理理
存放训练数据
模型训练
S3
compatible
s3fs-fuse
⽆无法创建空⽬目录
mkdir
存放训练数据
MinFS ⽆无法创建 symlink 存放训练数据
GlusterFS - - -
Datastore Example
S3
NFS
GPU
in container
host GPU
container GPU
GPU with Docker
• GPU 内存使⽤用有讲究

https://www.tensorflow.org/tutorials/using_gpu
• GPU Docker images

https://hub.docker.com/r/nvidia/cuda/
• nvidia-docker 不不是必须

CUDA_SO, NVIDIA_SO, DEVICES, …
GPU with Kubernetes
--feature-gates="Accelerators=true"
Distributed
Training
TensorFlow 的上⼀一代产品
DistBelief
• 解决了了数据量量 > GPU 最⼤大内存的问题
• paper 中提到最⼤大的 model,达到了了

1.7 billion parameters, utilizing 81 machines, delivering a 12x speedup.
• 它的缺点:
• 擅⻓长图像识别,但对其它机器器学习 model 适⽤用性差,⽽而且不不⽀支持 mobile
• 维护⼤大规模系统成为负担,缺乏抽象。
https://research.google.com/pubs/pub40565.html
TensorFlow Ecosystem: Integrating TensorFlow with Your Infrastructure — Derek Murray, Jonathan Hseu
TensorFlow Ecosystem: Integrating TensorFlow with Your Infrastructure — Derek Murray, Jonathan Hseu
👋⼿手动管理理的资源
• IP or DNS
• port
• task ID
• CPU/GPU ID
😫抓狂的节奏
登录 10 台服务器器
mount 10 遍共享存储
⼩小⼼心翼翼安排 GPU 资源
仔细设计 10 条启动命令
敲 10 遍这不不⼀一样的启动命令
⼿手动启动⼀一个 TensorBoard
⼿手动创建⽬目录
mkdir + cp ⼿手动管理理模型版本
……
天呐!这才训练一次
以后可怎么每周训练、天天训练啊?!
TensorFlow (GPU/CPU) cluster benchmark
Total: 6 GPUs
master worker
2 GPUs x 1 1 GPU x 4
Total: 9 GPUs
master worker
2 GPUs x 1 1 GPU x 7
Total: 3 GPUs
master worker
2 GPUs x 1 1 GPU x 1
11
global steps/s
28
global steps/s
45
global steps/s
Total: 9 CPUs
master worker
2 CPUs x 1 1 CPU x 7
5
global steps/s
Powered by
Model
Management
Serving+
Cluster
(Monthly Training)
model - 2017-07-23
model - 2017-07-23
model - 2017-07-16
model - 2017-07-09
model - 2017-07-02
Recommendation Model
Copy
GRPC
Online Serving
Use TensorFlow
Golang/Java Binding to
Load the Model
Dow
nload
System Datastore
User Defined
Datastore
Demo
Thinking of
为 TensorFlow 量量身打造
如果让 elearn ⽀支持其它 deep learning 框架,太容易易了了,

只要写 driver 就可以了了,但是我不不会去这样做的。
尽量量发挥 Kubernetes 的特⾊色功能和编排能⼒力力

Deployment, Job, StatefulSet, ConfigMap, Nginx Ingress, …
如果让 elearn ⽀支持其它 Cloud,太容易易了了,只要写 interface 够通⽤用,
然后写 driver 就可以了了,但会失去 Kubernetes 本身的意义,
所以 elearn cloud interfaces 的设计是⾯面向功能的 。
让 Kubernetes 变成 OS

所有⼩小的操作都以 Job 形式下发
⽐比如训练完成后的 post run、保存 modelVersion,这些任务都是 Kubernetes Job,
并不不在 elearn server 上运⾏行行。
I’m 江骏 / ohmystack
@饿了了么
https://github.com/ohmystack
http://weibo.com/jiangjun1990
http://ohmystack.com/
推荐项⽬目(点击链接前往)
dt (docker-tool)
最爽功能:dt ssh
gotool
管理理 Golang 开发环境 GOPATH 的利利器器
欢迎关注饿了了么技术社区

Contenu connexe

Tendances

A One-Stop Solution for Puppet and OpenStack
A One-Stop Solution for Puppet and OpenStackA One-Stop Solution for Puppet and OpenStack
A One-Stop Solution for Puppet and OpenStack
Puppet
 
Open stack china_201109_sjtu_jinyh
Open stack china_201109_sjtu_jinyhOpen stack china_201109_sjtu_jinyh
Open stack china_201109_sjtu_jinyh
OpenCity Community
 
CMPS 494 Presentation [Cloud Computing]
CMPS 494 Presentation [Cloud Computing]CMPS 494 Presentation [Cloud Computing]
CMPS 494 Presentation [Cloud Computing]
Travis McAdams
 

Tendances (20)

How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformatics
 
Bioinformatics Analysis Environment for Your Laboratory Use
Bioinformatics Analysis Environment for Your Laboratory UseBioinformatics Analysis Environment for Your Laboratory Use
Bioinformatics Analysis Environment for Your Laboratory Use
 
themidgame-tube-slides
themidgame-tube-slidesthemidgame-tube-slides
themidgame-tube-slides
 
A One-Stop Solution for Puppet and OpenStack
A One-Stop Solution for Puppet and OpenStackA One-Stop Solution for Puppet and OpenStack
A One-Stop Solution for Puppet and OpenStack
 
Open stack china_201109_sjtu_jinyh
Open stack china_201109_sjtu_jinyhOpen stack china_201109_sjtu_jinyh
Open stack china_201109_sjtu_jinyh
 
Local testing of Containerized Distributed Systems
Local testing of Containerized Distributed SystemsLocal testing of Containerized Distributed Systems
Local testing of Containerized Distributed Systems
 
Deploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and KubernetesDeploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and Kubernetes
 
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
 
Hadoop on OpenStack
Hadoop on OpenStackHadoop on OpenStack
Hadoop on OpenStack
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
HTrace: Tracing in HBase and HDFS (HBase Meetup)
HTrace: Tracing in HBase and HDFS (HBase Meetup)HTrace: Tracing in HBase and HDFS (HBase Meetup)
HTrace: Tracing in HBase and HDFS (HBase Meetup)
 
OpenNebula TechDay Boston 2015 - HA HPC with OpenNebula
OpenNebula TechDay Boston 2015 - HA HPC with OpenNebulaOpenNebula TechDay Boston 2015 - HA HPC with OpenNebula
OpenNebula TechDay Boston 2015 - HA HPC with OpenNebula
 
やっとでた! OpenStack Manila
やっとでた! OpenStack Manilaやっとでた! OpenStack Manila
やっとでた! OpenStack Manila
 
CMPS 494 Presentation [Cloud Computing]
CMPS 494 Presentation [Cloud Computing]CMPS 494 Presentation [Cloud Computing]
CMPS 494 Presentation [Cloud Computing]
 
k8sjp#9 KubeCon - Service Mesh, ML/DL on k8s
k8sjp#9 KubeCon - Service Mesh, ML/DL on k8sk8sjp#9 KubeCon - Service Mesh, ML/DL on k8s
k8sjp#9 KubeCon - Service Mesh, ML/DL on k8s
 
Thoughts on Cybersecurity
Thoughts on CybersecurityThoughts on Cybersecurity
Thoughts on Cybersecurity
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
 
Apps software development with Vert.X
Apps software development with Vert.XApps software development with Vert.X
Apps software development with Vert.X
 

Similaire à 饿了么 TensorFlow 深度学习平台:elearn

High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 

Similaire à 饿了么 TensorFlow 深度学习平台:elearn (20)

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning Platform
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
 
Beat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmarkBeat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmark
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlow
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

饿了么 TensorFlow 深度学习平台:elearn