How to build an elastic and efficient platform to support various Big Data and Machine Learning tasks is a challenge for a lot of corporations. In this presentation, Zhongbo Tian will give an overview of the Mesos-based core infrastructure of Douban, and demonstrate how to integrate the platform with state-of-art Big Data/ML technologies.
24. Paracel
• https://github.com/douban/paracel
• Jeffrey Dean, et al.
"Large scale distributed deep
networks."
• 参数服务器思想
• 分布式机器学习框架
• 使用 MPI 框架通信
• Stale Synchronous Parallel
Parameter Server 𝜔′ = 𝜔 − 𝜂Δ𝜔
𝜔 Δ𝜔
Model
Replicas
Data
Shards
25. DMLC on Mesos
• Distributed (Deep) Machine Learning Community
• 机器学习工具箱
• MXNet
• XGBoost
• Mesos Support for dmlc-core
• dmlc/dmlc-core#241 by Douban
• Powered by PyMesos
• Fallback to mesos-execute
• XGBoost on Mesos
• 获得近似线性加速能力
27. TFMesos
• https://github.com/douban/tfmesos
• Distributed Tensorflow on Mesos
• 支持 GPU
• 支持 Docker
• tfrun 工具适配 Between-Graph 模式
import tensorflow as tf
from tfmesos import cluster
jobs_def = [
{"name": "ps", "num": 2},
{"name": "worker", "num": 2},
]
with cluster(jobs_def) as c:
with tf.device('/job:ps/task:0'):
a = tf.Variable(10)
with tf.device('/job:ps/task:1'):
b = tf.Variable(32)
with tf.device("/job:worker/task:1"):
op = a + b
grpc_url = c.targets['/job:worker/task:0']
with tf.Session(grpc_url) as sess:
sess.run(tf.global_variables_initializer())
print sess.run(op)
add
a b
/job:ps/task:0
/job:ps/task:1
/job:worker/task:1