Topic: How to use big data to enhance AI
Outline:
1. Spark ETL
Spark SQL
Spark Streaming
2. Spark ML
Spark ML pipeline
Distributed model tuning
Spark ML model and data lineage management
3. Spark XGboost
XGboost introduction
XGboost with Spark
XGboost with GPU
4. Spark Deep Learning pipeline
Transfer learning
Build Spark ML pipeline with TensorFlow
Model selection on distributed TF model
6. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Applications driven for big data
⬢ Ecosystem of Hadoop
○ How Facebook use Hadoop?
■ Hive for OLAP query processing
■ HBase for for billion users activities tracking
○ How Twitter use Hadoop?
■ Storm: streaming data processing for twitter
stream data
○ How LinkedIn use Hadoop?
■ Kafaka to subscribe users streaming data
○ When Hadoop come together?
■ Ambari: for node management and deploy
different components
7. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
The leading data science platform for big data
Apache Spark
Hadoop
Interactive Streaming Batch
Nosql Tensor
flow
⬢ Apache Spark
○ Machine learning
application driven
○ The leading computation
engine for big data
processing
○ Data pipeline for
different data source
and other computation
engine
○ Uniform data processing
object RDD and
DataFrame
○ Memory based
8. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Data pipeline for machine learning
Resilient Distributed Dataset
server server server server
ETL Exploration Machine
learning
Structural
data
RAW data
processing
Interactive,
OLAP,
Spark SQL
Feature
engineering
Model
training
Data
Product
Visualization
13. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Motivation
⬢ XGBoost is the start-of-art approach in Kaggle for structural data
○ 80% teams win the competition based on XGBoost
○ A tree based model
○ Excellent at classification and regression
○ Ref: http://xgboost.readthedocs.io/en/latest/model.html
18. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Motivation
⬢ From single machine to parallel computation
○ Distributed training
○ GPU supported
○ Cowork with big data ecosystem
⬢ How to provide the end-end solution for DS?
○ Front-end
■ Easy and efficient way for parallel XGBoost computation
■ Notebook front end for model visualization
○ Backend
■ Yarn to allocate the resource for application (CPU, Memory, GPU)
■ Docker support
20. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
How Spark enhance XGBoost
⬢ Each node of XGBoost need Rabit to communicate with each others
○ Efficient but not easy to manage Rabit
XGBoost
worker2
XGBoost
worker3
XGBoost
worker4
Training data
Partition 1 XGBoost
worker1
Training data
Partition 2
Training data
Partition 3
Training data
Partition 4
Statistic sync:
optimal split value
21. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
XGBoost on Spark ML pipeline
⬢ Distributed XGBoost inside Spark ML pipeline
⬢ XGBoost estimator
○ Extend from Spark ML estimator
⬢ XGBoost model
○ Extend from Spark ML pipelineModel
○ Naturally work inside Spark ML Pipeline for model materialization
⬢ XGBoost parameter
○ Extend from Spark ML parameter
○ Enable automatically parameter tuning
22. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
XGBoost on Spark ML pipeline
⬢ Distributed XGBoost
○ Parameter:
○ val paramMap = List( "eta" -> 0.1f, "max_depth" -> 2, "objective" -> "binary:logistic").toMap
○ training
○ val xgboostModelRDD = XGBoost.train(trainRDD, paramMap, 1, 4, useExternalMemory=true)
○ val xgboostModelDF = XGBoost.trainWithDataFrame(trainDF, paramMap, 1, 4, useExternalMemory = true)
○ Prediction
○ val xgboostPredictionRDD = xgboostModelRDD.predict(trainRDD.map{x => x.features})
○ XGBoost inside ML pipeline
○ val xgboostEstimator = new XGBoostEstimator( Map[String, Any]("num_round" -> 30, "nworkers" -> 10, "objective" ->
"reg:linear", "eta" -> 0.3, "max_depth" -> 6, "early_stopping_rounds" -> 10))
val pipeline = new Pipeline() .setStages(Array(assembler, xgboostEstimator))
○ val pipelineData = dataset.withColumnRenamed("PE","label")
○ val pipelineModel = pipeline.fit(pipelineData)
26. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
GPU speedup XGBoost
⬢ GPU is good but manage GPU cluster is not easy
○ Different versions of drivers for GPUs
○ Users have to build XGBoost for GPU supported
○ Hard to manage the resources of GPU
○ GPU resource cannot be shared
⬢ An idle environment is everything included
○ Spark is an efficient distributed engine for data processing
○ Spark ML pipeline for model tuning
○ GPU is used to speedup the XGBoost training
○ Yarn is able to manage the resources of cluster
○ Notebook is used for end users
27. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
What you can learn from this notebook
⬢ Combine Spark, and XGBoost together
○ Train and deploy XGBoost model in a unified data platform
○ Automatically tune the XGBoost model based on Spark ML pipeline
○ Speedup XGBoost training based on distributed computation and GPU
○ Multiple users can share the same cluster with GPU and Spark
⬢ Benefits
○ End to end solution for ML pipeline with XGBoost support
○ Do not need to care about GPU management
○ Train the XGBoost with Spark ML APIs
○ Visualize the predication results on notebook
28. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Spark and Xgboost for Fintech
⬢ Lending club data
⬢ Spark Dataframe for ETL
⬢ Spark SQL for OLAP
⬢ Spark ML for auto modeling tuning and model serving
⬢ Notebook link: (use databricks community edition)
○ Part1: (https://bit.ly/2QuLQ9b) https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/49999
72933037924/27242371102049/8135547933712821/latest.html
○ Part2:(https://bit.ly/2AZJI3Z)
https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/49999
72933037924/27242371102070/8135547933712821/latest.html
⬢ Acknowledgment: https://databricks.com/blog/2018/08/09/loan-risk-analysis-
with-xgboost-and-databricks-runtime-for-machine-learning.html
32. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
What is deep learning
⬢ A set of machine learning techniques that can learn useful representations of
features directly from images, text and sound.
⬢ Achievements
○ ImageNet
○ Google Neural Machine
Translation
○ AlphaGo/AlphaZero
⬢ Benefit from big data and GPU
40. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Deep Learning in Spark MLlib Pipeline
⬢ Spark MLlib pipeline
○ Sequence of Transformers and Estimators
○ Simple, concise API and ease of use
⬢ Integrates with Spark APIs
○ Spark is great at scaling out computations
○ Image representation and reader in Spark DataFrame/Dataset (new in Spark 2.3)
⬢ Spark Deep Learning Pipelines (github.com/databricks/spark-deep-learning)
○ Plugin your own TensorFlow Graph or Keras Model as Transformers
○ Open source under Apache 2.0 license
41. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Auto ML in Spark ML pipeline
⬢ Spark to prepare the data
○ Spark streaming
○ Spark SQL
⬢ Spark for model parameter tuning
○ Hyper parameter
○ Save memory usage
⬢ TensorFlow auto network structure tuning
○ Reinforce learning
○ Transfer learning
⬢ Model deploy as a service
43. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
What you can learn this section
⬢ How to combine deep learning and Spark together
⬢ Take DL as a operator in Spark ML pipeline
⬢ Transfer learning with DL model
⬢ DL model parameter tuning
⬢ Apply DL model into Spark SQL
⬢ Notebook: https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4999972933037924/4
324977500035919/8135547933712821/latest.html
⬢ Acknowledgment: https://docs.databricks.com/applications/deep-learning/deep-learning-
pipelines.html