SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
Embrace sparsity at web scale: Apache
Spark* MLlib algorithms optimization
for sparse data
Yuhao Yang / Ding Ding
Big Data Technology,Intel
*Other names and brands may be claimed as the property of others
Intel & Big Data
• Contribution to big data community
– Consistently and actively
– Enthusiastic engineering team
– https://software.intel.com/en-us/bigdata
• Wide cooperation and partnership
– Consultations and co-development
– Send to open source projects.
Contributes
Intel
Apache	
Spark*
Spark	
Users
*Other names and brands may be claimed as the property of others
Sparse data is almost everywhere
• Data Source:
– Movie ratings
– Purchase history
• Feature engineering:
– NLP: CountVectorizer, HashingTF
– Categorical: OneHotEncoder
– Image, video
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10
Customers
products
Purchase History
Sparse data support in MLlib
new DenseVector(
values = Array(1.0, 0.0, 0.0, 100.0))
new SparseVector(
size = 4,
indices = Array(0, 3),
values = Array(1.0, 100.0))
*Other names and brands may be claimed as the property of others
First Tip: Anther option
• HashVector: a sparse vector backed by a hash array.
– Mutable Sparse Vector
– O(1) random access
– O(nnz) axpy, dot
• Available in Breeze and our package
Sparse data support in MLlib
• Supporting Sparse data since v1.0
– Load / Save, Sparse Vector, LIBSVM
– Supporting sparse vector is one of the primary review focus.
– Xiangrui’s talk in Spark Summit 2014: Sparse data support in MLlib
– https://spark-summit.org/2014/wp-content/uploads/2014/07/sparse_data_support_in_mllib1.pdf
*Other names and brands may be claimed as the property of others
Gaps with some industry scenarios
• Hi, I need
– LR with 1 billion dimension
– clustering with 10M dimension
– Large scale documents classification/clustering
– My data is quite sparse
• Yet with MLlib
– OOM…
– Can you help?
Sparse ML for Apache Spark*
• A Spark package containing algorithm optimization to
support the sparse data at large scope
*Other names and brands may be claimed as the property of others
Spark	Core
MLlib
Sparse	ML
SQL,
Streaming,	
GraphX	...
Sparse ML for Apache Spark*
• KMeans
• Linear methods (logistic regression, linear SVM, etc)
• HashVector
• MaxAbsScaler
• NaiveBayes
• Neural Network (WIP)
*Other names and brands may be claimed as the property of others
KMeans
• Pick initial cluster centers
– Random
– KMeans||
• Iterative training
– Points clustering, find nearest center for each point
– Re-compute center in each cluster (avg.)
• Cluster centers are vectors with the same dimension of data
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5
KMeans scenario: e-commerce
• Cluster customers into 200 clusters according to purchase
history:
– 20M customers
– 10M different products (feature dimension)
– 200 clusters
– Avg. sparsity 1e-6
MLlib iteration
1. Broadcast current centers (all dense vectors, 200 * 10M * 8 = 16G),
to all the workers
Driver
Worker Worker Worker
Broadcast	current	centers
*Other names and brands may be claimed as the property of others
MLlib iteration
2. Compute a sum table
for each partition of data
val sum = new Array[Vector](k)
for (each point in the partition) {
val bestCenter = traverse()
sum(bestCenter) += point
}
Training	dataset
Executor	
1
Executor	
2
Executor	
3
Sums:
16G
Centers:	16G
*Other names and brands may be claimed as the property of others
Analysis: Data
• Are the cluster centers dense?
• Let’s assume all the records have no overlapping features:
– 20M records / 200 clusters = 0.1M records per cluster
– 0.1M * 10 = 1M non-zero in their sum/center
– 1M / 10M = 0.1 center sparsity at most
Analysis: operations
• Core linear algebra operation:
Operations Sparse friendly
axpy Y += A * X No if Y is sparse,
yet X + Y is sparse-friendly
dot X dot Y Yes
Sqdist Square distance Yes, sparse faster
SparseKMeans
• Represent clustering centers with SparseVector
– Reduce memory and time consumption
Cluster centers
• What a center goes through in each iteration
– Broadcast
– Compute distance with all the points (sqdist , dot)
– Discard (New centers are generated)
• Cluster centers can always use SparseVector
– Without extra cost during computation
Advanced: Sum table
• Use SparseVectors to hold the sum for each cluster
– Reduce max memory requirement;
• Isn’t it slower to compute with Sparse vectors?
– SparseVector can not support axpy, but it supports x + y
– Modern JVM handles small objects efficiently
– Automatically converts to DenseVector (sparseThrehold)
Scalable KMeans
• What if you cluster centers are dense
– Reduce max memory consumption
– Break the constraint imposed by centers and sums
• Can we make the centers distributed?
– Array[Center] => RDD[Center]
– Each point vs. each cluster center.
– That sounds like a join
Scalable KMeans
Scalable KMeans
• Scalable
– No broadcast, no sum table
– 200G -> 20G * 10
– Remove memory constraint on single node
• Not only for Sparse data
KMeans
• Sparse KMeans:
– Cluster centers can be sparse:
• Scalable KMeans
– Cluster centers can be distributed
Tip2: MaxAbsScaler for feature engineering
• MinMaxScaler destroys data sparsity
• StandardScaler does not support SparseVector withMean
Logistic Regression on Spark
Training Data
Executor
Driver
Executor
weights
task
local gradient
task
local gradient
…
task
local gradient
task
local gradient
load
Partition
Partition
Partition
Partition
load
load
…
…
…
…
load 1
1
2
2
2
2
3
Customer’s training set:
– Number of features : 200s million
– Billions ~ trillions training samples
– Each sample has 100s – 1000 non-zero elements
Large Scale Logistic Regression
Challenges: big data and big model
Training Data
Executor
Driver
Executor
weights
task
local gradient
task
local gradient
…
task
local gradient
task
local gradient
load
Partitio
n
Partitio
n
Partitio
n
Partitio
n
load
load
…
…
…
…
load 1
1
2
2
2
2
3
Driver broadcast weights(>2G)
to each executor
Each task sends local gradient(>2G)
for aggregation
Gradients in each executor (>>2G)
Training data cached in executor
Training Data
Executor
Executor
The gradient is sparse as the feature
vector is sparse
task
local gradient
task
local gradient
…
task
local gradient
task
local gradient
load
Partition
load
load
…
…
…
…
loadf1
…
f2
f5
…
f4
…
f3
f6
f7
f8
…
Exploiting sparsity in gradients
𝑔 𝑤; 𝑥, 𝑦 = 𝑓 𝑥)
𝑤; 𝑦 	+ 	𝑥
• g = points.map(p => grad(w, p)).reduce(_ + _)
• Gradients: hashSparseVector
• Adds gradients to an initial hashSparseVector :
ü Fast random access: O(1)
ü Memory friendly:
Executor: 10G -> ~200M
Switch to sparse gradients
• Weights is with great sparsity
§ Waste memory on meaningless 0
§ Use dense vector with non zero elements
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3
Global2FeautreIdMapping
Exploiting sparsity in weights
1 2 5 11
0 1 2 3 globalIndex
featureId
Prune weights
• Implementation:
val global2FeatureIdMapping =
points.mappartition {p => p.mapping}.collect().flatMap(t => t).distinct
• GlobalIndex is used during traing
• Convert back to featureId after training
Optimize cached training data
• Use localIndex as sparse vector indices
10point1: 9 24
10point2: 3 12
10point3: 6 9
20point4: 6 12
3 12 6 20 1 9 24
0 1 2 3 4 5 6 7 localIndex
featureId
local2FeatureIdMapping:
10point1: 2 3
10point2: 4 5
10point3: 6 2
70point4: 6 5
• Encode localIndex
featureId: 0 – 200 millions localIndex: 0 – ~2 millions
Use 1-3 bytes to store localIndex
§ indices: Array[Int] -> Array[Array[Byte]] -> Array[Byte]
§ use first bit to identify if the following byte is a new localIndex
Optimize cached training data
• Support for binary(0 or 1) values
Optimize cached training data
1 3 6 90indices
values 2.1 0.0 2.2 1.01.0
61
2.22.1
indices
values
0
9
Before
Now
Dense
vector
Sparse
vector
Sparse
vector
Sparse Logistic Regression Performance
• Enviroment (12 executors with 8g memory in each)
q Spark LR: OOM
q Sparse LR: 90 seconds per epoch
Hardware: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, 128GB DRAM
Software: Spark on yarn (Spark ver1.6.0 , Hadoop ver2.6.0)
How to use SparseSpark
• https://github.com/intel-analytics/SparseSpark
• Consistent interface with MLlib
• Compile with application code.
*Other names and brands may be claimed as the property of others
THANK YOU.
Yuhao.yang@intel.com
Ding.ding@intel.com
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is grantedby this document.
Intel disclaims all express and implied warranties, includingwithout limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as
well as any warranty arisingfrom course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/orprocesses in development. All informationprovided hereis subject tochange without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may containdefects or errors known as errata whichmay causedeviations from publishedspecifications. Current characterized errata are
available on request.
Intel technologies’ features andbenefits depend on system configuration andmay require enabled hardware, software or serviceactivation. Performance varies depending on
system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com].
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configurationwill affect actual performance. Consult
other sources of information to evaluateperformance as you consider your purchase.
For more information go to http://www.intel.com/performance.
Intel and the Intel logo are trademarks of Intel Corporation in theU.S. and/orothercountries.
*Other names and brands may be claimed as the property of other.
Copyright ©2016 IntelCorporation.
Legal Disclaimer
37

Contenu connexe

Tendances

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 

Tendances (20)

Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 

En vedette

Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsHybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Matthias Braunhofer
 
Sparse matrices
Sparse matricesSparse matrices
Sparse matrices
Zain Zafar
 
Cp knowledge: 31 fbt presentation ankit
Cp knowledge: 31 fbt presentation ankitCp knowledge: 31 fbt presentation ankit
Cp knowledge: 31 fbt presentation ankit
Pavan Kumar Vijay
 
Mapas conceptuales acto creativo n 2
Mapas conceptuales acto creativo n 2Mapas conceptuales acto creativo n 2
Mapas conceptuales acto creativo n 2
IE Simona Duque
 
КРЕЧЕТ Беспилотные аэрофотосъёмочные системы
КРЕЧЕТ Беспилотные аэрофотосъёмочные системыКРЕЧЕТ Беспилотные аэрофотосъёмочные системы
КРЕЧЕТ Беспилотные аэрофотосъёмочные системы
kulibin
 
อาร์ม รูปโมลานิซ่า
อาร์ม  รูปโมลานิซ่าอาร์ม  รูปโมลานิซ่า
อาร์ม รูปโมลานิซ่า
Mos BirDy
 

En vedette (20)

Alleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment AnalysisAlleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment Analysis
 
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsHybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
 
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and TagsCold-Start Management with Cross-Domain Collaborative Filtering and Tags
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
 
Active Learning in Collaborative Filtering Recommender Systems : a Survey
Active Learning in Collaborative Filtering Recommender Systems : a SurveyActive Learning in Collaborative Filtering Recommender Systems : a Survey
Active Learning in Collaborative Filtering Recommender Systems : a Survey
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Multiplication of two 3 d sparse matrices using 1d arrays and linked lists
Multiplication of two 3 d sparse matrices using 1d arrays and linked listsMultiplication of two 3 d sparse matrices using 1d arrays and linked lists
Multiplication of two 3 d sparse matrices using 1d arrays and linked lists
 
Sparse matrices
Sparse matricesSparse matrices
Sparse matrices
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Understanding the Books.Quia Homework 2012
Understanding the Books.Quia Homework 2012Understanding the Books.Quia Homework 2012
Understanding the Books.Quia Homework 2012
 
ARRA-Management-Retail
ARRA-Management-RetailARRA-Management-Retail
ARRA-Management-Retail
 
Perfil credibilidad persuasion etica- ¿que factores hacen que un cliente teng...
Perfil credibilidad persuasion etica- ¿que factores hacen que un cliente teng...Perfil credibilidad persuasion etica- ¿que factores hacen que un cliente teng...
Perfil credibilidad persuasion etica- ¿que factores hacen que un cliente teng...
 
W:\Jane Smith\Biodiversity Heritage Library News From Europe Ala2010
W:\Jane Smith\Biodiversity Heritage Library News From Europe Ala2010W:\Jane Smith\Biodiversity Heritage Library News From Europe Ala2010
W:\Jane Smith\Biodiversity Heritage Library News From Europe Ala2010
 
Coworklisboa 2014
Coworklisboa 2014Coworklisboa 2014
Coworklisboa 2014
 
Cp knowledge: 31 fbt presentation ankit
Cp knowledge: 31 fbt presentation ankitCp knowledge: 31 fbt presentation ankit
Cp knowledge: 31 fbt presentation ankit
 
Mobile world Summit - 2014
Mobile world Summit - 2014Mobile world Summit - 2014
Mobile world Summit - 2014
 
電器 & 傢俬
電器 & 傢俬電器 & 傢俬
電器 & 傢俬
 
Mapas conceptuales acto creativo n 2
Mapas conceptuales acto creativo n 2Mapas conceptuales acto creativo n 2
Mapas conceptuales acto creativo n 2
 
КРЕЧЕТ Беспилотные аэрофотосъёмочные системы
КРЕЧЕТ Беспилотные аэрофотосъёмочные системыКРЕЧЕТ Беспилотные аэрофотосъёмочные системы
КРЕЧЕТ Беспилотные аэрофотосъёмочные системы
 
อาร์ม รูปโมลานิซ่า
อาร์ม  รูปโมลานิซ่าอาร์ม  รูปโมลานิซ่า
อาร์ม รูปโมลานิซ่า
 
Using Social Media for Gift Shops
Using Social Media for Gift ShopsUsing Social Media for Gift Shops
Using Social Media for Gift Shops
 

Similaire à Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For Sparse Data

Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 

Similaire à Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For Sparse Data (20)

Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Migrating existing open source machine learning to azure
Migrating existing open source machine learning to azureMigrating existing open source machine learning to azure
Migrating existing open source machine learning to azure
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 

Plus de Jen Aman

Plus de Jen Aman (20)

Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
 

Dernier

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Dernier (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 

Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For Sparse Data

  • 1. Embrace sparsity at web scale: Apache Spark* MLlib algorithms optimization for sparse data Yuhao Yang / Ding Ding Big Data Technology,Intel *Other names and brands may be claimed as the property of others
  • 2. Intel & Big Data • Contribution to big data community – Consistently and actively – Enthusiastic engineering team – https://software.intel.com/en-us/bigdata • Wide cooperation and partnership – Consultations and co-development – Send to open source projects. Contributes Intel Apache Spark* Spark Users *Other names and brands may be claimed as the property of others
  • 3. Sparse data is almost everywhere • Data Source: – Movie ratings – Purchase history • Feature engineering: – NLP: CountVectorizer, HashingTF – Categorical: OneHotEncoder – Image, video 0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 Customers products Purchase History
  • 4. Sparse data support in MLlib new DenseVector( values = Array(1.0, 0.0, 0.0, 100.0)) new SparseVector( size = 4, indices = Array(0, 3), values = Array(1.0, 100.0)) *Other names and brands may be claimed as the property of others
  • 5. First Tip: Anther option • HashVector: a sparse vector backed by a hash array. – Mutable Sparse Vector – O(1) random access – O(nnz) axpy, dot • Available in Breeze and our package
  • 6. Sparse data support in MLlib • Supporting Sparse data since v1.0 – Load / Save, Sparse Vector, LIBSVM – Supporting sparse vector is one of the primary review focus. – Xiangrui’s talk in Spark Summit 2014: Sparse data support in MLlib – https://spark-summit.org/2014/wp-content/uploads/2014/07/sparse_data_support_in_mllib1.pdf *Other names and brands may be claimed as the property of others
  • 7. Gaps with some industry scenarios • Hi, I need – LR with 1 billion dimension – clustering with 10M dimension – Large scale documents classification/clustering – My data is quite sparse • Yet with MLlib – OOM… – Can you help?
  • 8. Sparse ML for Apache Spark* • A Spark package containing algorithm optimization to support the sparse data at large scope *Other names and brands may be claimed as the property of others Spark Core MLlib Sparse ML SQL, Streaming, GraphX ...
  • 9. Sparse ML for Apache Spark* • KMeans • Linear methods (logistic regression, linear SVM, etc) • HashVector • MaxAbsScaler • NaiveBayes • Neural Network (WIP) *Other names and brands may be claimed as the property of others
  • 10. KMeans • Pick initial cluster centers – Random – KMeans|| • Iterative training – Points clustering, find nearest center for each point – Re-compute center in each cluster (avg.) • Cluster centers are vectors with the same dimension of data 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
  • 11. KMeans scenario: e-commerce • Cluster customers into 200 clusters according to purchase history: – 20M customers – 10M different products (feature dimension) – 200 clusters – Avg. sparsity 1e-6
  • 12. MLlib iteration 1. Broadcast current centers (all dense vectors, 200 * 10M * 8 = 16G), to all the workers Driver Worker Worker Worker Broadcast current centers *Other names and brands may be claimed as the property of others
  • 13. MLlib iteration 2. Compute a sum table for each partition of data val sum = new Array[Vector](k) for (each point in the partition) { val bestCenter = traverse() sum(bestCenter) += point } Training dataset Executor 1 Executor 2 Executor 3 Sums: 16G Centers: 16G *Other names and brands may be claimed as the property of others
  • 14. Analysis: Data • Are the cluster centers dense? • Let’s assume all the records have no overlapping features: – 20M records / 200 clusters = 0.1M records per cluster – 0.1M * 10 = 1M non-zero in their sum/center – 1M / 10M = 0.1 center sparsity at most
  • 15. Analysis: operations • Core linear algebra operation: Operations Sparse friendly axpy Y += A * X No if Y is sparse, yet X + Y is sparse-friendly dot X dot Y Yes Sqdist Square distance Yes, sparse faster
  • 16. SparseKMeans • Represent clustering centers with SparseVector – Reduce memory and time consumption
  • 17. Cluster centers • What a center goes through in each iteration – Broadcast – Compute distance with all the points (sqdist , dot) – Discard (New centers are generated) • Cluster centers can always use SparseVector – Without extra cost during computation
  • 18. Advanced: Sum table • Use SparseVectors to hold the sum for each cluster – Reduce max memory requirement; • Isn’t it slower to compute with Sparse vectors? – SparseVector can not support axpy, but it supports x + y – Modern JVM handles small objects efficiently – Automatically converts to DenseVector (sparseThrehold)
  • 19. Scalable KMeans • What if you cluster centers are dense – Reduce max memory consumption – Break the constraint imposed by centers and sums • Can we make the centers distributed? – Array[Center] => RDD[Center] – Each point vs. each cluster center. – That sounds like a join
  • 21. Scalable KMeans • Scalable – No broadcast, no sum table – 200G -> 20G * 10 – Remove memory constraint on single node • Not only for Sparse data
  • 22. KMeans • Sparse KMeans: – Cluster centers can be sparse: • Scalable KMeans – Cluster centers can be distributed
  • 23. Tip2: MaxAbsScaler for feature engineering • MinMaxScaler destroys data sparsity • StandardScaler does not support SparseVector withMean
  • 24. Logistic Regression on Spark Training Data Executor Driver Executor weights task local gradient task local gradient … task local gradient task local gradient load Partition Partition Partition Partition load load … … … … load 1 1 2 2 2 2 3
  • 25. Customer’s training set: – Number of features : 200s million – Billions ~ trillions training samples – Each sample has 100s – 1000 non-zero elements Large Scale Logistic Regression
  • 26. Challenges: big data and big model Training Data Executor Driver Executor weights task local gradient task local gradient … task local gradient task local gradient load Partitio n Partitio n Partitio n Partitio n load load … … … … load 1 1 2 2 2 2 3 Driver broadcast weights(>2G) to each executor Each task sends local gradient(>2G) for aggregation Gradients in each executor (>>2G) Training data cached in executor
  • 27. Training Data Executor Executor The gradient is sparse as the feature vector is sparse task local gradient task local gradient … task local gradient task local gradient load Partition load load … … … … loadf1 … f2 f5 … f4 … f3 f6 f7 f8 … Exploiting sparsity in gradients 𝑔 𝑤; 𝑥, 𝑦 = 𝑓 𝑥) 𝑤; 𝑦 + 𝑥
  • 28. • g = points.map(p => grad(w, p)).reduce(_ + _) • Gradients: hashSparseVector • Adds gradients to an initial hashSparseVector : ü Fast random access: O(1) ü Memory friendly: Executor: 10G -> ~200M Switch to sparse gradients
  • 29. • Weights is with great sparsity § Waste memory on meaningless 0 § Use dense vector with non zero elements 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 Global2FeautreIdMapping Exploiting sparsity in weights 1 2 5 11 0 1 2 3 globalIndex featureId
  • 30. Prune weights • Implementation: val global2FeatureIdMapping = points.mappartition {p => p.mapping}.collect().flatMap(t => t).distinct • GlobalIndex is used during traing • Convert back to featureId after training
  • 31. Optimize cached training data • Use localIndex as sparse vector indices 10point1: 9 24 10point2: 3 12 10point3: 6 9 20point4: 6 12 3 12 6 20 1 9 24 0 1 2 3 4 5 6 7 localIndex featureId local2FeatureIdMapping: 10point1: 2 3 10point2: 4 5 10point3: 6 2 70point4: 6 5
  • 32. • Encode localIndex featureId: 0 – 200 millions localIndex: 0 – ~2 millions Use 1-3 bytes to store localIndex § indices: Array[Int] -> Array[Array[Byte]] -> Array[Byte] § use first bit to identify if the following byte is a new localIndex Optimize cached training data
  • 33. • Support for binary(0 or 1) values Optimize cached training data 1 3 6 90indices values 2.1 0.0 2.2 1.01.0 61 2.22.1 indices values 0 9 Before Now Dense vector Sparse vector Sparse vector
  • 34. Sparse Logistic Regression Performance • Enviroment (12 executors with 8g memory in each) q Spark LR: OOM q Sparse LR: 90 seconds per epoch Hardware: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, 128GB DRAM Software: Spark on yarn (Spark ver1.6.0 , Hadoop ver2.6.0)
  • 35. How to use SparseSpark • https://github.com/intel-analytics/SparseSpark • Consistent interface with MLlib • Compile with application code. *Other names and brands may be claimed as the property of others
  • 37. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is grantedby this document. Intel disclaims all express and implied warranties, includingwithout limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arisingfrom course of performance, course of dealing, or usage in trade. This document contains information on products, services and/orprocesses in development. All informationprovided hereis subject tochange without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may containdefects or errors known as errata whichmay causedeviations from publishedspecifications. Current characterized errata are available on request. Intel technologies’ features andbenefits depend on system configuration andmay require enabled hardware, software or serviceactivation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com]. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configurationwill affect actual performance. Consult other sources of information to evaluateperformance as you consider your purchase. For more information go to http://www.intel.com/performance. Intel and the Intel logo are trademarks of Intel Corporation in theU.S. and/orothercountries. *Other names and brands may be claimed as the property of other. Copyright ©2016 IntelCorporation. Legal Disclaimer 37