SlideShare une entreprise Scribd logo
1  sur  25
Mahout Scala and Spark Bindings:
Bringing algebraic semantics
Dmitriy Lyubimov
2014
Requirements for an ideal ML Environment
Wanted:
1. Clear R (Matlab)-like semantics and type system that covers
 Linear Algebra, Stats and Data Frames
2. Modern programming language qualities
 Functional programming
 Object Oriented programming
 Sensible byte code Performance
 A Big Plus: Scripting and Interactive shell
3. Distributed scalability
with a sensible performance
4. Collection of off-the-shelf building blocks
and algorithms
5. Visualization
Mahout Scala & Spark Bindings aim to address (1-a), (2), (3),
(4).
Scala & Spark Bindings are:
1. Scala as programming/scripting environment
2. R-like DSL :
val g = bt.t %*% bt - c - c.t +
(s_q cross s_q) * (xi dot xi)
3. Algebraic expression optimizer for distributed
Linear Algebra
 Provides a translation layer to distributed engines: Spark,
(…)
What is Scala and Spark Bindings? (2)
What are the data types?
1. Scalar real values (Double)
2. In-core vectors (2 types of sparse, 1 type of dense)
3. In-core matrices: sparse and dense
 A number of specialized matrices
4. Distributed Row Matrices (DRM)
 Compatible across Mahout MR and Spark solvers via
persistence format
Dual representation of in-memory DRM
// Run LSA
val (drmU, drmV, s) = dssvd(A)
 U inherits row keys of A
automatically
 Special meaning of integer row
keys for physical transpose
Automatic row key
tracking:
Features (1)
 Matrix, vector, scalar
operators: in-core,
out-of- core
 Slicing operators
 Assignments (in-core
only)
 Vector-specific
 Summaries
drmA %*% drmB
A %*% x
A.t %*% A
A * B
A(5 until 20, 3 until 40)
A(5, ::); A(5, 5)
x(a to b)
A(5, ::) := x
A *= B
A -=: B; 1 /=: x
x dot y; x cross y
A.nrow; x.length;
A.colSums; B.rowMeans
x.sum; A.norm …
Features (2) – decompositions
 In-core
 Out-of-core
val (inCoreQ, inCoreR) = qr(inCoreM)
val ch = chol(inCoreM)
val (inCoreV, d) = eigen(inCoreM)
val (inCoreU, inCoreV, s) = svd(inCoreM)
val (inCoreU, inCoreV, s) = ssvd(inCoreM,
k = 50, q = 1)
val (drmQ, inCoreR) = thinQR(drmA)
val (drmU, drmV, s) = dssvd(drmA, k = 50,
q = 1)
Features (3) – construction and collect
 Parallelizing
from an in-
core matrix
 Collecting to
an in-core
val inCoreA = dense(
(1, 2, 3, 4),
(2, 3, 4, 5),
(3, -4, 5, 6),
(4, 5, 6, 7),
(8, 6, 7, 8)
)
val A = drmParallelize(inCoreA,
numPartitions = 2)
val inCoreB = drmB.collect
Features (4) – HDFS persistence
 Load DRM
from HDFS
 Save DRM
to HDFS
val drmB = drmFromHDFS(path = inputPath)
drmA.writeDRM(path = uploadPath)
Delayed execution and actions
 Optimizer action
 Defines
optimization
granularity
 Guarantees the
result will be
formed in its
entirety
 Computational
action
 Actually triggers
Spark action
 Optimizer actions
are implicitly
triggered by
computation
// Example: A = B’U
// Logical DAG:
val drmA = drmB.t %*% drmU
// Physical DAG:
drmA.checkpoint()
drmA.writeDrm(path)
(drmB.t %*% drmU).writeDRM(path)
Common computational paths
Checkpoint caching (maps 1:1 to Spark)
 Checkpoint caching is a combination of None | in-
memory | disk | serialized | replicated options
 Method “checkpoint()” signature:
def checkpoint(sLevel: StorageLevel =
StorageLevel.MEMORY_ONLY): CheckpointedDrm[K]
 Unpin data when no longer needed
drmA.uncache()
Optimization factors
 Geometry (size) of operands
 Orientation of operands
 Whether identically partitioned
 Whether computational paths are shared
E. g.: Matrix multiplication:
 5 physical operators for drmA %*% drmB
 2 operators for drmA %*% inCoreA
 1 operator for drm A %*% x
 1 operator for x %*% drmA
Component Stack
Customization: vertical block operator
 Custom vertical block processing
 must produce blocks of the same height
// A * 5.0
drmA.mapBlock() {
case (keys, block) =>
block *= 5.0
keys -> block
}
Customization: Externalizing RDDs
 Externalizing raw RDD
 Triggers optimizer checkpoint implicitly
val rawRdd:DrmRDD[K] = drmA.rdd
 Wrapping raw RDD into a DRM
 Stitching with data prep pipelines
 Building complex distributed algorithm
val drmA = drmWrap(rdd = rddA [, … ])
Broadcasting an in-core matrix or vector
 We cannot wrap in-core vector or matrix in a closure:
they do not support Java serialization
 Use broadcast api
 Also may improve performance (e.g. set up Spark to
broadcast via Torrent broadcast)
// Example: Subtract vector xi from each row:
val bcastXi = drmBroadcast(xi)
drmA.mapBlock() {
case(keys, block) =>
for (row <- block) row -= bcastXi
keys -> block
}
Guinea Pigs – actionable lines of code
 Thin QR
 Stochastic Singular Value Decomposition
 Stochastic PCA (MAHOUT-817 re-flow)
 Co-occurrence analysis recommender (aka RSJ)
Actionable lines of code (-blanks -comments -CLI)
Thin QR (d)ssvd (d)spca
R prototype n/a 28 38
In-core Scala
bindings
n/a 29 50
DRM Spark
bindings
17 32 68
Mahout/Java/MR n/a ~2581 ~2581
dspca (tail)
… …
val c = s_q cross s_b
val inCoreBBt = (drmBt.t %*% drmBt)
.checkpoint(StorageLevel.NONE).collect -
c - c.t + (s_q cross s_q) * (xi dot xi)
val (inCoreUHat, d) = eigen(inCoreBBt)
val s = d.sqrt
val drmU = drmQ %*% inCoreUHat
val drmV = drmBt %*% (inCoreUHat %*%: diagv(1 /: s))
(drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k))
}
Interactive Shell & Scripting!
Pitfalls
 Side-effects are not like in R
 In-core: no copy-on-write semantics
 Distributed: Cache policies without serialization may
cause cached blocks experience side effects from
subsequent actions
 Use something like MEMORY_DISK_SER for cached parents of
pipelines with side effects
 Beware of naïve and verbatim translations of in-core
methods
Recap: Key Concepts
 High level Math, Algebraic and Data Frames logical semantic
constructs
 R-like (Matlab-like), easy to prototype, read, maintain, customize
 Operator-centric: same operator semantics regardless of
operand types
 Strategical notion: Portability of logical semantic constructs
 Write once, run anywhere
 Cost-based & Rewriting Optimizer
 Tactical notion: low cost POC, sensible in-memory
computation performance
 Spark
 Strong programming language environment (Scala)
 Scriptable & interactive shell (extra bonus)
 Compatibility with the rest of Mahout solvers via DRM
persistence
Similar work
 Breeze:
 Excellent math and linear algebra DSL
 In-core only
 MLLib
 A collection of ML on Spark
 tightly coupled to Spark
 not an environment
 MLI
 Tightly coupled to Spark
 SystemML
 Advanced cost-based optimization
 Tightly bound to a specific resource manager(?)
 + yet another language
 Julia (closest conceptually)
 + yet another language
 + yet another backend
Wanted and WIP
 Data Frames DSL API & physical layer(M-1490)
 E.g. For standardizing feature vectorization in Mahout
 E.g. For custom business rules scripting
 “Bring Your Own Distributed Method” (BYODM) –
build out ScalaBindings’ “write once – run
everywhere” collection of things
 Bindings for http://Stratosphere.eu
 Automatic parallelism adjustments
 Ability scale and balance problem to all available
resources automatically
 For more, see Spark Bindings home page
Links
 Scala and Spark Bindings
http://mahout.apache.org/users/sparkbindings/home.
html
 Stochastic Singular Value Decomposition
http://mahout.apache.org/users/dim-
reduction/ssvd.html
 Blog http://weatheringthrutechdays.blogspot.com
Thank you.

Contenu connexe

Tendances

MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
MLconf
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
Shiladitya Sen
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Tendances (20)

A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
 

En vedette

Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
Edureka!
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
pramode_ce
 

En vedette (11)

Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Distributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutDistributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache Mahout
 
Scala Programming Introduction
Scala Programming IntroductionScala Programming Introduction
Scala Programming Introduction
 
Mahout
MahoutMahout
Mahout
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
Why Scala?
Why Scala?Why Scala?
Why Scala?
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 

Similaire à Mahout scala and spark bindings

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 

Similaire à Mahout scala and spark bindings (20)

Scala Macros
Scala MacrosScala Macros
Scala Macros
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
 
Software Security
Software SecuritySoftware Security
Software Security
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Mahout scala and spark bindings

  • 1. Mahout Scala and Spark Bindings: Bringing algebraic semantics Dmitriy Lyubimov 2014
  • 2. Requirements for an ideal ML Environment Wanted: 1. Clear R (Matlab)-like semantics and type system that covers  Linear Algebra, Stats and Data Frames 2. Modern programming language qualities  Functional programming  Object Oriented programming  Sensible byte code Performance  A Big Plus: Scripting and Interactive shell 3. Distributed scalability with a sensible performance 4. Collection of off-the-shelf building blocks and algorithms 5. Visualization Mahout Scala & Spark Bindings aim to address (1-a), (2), (3), (4).
  • 3. Scala & Spark Bindings are: 1. Scala as programming/scripting environment 2. R-like DSL : val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi) 3. Algebraic expression optimizer for distributed Linear Algebra  Provides a translation layer to distributed engines: Spark, (…) What is Scala and Spark Bindings? (2)
  • 4. What are the data types? 1. Scalar real values (Double) 2. In-core vectors (2 types of sparse, 1 type of dense) 3. In-core matrices: sparse and dense  A number of specialized matrices 4. Distributed Row Matrices (DRM)  Compatible across Mahout MR and Spark solvers via persistence format
  • 5. Dual representation of in-memory DRM // Run LSA val (drmU, drmV, s) = dssvd(A)  U inherits row keys of A automatically  Special meaning of integer row keys for physical transpose Automatic row key tracking:
  • 6. Features (1)  Matrix, vector, scalar operators: in-core, out-of- core  Slicing operators  Assignments (in-core only)  Vector-specific  Summaries drmA %*% drmB A %*% x A.t %*% A A * B A(5 until 20, 3 until 40) A(5, ::); A(5, 5) x(a to b) A(5, ::) := x A *= B A -=: B; 1 /=: x x dot y; x cross y A.nrow; x.length; A.colSums; B.rowMeans x.sum; A.norm …
  • 7. Features (2) – decompositions  In-core  Out-of-core val (inCoreQ, inCoreR) = qr(inCoreM) val ch = chol(inCoreM) val (inCoreV, d) = eigen(inCoreM) val (inCoreU, inCoreV, s) = svd(inCoreM) val (inCoreU, inCoreV, s) = ssvd(inCoreM, k = 50, q = 1) val (drmQ, inCoreR) = thinQR(drmA) val (drmU, drmV, s) = dssvd(drmA, k = 50, q = 1)
  • 8. Features (3) – construction and collect  Parallelizing from an in- core matrix  Collecting to an in-core val inCoreA = dense( (1, 2, 3, 4), (2, 3, 4, 5), (3, -4, 5, 6), (4, 5, 6, 7), (8, 6, 7, 8) ) val A = drmParallelize(inCoreA, numPartitions = 2) val inCoreB = drmB.collect
  • 9. Features (4) – HDFS persistence  Load DRM from HDFS  Save DRM to HDFS val drmB = drmFromHDFS(path = inputPath) drmA.writeDRM(path = uploadPath)
  • 10. Delayed execution and actions  Optimizer action  Defines optimization granularity  Guarantees the result will be formed in its entirety  Computational action  Actually triggers Spark action  Optimizer actions are implicitly triggered by computation // Example: A = B’U // Logical DAG: val drmA = drmB.t %*% drmU // Physical DAG: drmA.checkpoint() drmA.writeDrm(path) (drmB.t %*% drmU).writeDRM(path)
  • 12. Checkpoint caching (maps 1:1 to Spark)  Checkpoint caching is a combination of None | in- memory | disk | serialized | replicated options  Method “checkpoint()” signature: def checkpoint(sLevel: StorageLevel = StorageLevel.MEMORY_ONLY): CheckpointedDrm[K]  Unpin data when no longer needed drmA.uncache()
  • 13. Optimization factors  Geometry (size) of operands  Orientation of operands  Whether identically partitioned  Whether computational paths are shared E. g.: Matrix multiplication:  5 physical operators for drmA %*% drmB  2 operators for drmA %*% inCoreA  1 operator for drm A %*% x  1 operator for x %*% drmA
  • 15. Customization: vertical block operator  Custom vertical block processing  must produce blocks of the same height // A * 5.0 drmA.mapBlock() { case (keys, block) => block *= 5.0 keys -> block }
  • 16. Customization: Externalizing RDDs  Externalizing raw RDD  Triggers optimizer checkpoint implicitly val rawRdd:DrmRDD[K] = drmA.rdd  Wrapping raw RDD into a DRM  Stitching with data prep pipelines  Building complex distributed algorithm val drmA = drmWrap(rdd = rddA [, … ])
  • 17. Broadcasting an in-core matrix or vector  We cannot wrap in-core vector or matrix in a closure: they do not support Java serialization  Use broadcast api  Also may improve performance (e.g. set up Spark to broadcast via Torrent broadcast) // Example: Subtract vector xi from each row: val bcastXi = drmBroadcast(xi) drmA.mapBlock() { case(keys, block) => for (row <- block) row -= bcastXi keys -> block }
  • 18. Guinea Pigs – actionable lines of code  Thin QR  Stochastic Singular Value Decomposition  Stochastic PCA (MAHOUT-817 re-flow)  Co-occurrence analysis recommender (aka RSJ) Actionable lines of code (-blanks -comments -CLI) Thin QR (d)ssvd (d)spca R prototype n/a 28 38 In-core Scala bindings n/a 29 50 DRM Spark bindings 17 32 68 Mahout/Java/MR n/a ~2581 ~2581
  • 19. dspca (tail) … … val c = s_q cross s_b val inCoreBBt = (drmBt.t %*% drmBt) .checkpoint(StorageLevel.NONE).collect - c - c.t + (s_q cross s_q) * (xi dot xi) val (inCoreUHat, d) = eigen(inCoreBBt) val s = d.sqrt val drmU = drmQ %*% inCoreUHat val drmV = drmBt %*% (inCoreUHat %*%: diagv(1 /: s)) (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k)) }
  • 20. Interactive Shell & Scripting!
  • 21. Pitfalls  Side-effects are not like in R  In-core: no copy-on-write semantics  Distributed: Cache policies without serialization may cause cached blocks experience side effects from subsequent actions  Use something like MEMORY_DISK_SER for cached parents of pipelines with side effects  Beware of naïve and verbatim translations of in-core methods
  • 22. Recap: Key Concepts  High level Math, Algebraic and Data Frames logical semantic constructs  R-like (Matlab-like), easy to prototype, read, maintain, customize  Operator-centric: same operator semantics regardless of operand types  Strategical notion: Portability of logical semantic constructs  Write once, run anywhere  Cost-based & Rewriting Optimizer  Tactical notion: low cost POC, sensible in-memory computation performance  Spark  Strong programming language environment (Scala)  Scriptable & interactive shell (extra bonus)  Compatibility with the rest of Mahout solvers via DRM persistence
  • 23. Similar work  Breeze:  Excellent math and linear algebra DSL  In-core only  MLLib  A collection of ML on Spark  tightly coupled to Spark  not an environment  MLI  Tightly coupled to Spark  SystemML  Advanced cost-based optimization  Tightly bound to a specific resource manager(?)  + yet another language  Julia (closest conceptually)  + yet another language  + yet another backend
  • 24. Wanted and WIP  Data Frames DSL API & physical layer(M-1490)  E.g. For standardizing feature vectorization in Mahout  E.g. For custom business rules scripting  “Bring Your Own Distributed Method” (BYODM) – build out ScalaBindings’ “write once – run everywhere” collection of things  Bindings for http://Stratosphere.eu  Automatic parallelism adjustments  Ability scale and balance problem to all available resources automatically  For more, see Spark Bindings home page
  • 25. Links  Scala and Spark Bindings http://mahout.apache.org/users/sparkbindings/home. html  Stochastic Singular Value Decomposition http://mahout.apache.org/users/dim- reduction/ssvd.html  Blog http://weatheringthrutechdays.blogspot.com Thank you.

Notes de l'éditeur

  1. On the right side is the picture called “Statue of a miner”. I snapped it inKuntaHora, Czech Repulbic about 10 years ago. More specifically, it is a statute of a miner mining for sivler for a state mint. It was a heavy