SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Bringing Algebraic Semantics
to Apache Mahout
Sebastian Schelter
Infolunch @ Hasso Plattner Institut, Potsdam
05/07/2014
Mahout: Past & Future
Apache Mahout: History
• library for scalable machine learning (ML)
• started six years ago as ML on MapReduce
• focus on popular ML problems and algorithms
– Collaborative Filtering
• (Item-based, User-based, Matrix Factorization)
– Classification
• (Naive Bayes, Logistic Regression, Random Forest)
– Clustering
• (k-Means, Streaming k-Means)
– Dimensionality Reduction
• (Lanczos, Stochastic SVD)
+ in-memory math library (fork of Colt)
• large userbase (e.g. Adobe, AOL, Accenture, Foursquare, Mendeley, Researchgate, Twitter)
Apache Mahout: Problems
• Organisational
– Onerous to provide support and documentation for scalable ML
(need to convey technical and theoretical background at the same
time)
– partly failed to compensate for developers that left the project
– barrier for contributions too low
• Technical
– MapReduce not well suited for ML
• slow execution, especially for iterations
• constrained programming model makes code hard to
write, read and adjust
– different quality of implementations
– lack of unified preprocessing machinery
Apache Mahout: Future
• Narrowing of focus
– removed a large number of rarely used or barely maintained algorithm
implementations
• Abandonment of MapReduce
– will reject new MapReduce implementations
– widely used „legacy“ implementations will be maintained
• „Reboot“
– usage of modern parallel processing systems which offer richer progamming
model & more efficient execution
→ Apache Spark, possibly Stratosphere in the future
– DSL for linear algebraic operations as foundation for new implemenations
– automatic optimization and parallelization of programs written in this DSL
Scala & Spark-Bindings
Requirements for an ideal ML environment
1. R/Matlab-like semantics
– type system that covers (a) linear algebra,
(b) statistics and (c) data frames
2. Modern programming language qualities
– functional programming
– object oriented programming
– sensible byte code Performance
– scriptable and interactive
3. Scalability
– automatic distribution and
parallelization with sensible
performance
4. Collection of off-the-shelf building
blocks and algorithms
5. Visualization
Requirements for an ideal ML environment
1. R/Matlab-like semantics
– type system that covers (a) linear algebra,
(b) statistics and (c) data frames
2. Modern programming language qualities
– functional programming
– object oriented programming
– sensible byte code Performance
– scriptable and Interactive
3. Scalability
– automatic distribution and
parallelization with sensible
performance
4. Collection of off-the-shelf building
blocks and algorithms
5. Visualization → Mahout‘s new Scala & Spark-bindings
address requirements 1(a), 2, 3 and 4
Scala & Spark Bindings
• Scala as programming/scripting environment
• R-like DSL :
val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q)
• Algebraic expression optimizer for distributed linear algebra
– provides a translation layer to distributed engines
– currently supports Apache Spark only
– might support Stratosphere in the future
q
T
q
TTT
ssCCBBG 
Data Types
• Scalar real values
• In-memory vectors
– dense
– 2 types of sparse
• In-memory matrices
– sparse and dense
– a number of specialized matrices
• Distributed Row Matrices (DRM)
– huge matrix, partitioned by rows
– lives in the main memory of the cluster
– provides small set of parallelized
operations
– lazily evaluated operation execution
val x = 2.367
val v = dvec(1, 0, 5)
val w =
svec((0 -> 1)::(2 -> 5)::Nil)
val A = dense((1, 0, 5),
(2, 1, 4),
(4, 3, 1))
val drmA = drmFromHDFS(...)
Features (1)
• Matrix, vector, scalar operators:
in-memory, out-of-core
• Slicing operators
• Assignments (in-memory only)
• Vector-specific
• Summaries
drmA %*% drmB
A %*% x
A.t %*% drmB
A * B
A(5 until 20, 3 until 40)
A(5, ::); A(5, 5)
x(a to b)
A(5, ::) := x
A *= B
A -=: B; 1 /:= x
x dot y; x cross y
A.nrow; x.length;
A.colSums; B.rowMeans
x.sum; A.norm
Features (2)
• solving linear systems
• in-memory decompositions
• out-of-core decompositions
• caching of DRMs
val x = solve(A, b)
val (inMemQ, inMemR) = qr(inMemM)
val ch = chol(inMemM)
val (inMemV, d) = eigen(inMemM)
val (inMemU, inMemV, s) = svd(inMemM)
val (drmQ, inMemR) = thinQR(drmA)
val (drmU, drmV, s) =
dssvd(drmA, k = 50, q = 1)
val drmA_cached = drmA.checkpoint()
drmA_cached.uncache()
Example
Cereals
Name protein fat carbo sugars rating
Apple Cinnamon Cheerios 2 2 10.5 10 29.509541
Cap‘n‘Crunch 1 2 12 12 18.042851
Cocoa Puffs 1 1 12 13 22.736446
Froot Loops 2 1 11 13 32.207582
Honey Graham Ohs 1 2 12 11 21.871292
Wheaties Honey Gold 2 1 16 8 36.187559
Cheerios 6 2 17 1 50.764999
Clusters 3 2 13 7 40.400208
Great Grains Pecan 3 3 13 4 45.811716
http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
Linear Regression
• Assumption: target variable y generated by linear combination of feature
matrix X with parameter vector β, plus noise ε
• Goal: find estimate of the parameter
vector β that explains the data well
• Cereals example
X = weights of ingredients
y = customer rating
  Xy
Data Ingestion
• Usually: load dataset as DRM from a distributed filesystem:
val drmData = drmFromHdfs(...)
• ‚Mimick‘ a large dataset for our example:
val drmData = drmParallelize(dense(
(2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios
(1, 2, 12, 12, 18.042851), // Cap'n'Crunch
(1, 1, 12, 13, 22.736446), // Cocoa Puffs
(2, 1, 11, 13, 32.207582), // Froot Loops
(1, 2, 12, 11, 21.871292), // Honey Graham Ohs
(2, 1, 16, 8, 36.187559), // Wheaties Honey Gold
(6, 2, 17, 1, 50.764999), // Cheerios
(3, 2, 13, 7, 40.400208), // Clusters
(3, 3, 13, 4, 45.811716)), // Great Grains Pecan
numPartitions = 2)
Data Preparation
• Cereals example: target variable y is customer rating, weights of
ingredients are features X
• extract X as DRM by slicing,
fetch y as in-core vector
val drmX = drmData(::, 0 until 4)
val y = drmData.collect(::, 4)




























8117164541333
4002084071323
7649995011726
1875593681612
87129221111221
20758232131112
73644622131211
04285118121221
509541291051022
.
.
.
.
.
.
.
.
..
drmX y
Estimating β
• Ordinary Least Squares: minimizes the sum of residual squares between
true target variable and prediction of target variable
• Closed-form expression for estimation of ß as
• Computing XTX and XTy is as simple as typing the formulas:
val drmXtX = drmX.t %*% drmX
val drmXty = drmX %*% y
yXXX
TT 1
)(ˆ 

Estimating β
• Solve the following linear system to get least-squares estimate of ß
• Fetch XTX andXTy onto the driver and use an in-core solver
– assumes XTX fits into memory
– uses analogon to R’s solve() function
val XtX = drmXtX.collect
val Xty = drmXty.collect(::, 0)
val betaHat = solve(XtX, Xty)
yXXX
TT
ˆ
Estimating β
• Solve the following linear system to get least-squares estimate of ß
• Fetch XTX andXTy onto the driver and use an in-memory solver
– assumes XTX fits into memory
– uses analogon to R’s solve() function
val XtX = drmXtX.collect
val Xty = drmXty.collect(::, 0)
val betaHat = solve(XtX, Xty)
yXXX
TT
ˆ
→ We have implemented distributed linear regression!
Goodness of fit
• Prediction of the target variable is simple matrix-vector multiplication
• Check L2 norm of the difference between true target variable and our
prediction
val yHat = (drmX %*% betaHat).collect(::, 0)
(y - yHat).norm(2)
ˆˆ Xy 
Adding a bias term
• Bias term left out so far
– constant factor added to the model, “shifts the line vertically”
• Common trick is to add a column of ones to the feature matrix
– bias term will be learned automatically




























141333
171323
111726
181612
1111221
1131112
1131211
1121221
11051022 .




























41333
71323
11726
81612
111221
131112
131211
121221
105.1022
Adding a bias term
• How do we add a new column to a DRM?
→ mapBlock() allows custom modifications to the matrix
val drmXwithBiasColumn = drmX.mapBlock(ncol = drmX.ncol + 1) {
case(keys, block) =>
// create a new block with an additional column
val blockWithBiasCol = block.like(block.nrow, block.ncol+1)
// copy data from current block into the new block
blockWithBiasCol(::, 0 until block.ncol) := block
// last column consists of ones
blockWithBiasColumn(::, block.ncol) := 1
keys -> blockWithBiasColumn
}
Under the covers
Underlying systems
• currently: prototype on Apache Spark
– fast and expressive cluster
computing system
– general computation graphs,
in-memory primitives, rich API,
interactive shell
• future: add Stratosphere
– research system developed by
TU Berlin, HU Berlin, HPI
– accepted into Apache Incubator recently
– functionality similar to Apache Spark, adds data flow
optimization and efficient out-of-core execution
Optimization
• Execution is defered, user
composes logical operators
• Computational actions implicitly
trigger optimization (= selection
of physical plan) and execution
• Optimization factors: size of operands, orientation of operands,
partitioning, sharing of computational paths
• e. g.: matrix multiplication:
– 5 physical operators for drmA %*% drmB
– 2 operators for drmA %*% inMemA
– 1 operator for drm A %*% x
– 1 operator for x %*% drmA
val drmA = drmB.t %*% drmC
drmA.writeDrm(path);
val inMemA =
(drmB.t %*% drmB).collect
Optimization Example (1)
• Computation of XTX in the linear regression example
• Logical optimization:
logical plan gets rewritten to use
a special logical operator for
Transpose-Times-Self multiplication
via a rule in the optimizer
val drmXtX = drmX.t %*% drmX
val XtX = drmXtX.collect
OpAt
OpAB
drmX drmX
drmXtX
OpAtA
drmX
drmXtX
Optimization Example (2)
• Mahout computes XTX via row-outer-product formulation of matrix
multiplication that can be efficiently executed in a single pass over the
row-partitioned X
• optimizer chooses between to physical operators for OpAtA
– standard implementation:
• Map operator builds partial outer products from individual blocks based
on an estimate of a good partitioning
• Reduce operator adds them up using a Reduce operator
– AtA_slim for tall but skinny matrices when an intermediate symmetric matrix
requiring (X.ncol * X.ncol) / 2 entries fits into memory




m
i
T
ii
T
xxXX
0
Thank you. Questions?
Credits:
Design & implementation of the Scala & Spark Bindings
and parts of this slideset by Dmitriy Lyubimov
dlyubimov@apache.org

Contenu connexe

Tendances

Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative ModelsKenta Oono
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APISri Ambati
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_publicLong Nguyen
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature HashingWush Wu
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 

Tendances (20)

Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O API
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 

Similaire à Bringing Algebraic Semantics to Mahout

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Hsien-Hsin Sean Lee, Ph.D.
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB CodeJia-Bin Huang
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlibXiangrui Meng
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learningMax Kleiner
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 

Similaire à Bringing Algebraic Semantics to Mahout (20)

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB Code
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 

Plus de sscdotopen

Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenderssscdotopen
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filteringsscdotopen
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 

Plus de sscdotopen (8)

Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
mahout-cf
mahout-cfmahout-cf
mahout-cf
 

Dernier

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 

Dernier (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 

Bringing Algebraic Semantics to Mahout

  • 1. Bringing Algebraic Semantics to Apache Mahout Sebastian Schelter Infolunch @ Hasso Plattner Institut, Potsdam 05/07/2014
  • 2. Mahout: Past & Future
  • 3. Apache Mahout: History • library for scalable machine learning (ML) • started six years ago as ML on MapReduce • focus on popular ML problems and algorithms – Collaborative Filtering • (Item-based, User-based, Matrix Factorization) – Classification • (Naive Bayes, Logistic Regression, Random Forest) – Clustering • (k-Means, Streaming k-Means) – Dimensionality Reduction • (Lanczos, Stochastic SVD) + in-memory math library (fork of Colt) • large userbase (e.g. Adobe, AOL, Accenture, Foursquare, Mendeley, Researchgate, Twitter)
  • 4. Apache Mahout: Problems • Organisational – Onerous to provide support and documentation for scalable ML (need to convey technical and theoretical background at the same time) – partly failed to compensate for developers that left the project – barrier for contributions too low • Technical – MapReduce not well suited for ML • slow execution, especially for iterations • constrained programming model makes code hard to write, read and adjust – different quality of implementations – lack of unified preprocessing machinery
  • 5. Apache Mahout: Future • Narrowing of focus – removed a large number of rarely used or barely maintained algorithm implementations • Abandonment of MapReduce – will reject new MapReduce implementations – widely used „legacy“ implementations will be maintained • „Reboot“ – usage of modern parallel processing systems which offer richer progamming model & more efficient execution → Apache Spark, possibly Stratosphere in the future – DSL for linear algebraic operations as foundation for new implemenations – automatic optimization and parallelization of programs written in this DSL
  • 7. Requirements for an ideal ML environment 1. R/Matlab-like semantics – type system that covers (a) linear algebra, (b) statistics and (c) data frames 2. Modern programming language qualities – functional programming – object oriented programming – sensible byte code Performance – scriptable and interactive 3. Scalability – automatic distribution and parallelization with sensible performance 4. Collection of off-the-shelf building blocks and algorithms 5. Visualization
  • 8. Requirements for an ideal ML environment 1. R/Matlab-like semantics – type system that covers (a) linear algebra, (b) statistics and (c) data frames 2. Modern programming language qualities – functional programming – object oriented programming – sensible byte code Performance – scriptable and Interactive 3. Scalability – automatic distribution and parallelization with sensible performance 4. Collection of off-the-shelf building blocks and algorithms 5. Visualization → Mahout‘s new Scala & Spark-bindings address requirements 1(a), 2, 3 and 4
  • 9. Scala & Spark Bindings • Scala as programming/scripting environment • R-like DSL : val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q) • Algebraic expression optimizer for distributed linear algebra – provides a translation layer to distributed engines – currently supports Apache Spark only – might support Stratosphere in the future q T q TTT ssCCBBG 
  • 10. Data Types • Scalar real values • In-memory vectors – dense – 2 types of sparse • In-memory matrices – sparse and dense – a number of specialized matrices • Distributed Row Matrices (DRM) – huge matrix, partitioned by rows – lives in the main memory of the cluster – provides small set of parallelized operations – lazily evaluated operation execution val x = 2.367 val v = dvec(1, 0, 5) val w = svec((0 -> 1)::(2 -> 5)::Nil) val A = dense((1, 0, 5), (2, 1, 4), (4, 3, 1)) val drmA = drmFromHDFS(...)
  • 11. Features (1) • Matrix, vector, scalar operators: in-memory, out-of-core • Slicing operators • Assignments (in-memory only) • Vector-specific • Summaries drmA %*% drmB A %*% x A.t %*% drmB A * B A(5 until 20, 3 until 40) A(5, ::); A(5, 5) x(a to b) A(5, ::) := x A *= B A -=: B; 1 /:= x x dot y; x cross y A.nrow; x.length; A.colSums; B.rowMeans x.sum; A.norm
  • 12. Features (2) • solving linear systems • in-memory decompositions • out-of-core decompositions • caching of DRMs val x = solve(A, b) val (inMemQ, inMemR) = qr(inMemM) val ch = chol(inMemM) val (inMemV, d) = eigen(inMemM) val (inMemU, inMemV, s) = svd(inMemM) val (drmQ, inMemR) = thinQR(drmA) val (drmU, drmV, s) = dssvd(drmA, k = 50, q = 1) val drmA_cached = drmA.checkpoint() drmA_cached.uncache()
  • 14. Cereals Name protein fat carbo sugars rating Apple Cinnamon Cheerios 2 2 10.5 10 29.509541 Cap‘n‘Crunch 1 2 12 12 18.042851 Cocoa Puffs 1 1 12 13 22.736446 Froot Loops 2 1 11 13 32.207582 Honey Graham Ohs 1 2 12 11 21.871292 Wheaties Honey Gold 2 1 16 8 36.187559 Cheerios 6 2 17 1 50.764999 Clusters 3 2 13 7 40.400208 Great Grains Pecan 3 3 13 4 45.811716 http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
  • 15. Linear Regression • Assumption: target variable y generated by linear combination of feature matrix X with parameter vector β, plus noise ε • Goal: find estimate of the parameter vector β that explains the data well • Cereals example X = weights of ingredients y = customer rating   Xy
  • 16. Data Ingestion • Usually: load dataset as DRM from a distributed filesystem: val drmData = drmFromHdfs(...) • ‚Mimick‘ a large dataset for our example: val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'Crunch (1, 1, 12, 13, 22.736446), // Cocoa Puffs (2, 1, 11, 13, 32.207582), // Froot Loops (1, 2, 12, 11, 21.871292), // Honey Graham Ohs (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold (6, 2, 17, 1, 50.764999), // Cheerios (3, 2, 13, 7, 40.400208), // Clusters (3, 3, 13, 4, 45.811716)), // Great Grains Pecan numPartitions = 2)
  • 17. Data Preparation • Cereals example: target variable y is customer rating, weights of ingredients are features X • extract X as DRM by slicing, fetch y as in-core vector val drmX = drmData(::, 0 until 4) val y = drmData.collect(::, 4)                             8117164541333 4002084071323 7649995011726 1875593681612 87129221111221 20758232131112 73644622131211 04285118121221 509541291051022 . . . . . . . . .. drmX y
  • 18. Estimating β • Ordinary Least Squares: minimizes the sum of residual squares between true target variable and prediction of target variable • Closed-form expression for estimation of ß as • Computing XTX and XTy is as simple as typing the formulas: val drmXtX = drmX.t %*% drmX val drmXty = drmX %*% y yXXX TT 1 )(ˆ  
  • 19. Estimating β • Solve the following linear system to get least-squares estimate of ß • Fetch XTX andXTy onto the driver and use an in-core solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty) yXXX TT ˆ
  • 20. Estimating β • Solve the following linear system to get least-squares estimate of ß • Fetch XTX andXTy onto the driver and use an in-memory solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty) yXXX TT ˆ → We have implemented distributed linear regression!
  • 21. Goodness of fit • Prediction of the target variable is simple matrix-vector multiplication • Check L2 norm of the difference between true target variable and our prediction val yHat = (drmX %*% betaHat).collect(::, 0) (y - yHat).norm(2) ˆˆ Xy 
  • 22. Adding a bias term • Bias term left out so far – constant factor added to the model, “shifts the line vertically” • Common trick is to add a column of ones to the feature matrix – bias term will be learned automatically                             141333 171323 111726 181612 1111221 1131112 1131211 1121221 11051022 .                             41333 71323 11726 81612 111221 131112 131211 121221 105.1022
  • 23. Adding a bias term • How do we add a new column to a DRM? → mapBlock() allows custom modifications to the matrix val drmXwithBiasColumn = drmX.mapBlock(ncol = drmX.ncol + 1) { case(keys, block) => // create a new block with an additional column val blockWithBiasCol = block.like(block.nrow, block.ncol+1) // copy data from current block into the new block blockWithBiasCol(::, 0 until block.ncol) := block // last column consists of ones blockWithBiasColumn(::, block.ncol) := 1 keys -> blockWithBiasColumn }
  • 25. Underlying systems • currently: prototype on Apache Spark – fast and expressive cluster computing system – general computation graphs, in-memory primitives, rich API, interactive shell • future: add Stratosphere – research system developed by TU Berlin, HU Berlin, HPI – accepted into Apache Incubator recently – functionality similar to Apache Spark, adds data flow optimization and efficient out-of-core execution
  • 26. Optimization • Execution is defered, user composes logical operators • Computational actions implicitly trigger optimization (= selection of physical plan) and execution • Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths • e. g.: matrix multiplication: – 5 physical operators for drmA %*% drmB – 2 operators for drmA %*% inMemA – 1 operator for drm A %*% x – 1 operator for x %*% drmA val drmA = drmB.t %*% drmC drmA.writeDrm(path); val inMemA = (drmB.t %*% drmB).collect
  • 27. Optimization Example (1) • Computation of XTX in the linear regression example • Logical optimization: logical plan gets rewritten to use a special logical operator for Transpose-Times-Self multiplication via a rule in the optimizer val drmXtX = drmX.t %*% drmX val XtX = drmXtX.collect OpAt OpAB drmX drmX drmXtX OpAtA drmX drmXtX
  • 28. Optimization Example (2) • Mahout computes XTX via row-outer-product formulation of matrix multiplication that can be efficiently executed in a single pass over the row-partitioned X • optimizer chooses between to physical operators for OpAtA – standard implementation: • Map operator builds partial outer products from individual blocks based on an estimate of a good partitioning • Reduce operator adds them up using a Reduce operator – AtA_slim for tall but skinny matrices when an intermediate symmetric matrix requiring (X.ncol * X.ncol) / 2 entries fits into memory     m i T ii T xxXX 0
  • 29. Thank you. Questions? Credits: Design & implementation of the Scala & Spark Bindings and parts of this slideset by Dmitriy Lyubimov dlyubimov@apache.org