SlideShare une entreprise Scribd logo
1  sur  81
Télécharger pour lire hors ligne
thanachart@imcinstitute.com1
Hands-on: Exercise
Machine Learning using
Apache Spark MLlib
July 2016
Dr.Thanachart Numnonda
IMC Institute
thanachart@imcinstitute.com
thanachart@imcinstitute.com2
What is MLlib?
Source: MapR Academy
thanachart@imcinstitute.com3
MLlib is a Spark subproject providing machine
learning primitives:
– initial contribution from AMPLab, UC Berkeley
– shipped with Spark since version 0.8
– 33 contributors
What is MLlib?
thanachart@imcinstitute.com4
Classification: logistic regression, linear support vector
machine(SVM), naive Bayes
Regression: generalized linear regression (GLM)
Collaborative filtering: alternating least squares (ALS)
Clustering: k-means
Decomposition: singular value decomposition (SVD),
principal component analysis (PCA)
Mllib Algorithms
thanachart@imcinstitute.com5
What is in MLlib?
Source: Mllib:Spark's Machine Learning Library, A. Talwalkar
thanachart@imcinstitute.com6
Part of Spark
Scalable
Support: Python, Scala, Java
Broad coverage of applications & algorithms
Rapid developments in speed & robustness
MLlib: Benefits
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Machine Learning
Machine learning is a scientific discipline that
explores the construction and study of algorithms
that can learn from data.
[Wikipedia]
thanachart@imcinstitute.com8
A point is just a set of numbers. This set of numbers or
coordinates defines the point's position in space.
Points and vectors are same thing.
Dimensions in vectors are called features
Hyperspace is a space with more than three dimensions.
Example: A person has the following dimensions:
– Weight
– Height
– Age
Thus, the interpretation of point (160,69,24) would be
160 lb weight, 69 inches height, and 24 years age.
Vectors
Source:Spark Cookbook
thanachart@imcinstitute.com9
Spark has local vectors and matrices and also distributed
matrices.
– Distributed matrix is backed by one or more RDDs.
– A local vector has numeric indices and double values, and is
stored on a single machine.
Two types of local vectors in MLlib:
– Dense vector is backed by an array of its values.
– Sparse vector is backed by two parallel arrays, one for indices
and another for values.
Example
– Dense vector: [160.0,69.0,24.0]
– Sparse vector: (3,[0,1,2],[160.0,69.0,24.0])
Vectors in MLlib
Source:Spark Cookbook
thanachart@imcinstitute.com10
Library
– import org.apache.spark.mllib.linalg.{Vectors,Vector}
Signature of Vectors.dense:
– def dense(values: Array[Double]): Vector
Signature of Vectors.sparse:
– def sparse(size: Int, indices: Array[Int], values: Array[Double]):
Vector
Vectors in Mllib (cont.)
thanachart@imcinstitute.com11
Example
thanachart@imcinstitute.com12
Labeled point is a local vector (sparse/dense), ), which
has an associated label with it.
Labeled data is used in supervised learning to help train
algorithms.
Label is stored as a double value in LabeledPoint.
Labeled point
Source:Spark Cookbook
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Example
scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
scala> import org.apache.spark.mllib.regression.LabeledPoint
scala> val willBuySUV =
LabeledPoint(1.0,Vectors.dense(300.0,80,40))
scala> val willNotBuySUV =
LabeledPoint(0.0,Vectors.dense(150.0,60,25))
scala> val willBuySUV =
LabeledPoint(1.0,Vectors.sparse(3,Array(0,1,2),Array(300.0,80,
40)))
scala> val willNotBuySUV =
LabeledPoint(0.0,Vectors.sparse(3,Array(0,1,2),Array(150.0,60,
25)))
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Example (cont)
# vi person_libsvm.txt
scala> import org.apache.spark.mllib.util.MLUtils
scala> import org.apache.spark.rdd.RDD
scala> val persons =
MLUtils.loadLibSVMFile(sc,"hdfs:///user/cloudera/person_libsvm
.txt")
scala> persons.first()
thanachart@imcinstitute.com15
Spark has local matrices and also distributed matrices.
– Distributed matrix is backed by one or more RDDs.
– A local matrix stored on a single machine.
There are three types of distributed matrices in MLlib:
– RowMatrix: This has each row as a feature vector.
– IndexedRowMatrix: This also has row indices.
– CoordinateMatrix: This is simply a matrix of MatrixEntry. A
MatrixEntry represents an entry in the matrix represented by its
row and column index
Matrices in MLlib
Source:Spark Cookbook
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Example
scala> import org.apache.spark.mllib.linalg.{Vectors,Matrix,
Matrices}
scala> val people = Matrices.dense(3,2,Array(150d,60d,25d,
300d,80d,40d))
scala> val personRDD =
sc.parallelize(List(Vectors.dense(150,60,25),
Vectors.dense(300,80,40)))
scala> import org.apache.spark.mllib.linalg.distributed.
{IndexedRow, IndexedRowMatrix,RowMatrix, CoordinateMatrix,
MatrixEntry}
scala> val personMat = new RowMatrix(personRDD)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Example
scala> print(personMat.numRows)
scala> val personRDD = sc.parallelize(List(IndexedRow(0L,
Vectors.dense(150,60,25)), IndexedRow(1L,
Vectors.dense(300,80,40))))
scala> val pirmat = new IndexedRowMatrix(personRDD)
scala> val personMat = pirmat.toRowMatrix
scala> val meRDD = sc.parallelize(List(
MatrixEntry(0,0,150), MatrixEntry(1,0,60),
MatrixEntry(2,0,25), MatrixEntry(0,1,300),
MatrixEntry(1,1,80),MatrixEntry(2,1,40) ))
scala> val pcmat = new CoordinateMatrix(meRDD)
thanachart@imcinstitute.com18
Central tendency of data—mean, mode, median
Spread of data—variance, standard deviation
Boundary conditions—min, max
Statistic functions
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Example
scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
scala> import org.apache.spark.mllib.stat.Statistics
scala> val personRDD =
sc.parallelize(List(Vectors.dense(150,60,25),
Vectors.dense(300,80,40)))
scala> val summary = Statistics.colStats(personRDD)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Hands-on
Movie Recommendation
thanachart@imcinstitute.com21
Source: Mllib:Spark's Machine Learning Library, A. Talwalkar
Recommendation
thanachart@imcinstitute.com22
Source: Mllib:Spark's Machine Learning Library, A. Talwalkar
Recommendation: Collaborative Filtering
thanachart@imcinstitute.com23
Source: Mllib:Spark's Machine Learning Library, A. Talwalkar
Recommendation
thanachart@imcinstitute.com24
Source: Mllib:Spark's Machine Learning Library, A. Talwalkar
Recommendation: ALS
thanachart@imcinstitute.com25
Source: MLlib: Scalable Machine Learning on Spark, X. Meng
Alternating least squares (ALS)
thanachart@imcinstitute.com26
numBlocks is the number of blocks used to parallelize
computation (set to -1 to autoconfigure)
rank is the number of latent factors in the model
iterations is the number of iterations to run
lambda specifies the regularization parameter in ALS
implicitPrefs specifies whether to use the explicit feedback
ALS variant or one adapted for an implicit feedback data
alpha is a parameter applicable to the implicit feedback
variant of ALS that governs the baseline confidence in
preference observations
MLlib: ALS Algorithm
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
MovieLen Dataset
1)Type command > wget
http://files.grouplens.org/datasets/movielens/ml-100k.zip
2)Type command > yum install unzip
3)Type command > unzip ml-100k.zip
4)Type command > more ml-100k/u.user
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Moving dataset to HDFS
1)Type command > cd ml-100k
2)Type command > hadoop fs -mkdir /user/cloudera/movielens
3)Type command > hadoop fs -put u.user /user/cloudera/movielens
4)Type command > hadoop fs -put u.data /user/cloudera/movielens
4)Type command > hadoop fs -put u.genre /user/cloudera/movielens
5)Type command > hadoop fs -put u.item /user/cloudera/movielens
6)Type command > hadoop fs -ls /user/cloudera/movielens
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Start Spark-shell with extra memory
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Extracting features from the MovieLens dataset
scala> val rawData =
sc.textFile("hdfs:///user/cloudera/movielens/u.data")
scala> rawData.first()
scala> val rawRatings = rawData.map(_.split("t").take(3))
scala> rawRatings.first()
scala> import org.apache.spark.mllib.recommendation.Rating
scala> val ratings = rawRatings.map { case Array(user, movie,
rating) =>Rating(user.toInt, movie.toInt, rating.toDouble) }
scala> ratings.first()
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Training the recommendation model
scala> import org.apache.spark.mllib.recommendation.ALS
scala> val model = ALS.train(ratings, 50, 10, 0.01)
Note: We'll use rank of 50, 10 iterations, and a lambda parameter of 0.01
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Inspecting the recommendations
scala> val movies =
sc.textFile("hdfs:///user/cloudera/movielens/u.item")
scala> val titles = movies.map(line =>
line.split("|").take(2)).map(array
=>(array(0).toInt,array(1))).collectAsMap()
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Inspecting the recommendations (cont.)
scala> val moviesForUser = ratings.keyBy(_.user).lookup(789)
scala> moviesForUser.sortBy(-_.rating).take(10).map(rating =>
(titles(rating.product), rating.rating)).foreach(println)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Top 10 Recommendation for userid 789
scala> val topKRecs = model.recommendProducts(789,10)
scala> topKRecs.map(rating => (titles(rating.product),
rating.rating)).foreach(println)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Evaluating Performance: Mean Squared Error
scala> val actualRating = moviesForUser.take(1)(0)
scala> val predictedRating = model.predict(789,
actualRating.product)
scala> val squaredError = math.pow(predictedRating -
actualRating.rating, 2.0)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Overall Mean Squared Error
scala> val usersProducts = ratings.map{ case Rating(user,
product, rating) => (user, product)}
scala> val predictions = model.predict(usersProducts).map{
case Rating(user, product, rating) => ((user, product),
rating)}
scala> val ratingsAndPredictions = ratings.map{
case Rating(user, product, rating) => ((user, product),
rating)
}.join(predictions)
scala> val MSE = ratingsAndPredictions.map{
case ((user, product), (actual, predicted)) =>
math.pow((actual - predicted), 2)
}.reduce(_ + _) / ratingsAndPredictions.count
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Clustering using K-Means
thanachart@imcinstitute.com38
Market segmentation
Social network analysis: Finding a coherent group of
people in the social network for ad targeting
Data center computing clusters
Real estate: Identifying neighborhoods based on similar
features
Text analysis: Dividing text documents, such as novels or
essays, into genres
Clustering use cases
thanachart@imcinstitute.com39Source: Mahout in Action
thanachart@imcinstitute.com40Source: Mahout in Action
Sample Data
thanachart@imcinstitute.com41
Distance Measures
Source www.edureka.in/data-science
thanachart@imcinstitute.com42
Euclidean distance measure
Squared Euclidean distance measure
Manhattan distance measure
Cosine distance measure
Distance Measures
thanachart@imcinstitute.com43
Distance Measures
thanachart@imcinstitute.com44
K-Means Clustering
Source: www.edureka.in/data-science
thanachart@imcinstitute.com45
Example of K-Means Clustering
thanachart@imcinstitute.com46
http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html
thanachart@imcinstitute.com47
K-Means with different distance measures
Source: Mahout in Action
thanachart@imcinstitute.com48
Choosing number of clusters
thanachart@imcinstitute.com49
Dimensionality reduction
Process of reducing the number of dimensions or
features.
Dimensionality reduction serves several purposes
– Data compression
– Visualization
The most popular algorithm: Principal component
analysis (PCA).
thanachart@imcinstitute.com50
Dimensionality reduction
Source: Spark Cookbook
thanachart@imcinstitute.com51
Dimensionality reduction with SVD
Singular Value Decomposition (SVD): is based on a
theorem from linear algebra that a rectangular matrix A
can be broken down into a product of three matrices
thanachart@imcinstitute.com52
Dimensionality reduction with SVD
The basic idea behind SVD
– Take a high dimension, a highly variable set of data
points
– Reduce it to a lower dimensional space that exposes
the structure of the original data more clearly and
orders it from the most variation to the least.
So we can simply ignore variation below a certain
threshold to massively reduce the original data, making
sure that the original relationship interests are retained.
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Hands-on
Clustering on MovieLens Dataset
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Extracting features from the MovieLens dataset
scala> val rawData =
sc.textFile("hdfs:///user/cloudera/movielens/u.item")
scala> println(movies.first)
scala> val genres =
sc.textFile("hdfs:///user/cloudera/movielens/u.genre")
scala> genres.take(5).foreach(println)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Extracting features from the MovieLens dataset (cont.)
scala> val genreMap = genres.filter(!_.isEmpty).map(line =>
line.split("|")).map(array=> (array(1),
array(0))).collectAsMap
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Extracting features from the MovieLens dataset (cont.)
scala> val titlesAndGenres = movies.map(_.split("|")).map
{ array =>
val genres = array.toSeq.slice(5, array.size)
val genresAssigned = genres.zipWithIndex.filter { case (g,
idx) =>
g == "1"
}.map { case (g, idx) =>
genreMap(idx.toString)
}
(array(0).toInt, (array(1), genresAssigned))
}
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Training the recommendation model
scala> :paste
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val rawData =
sc.textFile("hdfs:///user/cloudera/movielens/u.data")
val rawRatings = rawData.map(_.split("t").take(3))
val ratings = rawRatings.map{ case Array(user, movie,
rating) => Rating(user.toInt, movie.toInt,
rating.toDouble) }
ratings.cache
val alsModel = ALS.train(ratings, 50, 10, 0.1)
import org.apache.spark.mllib.linalg.Vectors
val movieFactors = alsModel.productFeatures.map { case (id,
factor) => (id, Vectors.dense(factor)) }
val movieVectors = movieFactors.map(_._2)
val userFactors = alsModel.userFeatures.map { case (id,
factor) => (id, Vectors.dense(factor)) }
val userVectors = userFactors.map(_._2)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Normalization
scala> :paste
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val movieMatrix = new RowMatrix(movieVectors)
val movieMatrixSummary =
movieMatrix.computeColumnSummaryStatistics()
val userMatrix = new RowMatrix(userVectors)
val userMatrixSummary =
userMatrix.computeColumnSummaryStatistics()
println("Movie factors mean: " + movieMatrixSummary.mean)
println("Movie factors variance: " +
movieMatrixSummary.variance)
println("User factors mean: " + userMatrixSummary.mean)
println("User factors variance: " +
userMatrixSummary.variance)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Output from Normalization
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Training a clustering model
scala> import org.apache.spark.mllib.clustering.KMeans
scala> val numClusters = 5
scala> val numIterations = 10
scala> val numRuns = 3
scala> val movieClusterModel = KMeans.train(movieVectors,
numClusters, numIterations, numRuns)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Making predictions using a clustering model
scala> val movie1 = movieVectors.first
scala> val movieCluster = movieClusterModel.predict(movie1)
scala> val predictions =
movieClusterModel.predict(movieVectors)
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Interpreting cluster predictions
scala> :paste
import breeze.linalg._
import breeze.numerics.pow
def computeDistance(v1: DenseVector[Double], v2:
DenseVector[Double]) = pow(v1 - v2, 2).sum
val titlesWithFactors = titlesAndGenres.join(movieFactors)
val moviesAssigned = titlesWithFactors.map { case (id,
((title, genres), vector)) =>
val pred = movieClusterModel.predict(vector)
val clusterCentre = movieClusterModel.clusterCenters(pred)
val dist =
computeDistance(DenseVector(clusterCentre.toArray),
DenseVector(vector.toArray))
(id, title, genres.mkString(" "), pred, dist)
}
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Interpreting cluster predictions (cont.)
val clusterAssignments = moviesAssigned.groupBy { case (id,
title, genres, cluster, dist) => cluster }.collectAsMap
for ( (k, v) <- clusterAssignments.toSeq.sortBy(_._1)) {
println(s"Cluster $k:")
val m = v.toSeq.sortBy(_._5)
println(m.take(20).map { case (_, title, genres, _, d) =>
(title, genres, d) }.mkString("n"))
println("=====n")
}
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Real-time Machine Learning
using Streaming K-Means
thanachart@imcinstitute.com66
Online learning with Spark Streaming
Streaming regression
– trainOn: This takes DStream[LabeledPoint] as its
argument.
– predictOn: This also takes DStream[LabeledPoint].
Streaming KMeans
– An extension of the mini-batch K-means algorithm
thanachart@imcinstitute.com67
Streaming K-Means Program
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
MovieLen Training Dataset
●
The rows of the training text files must be vector data
in the form
[x1,x2,x3,...,xn]
1)Type command > wget
https://s3.amazonaws.com/imcbucket/data/movietest.data
2)Type command > more movietest.data
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Install & Start Kafka Server
# wget http://www-us.apache.org/dist/kafka/0.9.0.1/kafka_2.10-
0.9.0.1.tgz
# tar xzf kafka_2.10-0.9.0.1.tgz
# cd kafka_2.10-0.9.0.1
# bin/kafka-server-start.sh config/server.properties&
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Start Spark-shell with extra memory
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Streaming K-Means
$ scala> :paste
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import StorageLevel._
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka.KafkaUtils
val ssc = new StreamingContext(sc, Seconds(2))
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
val kafkaStream = KafkaUtils.createStream(ssc,
"localhost:2181","spark-streaming-consumer-group", Map("java-
topic" -> 5))
val lines = kafkaStream.map(_._2)
val ratings = lines.map(Vectors.parse)
val numDimensions = 3
val numClusters = 5
val model = new StreamingKMeans()
.setK(numClusters)
.setDecayFactor(1.0)
.setRandomCenters(numDimensions, 0.0)
model.trainOn(ratings)
model.predictOn(ratings).print()
ssc.start()
ssc.awaitTermination()
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Running HelloKafkaProducer on another windows
●
Open a new ssh windows
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Java Code: Kafka Producer
import java.util.Properties;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import java.io.*;
public class HelloKafkaProducer {
final static String TOPIC = "java-topic";
public static void main(String[] argv){
Properties properties = new Properties();
properties.put("metadata.broker.list","localhost:9092");
properties.put("serializer.class","kafka.serializer.StringEnco
der");
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Java Code: Kafka Producer (cont.)
try(BufferedReader br = new BufferedReader(new
FileReader(argv[0]))) {
StringBuilder sb = new StringBuilder();
ProducerConfig producerConfig = new
ProducerConfig(properties);
kafka.javaapi.producer.Producer<String,String>
producer = new kafka.javaapi.producer.Producer<String,
String>(producerConfig);
String line = br.readLine();
while (line != null) {
KeyedMessage<String, String> message
=new KeyedMessage<String, String>(TOPIC,line);
producer.send(message);
line = br.readLine();
}
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Java Code: Kafka Producer (cont.)
producer.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Compile & Run the program
// Using a vi Editor to edit the sourcecode
# vi HelloKafkaProducer.java
// Alternatively
# wget
https://s3.amazonaws.com/imcbucket/apps/HelloKafkaProducer.java
// Compile progeram
# export CLASSPATH=".:/root/kafka_2.10-0.9.0.1/libs/*"
# javac HelloKafkaProducer.java
//prepare the data
# cd
# wget https://s3.amazonaws.com/imcbucket/input/pg2600.txt
# cd kafka_2.10-0.9.0.1
// Run the program
# java HelloKafkaProducer /root/movietest.data
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Example Result
thanachart@imcinstitute.com79
Recommended Books
thanachart@imcinstitute.com80
Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
Thank you
www.imcinstitute.com
www.facebook.com/imcinstitute

Contenu connexe

Tendances

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

Tendances (20)

Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 

En vedette

New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 

En vedette (20)

Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Collaborative Filtering using KNN
Collaborative Filtering using KNNCollaborative Filtering using KNN
Collaborative Filtering using KNN
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera Quickstart
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
สมุดกิจกรรม Code for Kids
สมุดกิจกรรม Code for Kidsสมุดกิจกรรม Code for Kids
สมุดกิจกรรม Code for Kids
 
ITSS Overview
ITSS OverviewITSS Overview
ITSS Overview
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in China
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/Production
 

Similaire à Machine Learning using Apache Spark MLlib

MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
MLconf
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
Robin Anil
 

Similaire à Machine Learning using Apache Spark MLlib (20)

Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubix
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applications
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 

Plus de IMC Institute

Plus de IMC Institute (20)

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AI
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to Work
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon Valley
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.org
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.org
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Machine Learning using Apache Spark MLlib

  • 1. thanachart@imcinstitute.com1 Hands-on: Exercise Machine Learning using Apache Spark MLlib July 2016 Dr.Thanachart Numnonda IMC Institute thanachart@imcinstitute.com
  • 3. thanachart@imcinstitute.com3 MLlib is a Spark subproject providing machine learning primitives: – initial contribution from AMPLab, UC Berkeley – shipped with Spark since version 0.8 – 33 contributors What is MLlib?
  • 4. thanachart@imcinstitute.com4 Classification: logistic regression, linear support vector machine(SVM), naive Bayes Regression: generalized linear regression (GLM) Collaborative filtering: alternating least squares (ALS) Clustering: k-means Decomposition: singular value decomposition (SVD), principal component analysis (PCA) Mllib Algorithms
  • 5. thanachart@imcinstitute.com5 What is in MLlib? Source: Mllib:Spark's Machine Learning Library, A. Talwalkar
  • 6. thanachart@imcinstitute.com6 Part of Spark Scalable Support: Python, Scala, Java Broad coverage of applications & algorithms Rapid developments in speed & robustness MLlib: Benefits
  • 7. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Machine Learning Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. [Wikipedia]
  • 8. thanachart@imcinstitute.com8 A point is just a set of numbers. This set of numbers or coordinates defines the point's position in space. Points and vectors are same thing. Dimensions in vectors are called features Hyperspace is a space with more than three dimensions. Example: A person has the following dimensions: – Weight – Height – Age Thus, the interpretation of point (160,69,24) would be 160 lb weight, 69 inches height, and 24 years age. Vectors Source:Spark Cookbook
  • 9. thanachart@imcinstitute.com9 Spark has local vectors and matrices and also distributed matrices. – Distributed matrix is backed by one or more RDDs. – A local vector has numeric indices and double values, and is stored on a single machine. Two types of local vectors in MLlib: – Dense vector is backed by an array of its values. – Sparse vector is backed by two parallel arrays, one for indices and another for values. Example – Dense vector: [160.0,69.0,24.0] – Sparse vector: (3,[0,1,2],[160.0,69.0,24.0]) Vectors in MLlib Source:Spark Cookbook
  • 10. thanachart@imcinstitute.com10 Library – import org.apache.spark.mllib.linalg.{Vectors,Vector} Signature of Vectors.dense: – def dense(values: Array[Double]): Vector Signature of Vectors.sparse: – def sparse(size: Int, indices: Array[Int], values: Array[Double]): Vector Vectors in Mllib (cont.)
  • 12. thanachart@imcinstitute.com12 Labeled point is a local vector (sparse/dense), ), which has an associated label with it. Labeled data is used in supervised learning to help train algorithms. Label is stored as a double value in LabeledPoint. Labeled point Source:Spark Cookbook
  • 13. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Example scala> import org.apache.spark.mllib.linalg.{Vectors,Vector} scala> import org.apache.spark.mllib.regression.LabeledPoint scala> val willBuySUV = LabeledPoint(1.0,Vectors.dense(300.0,80,40)) scala> val willNotBuySUV = LabeledPoint(0.0,Vectors.dense(150.0,60,25)) scala> val willBuySUV = LabeledPoint(1.0,Vectors.sparse(3,Array(0,1,2),Array(300.0,80, 40))) scala> val willNotBuySUV = LabeledPoint(0.0,Vectors.sparse(3,Array(0,1,2),Array(150.0,60, 25)))
  • 14. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Example (cont) # vi person_libsvm.txt scala> import org.apache.spark.mllib.util.MLUtils scala> import org.apache.spark.rdd.RDD scala> val persons = MLUtils.loadLibSVMFile(sc,"hdfs:///user/cloudera/person_libsvm .txt") scala> persons.first()
  • 15. thanachart@imcinstitute.com15 Spark has local matrices and also distributed matrices. – Distributed matrix is backed by one or more RDDs. – A local matrix stored on a single machine. There are three types of distributed matrices in MLlib: – RowMatrix: This has each row as a feature vector. – IndexedRowMatrix: This also has row indices. – CoordinateMatrix: This is simply a matrix of MatrixEntry. A MatrixEntry represents an entry in the matrix represented by its row and column index Matrices in MLlib Source:Spark Cookbook
  • 16. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Example scala> import org.apache.spark.mllib.linalg.{Vectors,Matrix, Matrices} scala> val people = Matrices.dense(3,2,Array(150d,60d,25d, 300d,80d,40d)) scala> val personRDD = sc.parallelize(List(Vectors.dense(150,60,25), Vectors.dense(300,80,40))) scala> import org.apache.spark.mllib.linalg.distributed. {IndexedRow, IndexedRowMatrix,RowMatrix, CoordinateMatrix, MatrixEntry} scala> val personMat = new RowMatrix(personRDD)
  • 17. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Example scala> print(personMat.numRows) scala> val personRDD = sc.parallelize(List(IndexedRow(0L, Vectors.dense(150,60,25)), IndexedRow(1L, Vectors.dense(300,80,40)))) scala> val pirmat = new IndexedRowMatrix(personRDD) scala> val personMat = pirmat.toRowMatrix scala> val meRDD = sc.parallelize(List( MatrixEntry(0,0,150), MatrixEntry(1,0,60), MatrixEntry(2,0,25), MatrixEntry(0,1,300), MatrixEntry(1,1,80),MatrixEntry(2,1,40) )) scala> val pcmat = new CoordinateMatrix(meRDD)
  • 18. thanachart@imcinstitute.com18 Central tendency of data—mean, mode, median Spread of data—variance, standard deviation Boundary conditions—min, max Statistic functions
  • 19. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Example scala> import org.apache.spark.mllib.linalg.{Vectors,Vector} scala> import org.apache.spark.mllib.stat.Statistics scala> val personRDD = sc.parallelize(List(Vectors.dense(150,60,25), Vectors.dense(300,80,40))) scala> val summary = Statistics.colStats(personRDD)
  • 20. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Hands-on Movie Recommendation
  • 21. thanachart@imcinstitute.com21 Source: Mllib:Spark's Machine Learning Library, A. Talwalkar Recommendation
  • 22. thanachart@imcinstitute.com22 Source: Mllib:Spark's Machine Learning Library, A. Talwalkar Recommendation: Collaborative Filtering
  • 23. thanachart@imcinstitute.com23 Source: Mllib:Spark's Machine Learning Library, A. Talwalkar Recommendation
  • 24. thanachart@imcinstitute.com24 Source: Mllib:Spark's Machine Learning Library, A. Talwalkar Recommendation: ALS
  • 25. thanachart@imcinstitute.com25 Source: MLlib: Scalable Machine Learning on Spark, X. Meng Alternating least squares (ALS)
  • 26. thanachart@imcinstitute.com26 numBlocks is the number of blocks used to parallelize computation (set to -1 to autoconfigure) rank is the number of latent factors in the model iterations is the number of iterations to run lambda specifies the regularization parameter in ALS implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for an implicit feedback data alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations MLlib: ALS Algorithm
  • 27. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer MovieLen Dataset 1)Type command > wget http://files.grouplens.org/datasets/movielens/ml-100k.zip 2)Type command > yum install unzip 3)Type command > unzip ml-100k.zip 4)Type command > more ml-100k/u.user
  • 28. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Moving dataset to HDFS 1)Type command > cd ml-100k 2)Type command > hadoop fs -mkdir /user/cloudera/movielens 3)Type command > hadoop fs -put u.user /user/cloudera/movielens 4)Type command > hadoop fs -put u.data /user/cloudera/movielens 4)Type command > hadoop fs -put u.genre /user/cloudera/movielens 5)Type command > hadoop fs -put u.item /user/cloudera/movielens 6)Type command > hadoop fs -ls /user/cloudera/movielens
  • 29. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Start Spark-shell with extra memory
  • 30. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Extracting features from the MovieLens dataset scala> val rawData = sc.textFile("hdfs:///user/cloudera/movielens/u.data") scala> rawData.first() scala> val rawRatings = rawData.map(_.split("t").take(3)) scala> rawRatings.first() scala> import org.apache.spark.mllib.recommendation.Rating scala> val ratings = rawRatings.map { case Array(user, movie, rating) =>Rating(user.toInt, movie.toInt, rating.toDouble) } scala> ratings.first()
  • 31. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Training the recommendation model scala> import org.apache.spark.mllib.recommendation.ALS scala> val model = ALS.train(ratings, 50, 10, 0.01) Note: We'll use rank of 50, 10 iterations, and a lambda parameter of 0.01
  • 32. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Inspecting the recommendations scala> val movies = sc.textFile("hdfs:///user/cloudera/movielens/u.item") scala> val titles = movies.map(line => line.split("|").take(2)).map(array =>(array(0).toInt,array(1))).collectAsMap()
  • 33. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Inspecting the recommendations (cont.) scala> val moviesForUser = ratings.keyBy(_.user).lookup(789) scala> moviesForUser.sortBy(-_.rating).take(10).map(rating => (titles(rating.product), rating.rating)).foreach(println)
  • 34. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Top 10 Recommendation for userid 789 scala> val topKRecs = model.recommendProducts(789,10) scala> topKRecs.map(rating => (titles(rating.product), rating.rating)).foreach(println)
  • 35. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Evaluating Performance: Mean Squared Error scala> val actualRating = moviesForUser.take(1)(0) scala> val predictedRating = model.predict(789, actualRating.product) scala> val squaredError = math.pow(predictedRating - actualRating.rating, 2.0)
  • 36. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Overall Mean Squared Error scala> val usersProducts = ratings.map{ case Rating(user, product, rating) => (user, product)} scala> val predictions = model.predict(usersProducts).map{ case Rating(user, product, rating) => ((user, product), rating)} scala> val ratingsAndPredictions = ratings.map{ case Rating(user, product, rating) => ((user, product), rating) }.join(predictions) scala> val MSE = ratingsAndPredictions.map{ case ((user, product), (actual, predicted)) => math.pow((actual - predicted), 2) }.reduce(_ + _) / ratingsAndPredictions.count
  • 37. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Clustering using K-Means
  • 38. thanachart@imcinstitute.com38 Market segmentation Social network analysis: Finding a coherent group of people in the social network for ad targeting Data center computing clusters Real estate: Identifying neighborhoods based on similar features Text analysis: Dividing text documents, such as novels or essays, into genres Clustering use cases
  • 42. thanachart@imcinstitute.com42 Euclidean distance measure Squared Euclidean distance measure Manhattan distance measure Cosine distance measure Distance Measures
  • 47. thanachart@imcinstitute.com47 K-Means with different distance measures Source: Mahout in Action
  • 49. thanachart@imcinstitute.com49 Dimensionality reduction Process of reducing the number of dimensions or features. Dimensionality reduction serves several purposes – Data compression – Visualization The most popular algorithm: Principal component analysis (PCA).
  • 51. thanachart@imcinstitute.com51 Dimensionality reduction with SVD Singular Value Decomposition (SVD): is based on a theorem from linear algebra that a rectangular matrix A can be broken down into a product of three matrices
  • 52. thanachart@imcinstitute.com52 Dimensionality reduction with SVD The basic idea behind SVD – Take a high dimension, a highly variable set of data points – Reduce it to a lower dimensional space that exposes the structure of the original data more clearly and orders it from the most variation to the least. So we can simply ignore variation below a certain threshold to massively reduce the original data, making sure that the original relationship interests are retained.
  • 53. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Hands-on Clustering on MovieLens Dataset
  • 54. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Extracting features from the MovieLens dataset scala> val rawData = sc.textFile("hdfs:///user/cloudera/movielens/u.item") scala> println(movies.first) scala> val genres = sc.textFile("hdfs:///user/cloudera/movielens/u.genre") scala> genres.take(5).foreach(println)
  • 55. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Extracting features from the MovieLens dataset (cont.) scala> val genreMap = genres.filter(!_.isEmpty).map(line => line.split("|")).map(array=> (array(1), array(0))).collectAsMap
  • 56. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Extracting features from the MovieLens dataset (cont.) scala> val titlesAndGenres = movies.map(_.split("|")).map { array => val genres = array.toSeq.slice(5, array.size) val genresAssigned = genres.zipWithIndex.filter { case (g, idx) => g == "1" }.map { case (g, idx) => genreMap(idx.toString) } (array(0).toInt, (array(1), genresAssigned)) }
  • 57. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Training the recommendation model scala> :paste import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating val rawData = sc.textFile("hdfs:///user/cloudera/movielens/u.data") val rawRatings = rawData.map(_.split("t").take(3)) val ratings = rawRatings.map{ case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) } ratings.cache val alsModel = ALS.train(ratings, 50, 10, 0.1) import org.apache.spark.mllib.linalg.Vectors val movieFactors = alsModel.productFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) } val movieVectors = movieFactors.map(_._2) val userFactors = alsModel.userFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) } val userVectors = userFactors.map(_._2)
  • 58. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Normalization scala> :paste import org.apache.spark.mllib.linalg.distributed.RowMatrix val movieMatrix = new RowMatrix(movieVectors) val movieMatrixSummary = movieMatrix.computeColumnSummaryStatistics() val userMatrix = new RowMatrix(userVectors) val userMatrixSummary = userMatrix.computeColumnSummaryStatistics() println("Movie factors mean: " + movieMatrixSummary.mean) println("Movie factors variance: " + movieMatrixSummary.variance) println("User factors mean: " + userMatrixSummary.mean) println("User factors variance: " + userMatrixSummary.variance)
  • 59. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Output from Normalization
  • 60. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Training a clustering model scala> import org.apache.spark.mllib.clustering.KMeans scala> val numClusters = 5 scala> val numIterations = 10 scala> val numRuns = 3 scala> val movieClusterModel = KMeans.train(movieVectors, numClusters, numIterations, numRuns)
  • 61. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Making predictions using a clustering model scala> val movie1 = movieVectors.first scala> val movieCluster = movieClusterModel.predict(movie1) scala> val predictions = movieClusterModel.predict(movieVectors)
  • 62. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Interpreting cluster predictions scala> :paste import breeze.linalg._ import breeze.numerics.pow def computeDistance(v1: DenseVector[Double], v2: DenseVector[Double]) = pow(v1 - v2, 2).sum val titlesWithFactors = titlesAndGenres.join(movieFactors) val moviesAssigned = titlesWithFactors.map { case (id, ((title, genres), vector)) => val pred = movieClusterModel.predict(vector) val clusterCentre = movieClusterModel.clusterCenters(pred) val dist = computeDistance(DenseVector(clusterCentre.toArray), DenseVector(vector.toArray)) (id, title, genres.mkString(" "), pred, dist) }
  • 63. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Interpreting cluster predictions (cont.) val clusterAssignments = moviesAssigned.groupBy { case (id, title, genres, cluster, dist) => cluster }.collectAsMap for ( (k, v) <- clusterAssignments.toSeq.sortBy(_._1)) { println(s"Cluster $k:") val m = v.toSeq.sortBy(_._5) println(m.take(20).map { case (_, title, genres, _, d) => (title, genres, d) }.mkString("n")) println("=====n") }
  • 64. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer
  • 65. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Real-time Machine Learning using Streaming K-Means
  • 66. thanachart@imcinstitute.com66 Online learning with Spark Streaming Streaming regression – trainOn: This takes DStream[LabeledPoint] as its argument. – predictOn: This also takes DStream[LabeledPoint]. Streaming KMeans – An extension of the mini-batch K-means algorithm
  • 68. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer MovieLen Training Dataset ● The rows of the training text files must be vector data in the form [x1,x2,x3,...,xn] 1)Type command > wget https://s3.amazonaws.com/imcbucket/data/movietest.data 2)Type command > more movietest.data
  • 69. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Install & Start Kafka Server # wget http://www-us.apache.org/dist/kafka/0.9.0.1/kafka_2.10- 0.9.0.1.tgz # tar xzf kafka_2.10-0.9.0.1.tgz # cd kafka_2.10-0.9.0.1 # bin/kafka-server-start.sh config/server.properties&
  • 70. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Start Spark-shell with extra memory
  • 71. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Streaming K-Means $ scala> :paste import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.clustering.StreamingKMeans import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.storage.StorageLevel import StorageLevel._ import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.kafka.KafkaUtils val ssc = new StreamingContext(sc, Seconds(2))
  • 72. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("java- topic" -> 5)) val lines = kafkaStream.map(_._2) val ratings = lines.map(Vectors.parse) val numDimensions = 3 val numClusters = 5 val model = new StreamingKMeans() .setK(numClusters) .setDecayFactor(1.0) .setRandomCenters(numDimensions, 0.0) model.trainOn(ratings) model.predictOn(ratings).print() ssc.start() ssc.awaitTermination()
  • 73. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Running HelloKafkaProducer on another windows ● Open a new ssh windows
  • 74. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Java Code: Kafka Producer import java.util.Properties; import kafka.producer.KeyedMessage; import kafka.producer.ProducerConfig; import java.io.*; public class HelloKafkaProducer { final static String TOPIC = "java-topic"; public static void main(String[] argv){ Properties properties = new Properties(); properties.put("metadata.broker.list","localhost:9092"); properties.put("serializer.class","kafka.serializer.StringEnco der");
  • 75. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Java Code: Kafka Producer (cont.) try(BufferedReader br = new BufferedReader(new FileReader(argv[0]))) { StringBuilder sb = new StringBuilder(); ProducerConfig producerConfig = new ProducerConfig(properties); kafka.javaapi.producer.Producer<String,String> producer = new kafka.javaapi.producer.Producer<String, String>(producerConfig); String line = br.readLine(); while (line != null) { KeyedMessage<String, String> message =new KeyedMessage<String, String>(TOPIC,line); producer.send(message); line = br.readLine(); }
  • 76. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Java Code: Kafka Producer (cont.) producer.close(); } catch (IOException ex) { ex.printStackTrace(); } } }
  • 77. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Compile & Run the program // Using a vi Editor to edit the sourcecode # vi HelloKafkaProducer.java // Alternatively # wget https://s3.amazonaws.com/imcbucket/apps/HelloKafkaProducer.java // Compile progeram # export CLASSPATH=".:/root/kafka_2.10-0.9.0.1/libs/*" # javac HelloKafkaProducer.java //prepare the data # cd # wget https://s3.amazonaws.com/imcbucket/input/pg2600.txt # cd kafka_2.10-0.9.0.1 // Run the program # java HelloKafkaProducer /root/movietest.data
  • 78. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Example Result
  • 81. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark : Train the trainer Thank you www.imcinstitute.com www.facebook.com/imcinstitute