Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Prochain SlideShare
Chargement dans…5
×

# Multiclass Logistic Regression: Derivation and Apache Spark Examples

The article provides detailed derivation of the multiclass logistic regression as well as Apache Spark example code

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Soyez le premier à commenter

• Soyez le premier à aimer ceci

### Multiclass Logistic Regression: Derivation and Apache Spark Examples

1. 1. Multiclass Logistic Regression : Derivation and Apache Spark Examples Author: Marjan Sterjev Logistic Regression is supervised classification algorithm. These kind of algorithms work this way: 1. Classification model is trained based on a provided training set of N samples those class labels are known (the class labels are usually provided manually by human). 2. The class labels for new, previously unseen samples are predicted based on the model generated in the previous step. This is known as sample classification. Each sample in the training data set has M numerical features (coordinates) as well as class label y . The number of classes is K and the class label can have one of the following values 0,1,2,... , K−1 . If K=2 the classifier is binary. For K≥3 the classifier is multiclass. Particular sample Xi is represented as column vector of length M +1 : Xi T =[1, x1 , x2 ,... xM ] (1) Note that the first feature x0 for each sample vector is 1 and it is “artificially” added to the “native” sample features in order to support the intercept in the model vectors. The model is represented with K−1 vectors of size M +1 or equivalently, a matrix of dimension [ K−1][ M +1] : W =[ w1,0 w1,1 ... w1,M w2, 0 w2,1 ... w2,M ... ... ... ... wK−1,0 wK −1,1 ... wK −1,M ] (2) We will denote the model column vectors as: W 1 T =[w1,0 ,w1, 1 ,..,w1,M ] W 2 T =[w2,0 ,w2,1 ,..,w2, M ] ... W K −1 T =[wK−1,0 ,wK −1,1 ,.., wK−1, M ] (3) The purpose of the Logistic Regression algorithm is to build a model that predicts the class label for a given sample. In Multiclass Logistic Regression this is a two step process. First, the sample X is projected into K probabilities, one probability per class: 1
2. 2. P(0∣X ,W ) ,P(1∣X ,W ),... ,P( K−1∣X ,W ) (4) Each probability is obtained as a function of the sample and the model vectors: P(i∣X ,W )= f ( X ,W 1 ,W 2 ,...,W K−1) (5) The class prediction output equals to the class having maximum probability for that particular sample. The class probabilities are defined as: P(0∣X ,W )= 1 1+∑ j=1 K −1 e W j T X and for k=1,2,... ,K−1 : (6) P(k∣X ,W )= e W k T X 1+∑ j=1 K−1 e W j T X , (7) Note that ∑ j=0 K −1 P( j∣X ,W )=1 i.e. the (6) and (7) define probability distribution. The training set consists of N samples with known class labels: Sample0=X 0 , y0 Sample1=X 1 , y1 ... SampleN −1=X N −1 , yN −1 (8) The joint likelihood probability of the training set is: ∏ i=0 N−1 P( yi∣Xi ,W ) (9) The Logistic Regression model training process is procedure that shall search for and find model vectors W 1 ,W 2 ,..W K −1 that will maximize the above joint probability. The procedure is also known as MLE (Maximum Likelihood Estimator). Maximizing logarithm of some function is the same as maximizing the function itself. If we also divide the logarithm with the number of samples in order to deal with average likelihood the result is: 2
3. 3. L= 1 N log(∏ i=0 N −1 P( yi∣X i ,W )) L= 1 N ∑ i=0 N −1 log( P( yi∣X i ,W )) (10) If we substitute the probability formulas defined above we get: L= 1 N ∑ i=0 N −1 log( P( yi∣X i ,W )) L= 1 N ∑ i=0 N −1 ( I ( yi =0)log( 1 1+∑ j=1 K −1 e W j T X i )+(1−I ( yi=0))log( e W yi T Xi 1+∑ j=1 K −1 e W j T Xi )) L= 1 N ∑ i=0 N −1 (1−( I ( yi=0))W yi T Xi+log( 1 1+∑ j=1 K −1 e W j T Xi )) L= 1 N ∑ i=0 N −1 ((1−I ( yi=0))W yi T X i−log(1+∑ j=1 K −1 e W j T Xi )) (11) where I is indicator function defined as I (true)=1, I ( false)=0 . The likelihood depends on each model vector and each coefficient therein. The gradient against the m-th coefficient in the k-th model vector, where k=1,2,... K−1 and m=0,1,2,... M is: ∂ L ∂wk ,m = 1 N ∑ i=0 N −1 ( I ( yi=k) Xi ,m− e W k T X X i, m 1+∑ j=1 K −1 eW j T X i ) ∂ L ∂wk ,m = 1 N ∑ i=0 N −1 Xi , m(I ( yi=k)− e W k T Xi 1+∑ j=1 K −1 e W j T X i ) ∂ L ∂ wk , m = 1 N ∑ i=0 N −1 X i, m(I ( yi=k)−P(k∣Xi ,W )) (12) The gradient against the whole model vector is a vector of gradients against each coefficient, i.e.: 3
4. 4. ∂ L ∂W k =[ ∂ L ∂ wk , 0 , ∂ L ∂ wk , 1 , ∂ L ∂ wk , 2 ,..., ∂ L ∂wk , M ] T ∂ L ∂ W k = 1 N ∑ i=0 N −1 X i(I ( yi=k)−P(k∣X i ,W )) (13) If the coefficient gradient is positive, then likelihood increases if the coefficient increases. On the contrary, if the gradient is negative then likelihood decreases if the coefficient increases. The likelihood will be maximized if we update the coefficients proportionally, in the same direction with the gradient: wk ,m=wk ,m+λ ∂ L ∂ wk , m W k =W k +λ ∂ L ∂W k (14) The procedure is known as Gradient Ascent. If we were dealing with minimization of loss function (quadratic loss, negative likelihood) then the update in (14) will be in the opposite direction (minus) which is known as Gradient Descent. Logistic Regression is iterative algorithm. The model vectors start from some initial state (all zeros for example) and they are recalculated in each iteration. The updated vectors are input for the gradient calculations in the next iteration step. The iterative procedure ends after some maximum number of iterations or in the case when model vectors do not change substantially with each next iteration. The parameter λ is the update step size. It is usually a number like 0.1 or number that decreases with each iteration. For example: a=−2log(5) λ=stepe a∗i maxIterations (15) The Iris Data Set Preparation The examples below demonstrate training of Multiclass Logistic Regression model against the Iris data set. The Iris data set is well known and can be found online. The algorithm requires numeric class labels. For that purpose the labels Iris-setosa, Iris-versicolor, Iris-virginica shall be replaced with 0, 1 or 2 accordingly. Most text editors support this kind of find/replace modification. You can try the following examples in the Spark shell. 4
5. 5. Apache Spark Multiclass Logistic Regression Example import scala.util._ import org.apache.spark.sql._ object ArrayExt extends Serializable{ implicit class ArrayIntOperations(a: Array[Int]) extends Serializable{ def +(b: Array[Int]): Array[Int] = (a, b).zipped.map(_ + _) } implicit class ArrayDoubleOperations(a: Array[Double]) extends Serializable{ def +(b: Array[Double]): Array[Double] = (a, b).zipped.map(_ + _) def -(b: Array[Double]): Array[Double] = (a, b).zipped.map(_ - _) def *(b: Array[Double]): Array[Double] = (a, b).zipped.map(_ * _) def *(b: Double): Array[Double] = a.map(_ * b) def /(b: Array[Double]): Array[Double] = (a,b).zipped.map(_ / _) def /(b: Double): Array[Double] = a.map(_ / b) def dot(b: Array[Double]): Double = (a,b).zipped.map(_ * _).sum } implicit class ArrayDouble2Operations(a: Array[Array[Double]]) extends Serializable{ def +(b: Array[Array[Double]]): Array[Array[Double]] = (a, b).zipped.map(_ + _) def -(b: Array[Array[Double]]): Array[Array[Double]] = (a, b).zipped.map(_ - _) def *(b: Array[Array[Double]]): Array[Array[Double]] = (a, b).zipped.map(_ * _) def *(b: Double): Array[Array[Double]]= a.map(_ * b) def /(b: Array[Array[Double]]): Array[Array[Double]] = (a, b).zipped.map(_ / _) def /(b: Int):Array[Array[Double]] = a.map(_ / b) } } import ArrayExt._ case class IrisSample(values: Array[Double], label: Int, var predicted: Int, sampler: Double) case class Accumulator(var gradient: Array[Array[Double]], var count: Int) //Load the Iris data set val data = sc.textFile("C:/ml/iris-mv.data").map(line=>{ val parts = line.split(",") val values = Array(1.0, parts(0).toDouble, parts(1).toDouble, parts(2).toDouble, parts(3).toDouble) val label = parts(4).toInt IrisSample(values, label, -1, Random.nextDouble) }) data.cache() val trainData = data.filter(sample => sample.sampler >= 0.4) trainData.cache() val testData = data.filter(sample => sample.sampler < 0.4) testData.cache() //MLE Logistic Regression val numFeatures = 4 val numClasses = 3 val maxNumIterations = 200 val step = 1.0 val mi = 0.01 val a = -2 * Math.log(5) 5
6. 6. var w = Array.ofDim[Double](numClasses - 1, numFeatures + 1) var finished = false for(i<-0 to maxNumIterations if ! finished){ val lambda = step * Math.exp(a * i / maxNumIterations) println(s"Round \$i with lambda = \$lambda ...") val accumulator = trainData.aggregate(Accumulator(Array.ofDim[Double](numClasses - 1, numFeatures + 1), 0))( (a, sample) => { val exponents: Array[Double] = w.map(v => Math.exp(v dot sample.values)) val referenceClassProbability = 1.0 / (1.0 + exponents.sum) val classProbabilities = exponents * referenceClassProbability for(i <- a.gradient.indices) { val indicator = if(sample.label == (i+1)) 1 else 0 val probability = classProbabilities(i) a.gradient(i) = a.gradient(i)+sample.values*(indicator - probability) } a.count = a.count +1 a }, (x,y)=>{Accumulator(x.gradient + y.gradient, x.count + y.count)} ) val w_old = w.clone val gradient = accumulator.gradient / accumulator.count val update = gradient - w * mi w = w + update * lambda val w_diff = w-w_old finished = w_diff.map(x => Math.sqrt(x dot x)).forall(_ < 0.01) } def predict(w: Array[Array[Double]],sample: IrisSample): Int = { val exponents: Array[Double] = w.map(v => Math.exp(v dot sample.values)) val referenceClassProbability = 1.0 / (1.0 + exponents.sum) val classProbabilities = exponents * referenceClassProbability val classMaxProbability = classProbabilities.max val predicted = if(classMaxProbability < referenceClassProbability) 0 else (classProbabilities.indexOf(classMaxProbability) + 1) predicted } val irisPredictions = testData.map(sample => { sample.predicted=predict(w, sample) sample }) val sqlContext=new SQLContext(sc) import sqlContext.implicits._ val irisPredictionsDF = irisPredictions.zipWithIndex.map({case (sample,i)=> (i,sample.label, sample.predicted) }).toDF("id","label","predicted") irisPredictionsDF.registerTempTable("iris_predictions") sqlContext.sql("SELECT id, label,predicted FROM iris_predictions WHERE label != predicted").show(100) 6
7. 7. Apache Spark MLlib Logistic Regression Example import org.apache.spark.SparkContext import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.sql._ //Load the Iris data set val data=sc.textFile("C:/ml/iris-mv.data").map(line => { val parts = line.split(",") val values = Vectors.dense(parts(0).toDouble, parts(1).toDouble, parts(2).toDouble, parts(3).toDouble) val label = parts(4).toDouble LabeledPoint(label, values) }) val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1).cache() val model = new LogisticRegressionWithLBFGS() .setNumClasses(3) .run(training) val predictionAndLabels = test.map { case LabeledPoint(label, features) => val prediction = model.predict(features) (prediction, label) } val sqlContext=new SQLContext(sc) import sqlContext.implicits._ val irisPredictionsDF=predictionAndLabels.toDF("label","predicted") irisPredictionsDF.registerTempTable("iris_predictions") sqlContext.sql("SELECT label,predicted FROM iris_predictions WHERE label != predicted").show(100) 7

### Identifiez-vous pour voir les commentaires

The article provides detailed derivation of the multiclass logistic regression as well as Apache Spark example code

#### Vues

Nombre de vues

1 564

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

149

Téléchargements

0

Partages

0

Commentaires

0

Mentions J'aime

0