Multiclass Logistic Regression: Derivation and Apache Spark Examples

Multiclass Logistic Regression :
Derivation and Apache Spark Examples
Author: Marjan Sterjev
Logistic Regression is supervised classification algorithm. These kind of algorithms work this way:
1. Classification model is trained based on a provided training set of N samples those class
labels are known (the class labels are usually provided manually by human).
2. The class labels for new, previously unseen samples are predicted based on the model
generated in the previous step. This is known as sample classification.
Each sample in the training data set has M numerical features (coordinates) as well as class label
y . The number of classes is K and the class label can have one of the following values
0,1,2,... , K−1 . If K=2 the classifier is binary. For K≥3 the classifier is multiclass.
Particular sample Xi is represented as column vector of length M +1 :
Xi
T
=[1, x1 , x2 ,... xM ] (1)
Note that the first feature x0 for each sample vector is 1 and it is “artificially” added to the “native”
sample features in order to support the intercept in the model vectors.
The model is represented with K−1 vectors of size M +1 or equivalently, a matrix of dimension
[ K−1][ M +1] :
W =[
w1,0 w1,1 ... w1,M
w2, 0 w2,1 ... w2,M
... ... ... ...
wK−1,0 wK −1,1 ... wK −1,M
] (2)
We will denote the model column vectors as:
W 1
T
=[w1,0 ,w1, 1 ,..,w1,M ]
W 2
T
=[w2,0 ,w2,1 ,..,w2, M ]
...
W K −1
T
=[wK−1,0 ,wK −1,1 ,.., wK−1, M ]
(3)
The purpose of the Logistic Regression algorithm is to build a model that predicts the class label for a
given sample. In Multiclass Logistic Regression this is a two step process. First, the sample X is
projected into K probabilities, one probability per class:
1

P(0∣X ,W ) ,P(1∣X ,W ),... ,P( K−1∣X ,W ) (4)
Each probability is obtained as a function of the sample and the model vectors:
P(i∣X ,W )= f ( X ,W 1 ,W 2 ,...,W K−1) (5)
The class prediction output equals to the class having maximum probability for that particular sample.
The class probabilities are defined as:
P(0∣X ,W )=
1
1+∑
j=1
K −1
e
W j
T
X
and for k=1,2,... ,K−1 :
(6)
P(k∣X ,W )=
e
W k
T
X
1+∑
j=1
K−1
e
W j
T
X
,
(7)
Note that ∑
j=0
K −1
P( j∣X ,W )=1 i.e. the (6) and (7) define probability distribution.
The training set consists of N samples with known class labels:
Sample0=X 0 , y0
Sample1=X 1 , y1
...
SampleN −1=X N −1 , yN −1
(8)
The joint likelihood probability of the training set is:
∏
i=0
N−1
P( yi∣Xi ,W ) (9)
The Logistic Regression model training process is procedure that shall search for and find model
vectors W 1 ,W 2 ,..W K −1 that will maximize the above joint probability. The procedure is also known
as MLE (Maximum Likelihood Estimator).
Maximizing logarithm of some function is the same as maximizing the function itself. If we also divide
the logarithm with the number of samples in order to deal with average likelihood the result is:
2

L=
1
N
log(∏
i=0
N −1
P( yi∣X i ,W ))
L=
1
N
∑
i=0
N −1
log( P( yi∣X i ,W ))
(10)
If we substitute the probability formulas defined above we get:
L=
1
N
∑
i=0
N −1
log( P( yi∣X i ,W ))
L=
1
N
∑
i=0
N −1
( I ( yi =0)log(
1
1+∑
j=1
K −1
e
W j
T
X i
)+(1−I ( yi=0))log(
e
W yi
T
Xi
1+∑
j=1
K −1
e
W j
T
Xi
))
L=
1
N
∑
i=0
N −1
(1−( I ( yi=0))W yi
T
Xi+log(
1
1+∑
j=1
K −1
e
W j
T
Xi
))
L=
1
N
∑
i=0
N −1
((1−I ( yi=0))W yi
T
X i−log(1+∑
j=1
K −1
e
W j
T
Xi
))
(11)
where I is indicator function defined as I (true)=1, I ( false)=0 .
The likelihood depends on each model vector and each coefficient therein. The gradient against the
m-th coefficient in the k-th model vector, where k=1,2,... K−1 and m=0,1,2,... M is:
∂ L
∂wk ,m
=
1
N
∑
i=0
N −1
( I ( yi=k) Xi ,m−
e
W k
T
X
X i, m
1+∑
j=1
K −1
eW j
T
X i
)
∂ L
∂wk ,m
=
1
N
∑
i=0
N −1
Xi , m(I ( yi=k)−
e
W k
T
Xi
1+∑
j=1
K −1
e
W j
T
X i
)
∂ L
∂ wk , m
=
1
N
∑
i=0
N −1
X i, m(I ( yi=k)−P(k∣Xi ,W ))
(12)
The gradient against the whole model vector is a vector of gradients against each coefficient, i.e.:
3

∂ L
∂W k
=[
∂ L
∂ wk , 0
,
∂ L
∂ wk , 1
,
∂ L
∂ wk , 2
,...,
∂ L
∂wk , M
]
T
∂ L
∂ W k
=
1
N
∑
i=0
N −1
X i(I ( yi=k)−P(k∣X i ,W ))
(13)
If the coefficient gradient is positive, then likelihood increases if the coefficient increases. On the
contrary, if the gradient is negative then likelihood decreases if the coefficient increases.
The likelihood will be maximized if we update the coefficients proportionally, in the same direction with
the gradient:
wk ,m=wk ,m+λ
∂ L
∂ wk , m
W k =W k +λ
∂ L
∂W k
(14)
The procedure is known as Gradient Ascent. If we were dealing with minimization of loss function
(quadratic loss, negative likelihood) then the update in (14) will be in the opposite direction (minus)
which is known as Gradient Descent.
Logistic Regression is iterative algorithm. The model vectors start from some initial state (all zeros for
example) and they are recalculated in each iteration. The updated vectors are input for the gradient
calculations in the next iteration step. The iterative procedure ends after some maximum number of
iterations or in the case when model vectors do not change substantially with each next iteration.
The parameter λ is the update step size. It is usually a number like 0.1 or number that decreases
with each iteration. For example:
a=−2log(5)
λ=stepe
a∗i
maxIterations
(15)
The Iris Data Set Preparation
The examples below demonstrate training of Multiclass Logistic Regression model against the Iris
data set. The Iris data set is well known and can be found online. The algorithm requires numeric class
labels. For that purpose the labels Iris-setosa, Iris-versicolor, Iris-virginica shall be replaced with 0, 1 or
2 accordingly. Most text editors support this kind of find/replace modification.
You can try the following examples in the Spark shell.
4

Apache Spark Multiclass Logistic Regression Example
import scala.util._
import org.apache.spark.sql._
object ArrayExt extends Serializable{
implicit class ArrayIntOperations(a: Array[Int]) extends Serializable{
def +(b: Array[Int]): Array[Int] = (a, b).zipped.map(_ + _)
}
implicit class ArrayDoubleOperations(a: Array[Double]) extends Serializable{
def +(b: Array[Double]): Array[Double] = (a, b).zipped.map(_ + _)
def -(b: Array[Double]): Array[Double] = (a, b).zipped.map(_ - _)
def *(b: Array[Double]): Array[Double] = (a, b).zipped.map(_ * _)
def *(b: Double): Array[Double] = a.map(_ * b)
def /(b: Array[Double]): Array[Double] = (a,b).zipped.map(_ / _)
def /(b: Double): Array[Double] = a.map(_ / b)
def dot(b: Array[Double]): Double = (a,b).zipped.map(_ * _).sum
}
implicit class ArrayDouble2Operations(a: Array[Array[Double]]) extends Serializable{
def +(b: Array[Array[Double]]): Array[Array[Double]] = (a, b).zipped.map(_ + _)
def -(b: Array[Array[Double]]): Array[Array[Double]] = (a, b).zipped.map(_ - _)
def *(b: Array[Array[Double]]): Array[Array[Double]] = (a, b).zipped.map(_ * _)
def *(b: Double): Array[Array[Double]]= a.map(_ * b)
def /(b: Array[Array[Double]]): Array[Array[Double]] = (a, b).zipped.map(_ / _)
def /(b: Int):Array[Array[Double]] = a.map(_ / b)
}
}
import ArrayExt._
case class IrisSample(values: Array[Double], label: Int, var predicted: Int, sampler:
Double)
case class Accumulator(var gradient: Array[Array[Double]], var count: Int)
//Load the Iris data set
val data = sc.textFile("C:/ml/iris-mv.data").map(line=>{
val parts = line.split(",")
val values = Array(1.0, parts(0).toDouble, parts(1).toDouble, parts(2).toDouble,
parts(3).toDouble)
val label = parts(4).toInt
IrisSample(values, label, -1, Random.nextDouble)
})
data.cache()
val trainData = data.filter(sample => sample.sampler >= 0.4)
trainData.cache()
val testData = data.filter(sample => sample.sampler < 0.4)
testData.cache()
//MLE Logistic Regression
val numFeatures = 4
val numClasses = 3
val maxNumIterations = 200
val step = 1.0
val mi = 0.01
val a = -2 * Math.log(5)
5

var w = Array.ofDim[Double](numClasses - 1, numFeatures + 1)
var finished = false
for(i<-0 to maxNumIterations if ! finished){
val lambda = step * Math.exp(a * i / maxNumIterations)
println(s"Round $i with lambda = $lambda ...")
val accumulator = trainData.aggregate(Accumulator(Array.ofDim[Double](numClasses - 1,
numFeatures + 1), 0))(
(a, sample) => {
val exponents: Array[Double] = w.map(v => Math.exp(v dot sample.values))
val referenceClassProbability = 1.0 / (1.0 + exponents.sum)
val classProbabilities = exponents * referenceClassProbability
for(i <- a.gradient.indices) {
val indicator = if(sample.label == (i+1)) 1 else 0
val probability = classProbabilities(i)
a.gradient(i) = a.gradient(i)+sample.values*(indicator - probability)
}
a.count = a.count +1
a
},
(x,y)=>{Accumulator(x.gradient + y.gradient, x.count + y.count)}
)
val w_old = w.clone
val gradient = accumulator.gradient / accumulator.count
val update = gradient - w * mi
w = w + update * lambda
val w_diff = w-w_old
finished = w_diff.map(x => Math.sqrt(x dot x)).forall(_ < 0.01)
}
def predict(w: Array[Array[Double]],sample: IrisSample): Int = {
val exponents: Array[Double] = w.map(v => Math.exp(v dot sample.values))
val referenceClassProbability = 1.0 / (1.0 + exponents.sum)
val classProbabilities = exponents * referenceClassProbability
val classMaxProbability = classProbabilities.max
val predicted = if(classMaxProbability < referenceClassProbability) 0 else
(classProbabilities.indexOf(classMaxProbability) + 1)
predicted
}
val irisPredictions = testData.map(sample => {
sample.predicted=predict(w, sample)
sample
})
val sqlContext=new SQLContext(sc)
import sqlContext.implicits._
val irisPredictionsDF = irisPredictions.zipWithIndex.map({case (sample,i)=>
(i,sample.label, sample.predicted)
}).toDF("id","label","predicted")
irisPredictionsDF.registerTempTable("iris_predictions")
sqlContext.sql("SELECT id, label,predicted FROM iris_predictions WHERE label !=
predicted").show(100)
6

Apache Spark MLlib Logistic Regression Example
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql._
//Load the Iris data set
val data=sc.textFile("C:/ml/iris-mv.data").map(line => {
val parts = line.split(",")
val values = Vectors.dense(parts(0).toDouble, parts(1).toDouble, parts(2).toDouble,
parts(3).toDouble)
val label = parts(4).toDouble
LabeledPoint(label, values)
})
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1).cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(3)
.run(training)
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
val sqlContext=new SQLContext(sc)
import sqlContext.implicits._
val irisPredictionsDF=predictionAndLabels.toDF("label","predicted")
irisPredictionsDF.registerTempTable("iris_predictions")
sqlContext.sql("SELECT label,predicted FROM iris_predictions WHERE label !=
predicted").show(100)
7

Multiclass Logistic Regression: Derivation and Apache Spark Examples

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Multiclass Logistic Regression: Derivation and Apache Spark Examples

Similaire à Multiclass Logistic Regression: Derivation and Apache Spark Examples (20)

Dernier

Dernier (20)

Multiclass Logistic Regression: Derivation and Apache Spark Examples