Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Linear Regression –
Ordinary Least Squares Distributed Calculation Example
Author: Marjan Sterjev
Linear regression is one...
The matrix form of the conditions above is:
X * B ~ Y
The Ordinary Least Squares (https://en.wikipedia.org/wiki/Ordinary_l...
X3 Y3
4 1 44
5 1 56
For each chunk the mapper will produce partial matrix products Xi
T *Xi and Xi
T * Yi (i=1,2,3).
Map I...
anchor reducer is also automatically handled by Hadoop. What is left to the developer is providing
several lines of mapper...
Prochain SlideShare
Chargement dans…5
×

Linear Regression Ordinary Least Squares Distributed Calculation Example

Provides an example for distributed linear regression ordinary least squares coefficients calculation.

  • Identifiez-vous pour voir les commentaires

Linear Regression Ordinary Least Squares Distributed Calculation Example

  1. 1. Linear Regression – Ordinary Least Squares Distributed Calculation Example Author: Marjan Sterjev Linear regression is one of the most essential machine learning algorithms. It is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables X: x1 x2 x3...xn. The model is also known as trend line. If we can explain that relationship with simple linear equation in the form y= bn*xn +… + b2*x2+ b1*x1+ b0 than we can predict the value of y based on the X values substituted in that equation. For example consider that we have the following pairs of numbers (x,y): 0 3 1 16 2 24 3 37 4 44 5 56 Based on the provided example pairs (x,y), our task is to find linear equation y= b1*x1+ b0 that will match the above pairs as much as possible: b1 * 0 + b0 ~ 3 b1 * 1 + b0 ~ 16 b1 * 2 + b0 ~ 24 b1 * 3 + b0 ~ 37 b1 * 4 + b0 ~ 44 b1 * 5 + b0 ~ 56 The solution for the coefficients b1 and b0 shall minimize the overall squared error between linear equation predicted values and the real ones. Let's define the matrices X, B and Y: X B Y 0 1 b1 3 1 1 b0 16 2 1 24 3 1 37 4 1 44 5 1 56 1
  2. 2. The matrix form of the conditions above is: X * B ~ Y The Ordinary Least Squares (https://en.wikipedia.org/wiki/Ordinary_least_squares) closed form solution for B is: B=(XT*X)-1 * XT*Y In R linear regression model coefficients can be calculated as: > X <- matrix(c(0,1,1,1,2,1,3,1,4,1,5,1),ncol=2, byrow=TRUE) > Y <- matrix(c(3,16,24,37,44,56), ncol=1, byrow=TRUE) > solve(t(X)%*%X, t(X)%*%Y) [,1] [1,] 10.342857 [2,] 4.142857 The linear regression coefficients are: b1=10.34 b0= 4.14 Based on the linear regression model we can calculate and predict value y for previously unseen x variable. For example if x=7 the predicted y value will be: 10.34*7+4.14=76.52 The problem arises if the number of pairs (x,y) is very large, several billions for example. The matrices X and Y will have several billions of rows too. Calculating the matrix products XT *X and XT*Y will be time and memory space consuming, i.e. single worker process shall store matrices X and Y in memory and execute billions of multiplications and additions. The natural question is if we can divide the job among several processes that will join their efforts and calculate XT *X and XT*Y in a distributed fashion. Let us split the above input pairs (x,y) into 3 chunks that will be processed by 3 different processes (the mappers): X1 Y1 0 1 3 1 1 16 X2 Y2 2 1 24 3 1 37 2
  3. 3. X3 Y3 4 1 44 5 1 56 For each chunk the mapper will produce partial matrix products Xi T *Xi and Xi T * Yi (i=1,2,3). Map Input Map Output X1 T X1 Y1 X1 T*X1 X1 T*Y1 0 1 0 1 3 1 1 16 1 1 1 1 16 1 2 19 X2 T X2 Y2 X2 T*X2 X2 T*Y2 2 3 2 1 24 13 5 159 1 1 3 1 37 5 2 61 X3 T X3 Y3 X3 T*X3 X3 T*Y3 4 5 4 1 44 41 9 456 1 1 5 1 56 9 2 100 Note that the partial multiplication is executed with matrices that are small and that multiplication is fast. All partial matrix product results shall be collected by another process (the reducer) that will sum the partial matrices and reconstruct the same result as if the complete matrix cross products were produced by a single process. R1=Reduce Output R2=Reduce Output XT*X= X1 T*X1+X2 T*X2+X3 T*X3 XT*Y= X1 T*Y1+X2 T*Y2+X3 T*Y3 55 15 631 15 6 180 Once we have the reconstructed matrices XT*X and XT*Y, the solution is as simple as: (XT *X) *B= XT * Y B= (XT *X) -1* XT * Y = [10.34, 4.14] The approach described above is an example of Map-Reduce based linear regression model training that can be easily implemented on top of Apache Hadoop. The pairs of numbers can be stored into files (single line per pair). Once the model calculation starts, Hadoop file splitting mechanism will automatically delegate units of work to several map processes. The partial results distribution to the 3
  4. 4. anchor reducer is also automatically handled by Hadoop. What is left to the developer is providing several lines of mapper/reducer code that will parse the input lines into (small) matrices and execute cross products and additions against those matrices. 4

×