Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Prochain SlideShare
Chargement dans…5
×

# Principal Components Analysis, Calculation and Visualization

The article explains dimension reduction principles, PCA algorithm and mathematics behind. The PCA calculation and data projection is demonstrated in R, Python and Apache Spark. Finally the results are visualized with D3.js.

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Soyez le premier à commenter

• Soyez le premier à aimer ceci

### Principal Components Analysis, Calculation and Visualization

1. 1. Principal Components Analysis: Calculation and Visualization Author: Marjan Sterjev Every day humans and machines produce tremendous amount of data that is collected by the systems they are interacting with. The value of that data is enormous if we can analyze it and digest some conclusions that will help the humans improve their lives, increase the business revenues and so on. However, that information is very often hidden deep in the data and data analysts shall dig for valuable clue. Usually the data is provided in a columnar format with M samples, each sample having N dimensions (example dimensions could be color, weight, height etc.) The spreadsheet representation of the Data is: Dimension 1 Dimension 2 Dimension 3 ... Dimension N Sample 1 value11 value12 value13 ... value1N Sample 2 value21 value22 value23 ... value2N Sample 3 value31 value32 value32 ... value3N ... ... ... ... ... ... Sample M valueM1 valueM2 valueM2 ... valueMN The first step in each data science related work is data exploration. During the process of exploration the data analyst shall get quick first impression about the data like what dimensions are useful and what are not, is there any grouping (clustering) among the samples in the data etc. Particular dimension is more useful than some other dimension if the values in that dimension vary more compared with the values in the other dimension. For example if all of the values in the Dimension 1 are the same, than that dimension is useless because it gives us no information that will help us distinguish one data sample from another. We can safely drop that dimension (column) and proceed the data analysis without it. Data visualization can help the data analyst with detecting some clustering among the data samples. Humans can understand 2-D or 3-D plots. However, the data usually has tens or hundreds of dimensions. How can we plot that data? In order to produce 2-D plot from high-dimensional data we need to: 1. Transform the existing N dimensions of the data into a new set of N dimensions. The data value variance in some of the new dimensions will be large compared with the variance in the others. Simply speaking some of the new dimensions will be more important and the others will be less important. 2. Order the new dimensions based on the data value variance therein (order by importance). The dimension with the highest variance of the values shall come first (it shall become the leftmost dimension column in the tabular view above). 1
2. 2. 3. Keep the 2 left most dimensions that have maximum value variance and drop the rest N-2 dimensions. Principal Components Analysis (PCA) is a well known and established algorithm for dimension reduction. The output of the algorithm is a small set of principal components that will help us project the data into a new, low-dimensional space. In order to define PCA let's start with some statistics definitions. Each sample with N dimensions can be represented as vector in the following form: X =[ X1 , X 2 X 3 , X 4 ,.... X N ] (1) Ex: X =[1,2,3] (2) The mean of the values in the vector X is defined as: ̄X = 1 N ∑ i=1 N Xi (3) Ex: ̄X = 1 3 (1+2+3)=6/3=2 (4) The standard deviation of the values in the vector X is defined as: σ= √ 1 N −1 ∑ i=1 N ( X i− ̄X ) 2 (5) Ex:σ= √1 2 ((1−2)2 +(2−2)2 +(3−2)2 )=1 (6) The variance of the values in the vector X is defined as: var ( X )= 1 N −1 ∑ i=1 N ( X i− ̄X )2 =σ2 (7) The co-variance between two vectors X and Y is defined as: cov( X ,Y )= 1 N −1 ∑ i=1 N ( X i− ̄X )(Yi−̄Y ) (8) 2
3. 3. For square matrix A eigenvalue/eigenvector pair is defined as: Ax=λ x (9) The matrix eigenvalues are calculated by solving the following determinant equation: det ( A−λ I )=0 (10) Ex: A=(2 0 0 3 ) det( A−λ I )=det( 2−λ 0 0 3−λ )=0 (2−λ)(3−λ)=0 λ1=2 λ2=3 (11) The eigenvectors can be calculated if we substitute back the eigenvalue solutions into the eigenvector equation: Ax1=λ1 x1 Ax2=λ2 x2 (12) Ax1=λ1 x1 ( 2 0 0 3 )( x11 x12 )=2( x11 x12 ) 2 x11=2 x11 3x12=2 x12 x11=?; x12=0; x1=[ x11 0] (13) Ax2=λ2 x2 ( 2 0 0 3 )( x21 x22 )=3( x21 x22 ) 2 x21=3 x21 3 x22=3 x22 x21=0 ; x22=? ; x2=[0 x22] (14) The eigenvectors x1 and x2 shall be orthonormal. Orthonormal vectors are vectors with unit length and zero dot product: ∣x1∣=√x11 2 +x12 2 =√x11 2 =1 ∣x2∣=√x21 2 +x22 2 =√x22 2 =1 x1 x2=x11 x21+x12 x22=0 (15) 3
4. 4. Finally we can choose the following values that satisfy the orthonormality condition: x11=−1 x22=1 x1=[−1,0] x2=[0,1] (16) The data samples matrix Data defined above has N dimensions: Dim1 , Dim2 , Dim3 ,... ,DimN (17) Each dimension can be represented as vector of M elements: Dimi=[value1i ,value2i ,... ,valueM i] (18) Each dimension vector has mean value of : ̄Dimi= 1 M ∑ k =1 M valuek i (19) The co-variance between two dimension vectors Dimi and Dim j is: cov( Dimi , Dim j)= 1 M −1 ∑ k=1 M (valuek i− ̄Dimi)(valuek j− ̄Dim j) (20) The co-variance matrix for N vectors is N x N square matrix where the element at position (i, j) equals to the co-variance of Dimi and Dim j : C=(ci j=cov(Dimi , Dim j)) (21) C=( cov(Dim1 , Dim1) cov(Dim1 , Dim2) ... cov( Dim1 , DimN ) cov(Dim2 ,Dim1) cov(Dim2 , Dim2) ... cov(Dim2 , DimN ) ... ... ,... ... ... cov( DimN , Dim1) cov(DimN , Dim2) ... cov( DimN ,DimN ) ) (22) Note that the co-variance matrix can be calculated as : C= 1 M −1 AdjData T AdjData (23) where AdjData is obtained by subtracting from each data value the column's mean: 4
7. 7. https://en.wikipedia.org/wiki/Iris_flower_data_set This is not some large data set. Actually it contains only 150 samples, 50 samples for each flower category: setosa, versicolor and virginica. The data contains 4 dimensions: Sepal length, Sepal width, Petal length, Petal width. Our task is to transform the original data set into 2 dimensional data set that is suitable for 2-D plotting. The following Scala Spark code demonstrates how to perform that transformation. It also presents some other Spark features like Spark SQL querying and JSON marshalling of the results. import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.linalg.distributed._ import org.apache.spark.sql._ val irisBase = sc.textFile("C:/ml/iris.data").zipWithIndex() val irisRows = irisBase.map({ case (line,index)=>{ val parts=line.split(",") new IndexedRow(index,Vectors.dense(parts.reverse.tail.map(_.toDouble).reverse)) }}) val irisMatrix = new IndexedRowMatrix(irisRows) val pc = irisMatrix.toRowMatrix().computePrincipalComponents(2) val irisPCProjection = irisMatrix.multiply(pc).rows.map(x=>(x.index,x.vector.toArray)) val irisLabels=irisBase .map({ case (line,index)=>{ val parts=line.split(",") (index,parts.reverse.head) }}) case class Iris(id:Long,x:Double,y:Double,label:String) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val irisProjection = irisPCProjection.join(irisLabels) .map(x=>Iris(x._1+1,x._2._1(0),x._2._1(1),x._2._2)).toDF() irisProjection.registerTempTable("iris") sqlContext.sql("SELECT * FROM iris ORDER BY id") .repartition(1).toJSON.saveAsTextFile("C:/ml/iris-results") After successful execution of the above commands in the Spark shell, the new samples are located in the file “iris-results/part-0000”. The content shall look like the following extract: 7
8. 8. {"id":1,"x":-2.8271359726790117,"y":-5.641331045573361,"label":"Iris-setosa"} {"id":2,"x":-2.7959524821488304,"y":-5.14516688325295,"label":"Iris-setosa"} {"id":3,"x":-2.621523558165045,"y":-5.177378121203946,"label":"Iris-setosa"} Note that each row in the file is valid JSON, however the whole content is not valid JSON. We can manually produce valid JSON if we replace each new line with comma plus new line, opening square bracket at the beginning of the file and closing square bracket at the end of the file. This manual transformation step generates JavaScript array of JSON projected samples. Data Visualization with D3.js Data visualization is the final step when exploring data. There is an excellent JavaScript data visualization library called D3.js: http://d3js.org/ This plotting library is an excellent tool for generating online, interactive data driven dashboards. In our case we can visualize the projected Iris flower data set as presented below. <!DOCTYPE html> <html> <head> <style> body { font: 11px sans-serif; } .axis path, .axis line { fill: none; stroke: #000; shape-rendering: crispEdges; } .dot { stroke: #000; } </style> <script src ="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.6/d3.min.js" charset="utf-8"></script> <script type="text/javascript"> var data=[ {"id":1,"x":-2.8271359726790117,"y":-5.641331045573361,"label":"Iris-setosa"}, {"id":2,"x":-2.7959524821488304,"y":-5.14516688325295,"label":"Iris-setosa"}, {"id":3,"x":-2.621523558165045,"y":-5.177378121203946,"label":"Iris-setosa"}, . . . ]; function drawScatterPlot(){ var margin = {top: 20, right: 20, bottom: 30, left: 40}; 8
9. 9. var width = 600 - margin.left - margin.right; var height = 600 - margin.top - margin.bottom; //Setup x var xValue = function(d) { return d.x;}, xScale = d3.scale.linear().range([0,width]), xMap = function(d) { return xScale(xValue(d));}, xAxis = d3.svg.axis().scale(xScale).orient("bottom"); //Setup y var yValue = function(d) { return d.y;}, yScale = d3.scale.linear().range([height,0]), yMap = function(d) { return yScale(yValue(d));}, yAxis = d3.svg.axis().scale(yScale).orient("left"); //Setup colors var cValue = function(d) { return d.label;}, color = d3.scale.category10(); var svg = d3.select("body").append("svg") .attr("width", width + margin.left + margin.right) .attr("height", height + margin.top + margin.bottom) .append("g") .attr("transform", "translate(" + margin.left + "," + margin.top + ")"); xScale.domain([d3.min(data, xValue)-1, d3.max(data, xValue)+1]); yScale.domain([d3.min(data, yValue)-1, d3.max(data, yValue)+1]); // x-axis svg.append("g") .attr("class", "x axis") .attr("transform", "translate(0," + height + ")") .call(xAxis) .append("text") .attr("class", "label") .attr("x", width) .attr("y", -6) .style("text-anchor", "end") .text("X"); // y-axis svg.append("g") .attr("class", "y axis") .call(yAxis) .append("text") .attr("class", "label") .attr("transform", "rotate(-90)") .attr("y", 6) .attr("dy", ".71em") .style("text-anchor", "end") .text("Y"); //Plot Iris data svg.selectAll(".dot").data(data) .enter().append("circle") .attr("class", "dot") .attr("r", 4) 9
10. 10. .attr("cx", xMap) .attr("cy", yMap) .style("fill", function(d) { return color(cValue(d));}) .on("mouseover", function(d) { var circle = d3.select(this); circle.transition().duration(100) .attr("r", 8); }) .on("mouseout", function(d) { var circle = d3.select(this); circle.transition().duration(100) .attr("r", 4 ); }); // draw legend var legend = svg.selectAll(".legend") .data(color.domain()) .enter().append("g") .attr("class", "legend") .attr("transform", function(d, i) { return "translate(0," + i * 20 + ")"; }); //Draw legend colored rectangles legend.append("rect") .attr("x", width - 18) .attr("width", 18) .attr("height", 18) .style("fill", color); //Draw legend text legend.append("text") .attr("x", width - 24) .attr("y", 9) .attr("dy", ".35em") .style("text-anchor", "end") .text(function(d) { return d;}); } </script> </head> <body onload="drawScatterPlot()"> </body> </html> The result is shown on Figure 1. We can see that the data analyst can observe the flower group boundaries with only 2 dimensions involved. All of the principles explained in this article (dimensionality reduction, principal component analysis etc.) realized with processing tools and frameworks like R, scikit-learn, Spark as well as front-end JavaScript libraries for visualization, plotting and interaction like the famous D3.js, provide us a foundation for building extremely useful online dashboards and services for solving problems in different domains. 10
11. 11. Figure 1. PCA projected Iris flower data set 11

### Soyez le premier à commenter

The article explains dimension reduction principles, PCA algorithm and mathematics behind. The PCA calculation and data projection is demonstrated in R, Python and Apache Spark. Finally the results are visualized with D3.js.

#### Vues

Nombre de vues

993

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

86

Téléchargements

0

Partages

0

Commentaires

0

Mentions J'aime

0