SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
ClusterAnalysis-GeoData
Anbarasan S
January 23, 2016
4.1.Introduction
 Cluster Analysis , known as unsupervised learning in machine learning literature , is
used to discover unknown patterns, hidden in the data. Similar data points are
grouped together to form clusters, which share some commonality among them.
 From the geodata dataset, given the location details and their booking time , we can
figure out the number of bookings in a particular location over a period of time .
 Dense clusters indicate more bookings in that cluster and routing more vehicles in
that region in a particular range of time ,can help in increasing the revenue of
company by accepting all the bookings , and decrease the waiting time for the
passengers .
 For instance .,clusters in the cairo region is denser than any other area during noon
hours ,hence offering more services in this region during day time can get even
more bookings and reduced waiting time for customers.
 Performed Clustering using k-means algorithm, DBSCAN and model based
clustering algorithms. Out of these model based clustering gives some meaningful
insights on clusters
4.1.1.Load The Data
geodata = read.csv("G:/Careem/Data files/GeoData.csv")
head(geodata,10)
## Latitude Longitude booking_time
## 1 30.05464 31.49216 6:09:21 PM
## 2 30.05464 31.49216 6:09:14 PM
## 3 30.05900 31.49587 6:09:11 PM
## 4 30.07490 31.24056 6:09:00 PM
## 5 30.05646 31.48968 6:08:56 PM
## 6 30.05900 31.49587 6:08:48 PM
## 7 30.05646 31.48968 6:08:45 PM
## 8 30.05900 31.49587 6:08:41 PM
## 9 30.05646 31.48968 6:08:33 PM
## 10 30.05646 31.48968 6:08:26 PM
str(geodata)
## 'data.frame': 3029 obs. of 3 variables:
## $ Latitude : num 30.1 30.1 30.1 30.1 30.1 ...
## $ Longitude : num 31.5 31.5 31.5 31.2 31.5 ...
## $ booking_time: Factor w/ 2901 levels " 1:00:06 PM",..: 2535 2534 2533
2532 2531 2530 2529 2528 2527 2526 ...
# Check For any missing data
sum(is.na(geodata))
## [1] 0
4.3 Project the Spatial data on to a map
• R provides a number of useful packages for dealing with maps and spatial data.
• Here ,maps are created using two useful packages RGoogleMaps & ggmap.
• The Spatial Data provided ,when projected on to this base maps , help us to reveal some
useful information hidden in the data
# Use RGoogleMaps
# Load the library RgoogleMaps
library(RgoogleMaps)
## Warning: package 'RgoogleMaps' was built under R version 3.2.3
# Get the Latitude and longitude range
lat.range = range(geodata$Latitude)
lon.range = range(geodata$Longitude)
# Get Maps Based upon the range of longitude and latitude values
geo.basemap <- GetMap.bbox(lonR = lon.range, latR = lat.range,
destfile = "geo_BaseMap.png",
maptype = "roadmap",
zoom = 11)
# Plot The Geo-Data on the Map obtained fro the latitude and Longitude
Range
PlotOnStaticMap(geo.basemap,
lat = geodata$Latitude, lon = geodata$Longitude,
zoom = 18, cex = 0.5, pch = 19, col = "red",
FUN = points, add = F)
# Use ggmap
# Load the required library
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
lat.centre = median(geodata$Latitude)
lon.centre = median(geodata$Longitude)
geo.basemap2 <- get_map(location = c(lon.centre,lat.centre),
maptype = "roadmap",
source="google",
zoom = 11)
## Map from URL :
http://maps.googleapis.com/maps/api/staticmap?center=30.046766,31.306726&zoom
=11&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
ggmap(geo.basemap2) +
geom_point(data = geodata, aes(x = Longitude, y = Latitude),
color="red", size= 1.5, alpha=0.5)
## Warning: Removed 400 rows containing missing values (geom_point).
ggmap(geo.basemap2)+
stat_density2d(aes(x = Longitude, y = Latitude, fill =
..level.., alpha = ..level..),
size = 2, data = geodata, geom = "polygon")
## Warning: Removed 400 rows containing non-finite values (stat_density2d).
4.4 Clustering using K-Means Algorithm
• K-Means algorithm groups clusters using partitioning approach.
• There are 2 factors that determine the quality of k-means clustering
• Initial choice of centroids
• The number of clusters present in the data should be known before performing the
clustering operation
4.4.1. Computing the Distance matrix
• There are different approaches for calculating the distance between two data points, like
• Hamming Distance
• Euclidean Distance
• Manhattan or City Block Distance
• However,using Euclidean distance as a measure does not give the actual distance measure
for spatial data involving latitude and longitude.
• Haversian Distance Measure ,provided in the "geosphere" package was used to calculate
meaningful distance between spatial points , taking into account the spherical/elliptical
shape of earth.
## Compute the distance matrix using Geosphere package
## Function to Haversian Distance
geo.dist <- function(df) {
require(geosphere)
d <- function(i,z){ # z[1:2] contain long, lat
dist <- rep(0,nrow(z))
dist[i:nrow(z)] <-
distHaversine(z[i:nrow(z),1:2],z[i,1:2])
return(dist)
}
dm <- do.call(cbind,lapply(1:nrow(df),d,df))
return(as.dist(dm))
}
distance.matrix <- geo.dist(geodata[,c(1,2)])
## Loading required package: geosphere
## Warning: package 'geosphere' was built under R version 3.2.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.2.3
4.4.2 Determining the optimal number of clusters for K-means
• Sign of a good cluster is to have high inter-Cluster similarity and low intra-cluster similarity.
• "within Sum of Squares"-WSS, is a measure to find the coherence inside a cluster . WSS is
the sum of Squared distance between centroid and every point inside a cluster.So, Once a
optimal no of clusters are formed in the data , there is very low decrease in WSS
• There is a fast drop in WSS values upto 4 clusters ,after which there is only slight decrease
in WSS , which suggests that, 8 is th optimal number of clusters .
## Determine the no of clusters
wssplot.distancematrix <- function(data, nc=15, seed=1234){
wss <- rep(0,15)
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data,
centers=i)$withinss)
}
plot(1:nc, wss,
type="b",
xlab="Number of Clusters",
ylab="Within groups sum of
squares")
}
wssplot.distancematrix(distance.matrix)
4.4.3 Perform K-Means Clustering
• Perform K-Means Clustering using 4 clusters and 20 sets of random initialization of
centroids for the clusters.
• Visualize the clusters on the map
## Perform K-Means Clustering
cluster.kmeans.geodata = kmeans(distance.matrix, 8, nstart =20 )
summary(cluster.kmeans.geodata)
## Length Class Mode
## cluster 3029 -none- numeric
## centers 24232 -none- numeric
## totss 1 -none- numeric
## withinss 8 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 8 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
## visualize k-means clustering
cluster.factors = as.factor(cluster.kmeans.geodata$cluster)
plot.kmeans <- ggmap(geo.basemap2)+
geom_point(data = geodata, aes(x = Longitude, y = Latitude),
color=cluster.factors, size= 1.5, alpha=0.5)+
ggtitle("k-Means cluster of Booking Locations-Pointwise")
plot.kmeans
## Warning: Removed 400 rows containing missing values (geom_point).
4.5 Density Based Clustering
• Conventional methods of clustering like k-means and hierarchical, constructs clusters of
spherical shape and even include some outliers in the process to some nearest cluster.
• But Density Based clustering is useful in finding non-linear clusters like s- shaped or oval or
any other non-linear shape .
• DBSCAN- Density Based Spatial Clustering of Applications with Noise , performs good
clustering even in the presence of outliers
• knnDistPlot is used to find the epsilon for forming clusters and it is around 0.015
library(dbscan)
## Warning: package 'dbscan' was built under R version 3.2.3
library(cluster)
## Warning: package 'cluster' was built under R version 3.2.3
kNNdistplot(geodata[,c(1,2)], k = 3)
abline(h=0.01, col="red")
db <- dbscan(geodata[,c(1,2)], eps=0.02, minPts=45)
db
## DBSCAN clustering for 3029 objects.
## Parameters: eps = 0.02, minPts = 45
## The clustering contains 7 cluster(s).
## Available fields: cluster, eps, minPts
cluster.factors.db = as.factor(db$cluster)
## DBSCAN Visualization
plot.dbscan <- ggmap(geo.basemap2)+
geom_point(data = geodata, aes(x = Longitude, y = Latitude),
color=db$cluster+1L, size= 1.5, alpha=0.5)+
ggtitle("DBSCAN cluster of Booking Locations-Pointwise")
plot.dbscan
## Warning: Removed 400 rows containing missing values (geom_point).
*
 the plot suggests that DBSCAN clusters are highly cohesive than that of k-means
algorithm, since the outliers are not included in any cluster and are shown as black
points. Whereas in K-means algorithms ,clusters inlcude the outlier points also .
4.6 Timely Evolution of Bookings (Hourly Evolution)
• Perform feature engineering on the booking time using strptime and group the data based
upon the booking time .
• plot density plots of the observations for every hour
GroupByHours <- function(df){
TimeFormat = "%I:%M:%S %p"
lt_time = strptime(df,TimeFormat)
return (lt_time$hour)
}
geodata$Hours <- sapply(geodata$booking_time,GroupByHours)
geodata <- geodata[order(geodata$Hours),]
geo.basemap3 <- ggmap(geo.basemap2)
geo.basemap3 +
stat_density2d(aes(x = Longitude, y = Latitude,fill = ..level..,alpha =
..level..),
geom = "polygon", data = geodata) +
facet_wrap( ~ Hours) +
theme(strip.text.x = element_text(size=12, face="bold"),
strip.background = element_rect(colour="red", fill="#CCCCFF"))+
ggtitle("Hourly Density estimation of Points in 24 hr Format")
## Warning: Removed 400 rows containing non-finite values (stat_density2d).
* The Evolution of density plots suggest that ,the bookings are more diversely distributed
during the day time from 8 am to 5 pm. * during midnight hours the density distribution is very
scarce and restricted to few hotspots alone
Cluster Analysis-Model Based Clustering
Anbarasan S
January 22, 2016
1.Load The Data
# Load the Library
library(mclust)
## Warning: package 'mclust' was built under R version 3.2.3
## Package 'mclust' version 5.1
## Type 'citation("mclust")' for citing this R package in publications.
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
library(grid)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.3
#load the data
geodata <- read.csv("G:/Careem/Data files/GeoData.csv")
# get the base Map
lat.centre = median(geodata$Latitude)
lon.centre = median(geodata$Longitude)
geo.basemap2 <- get_map(location = c(lon.centre,lat.centre),
maptype = "roadmap",
source="google",
zoom = 11)
## Map from URL :
http://maps.googleapis.com/maps/api/staticmap?center=30.046766,31.306726&zoom
=11&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
2.Multiplot Function
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
require(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
3.Perform Model Based Clustering
• Model Based Clustering ,uses the bayesian probabilistic interpretation ,to assign an
observation to a cluster.
• Hence in a model based clustering , each data point (or) observation belong to more than
one cluster ,with a certain probability.And the observation is assigned to cluster with a
maximum probabilty.
3.1 Advantages of Model Based Clustering over other Clustering
techniques
• No need to determine the number of clusters in advance
• Very useful in detremining clusters of any shape like oval or s- shaped
• Not sensitive to outliers.
#Apply Mclust
geodata.mclust <- Mclust(geodata[,c(1,2)])
summary(geodata.mclust)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model
with 9 components:
##
## log.likelihood n df BIC ICL
## 9351.399 3029 53 18277.95 17990.1
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 411 88 591 626 363 255 311 65 319
# Visualize the Clusters
plot.mclust <- ggmap(geo.basemap2)+
geom_point(data = geodata, aes(x = Longitude, y = Latitude),
color=geodata.mclust$classification, size= 1.5, alpha=0.5)+
ggtitle("Model Based clustering of Booking Locations-Pointwise")
plot.mclust
## Warning: Removed 400 rows containing missing values (geom_point).
## Inference From Clustering: * Model Based Clustering gives more meaningful
clusters than DBSCAN or K-Means as can be seen from the visualization of the three
Clusters
4.Hourly Evolution of clusters
• To Understand how the clusters evolve over time ,we group the booking time into four
categories - + Midnight(12 AM -6 AM)
• Morning(6 AM - 10 AM)
• Noon(10 AM -4 PM)
• Evening(4 PM- 6 PM)
## Create a feature Hours
GroupByHours <- function(df){
TimeFormat = "%I:%M:%S %p"
lt_time = strptime(df,TimeFormat)
return (lt_time$hour)
}
geodata$Hours <- sapply(geodata$booking_time,GroupByHours)
geodata <- geodata[order(geodata$Hours),]
## Hourly Evolution of model based clusters
## Group the data based on Hours
geodata_midnight <- subset(geodata, Hours>=0 & Hours <6)
geodata_morning <- subset(geodata, Hours>=6 & Hours <11)
geodata_noon <- subset(geodata, Hours>=11 & Hours <16)
geodata_evening <- subset(geodata, Hours>=16 & Hours <=18)
# Perform Model Based clustering during midnight hours
geodata.mclust.midnight <- Mclust(geodata_midnight[,c(1,2)])
summary(geodata.mclust.midnight)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 9 components:
##
## log.likelihood n df BIC ICL
## 744.7721 209 45 1249.139 1234.731
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 65 13 13 39 10 16 18 16 19
midnight <- ggmap(geo.basemap2)+
geom_point(data = geodata_midnight, aes(x = Longitude, y = Latitude),
color=geodata.mclust.midnight$classification, size= 1.5,
alpha=0.5)+
ggtitle("Midnight Bookings")
# Perform Model Based clustering during morning hours
geodata.mclust.morning <- Mclust(geodata_morning[,c(1,2)])
summary(geodata.mclust.morning)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model
with 9 components:
##
## log.likelihood n df BIC ICL
## 1608.484 521 53 2885.414 2861.465
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 160 60 92 49 48 10 24 47 31
morning <- ggmap(geo.basemap2)+
geom_point(data = geodata_morning, aes(x = Longitude, y = Latitude),
color=geodata.mclust.morning$classification, size= 1.5,
alpha=0.5)+
ggtitle("Morning Bookings")
# Perform Model Based clustering during afternoon hours
geodata.mclust.noon <- Mclust(geodata_noon[,c(1,2)])
summary(geodata.mclust.noon)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVE (ellipsoidal, equal orientation) model with 9 components:
##
## log.likelihood n df BIC ICL
## 5086.66 1621 45 9840.734 9556.333
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 234 162 203 339 100 107 296 57 123
noon <- ggmap(geo.basemap2)+
geom_point(data = geodata_noon, aes(x = Longitude, y = Latitude),
color=geodata.mclust.noon$classification, size= 1.5,
alpha=0.5)+
ggtitle("Noon Bookings")
# Perform Model Based clustering during evening hours
geodata.mclust.evening <- Mclust(geodata_evening[,c(1,2)])
summary(geodata.mclust.evening)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVE (ellipsoidal, equal orientation) model with 9 components:
##
## log.likelihood n df BIC ICL
## 2223.575 678 45 4153.789 4096.327
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 121 12 164 64 11 61 77 100 68
evening <- ggmap(geo.basemap2)+
geom_point(data = geodata_evening, aes(x = Longitude, y = Latitude),
color=geodata.mclust.evening$classification, size= 1.5,
alpha=0.5)+
ggtitle("Evening bookings")
# Multi Plot
multiplot(midnight, morning, noon, evening,cols=2)
## Warning: Removed 18 rows containing missing values (geom_point).
## Warning: Removed 78 rows containing missing values (geom_point).
## Warning: Removed 235 rows containing missing values (geom_point).
## Warning: Removed 69 rows containing missing values (geom_point).
4.1 Inference from timely evolution of clusters
• The Clusters are more dense during the Noon or mid-day period from 10 am - 4 pm ,
where the clusters are dense.The Possible reason for this dense clusters during day time
may be due to office going population.
• More bookings are from the cairo region followed all hay illamin region, over all periods of
the day
• Midnight Cluster suggests that majority booking is from cairo region, with few interspersed
bookings here and there.
• The clusters are found around the Ring Road area

Contenu connexe

Tendances

Introduction to spatial data analysis in r
Introduction to spatial data analysis in rIntroduction to spatial data analysis in r
Introduction to spatial data analysis in r
Richard Wamalwa
 
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Computer Science Club
 
nips report
nips reportnips report
nips report
?? ?
 

Tendances (19)

Quiz 2
Quiz 2Quiz 2
Quiz 2
 
Introduction to spatial data analysis in r
Introduction to spatial data analysis in rIntroduction to spatial data analysis in r
Introduction to spatial data analysis in r
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Performance features12102 doag_2014
Performance features12102 doag_2014Performance features12102 doag_2014
Performance features12102 doag_2014
 
Graph Regularised Hashing
Graph Regularised HashingGraph Regularised Hashing
Graph Regularised Hashing
 
Optimisation random graph presentation
Optimisation random graph presentationOptimisation random graph presentation
Optimisation random graph presentation
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
 
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
 
TENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERTENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIER
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
 
Vgg
VggVgg
Vgg
 
Dsp lecture vol 4 digital filters
Dsp lecture vol 4 digital filtersDsp lecture vol 4 digital filters
Dsp lecture vol 4 digital filters
 
Blind separation of complex-valued satellite-AIS data for marine surveillance...
Blind separation of complex-valued satellite-AIS data for marine surveillance...Blind separation of complex-valued satellite-AIS data for marine surveillance...
Blind separation of complex-valued satellite-AIS data for marine surveillance...
 
GTC 2014 - DirectX 11 Rendering and NVIDIA GameWorks in Batman: Arkham Origins
GTC 2014 - DirectX 11 Rendering and NVIDIA GameWorks in Batman: Arkham OriginsGTC 2014 - DirectX 11 Rendering and NVIDIA GameWorks in Batman: Arkham Origins
GTC 2014 - DirectX 11 Rendering and NVIDIA GameWorks in Batman: Arkham Origins
 
distance_matrix_ch
distance_matrix_chdistance_matrix_ch
distance_matrix_ch
 
Algorithm
AlgorithmAlgorithm
Algorithm
 
VJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNVJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCN
 
nips report
nips reportnips report
nips report
 

Similaire à ClusterAnalysis

Fuzzy c means_realestate_application
Fuzzy c means_realestate_applicationFuzzy c means_realestate_application
Fuzzy c means_realestate_application
Cemal Ardil
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
Sri Prasanna
 
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
ArchiLab 7
 

Similaire à ClusterAnalysis (20)

RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-r
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
DCSM report2
DCSM report2DCSM report2
DCSM report2
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Visualization of Crisp and Rough Clustering using MATLAB
Visualization of Crisp and Rough Clustering using MATLABVisualization of Crisp and Rough Clustering using MATLAB
Visualization of Crisp and Rough Clustering using MATLAB
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
cluster(python)
cluster(python)cluster(python)
cluster(python)
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
 
Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...
Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...
Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...
 
Fuzzy c means_realestate_application
Fuzzy c means_realestate_applicationFuzzy c means_realestate_application
Fuzzy c means_realestate_application
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
 
Project PPT
Project PPTProject PPT
Project PPT
 

ClusterAnalysis

  • 1. ClusterAnalysis-GeoData Anbarasan S January 23, 2016 4.1.Introduction  Cluster Analysis , known as unsupervised learning in machine learning literature , is used to discover unknown patterns, hidden in the data. Similar data points are grouped together to form clusters, which share some commonality among them.  From the geodata dataset, given the location details and their booking time , we can figure out the number of bookings in a particular location over a period of time .  Dense clusters indicate more bookings in that cluster and routing more vehicles in that region in a particular range of time ,can help in increasing the revenue of company by accepting all the bookings , and decrease the waiting time for the passengers .  For instance .,clusters in the cairo region is denser than any other area during noon hours ,hence offering more services in this region during day time can get even more bookings and reduced waiting time for customers.  Performed Clustering using k-means algorithm, DBSCAN and model based clustering algorithms. Out of these model based clustering gives some meaningful insights on clusters 4.1.1.Load The Data geodata = read.csv("G:/Careem/Data files/GeoData.csv") head(geodata,10) ## Latitude Longitude booking_time ## 1 30.05464 31.49216 6:09:21 PM ## 2 30.05464 31.49216 6:09:14 PM ## 3 30.05900 31.49587 6:09:11 PM ## 4 30.07490 31.24056 6:09:00 PM ## 5 30.05646 31.48968 6:08:56 PM ## 6 30.05900 31.49587 6:08:48 PM ## 7 30.05646 31.48968 6:08:45 PM ## 8 30.05900 31.49587 6:08:41 PM ## 9 30.05646 31.48968 6:08:33 PM ## 10 30.05646 31.48968 6:08:26 PM str(geodata) ## 'data.frame': 3029 obs. of 3 variables: ## $ Latitude : num 30.1 30.1 30.1 30.1 30.1 ...
  • 2. ## $ Longitude : num 31.5 31.5 31.5 31.2 31.5 ... ## $ booking_time: Factor w/ 2901 levels " 1:00:06 PM",..: 2535 2534 2533 2532 2531 2530 2529 2528 2527 2526 ... # Check For any missing data sum(is.na(geodata)) ## [1] 0 4.3 Project the Spatial data on to a map • R provides a number of useful packages for dealing with maps and spatial data. • Here ,maps are created using two useful packages RGoogleMaps & ggmap. • The Spatial Data provided ,when projected on to this base maps , help us to reveal some useful information hidden in the data # Use RGoogleMaps # Load the library RgoogleMaps library(RgoogleMaps) ## Warning: package 'RgoogleMaps' was built under R version 3.2.3 # Get the Latitude and longitude range lat.range = range(geodata$Latitude) lon.range = range(geodata$Longitude) # Get Maps Based upon the range of longitude and latitude values geo.basemap <- GetMap.bbox(lonR = lon.range, latR = lat.range, destfile = "geo_BaseMap.png", maptype = "roadmap", zoom = 11) # Plot The Geo-Data on the Map obtained fro the latitude and Longitude Range PlotOnStaticMap(geo.basemap, lat = geodata$Latitude, lon = geodata$Longitude, zoom = 18, cex = 0.5, pch = 19, col = "red", FUN = points, add = F) # Use ggmap # Load the required library library(ggmap) ## Warning: package 'ggmap' was built under R version 3.2.3
  • 3. ## Loading required package: ggplot2 ## Warning: package 'ggplot2' was built under R version 3.2.3 lat.centre = median(geodata$Latitude) lon.centre = median(geodata$Longitude) geo.basemap2 <- get_map(location = c(lon.centre,lat.centre), maptype = "roadmap", source="google", zoom = 11) ## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=30.046766,31.306726&zoom =11&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false ggmap(geo.basemap2) + geom_point(data = geodata, aes(x = Longitude, y = Latitude), color="red", size= 1.5, alpha=0.5) ## Warning: Removed 400 rows containing missing values (geom_point).
  • 4. ggmap(geo.basemap2)+ stat_density2d(aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 2, data = geodata, geom = "polygon") ## Warning: Removed 400 rows containing non-finite values (stat_density2d).
  • 5. 4.4 Clustering using K-Means Algorithm • K-Means algorithm groups clusters using partitioning approach. • There are 2 factors that determine the quality of k-means clustering • Initial choice of centroids • The number of clusters present in the data should be known before performing the clustering operation 4.4.1. Computing the Distance matrix • There are different approaches for calculating the distance between two data points, like • Hamming Distance • Euclidean Distance • Manhattan or City Block Distance • However,using Euclidean distance as a measure does not give the actual distance measure for spatial data involving latitude and longitude. • Haversian Distance Measure ,provided in the "geosphere" package was used to calculate meaningful distance between spatial points , taking into account the spherical/elliptical shape of earth. ## Compute the distance matrix using Geosphere package ## Function to Haversian Distance geo.dist <- function(df) {
  • 6. require(geosphere) d <- function(i,z){ # z[1:2] contain long, lat dist <- rep(0,nrow(z)) dist[i:nrow(z)] <- distHaversine(z[i:nrow(z),1:2],z[i,1:2]) return(dist) } dm <- do.call(cbind,lapply(1:nrow(df),d,df)) return(as.dist(dm)) } distance.matrix <- geo.dist(geodata[,c(1,2)]) ## Loading required package: geosphere ## Warning: package 'geosphere' was built under R version 3.2.3 ## Loading required package: sp ## Warning: package 'sp' was built under R version 3.2.3 4.4.2 Determining the optimal number of clusters for K-means • Sign of a good cluster is to have high inter-Cluster similarity and low intra-cluster similarity. • "within Sum of Squares"-WSS, is a measure to find the coherence inside a cluster . WSS is the sum of Squared distance between centroid and every point inside a cluster.So, Once a optimal no of clusters are formed in the data , there is very low decrease in WSS • There is a fast drop in WSS values upto 4 clusters ,after which there is only slight decrease in WSS , which suggests that, 8 is th optimal number of clusters . ## Determine the no of clusters wssplot.distancematrix <- function(data, nc=15, seed=1234){ wss <- rep(0,15) for (i in 2:nc){ set.seed(seed) wss[i] <- sum(kmeans(data, centers=i)$withinss) } plot(1:nc, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") } wssplot.distancematrix(distance.matrix)
  • 7. 4.4.3 Perform K-Means Clustering • Perform K-Means Clustering using 4 clusters and 20 sets of random initialization of centroids for the clusters. • Visualize the clusters on the map ## Perform K-Means Clustering cluster.kmeans.geodata = kmeans(distance.matrix, 8, nstart =20 ) summary(cluster.kmeans.geodata) ## Length Class Mode ## cluster 3029 -none- numeric ## centers 24232 -none- numeric ## totss 1 -none- numeric ## withinss 8 -none- numeric ## tot.withinss 1 -none- numeric ## betweenss 1 -none- numeric ## size 8 -none- numeric ## iter 1 -none- numeric ## ifault 1 -none- numeric ## visualize k-means clustering cluster.factors = as.factor(cluster.kmeans.geodata$cluster) plot.kmeans <- ggmap(geo.basemap2)+
  • 8. geom_point(data = geodata, aes(x = Longitude, y = Latitude), color=cluster.factors, size= 1.5, alpha=0.5)+ ggtitle("k-Means cluster of Booking Locations-Pointwise") plot.kmeans ## Warning: Removed 400 rows containing missing values (geom_point). 4.5 Density Based Clustering • Conventional methods of clustering like k-means and hierarchical, constructs clusters of spherical shape and even include some outliers in the process to some nearest cluster. • But Density Based clustering is useful in finding non-linear clusters like s- shaped or oval or any other non-linear shape . • DBSCAN- Density Based Spatial Clustering of Applications with Noise , performs good clustering even in the presence of outliers • knnDistPlot is used to find the epsilon for forming clusters and it is around 0.015 library(dbscan) ## Warning: package 'dbscan' was built under R version 3.2.3 library(cluster) ## Warning: package 'cluster' was built under R version 3.2.3
  • 9. kNNdistplot(geodata[,c(1,2)], k = 3) abline(h=0.01, col="red") db <- dbscan(geodata[,c(1,2)], eps=0.02, minPts=45) db ## DBSCAN clustering for 3029 objects. ## Parameters: eps = 0.02, minPts = 45 ## The clustering contains 7 cluster(s). ## Available fields: cluster, eps, minPts cluster.factors.db = as.factor(db$cluster) ## DBSCAN Visualization plot.dbscan <- ggmap(geo.basemap2)+ geom_point(data = geodata, aes(x = Longitude, y = Latitude), color=db$cluster+1L, size= 1.5, alpha=0.5)+ ggtitle("DBSCAN cluster of Booking Locations-Pointwise") plot.dbscan ## Warning: Removed 400 rows containing missing values (geom_point).
  • 10. *  the plot suggests that DBSCAN clusters are highly cohesive than that of k-means algorithm, since the outliers are not included in any cluster and are shown as black points. Whereas in K-means algorithms ,clusters inlcude the outlier points also . 4.6 Timely Evolution of Bookings (Hourly Evolution) • Perform feature engineering on the booking time using strptime and group the data based upon the booking time . • plot density plots of the observations for every hour GroupByHours <- function(df){ TimeFormat = "%I:%M:%S %p" lt_time = strptime(df,TimeFormat) return (lt_time$hour) } geodata$Hours <- sapply(geodata$booking_time,GroupByHours) geodata <- geodata[order(geodata$Hours),] geo.basemap3 <- ggmap(geo.basemap2) geo.basemap3 + stat_density2d(aes(x = Longitude, y = Latitude,fill = ..level..,alpha = ..level..),
  • 11. geom = "polygon", data = geodata) + facet_wrap( ~ Hours) + theme(strip.text.x = element_text(size=12, face="bold"), strip.background = element_rect(colour="red", fill="#CCCCFF"))+ ggtitle("Hourly Density estimation of Points in 24 hr Format") ## Warning: Removed 400 rows containing non-finite values (stat_density2d). * The Evolution of density plots suggest that ,the bookings are more diversely distributed during the day time from 8 am to 5 pm. * during midnight hours the density distribution is very scarce and restricted to few hotspots alone
  • 12. Cluster Analysis-Model Based Clustering Anbarasan S January 22, 2016 1.Load The Data # Load the Library library(mclust) ## Warning: package 'mclust' was built under R version 3.2.3 ## Package 'mclust' version 5.1 ## Type 'citation("mclust")' for citing this R package in publications. library(ggmap) ## Warning: package 'ggmap' was built under R version 3.2.3 ## Loading required package: ggplot2 ## Warning: package 'ggplot2' was built under R version 3.2.3 library(grid) library(gridExtra) ## Warning: package 'gridExtra' was built under R version 3.2.3 #load the data geodata <- read.csv("G:/Careem/Data files/GeoData.csv") # get the base Map lat.centre = median(geodata$Latitude) lon.centre = median(geodata$Longitude) geo.basemap2 <- get_map(location = c(lon.centre,lat.centre), maptype = "roadmap", source="google", zoom = 11)
  • 13. ## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=30.046766,31.306726&zoom =11&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false 2.Multiplot Function multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) { require(grid) # Make a list from the ... arguments and plotlist plots <- c(list(...), plotlist) numPlots = length(plots) # If layout is NULL, then use 'cols' to determine layout if (is.null(layout)) { # Make the panel # ncol: Number of columns of plots # nrow: Number of rows needed, calculated from # of cols layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), ncol = cols, nrow = ceiling(numPlots/cols)) } if (numPlots==1) { print(plots[[1]]) } else { # Set up the page grid.newpage() pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) # Make each plot, in the correct location for (i in 1:numPlots) { # Get the i,j matrix positions of the regions that contain this subplot matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col)) } } } 3.Perform Model Based Clustering • Model Based Clustering ,uses the bayesian probabilistic interpretation ,to assign an observation to a cluster.
  • 14. • Hence in a model based clustering , each data point (or) observation belong to more than one cluster ,with a certain probability.And the observation is assigned to cluster with a maximum probabilty. 3.1 Advantages of Model Based Clustering over other Clustering techniques • No need to determine the number of clusters in advance • Very useful in detremining clusters of any shape like oval or s- shaped • Not sensitive to outliers. #Apply Mclust geodata.mclust <- Mclust(geodata[,c(1,2)]) summary(geodata.mclust) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 9 components: ## ## log.likelihood n df BIC ICL ## 9351.399 3029 53 18277.95 17990.1 ## ## Clustering table: ## 1 2 3 4 5 6 7 8 9 ## 411 88 591 626 363 255 311 65 319 # Visualize the Clusters plot.mclust <- ggmap(geo.basemap2)+ geom_point(data = geodata, aes(x = Longitude, y = Latitude), color=geodata.mclust$classification, size= 1.5, alpha=0.5)+ ggtitle("Model Based clustering of Booking Locations-Pointwise") plot.mclust ## Warning: Removed 400 rows containing missing values (geom_point).
  • 15. ## Inference From Clustering: * Model Based Clustering gives more meaningful clusters than DBSCAN or K-Means as can be seen from the visualization of the three Clusters
  • 16. 4.Hourly Evolution of clusters • To Understand how the clusters evolve over time ,we group the booking time into four categories - + Midnight(12 AM -6 AM) • Morning(6 AM - 10 AM) • Noon(10 AM -4 PM) • Evening(4 PM- 6 PM) ## Create a feature Hours GroupByHours <- function(df){ TimeFormat = "%I:%M:%S %p" lt_time = strptime(df,TimeFormat) return (lt_time$hour) } geodata$Hours <- sapply(geodata$booking_time,GroupByHours) geodata <- geodata[order(geodata$Hours),] ## Hourly Evolution of model based clusters ## Group the data based on Hours geodata_midnight <- subset(geodata, Hours>=0 & Hours <6) geodata_morning <- subset(geodata, Hours>=6 & Hours <11) geodata_noon <- subset(geodata, Hours>=11 & Hours <16)
  • 17. geodata_evening <- subset(geodata, Hours>=16 & Hours <=18) # Perform Model Based clustering during midnight hours geodata.mclust.midnight <- Mclust(geodata_midnight[,c(1,2)]) summary(geodata.mclust.midnight) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VEV (ellipsoidal, equal shape) model with 9 components: ## ## log.likelihood n df BIC ICL ## 744.7721 209 45 1249.139 1234.731 ## ## Clustering table: ## 1 2 3 4 5 6 7 8 9 ## 65 13 13 39 10 16 18 16 19 midnight <- ggmap(geo.basemap2)+ geom_point(data = geodata_midnight, aes(x = Longitude, y = Latitude), color=geodata.mclust.midnight$classification, size= 1.5, alpha=0.5)+ ggtitle("Midnight Bookings") # Perform Model Based clustering during morning hours geodata.mclust.morning <- Mclust(geodata_morning[,c(1,2)]) summary(geodata.mclust.morning) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 9 components: ## ## log.likelihood n df BIC ICL ## 1608.484 521 53 2885.414 2861.465 ## ## Clustering table: ## 1 2 3 4 5 6 7 8 9 ## 160 60 92 49 48 10 24 47 31 morning <- ggmap(geo.basemap2)+ geom_point(data = geodata_morning, aes(x = Longitude, y = Latitude), color=geodata.mclust.morning$classification, size= 1.5, alpha=0.5)+ ggtitle("Morning Bookings")
  • 18. # Perform Model Based clustering during afternoon hours geodata.mclust.noon <- Mclust(geodata_noon[,c(1,2)]) summary(geodata.mclust.noon) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVE (ellipsoidal, equal orientation) model with 9 components: ## ## log.likelihood n df BIC ICL ## 5086.66 1621 45 9840.734 9556.333 ## ## Clustering table: ## 1 2 3 4 5 6 7 8 9 ## 234 162 203 339 100 107 296 57 123 noon <- ggmap(geo.basemap2)+ geom_point(data = geodata_noon, aes(x = Longitude, y = Latitude), color=geodata.mclust.noon$classification, size= 1.5, alpha=0.5)+ ggtitle("Noon Bookings") # Perform Model Based clustering during evening hours geodata.mclust.evening <- Mclust(geodata_evening[,c(1,2)]) summary(geodata.mclust.evening) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVE (ellipsoidal, equal orientation) model with 9 components: ## ## log.likelihood n df BIC ICL ## 2223.575 678 45 4153.789 4096.327 ## ## Clustering table: ## 1 2 3 4 5 6 7 8 9 ## 121 12 164 64 11 61 77 100 68 evening <- ggmap(geo.basemap2)+ geom_point(data = geodata_evening, aes(x = Longitude, y = Latitude), color=geodata.mclust.evening$classification, size= 1.5, alpha=0.5)+ ggtitle("Evening bookings") # Multi Plot multiplot(midnight, morning, noon, evening,cols=2)
  • 19. ## Warning: Removed 18 rows containing missing values (geom_point). ## Warning: Removed 78 rows containing missing values (geom_point). ## Warning: Removed 235 rows containing missing values (geom_point). ## Warning: Removed 69 rows containing missing values (geom_point). 4.1 Inference from timely evolution of clusters • The Clusters are more dense during the Noon or mid-day period from 10 am - 4 pm , where the clusters are dense.The Possible reason for this dense clusters during day time may be due to office going population. • More bookings are from the cairo region followed all hay illamin region, over all periods of the day • Midnight Cluster suggests that majority booking is from cairo region, with few interspersed bookings here and there. • The clusters are found around the Ring Road area