This document discusses different clustering techniques applied to geo-location booking data to identify patterns.
It loads booking data containing latitude, longitude, and time. It performs k-means clustering with 8 clusters, identifying optimal clusters using within sum of squares. DBSCAN clustering identifies 7 cohesive clusters without outliers.
Model-based clustering using Mclust is also applied. It identifies 9 non-spherical clusters and does not require pre-specifying the number of clusters. Hourly density plots show bookings are more dispersed during daytime compared to night. Overall, model-based clustering provides the most meaningful insights into clusters within the geo-location data.
1. ClusterAnalysis-GeoData
Anbarasan S
January 23, 2016
4.1.Introduction
Cluster Analysis , known as unsupervised learning in machine learning literature , is
used to discover unknown patterns, hidden in the data. Similar data points are
grouped together to form clusters, which share some commonality among them.
From the geodata dataset, given the location details and their booking time , we can
figure out the number of bookings in a particular location over a period of time .
Dense clusters indicate more bookings in that cluster and routing more vehicles in
that region in a particular range of time ,can help in increasing the revenue of
company by accepting all the bookings , and decrease the waiting time for the
passengers .
For instance .,clusters in the cairo region is denser than any other area during noon
hours ,hence offering more services in this region during day time can get even
more bookings and reduced waiting time for customers.
Performed Clustering using k-means algorithm, DBSCAN and model based
clustering algorithms. Out of these model based clustering gives some meaningful
insights on clusters
4.1.1.Load The Data
geodata = read.csv("G:/Careem/Data files/GeoData.csv")
head(geodata,10)
## Latitude Longitude booking_time
## 1 30.05464 31.49216 6:09:21 PM
## 2 30.05464 31.49216 6:09:14 PM
## 3 30.05900 31.49587 6:09:11 PM
## 4 30.07490 31.24056 6:09:00 PM
## 5 30.05646 31.48968 6:08:56 PM
## 6 30.05900 31.49587 6:08:48 PM
## 7 30.05646 31.48968 6:08:45 PM
## 8 30.05900 31.49587 6:08:41 PM
## 9 30.05646 31.48968 6:08:33 PM
## 10 30.05646 31.48968 6:08:26 PM
str(geodata)
## 'data.frame': 3029 obs. of 3 variables:
## $ Latitude : num 30.1 30.1 30.1 30.1 30.1 ...
2. ## $ Longitude : num 31.5 31.5 31.5 31.2 31.5 ...
## $ booking_time: Factor w/ 2901 levels " 1:00:06 PM",..: 2535 2534 2533
2532 2531 2530 2529 2528 2527 2526 ...
# Check For any missing data
sum(is.na(geodata))
## [1] 0
4.3 Project the Spatial data on to a map
• R provides a number of useful packages for dealing with maps and spatial data.
• Here ,maps are created using two useful packages RGoogleMaps & ggmap.
• The Spatial Data provided ,when projected on to this base maps , help us to reveal some
useful information hidden in the data
# Use RGoogleMaps
# Load the library RgoogleMaps
library(RgoogleMaps)
## Warning: package 'RgoogleMaps' was built under R version 3.2.3
# Get the Latitude and longitude range
lat.range = range(geodata$Latitude)
lon.range = range(geodata$Longitude)
# Get Maps Based upon the range of longitude and latitude values
geo.basemap <- GetMap.bbox(lonR = lon.range, latR = lat.range,
destfile = "geo_BaseMap.png",
maptype = "roadmap",
zoom = 11)
# Plot The Geo-Data on the Map obtained fro the latitude and Longitude
Range
PlotOnStaticMap(geo.basemap,
lat = geodata$Latitude, lon = geodata$Longitude,
zoom = 18, cex = 0.5, pch = 19, col = "red",
FUN = points, add = F)
# Use ggmap
# Load the required library
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.2.3
3. ## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
lat.centre = median(geodata$Latitude)
lon.centre = median(geodata$Longitude)
geo.basemap2 <- get_map(location = c(lon.centre,lat.centre),
maptype = "roadmap",
source="google",
zoom = 11)
## Map from URL :
http://maps.googleapis.com/maps/api/staticmap?center=30.046766,31.306726&zoom
=11&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
ggmap(geo.basemap2) +
geom_point(data = geodata, aes(x = Longitude, y = Latitude),
color="red", size= 1.5, alpha=0.5)
## Warning: Removed 400 rows containing missing values (geom_point).
5. 4.4 Clustering using K-Means Algorithm
• K-Means algorithm groups clusters using partitioning approach.
• There are 2 factors that determine the quality of k-means clustering
• Initial choice of centroids
• The number of clusters present in the data should be known before performing the
clustering operation
4.4.1. Computing the Distance matrix
• There are different approaches for calculating the distance between two data points, like
• Hamming Distance
• Euclidean Distance
• Manhattan or City Block Distance
• However,using Euclidean distance as a measure does not give the actual distance measure
for spatial data involving latitude and longitude.
• Haversian Distance Measure ,provided in the "geosphere" package was used to calculate
meaningful distance between spatial points , taking into account the spherical/elliptical
shape of earth.
## Compute the distance matrix using Geosphere package
## Function to Haversian Distance
geo.dist <- function(df) {
6. require(geosphere)
d <- function(i,z){ # z[1:2] contain long, lat
dist <- rep(0,nrow(z))
dist[i:nrow(z)] <-
distHaversine(z[i:nrow(z),1:2],z[i,1:2])
return(dist)
}
dm <- do.call(cbind,lapply(1:nrow(df),d,df))
return(as.dist(dm))
}
distance.matrix <- geo.dist(geodata[,c(1,2)])
## Loading required package: geosphere
## Warning: package 'geosphere' was built under R version 3.2.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.2.3
4.4.2 Determining the optimal number of clusters for K-means
• Sign of a good cluster is to have high inter-Cluster similarity and low intra-cluster similarity.
• "within Sum of Squares"-WSS, is a measure to find the coherence inside a cluster . WSS is
the sum of Squared distance between centroid and every point inside a cluster.So, Once a
optimal no of clusters are formed in the data , there is very low decrease in WSS
• There is a fast drop in WSS values upto 4 clusters ,after which there is only slight decrease
in WSS , which suggests that, 8 is th optimal number of clusters .
## Determine the no of clusters
wssplot.distancematrix <- function(data, nc=15, seed=1234){
wss <- rep(0,15)
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data,
centers=i)$withinss)
}
plot(1:nc, wss,
type="b",
xlab="Number of Clusters",
ylab="Within groups sum of
squares")
}
wssplot.distancematrix(distance.matrix)
7. 4.4.3 Perform K-Means Clustering
• Perform K-Means Clustering using 4 clusters and 20 sets of random initialization of
centroids for the clusters.
• Visualize the clusters on the map
## Perform K-Means Clustering
cluster.kmeans.geodata = kmeans(distance.matrix, 8, nstart =20 )
summary(cluster.kmeans.geodata)
## Length Class Mode
## cluster 3029 -none- numeric
## centers 24232 -none- numeric
## totss 1 -none- numeric
## withinss 8 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 8 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
## visualize k-means clustering
cluster.factors = as.factor(cluster.kmeans.geodata$cluster)
plot.kmeans <- ggmap(geo.basemap2)+
8. geom_point(data = geodata, aes(x = Longitude, y = Latitude),
color=cluster.factors, size= 1.5, alpha=0.5)+
ggtitle("k-Means cluster of Booking Locations-Pointwise")
plot.kmeans
## Warning: Removed 400 rows containing missing values (geom_point).
4.5 Density Based Clustering
• Conventional methods of clustering like k-means and hierarchical, constructs clusters of
spherical shape and even include some outliers in the process to some nearest cluster.
• But Density Based clustering is useful in finding non-linear clusters like s- shaped or oval or
any other non-linear shape .
• DBSCAN- Density Based Spatial Clustering of Applications with Noise , performs good
clustering even in the presence of outliers
• knnDistPlot is used to find the epsilon for forming clusters and it is around 0.015
library(dbscan)
## Warning: package 'dbscan' was built under R version 3.2.3
library(cluster)
## Warning: package 'cluster' was built under R version 3.2.3
9. kNNdistplot(geodata[,c(1,2)], k = 3)
abline(h=0.01, col="red")
db <- dbscan(geodata[,c(1,2)], eps=0.02, minPts=45)
db
## DBSCAN clustering for 3029 objects.
## Parameters: eps = 0.02, minPts = 45
## The clustering contains 7 cluster(s).
## Available fields: cluster, eps, minPts
cluster.factors.db = as.factor(db$cluster)
## DBSCAN Visualization
plot.dbscan <- ggmap(geo.basemap2)+
geom_point(data = geodata, aes(x = Longitude, y = Latitude),
color=db$cluster+1L, size= 1.5, alpha=0.5)+
ggtitle("DBSCAN cluster of Booking Locations-Pointwise")
plot.dbscan
## Warning: Removed 400 rows containing missing values (geom_point).
10. *
the plot suggests that DBSCAN clusters are highly cohesive than that of k-means
algorithm, since the outliers are not included in any cluster and are shown as black
points. Whereas in K-means algorithms ,clusters inlcude the outlier points also .
4.6 Timely Evolution of Bookings (Hourly Evolution)
• Perform feature engineering on the booking time using strptime and group the data based
upon the booking time .
• plot density plots of the observations for every hour
GroupByHours <- function(df){
TimeFormat = "%I:%M:%S %p"
lt_time = strptime(df,TimeFormat)
return (lt_time$hour)
}
geodata$Hours <- sapply(geodata$booking_time,GroupByHours)
geodata <- geodata[order(geodata$Hours),]
geo.basemap3 <- ggmap(geo.basemap2)
geo.basemap3 +
stat_density2d(aes(x = Longitude, y = Latitude,fill = ..level..,alpha =
..level..),
11. geom = "polygon", data = geodata) +
facet_wrap( ~ Hours) +
theme(strip.text.x = element_text(size=12, face="bold"),
strip.background = element_rect(colour="red", fill="#CCCCFF"))+
ggtitle("Hourly Density estimation of Points in 24 hr Format")
## Warning: Removed 400 rows containing non-finite values (stat_density2d).
* The Evolution of density plots suggest that ,the bookings are more diversely distributed
during the day time from 8 am to 5 pm. * during midnight hours the density distribution is very
scarce and restricted to few hotspots alone
12. Cluster Analysis-Model Based Clustering
Anbarasan S
January 22, 2016
1.Load The Data
# Load the Library
library(mclust)
## Warning: package 'mclust' was built under R version 3.2.3
## Package 'mclust' version 5.1
## Type 'citation("mclust")' for citing this R package in publications.
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
library(grid)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.3
#load the data
geodata <- read.csv("G:/Careem/Data files/GeoData.csv")
# get the base Map
lat.centre = median(geodata$Latitude)
lon.centre = median(geodata$Longitude)
geo.basemap2 <- get_map(location = c(lon.centre,lat.centre),
maptype = "roadmap",
source="google",
zoom = 11)
13. ## Map from URL :
http://maps.googleapis.com/maps/api/staticmap?center=30.046766,31.306726&zoom
=11&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
2.Multiplot Function
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
require(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
3.Perform Model Based Clustering
• Model Based Clustering ,uses the bayesian probabilistic interpretation ,to assign an
observation to a cluster.
14. • Hence in a model based clustering , each data point (or) observation belong to more than
one cluster ,with a certain probability.And the observation is assigned to cluster with a
maximum probabilty.
3.1 Advantages of Model Based Clustering over other Clustering
techniques
• No need to determine the number of clusters in advance
• Very useful in detremining clusters of any shape like oval or s- shaped
• Not sensitive to outliers.
#Apply Mclust
geodata.mclust <- Mclust(geodata[,c(1,2)])
summary(geodata.mclust)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model
with 9 components:
##
## log.likelihood n df BIC ICL
## 9351.399 3029 53 18277.95 17990.1
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 411 88 591 626 363 255 311 65 319
# Visualize the Clusters
plot.mclust <- ggmap(geo.basemap2)+
geom_point(data = geodata, aes(x = Longitude, y = Latitude),
color=geodata.mclust$classification, size= 1.5, alpha=0.5)+
ggtitle("Model Based clustering of Booking Locations-Pointwise")
plot.mclust
## Warning: Removed 400 rows containing missing values (geom_point).
15. ## Inference From Clustering: * Model Based Clustering gives more meaningful
clusters than DBSCAN or K-Means as can be seen from the visualization of the three
Clusters
16. 4.Hourly Evolution of clusters
• To Understand how the clusters evolve over time ,we group the booking time into four
categories - + Midnight(12 AM -6 AM)
• Morning(6 AM - 10 AM)
• Noon(10 AM -4 PM)
• Evening(4 PM- 6 PM)
## Create a feature Hours
GroupByHours <- function(df){
TimeFormat = "%I:%M:%S %p"
lt_time = strptime(df,TimeFormat)
return (lt_time$hour)
}
geodata$Hours <- sapply(geodata$booking_time,GroupByHours)
geodata <- geodata[order(geodata$Hours),]
## Hourly Evolution of model based clusters
## Group the data based on Hours
geodata_midnight <- subset(geodata, Hours>=0 & Hours <6)
geodata_morning <- subset(geodata, Hours>=6 & Hours <11)
geodata_noon <- subset(geodata, Hours>=11 & Hours <16)
17. geodata_evening <- subset(geodata, Hours>=16 & Hours <=18)
# Perform Model Based clustering during midnight hours
geodata.mclust.midnight <- Mclust(geodata_midnight[,c(1,2)])
summary(geodata.mclust.midnight)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 9 components:
##
## log.likelihood n df BIC ICL
## 744.7721 209 45 1249.139 1234.731
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 65 13 13 39 10 16 18 16 19
midnight <- ggmap(geo.basemap2)+
geom_point(data = geodata_midnight, aes(x = Longitude, y = Latitude),
color=geodata.mclust.midnight$classification, size= 1.5,
alpha=0.5)+
ggtitle("Midnight Bookings")
# Perform Model Based clustering during morning hours
geodata.mclust.morning <- Mclust(geodata_morning[,c(1,2)])
summary(geodata.mclust.morning)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model
with 9 components:
##
## log.likelihood n df BIC ICL
## 1608.484 521 53 2885.414 2861.465
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 160 60 92 49 48 10 24 47 31
morning <- ggmap(geo.basemap2)+
geom_point(data = geodata_morning, aes(x = Longitude, y = Latitude),
color=geodata.mclust.morning$classification, size= 1.5,
alpha=0.5)+
ggtitle("Morning Bookings")
18. # Perform Model Based clustering during afternoon hours
geodata.mclust.noon <- Mclust(geodata_noon[,c(1,2)])
summary(geodata.mclust.noon)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVE (ellipsoidal, equal orientation) model with 9 components:
##
## log.likelihood n df BIC ICL
## 5086.66 1621 45 9840.734 9556.333
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 234 162 203 339 100 107 296 57 123
noon <- ggmap(geo.basemap2)+
geom_point(data = geodata_noon, aes(x = Longitude, y = Latitude),
color=geodata.mclust.noon$classification, size= 1.5,
alpha=0.5)+
ggtitle("Noon Bookings")
# Perform Model Based clustering during evening hours
geodata.mclust.evening <- Mclust(geodata_evening[,c(1,2)])
summary(geodata.mclust.evening)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVE (ellipsoidal, equal orientation) model with 9 components:
##
## log.likelihood n df BIC ICL
## 2223.575 678 45 4153.789 4096.327
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 121 12 164 64 11 61 77 100 68
evening <- ggmap(geo.basemap2)+
geom_point(data = geodata_evening, aes(x = Longitude, y = Latitude),
color=geodata.mclust.evening$classification, size= 1.5,
alpha=0.5)+
ggtitle("Evening bookings")
# Multi Plot
multiplot(midnight, morning, noon, evening,cols=2)
19. ## Warning: Removed 18 rows containing missing values (geom_point).
## Warning: Removed 78 rows containing missing values (geom_point).
## Warning: Removed 235 rows containing missing values (geom_point).
## Warning: Removed 69 rows containing missing values (geom_point).
4.1 Inference from timely evolution of clusters
• The Clusters are more dense during the Noon or mid-day period from 10 am - 4 pm ,
where the clusters are dense.The Possible reason for this dense clusters during day time
may be due to office going population.
• More bookings are from the cairo region followed all hay illamin region, over all periods of
the day
• Midnight Cluster suggests that majority booking is from cairo region, with few interspersed
bookings here and there.
• The clusters are found around the Ring Road area