This presentation gives an overview of the Open data. A number of case studies are given on the spatio-temporal analysis and visualization of the Social Media data (Twitter). The presentation also explains the creation of a heatmap visualisation by using R.
1. Open Data: Analysis and Visualisation
Muhammad Adnan
Department of Geography, University College London
Web: http://www.uncertaintyofidentity.com
Twitter: @gisandtech
2. Dr. Muhammad Adnan
• Research Associate
– Working on an EPSRC funded project “Uncertainty of Identity”
– http://www.uncertaintyofidentity.com
Research Interests
• Data Mining
• Social Media Analysis
• Data Visualisation
3. Outline
• Open Data
• Crowd-Sourced Data (Social Media)
• Analysis and Visualisation Challenges
• Twitter Case Study
• Spatial Analysis
• Temporal Analysis
• R
• A brief introduction
• How to create heat maps
4. Open data
Data that is:
Open and Free to the public
Complete
Accessible
Timely
Machine processable
Non-discriminatory
12. Open Data Profiler
• http://www.opendataprofiler.com/
• Users can visualise 60 different 2011 Census datasets
13. Crowd Sourced datasets
• Twitter
• Public streaming API can be used to download live tweets
• Four Square
• Has an API which can be used to access the Four Square data
• Facebook
• Facebook applications can access user information
• Flickr
• Wikipedia
• Youtube
14. How big are crowd sourced datasets ?
• Facebook
• Number of active users: 850 Million
• Average daily uploaded photos: 360 Million
• Total data size: 30+ Petabytes
• Twitter
• Number of active users: 200 Million
• Daily tweets (posts): 350 Million
• Foursquare
• Number of active users: 15 Million
• Total check-ins: 1.5 Billion
15. What are the issues with these datasets ?
• How representative social media data sets are of the Census or
Electoral roll data ?
• Who: Ethnicity, Gender, and Age of social media users
• Where: Where social media conversations are happening
and who is leading them
•
Intelligence about where people are located and what they are doing
• When: What time of day conversations happen
16. Twitter (www.twitter.com)
• Online social-networking and micro blogging service
• Launched in 2006
• Users can send messages of 140 characters or less
• Approximately 200 million active users
• 350 million tweets daily
• In 2012, UK and London were ranked 4th and
3rd, respectively, in terms of the number of posted tweets
18. Data available through the Twitter API
•
•
•
•
•
•
•
•
•
User Creation Date
Followers
Friends
User ID
Language
Location
Name
Screen Name
Time Zone
•
•
•
•
•
Geo Enabled
Latitude
Longitude
Tweet date and time
Tweet text
Users can download 1% sample of the live tweets through the API
27. Analysis of User Names on Twitter
• A name is a statement of the person‟s ethnic, linguistic, and
cultural identity.
• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo
mateos is a Spanish (Hispanic) name.
28. Analysing Names on Twitter
• Some examples of NAME variations on Twitter
Real Names
Kevin Hodge
Andre Alves
Jose de Franco
Carolina Thomas, Dr.
Prof. Martha Del Val
Fabíola Sanchez Fernandes
Fake Names
JustinBieber_Home.
WHAT IS LOVE?
MysticMind
KIRILL_aka_KID
Vanessa
Petuna
29. Analysing Names on Twitter
• Some examples of NAME variations on Twitter
Real Names
Kevin Hodge -> F: „Kevin‟ ; S: „Hodge‟
Andre Alves -> F: „Andre‟ ; S: „Alves‟
Jose De Franco -> F: „Jose‟ ; S: „De Franco‟
Carolina Thomas, Dr. -> F: „Carolina‟ ; S: „Thomas‟
Prof. Martha Del Val -> F: „Martha‟ ; S: „Del Val‟
Fabíola Sanchez Fernandes -> F: „Fabíola‟ ; S: „Fernandes‟
33. Predicting Ethnicity of Twitter Users by
using their „Names‟
• A name is a statement of the person‟s ethnic, linguistic, and
cultural identity.
• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo
mateos is a Spanish (Hispanic) name.
34. Classifying Twitter Data to ethnic origins
• Applied ONOMAP (www.onomap.org) on FORENAME +
SURNAME pairs
Kevin Hodge (ENGLISH)
Pablo Mateos (Spanish)
…
…
…
…
37. Comparison of Ethnic Groups between „2011
Census‟ and „Twitter‟
• Onomap groups were aggregated to match the appropriate
groups from the Census
London
Total
White
British
White
other
Indian
Pakistani
Bangladeshi
Black
Chinese
African
Week
Night
53611
71.35% 12.12%
2.63%
2.63%
1.82%
1.52%
1.74%
Week Day
80676
73.12% 11.80%
2.41%
2.41%
1.56%
1.25%
1.61%
Weekend
67351
72.86% 12.17%
2.61%
2.61%
1.67%
1.39%
1.73%
44.89% 12.65%
6.64%
2.74%
2.72%
7.02%
1.52%
2011 Census
38. Comparison of the distribution of ethnicity with the
2011 Census
White British (Quintiles)
2011 Census
Twitter
39. Gender and Age Analysis of Twitter Users
by using their „forenames‟
40. Gender Analysis of Twitter Users
60%
50%
40%
30%
20%
10%
0%
Male
Female
Number of Tweets
Unisex
Number of Unique Users
Not Found
41. Age estimation from „forenames‟
Data: Monica (CACI, Ltd.) and Birth Certificate Data (Office of National
Statistics)
45%
40%
35%
Percent
30%
25%
20%
15%
10%
5%
0%
Age group
PAUL
BETTY
GUY
MUHAMMAD
47. Segregation Analysis
• The value of the information theory index is between 0 (low
segregation) and 1 (high segregation).
Ethnic Groups
H (Domestic
buildings and
gardens)
H (Week Nights)
H (Week Days)
H (Weekend)
British
0.483
0.401
0.211
0.315
Irish
0.670
0.571
0.357
0.475
White Other
0.630
0.510
0.303
0.420
Pakistani
0.765
0.679
0.488
0.633
Indian
0.748
0.673
0.451
0.590
Bangladeshi
0.864
0.834
0.671
0.784
Black Caribbean
0.831
0.808
0.548
0.666
Black African
0.764
0.704
0.492
0.640
Chinese
0.712
0.608
0.403
0.524
Other
0.710
0.593
0.374
0.497
59. Data available through the Twitter API
•
•
•
•
•
•
•
•
•
User Creation Date
Followers
Friends
User ID
Language
Location
Name
Screen Name
Time Zone
•
•
•
•
•
Geo Enabled
Latitude
Longitude
Tweet date and time
Tweet text
66. Temporal Analysis of the Twitter Data
• Data: 12 September, 2012 – 25 September, 2013
• We extracted a total of approx. 800 million tweets over the last year
• A temporal activity analysis of different cities could potentially reveal a lot
of information about the residents of the city
• But Twitter data is not clean and has lots of problems !
67. Problems with the data
1) Extracting the data for individual cities or places
• Use of bounding boxes to extract the data
• New York City NW: 40.91762, -73.7004 SW: 40.47662, -74.2589
• http://isithackday.com could be used to find the bounding boxes of
different cities
68. Problems with the data
2) Twitter data has a GMT and BST timestamp. Conversion to other time
stamp is very time consuming
• 12p.m. in „London‟ is 5a.m in Los Angeles, if the time stamp is GMT.
• 12p.m. in „London‟ is 6a.m in Los Angeles, if the time stamp is BST.
69. Temporal Analysis of different cities
• Approx. 170 million tweets were sent from the following 30 cities.
40
35
Number of Tweets (Millions)
30
25
20
15
10
5
0
76. What is R?
• The R statistical programming language is a free open
source package based on the S language developed
by Bell Labs.
• The language is very powerful for writing programs.
• Many statistical functions are already built in.
• Very easy to create maps and different visualizations.
77. What is R?
• You will have to write some code to get the things
done !
• R is available @ www.r-project.org
• Supports both 32 and 64 bit Windows
PCs, Linux, Unix, and Mac OS operating sytems
80. Interacting with R
Math:
> 1 + 1
[1] 2
> 1 + 1 * 7
[1] 8
> (1 + 1) * 7
[1] 14
Variables:
> x
> x
[1]
> y
> y
[1]
> z
> z
[1]
<- 1
1
<- 2
2
<- x+y
3
> sqrt(16)
[1] 4
80
81. Importing Data
• How do we get data into R?
• First make sure your data is in an easy to read
format such as CSV (Comma Separated Values)
• Use code:
– D <- read.csv(“path”,sep=“,”,header=T)
– D <- read.table(“path”,sep=“,”,header=T)
82. Working with data.
• Accessing columns.
• D has our data in it…. But you can‟t see it directly.
• To select a column use D$column.
85. How to create a heat map in R ?
• Three steps:
– Read a CSV file
– Chose the colours for the heat map
– Create the heat map
86. How to create a heat map in R ?
• Step 1: Read a CSV file
read.csv(“FILE NAME", sep=",", header=T)
87. How to create a heat map in R ?
• Step 1: Read a CSV file
read.csv(“FILE NAME", sep=",", header=T)
• Assign it to a variable
Input <- read.csv(“FILE NAME", sep=",", header=T)
i.e. with „<„ (less than) and „-‟ (dash) symbols.
88. How to create a heat map in R ?
• Step 1: Read a CSV file
89. How to create a heat map in R ?
• Step 2: Chose the colours for the heat map
colours <- c(0) (Create an empty variable)
90. How to create a heat map in R ?
• Step 2: Chose the colours for the heat map
colours <- c(0)
colours[1] <- "#FDD49E"
colours[2] <- "#FDBB84"
colours[3] <- "#FC8D59"
colours[4] <- "#EF6548"
colours[5] <- "#D7301F"
colours[6] <- "#B30000"
colours[7] <- "#7F0000"
91. How to create a heat map in R ?
• Step 2: Chose the colours for the heat map
colours <- c(0)
colours[1] <- "#FDD49E"
colours[2] <- "#FDBB84"
colours[3] <- "#FC8D59"
colours[4] <- "#EF6548"
colours[5] <- "#D7301F"
colours[6] <- "#B30000"
colours[7] <- "#7F0000"
92. How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)
93. How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)
Input Data
94. How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)
Whether to apply scaling on the data. Options
are „col‟, „row‟, and „none‟.
95. How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)
Leave them as they are!
96. How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)
Colours
97. •
•
•
•
Open Data
Crowd-Sourced Data (Social Media)
Analysis and Visualisation Challenges
Twitter Case Study
• Spatial Analysis
• Temporal Analysis
• R
• A brief introduction
• How to create heat maps
Any Questions ?