Open Data: Analysis and Visualisation

Open Data: Analysis and Visualisation
Muhammad Adnan
Department of Geography, University College London

Web: http://www.uncertaintyofidentity.com
Twitter: @gisandtech

Dr. Muhammad Adnan
• Research Associate
– Working on an EPSRC funded project “Uncertainty of Identity”
– http://www.uncertaintyofidentity.com

Research Interests
• Data Mining
• Social Media Analysis
• Data Visualisation

Outline
• Open Data
• Crowd-Sourced Data (Social Media)
• Analysis and Visualisation Challenges
• Twitter Case Study
• Spatial Analysis
• Temporal Analysis

• R
• A brief introduction
• How to create heat maps

Open data
Data that is:
 Open and Free to the public
 Complete
 Accessible
 Timely
 Machine processable
 Non-discriminatory

Dataset examples
•
•
•
•
•
•
•
•
•

National Budgets
Car registries
National roads
Water heights
Schools
Weather
Public transport
Council tax bands
And many more

Census Profiler
• http://www.censusprofiler.org/
• Users can visualise 2001 Census data

Education Profiler
• http://www.educationprofiler.org/
• Users can visualise education datasets

Open Data Profiler
• http://www.opendataprofiler.com/
• Users can visualise 60 different 2011 Census datasets

Crowd Sourced datasets
• Twitter
• Public streaming API can be used to download live tweets

• Four Square
• Has an API which can be used to access the Four Square data

• Facebook
• Facebook applications can access user information
• Flickr
• Wikipedia
• Youtube

How big are crowd sourced datasets ?
• Facebook
• Number of active users: 850 Million
• Average daily uploaded photos: 360 Million
• Total data size: 30+ Petabytes

• Twitter
• Daily tweets (posts): 350 Million

• Foursquare
• Total check-ins: 1.5 Billion

What are the issues with these datasets ?
• How representative social media data sets are of the Census or
Electoral roll data ?

• Who: Ethnicity, Gender, and Age of social media users
• Where: Where social media conversations are happening
and who is leading them
•

Intelligence about where people are located and what they are doing

• When: What time of day conversations happen

Twitter (www.twitter.com)
• Online social-networking and micro blogging service
• Launched in 2006

• Users can send messages of 140 characters or less
• Approximately 200 million active users

• 350 million tweets daily
• In 2012, UK and London were ranked 4th and
3rd, respectively, in terms of the number of posted tweets

Basic Analysis of the Twitter data

Data available through the Twitter API
•
•
•
•
•
•
•
•
•

User Creation Date
Followers
Friends
User ID
Language
Location
Name
Screen Name
Time Zone

•
•
•
•
•

Geo Enabled
Latitude
Longitude
Tweet date and time
Tweet text

Users can download 1% sample of the live tweets through the API

Created with approx. 100 million tweets

4 million geo-tagged tweets downloaded during August and
December, 2012

Hourly and Daily Twitter Activity in London

Hourly Twitter Activity in London

Daily Twitter Activity in London
Monday

Tuesday

12000

12000

10000

10000

8000

8000

6000

6000

4000

4000

2000

2000

0

0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

Hour

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

Wednesday

Thursday

12000

12000

10000

10000

8000

8000

6000

6000

4000

4000

2000

2000

0

0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

Daily Twitter Activity in London
Friday

Saturday

12000

12000

10000

10000

8000

8000

6000

6000

4000

4000

2000

2000

0

0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

Sunday
12000
10000
8000
6000
4000
2000
0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

Analysis of User Names on Twitter
• A name is a statement of the person‟s ethnic, linguistic, and
cultural identity.
• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo
mateos is a Spanish (Hispanic) name.

Analysing Names on Twitter
• Some examples of NAME variations on Twitter
Real Names
Kevin Hodge
Andre Alves
Jose de Franco
Carolina Thomas, Dr.
Prof. Martha Del Val
Fabíola Sanchez Fernandes

Fake Names
JustinBieber_Home.
WHAT IS LOVE?
MysticMind
KIRILL_aka_KID
Vanessa
Petuna

Analysing Names on Twitter
• Some examples of NAME variations on Twitter
Real Names
Kevin Hodge -> F: „Kevin‟ ; S: „Hodge‟
Andre Alves -> F: „Andre‟ ; S: „Alves‟
Jose De Franco -> F: „Jose‟ ; S: „De Franco‟
Carolina Thomas, Dr. -> F: „Carolina‟ ; S: „Thomas‟
Prof. Martha Del Val -> F: „Martha‟ ; S: „Del Val‟
Fabíola Sanchez Fernandes -> F: „Fabíola‟ ; S: „Fernandes‟

Predicting Ethnicity of Twitter Users by
using their „Names‟
• A name is a statement of the person‟s ethnic, linguistic, and
cultural identity.
• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo
mateos is a Spanish (Hispanic) name.

Classifying Twitter Data to ethnic origins
• Applied ONOMAP (www.onomap.org) on FORENAME +
SURNAME pairs
Kevin Hodge (ENGLISH)
Pablo Mateos (Spanish)
…
…
…
…

Top 10 Ethnic Groups of Twitter Users

Tweeting Activity by different Ethnic Groups

Comparison of Ethnic Groups between „2011
Census‟ and „Twitter‟
• Onomap groups were aggregated to match the appropriate
groups from the Census

London

Total

White
British

White
other

Indian

Pakistani

Bangladeshi

Black
Chinese
African

Week
Night

53611

71.35% 12.12%

2.63%

2.63%

1.82%

1.52%

1.74%

Week Day

80676

73.12% 11.80%

2.41%

2.41%

1.56%

1.25%

1.61%

Weekend

67351

72.86% 12.17%

2.61%

2.61%

1.67%

1.39%

1.73%

44.89% 12.65%

6.64%

2.74%

2.72%

7.02%

1.52%

2011 Census

Comparison of the distribution of ethnicity with the
2011 Census
White British (Quintiles)

2011 Census

Twitter

Gender and Age Analysis of Twitter Users
by using their „forenames‟

Gender Analysis of Twitter Users
60%
50%
40%
30%
20%
10%
0%
Male

Female
Number of Tweets

Unisex
Number of Unique Users

Not Found

Age estimation from „forenames‟
Data: Monica (CACI, Ltd.) and Birth Certificate Data (Office of National
Statistics)
45%
40%
35%

Percent

30%
25%
20%
15%
10%
5%
0%

Age group
PAUL

BETTY

GUY

MUHAMMAD

Age-Sex structure of Twitter Users and 2011 Census

Male

Female

Tweets by different Land-use Categories

Temporal Activity: Tweets from different Land-use
Categories

Ethnic Segregation of Twitter Users

Segregation Analysis
• The value of the information theory index is between 0 (low
segregation) and 1 (high segregation).
Ethnic Groups

H (Domestic
buildings and
gardens)

H (Week Nights)

H (Week Days)

H (Weekend)

British

0.483

0.401

0.211

0.315

Irish

0.670

0.571

0.357

0.475

White Other

0.630

0.510

0.303

0.420

Pakistani

0.765

0.679

0.488

0.633

Indian

0.748

0.673

0.451

0.590

Bangladeshi

0.864

0.834

0.671

0.784

Black Caribbean

0.831

0.808

0.548

0.666

Black African

0.764

0.704

0.492

0.640

Chinese

0.712

0.608

0.403

0.524

Other

0.710

0.593

0.374

0.497

Extending the analysis to other cities

Tweet density map of New York City

Top 10 ethnic groups in London

Tweeting Activity by different Ethnic Groups (NYC)

Tweeting Activity by different Ethnic Groups (Paris)

Exploring the Languages on Twitter

Data available through the Twitter API
•
•
•
•
•
•
•
•
•

User Creation Date
Followers
Friends
User ID
Language
Location
Name
Screen Name
Time Zone

•
•
•
•
•

Geo Enabled
Latitude
Longitude
Tweet date and time
Tweet text

Temporal Analysis of the data sets

Temporal Analysis of the Twitter Data
• Data: 12 September, 2012 – 25 September, 2013
• We extracted a total of approx. 800 million tweets over the last year
• A temporal activity analysis of different cities could potentially reveal a lot
of information about the residents of the city
• But Twitter data is not clean and has lots of problems !

Problems with the data
1) Extracting the data for individual cities or places

• Use of bounding boxes to extract the data
• New York City NW: 40.91762, -73.7004 SW: 40.47662, -74.2589

• http://isithackday.com could be used to find the bounding boxes of
different cities

Problems with the data
2) Twitter data has a GMT and BST timestamp. Conversion to other time
stamp is very time consuming

• 12p.m. in „London‟ is 5a.m in Los Angeles, if the time stamp is GMT.
• 12p.m. in „London‟ is 6a.m in Los Angeles, if the time stamp is BST.

Temporal Analysis of different cities
• Approx. 170 million tweets were sent from the following 30 cities.
40

35

Number of Tweets (Millions)

30

25

20

15

10

5

0

LONDON

LONDON
PARIS

JAKARTA

JAKARTA
RIYADH

ISTANBUL
JAKARTA

What is R?
• The R statistical programming language is a free open
source package based on the S language developed
by Bell Labs.
• The language is very powerful for writing programs.
• Many statistical functions are already built in.
• Very easy to create maps and different visualizations.

What is R?
• You will have to write some code to get the things
done !

• R is available @ www.r-project.org
• Supports both 32 and 64 bit Windows
PCs, Linux, Unix, and Mac OS operating sytems

Getting Started

• The R GUI?

Interacting with R
Math:
> 1 + 1
[1] 2
> 1 + 1 * 7
[1] 8
> (1 + 1) * 7
[1] 14

Variables:
> x
> x
[1]
> y
> y
[1]
> z
> z
[1]

<- 1
1
<- 2

2
<- x+y
3

> sqrt(16)
[1] 4
80

Importing Data
• How do we get data into R?
• First make sure your data is in an easy to read
format such as CSV (Comma Separated Values)
• Use code:
– D <- read.csv(“path”,sep=“,”,header=T)
– D <- read.table(“path”,sep=“,”,header=T)

Working with data.
• Accessing columns.
• D has our data in it…. But you can‟t see it directly.
• To select a column use D$column.

Basic Graphics

• Histogram
– hist(D$wg)

How to create a heat map in R ?

• Three steps:
– Read a CSV file
– Chose the colours for the heat map
– Create the heat map

• Step 1: Read a CSV file
read.csv(“FILE NAME", sep=",", header=T)

read.csv(“FILE NAME", sep=",", header=T)

• Assign it to a variable
Input <- read.csv(“FILE NAME", sep=",", header=T)
i.e. with „<„ (less than) and „-‟ (dash) symbols.

• Step 2: Chose the colours for the heat map
colours <- c(0) (Create an empty variable)

• Step 2: Chose the colours for the heat map
colours <- c(0)
colours[1] <- "#FDD49E"
colours[2] <- "#FDBB84"
colours[3] <- "#FC8D59"
colours[4] <- "#EF6548"
colours[5] <- "#D7301F"
colours[6] <- "#B30000"
colours[7] <- "#7F0000"

• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)

NA, col=colours)

Input Data

NA, col=colours)

Whether to apply scaling on the data. Options
are „col‟, „row‟, and „none‟.

NA, col=colours)

Leave them as they are!

NA, col=colours)

Colours

•
•
•
•

Open Data
Crowd-Sourced Data (Social Media)
Analysis and Visualisation Challenges
Twitter Case Study
• Spatial Analysis
• Temporal Analysis

• R
• A brief introduction
• How to create heat maps

Any Questions ?

Open Data: Analysis and Visualisation

Recommandé

Recommandé

Contenu connexe

Similaire à Open Data: Analysis and Visualisation

Similaire à Open Data: Analysis and Visualisation (20)

Dernier

Dernier (20)

Open Data: Analysis and Visualisation