SlideShare une entreprise Scribd logo
1  sur  97
Open Data: Analysis and Visualisation
Muhammad Adnan
Department of Geography, University College London

Web: http://www.uncertaintyofidentity.com
Twitter: @gisandtech
Dr. Muhammad Adnan
• Research Associate
– Working on an EPSRC funded project “Uncertainty of Identity”
– http://www.uncertaintyofidentity.com

Research Interests
• Data Mining
• Social Media Analysis
• Data Visualisation
Outline
• Open Data
• Crowd-Sourced Data (Social Media)
• Analysis and Visualisation Challenges
• Twitter Case Study
• Spatial Analysis
• Temporal Analysis

• R
• A brief introduction
• How to create heat maps
Open data
Data that is:
 Open and Free to the public
 Complete
 Accessible
 Timely
 Machine processable
 Non-discriminatory
Dataset examples
•
•
•
•
•
•
•
•
•

National Budgets
Car registries
National roads
Water heights
Schools
Weather
Public transport
Council tax bands
And many more
Census Profiler
• http://www.censusprofiler.org/
• Users can visualise 2001 Census data
Education Profiler
• http://www.educationprofiler.org/
• Users can visualise education datasets
Open Data Profiler
• http://www.opendataprofiler.com/
• Users can visualise 60 different 2011 Census datasets
Crowd Sourced datasets
• Twitter
• Public streaming API can be used to download live tweets

• Four Square
• Has an API which can be used to access the Four Square data

• Facebook
• Facebook applications can access user information
• Flickr
• Wikipedia
• Youtube
How big are crowd sourced datasets ?
• Facebook
• Number of active users: 850 Million
• Average daily uploaded photos: 360 Million
• Total data size: 30+ Petabytes

• Twitter
• Number of active users: 200 Million
• Daily tweets (posts): 350 Million

• Foursquare
• Number of active users: 15 Million
• Total check-ins: 1.5 Billion
What are the issues with these datasets ?
• How representative social media data sets are of the Census or
Electoral roll data ?

• Who: Ethnicity, Gender, and Age of social media users
• Where: Where social media conversations are happening
and who is leading them
•

Intelligence about where people are located and what they are doing

• When: What time of day conversations happen
Twitter (www.twitter.com)
• Online social-networking and micro blogging service
• Launched in 2006

• Users can send messages of 140 characters or less
• Approximately 200 million active users

• 350 million tweets daily
• In 2012, UK and London were ranked 4th and
3rd, respectively, in terms of the number of posted tweets
Basic Analysis of the Twitter data
Data available through the Twitter API
•
•
•
•
•
•
•
•
•

User Creation Date
Followers
Friends
User ID
Language
Location
Name
Screen Name
Time Zone

•
•
•
•
•

Geo Enabled
Latitude
Longitude
Tweet date and time
Tweet text

Users can download 1% sample of the live tweets through the API
Created with approx. 100 million tweets
4 million geo-tagged tweets downloaded during August and
December, 2012
4 million geo-tagged tweets downloaded during August and
December, 2012
Hourly and Daily Twitter Activity in London
Hourly Twitter Activity in London
Daily Twitter Activity in London
Monday

Tuesday

12000

12000

10000

10000

8000

8000

6000

6000

4000

4000

2000

2000

0

0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

Hour

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

Wednesday

Thursday

12000

12000

10000

10000

8000

8000

6000

6000

4000

4000

2000

2000

0

0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour
Daily Twitter Activity in London
Friday

Saturday

12000

12000

10000

10000

8000

8000

6000

6000

4000

4000

2000

2000

0

0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

Sunday
12000
10000
8000
6000
4000
2000
0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour
Analysis of User Names on Twitter
• A name is a statement of the person‟s ethnic, linguistic, and
cultural identity.
• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo
mateos is a Spanish (Hispanic) name.
Analysing Names on Twitter
• Some examples of NAME variations on Twitter
Real Names
Kevin Hodge
Andre Alves
Jose de Franco
Carolina Thomas, Dr.
Prof. Martha Del Val
Fabíola Sanchez Fernandes

Fake Names
JustinBieber_Home.
WHAT IS LOVE?
MysticMind
KIRILL_aka_KID
Vanessa
Petuna
Analysing Names on Twitter
• Some examples of NAME variations on Twitter
Real Names
Kevin Hodge -> F: „Kevin‟ ; S: „Hodge‟
Andre Alves -> F: „Andre‟ ; S: „Alves‟
Jose De Franco -> F: „Jose‟ ; S: „De Franco‟
Carolina Thomas, Dr. -> F: „Carolina‟ ; S: „Thomas‟
Prof. Martha Del Val -> F: „Martha‟ ; S: „Del Val‟
Fabíola Sanchez Fernandes -> F: „Fabíola‟ ; S: „Fernandes‟
Where they tweet from:
Where they tweet from:
Where they tweet from:
Predicting Ethnicity of Twitter Users by
using their „Names‟
• A name is a statement of the person‟s ethnic, linguistic, and
cultural identity.
• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo
mateos is a Spanish (Hispanic) name.
Classifying Twitter Data to ethnic origins
• Applied ONOMAP (www.onomap.org) on FORENAME +
SURNAME pairs
Kevin Hodge (ENGLISH)
Pablo Mateos (Spanish)
…
…
…
…
Top 10 Ethnic Groups of Twitter Users
Tweeting Activity by different Ethnic Groups
Comparison of Ethnic Groups between „2011
Census‟ and „Twitter‟
• Onomap groups were aggregated to match the appropriate
groups from the Census

London

Total

White
British

White
other

Indian

Pakistani

Bangladeshi

Black
Chinese
African

Week
Night

53611

71.35% 12.12%

2.63%

2.63%

1.82%

1.52%

1.74%

Week Day

80676

73.12% 11.80%

2.41%

2.41%

1.56%

1.25%

1.61%

Weekend

67351

72.86% 12.17%

2.61%

2.61%

1.67%

1.39%

1.73%

44.89% 12.65%

6.64%

2.74%

2.72%

7.02%

1.52%

2011 Census
Comparison of the distribution of ethnicity with the
2011 Census
White British (Quintiles)

2011 Census

Twitter
Gender and Age Analysis of Twitter Users
by using their „forenames‟
Gender Analysis of Twitter Users
60%
50%
40%
30%
20%
10%
0%
Male

Female
Number of Tweets

Unisex
Number of Unique Users

Not Found
Age estimation from „forenames‟
Data: Monica (CACI, Ltd.) and Birth Certificate Data (Office of National
Statistics)
45%
40%
35%

Percent

30%
25%
20%
15%
10%
5%
0%

Age group
PAUL

BETTY

GUY

MUHAMMAD
Age-Sex structure of Twitter Users and 2011 Census

Male

Female
Tweets by different Land-use Categories
Temporal Activity: Tweets from different Land-use
Categories
Ethnic Segregation of Twitter Users
Segregation Analysis
Segregation Analysis
• The value of the information theory index is between 0 (low
segregation) and 1 (high segregation).
Ethnic Groups

H (Domestic
buildings and
gardens)

H (Week Nights)

H (Week Days)

H (Weekend)

British

0.483

0.401

0.211

0.315

Irish

0.670

0.571

0.357

0.475

White Other

0.630

0.510

0.303

0.420

Pakistani

0.765

0.679

0.488

0.633

Indian

0.748

0.673

0.451

0.590

Bangladeshi

0.864

0.834

0.671

0.784

Black Caribbean

0.831

0.808

0.548

0.666

Black African

0.764

0.704

0.492

0.640

Chinese

0.712

0.608

0.403

0.524

Other

0.710

0.593

0.374

0.497
Extending the analysis to other cities
Tweet density map of London
Tweet density map of Paris
Tweet density map of New York City
Top 10 ethnic groups in London
Top 10 ethnic groups in Paris
Top 10 ethnic groups in NYC
Tweeting Activity by different Ethnic Groups (NYC)
Tweeting Activity by different Ethnic Groups (Paris)
Gender Analysis
Exploring the Languages on Twitter
Data available through the Twitter API
•
•
•
•
•
•
•
•
•

User Creation Date
Followers
Friends
User ID
Language
Location
Name
Screen Name
Time Zone

•
•
•
•
•

Geo Enabled
Latitude
Longitude
Tweet date and time
Tweet text
Twitter Languages (World)
Twitter Languages (Europe)
Twitter Language Maps
Twitter Language Maps
Twitter Language Maps
Temporal Analysis of the data sets
Temporal Analysis of the Twitter Data
• Data: 12 September, 2012 – 25 September, 2013
• We extracted a total of approx. 800 million tweets over the last year
• A temporal activity analysis of different cities could potentially reveal a lot
of information about the residents of the city
• But Twitter data is not clean and has lots of problems !
Problems with the data
1) Extracting the data for individual cities or places

• Use of bounding boxes to extract the data
• New York City NW: 40.91762, -73.7004 SW: 40.47662, -74.2589

• http://isithackday.com could be used to find the bounding boxes of
different cities
Problems with the data
2) Twitter data has a GMT and BST timestamp. Conversion to other time
stamp is very time consuming

• 12p.m. in „London‟ is 5a.m in Los Angeles, if the time stamp is GMT.
• 12p.m. in „London‟ is 6a.m in Los Angeles, if the time stamp is BST.
Temporal Analysis of different cities
• Approx. 170 million tweets were sent from the following 30 cities.
40

35

Number of Tweets (Millions)

30

25

20

15

10

5

0
Temporal Analysis of different cities
LONDON
Temporal Analysis of different cities
LONDON
PARIS
Temporal Analysis of different cities
JAKARTA
Temporal Analysis of different cities
JAKARTA
RIYADH
Temporal Analysis of different cities
ISTANBUL
JAKARTA
Introduction to R
What is R?
• The R statistical programming language is a free open
source package based on the S language developed
by Bell Labs.
• The language is very powerful for writing programs.
• Many statistical functions are already built in.
• Very easy to create maps and different visualizations.
What is R?
• You will have to write some code to get the things
done !

• R is available @ www.r-project.org
• Supports both 32 and 64 bit Windows
PCs, Linux, Unix, and Mac OS operating sytems
Getting Started

• The R GUI?
Getting Started
Interacting with R
Math:
> 1 + 1
[1] 2
> 1 + 1 * 7
[1] 8
> (1 + 1) * 7
[1] 14

Variables:
> x
> x
[1]
> y
> y
[1]
> z
> z
[1]

<- 1
1
<- 2

2
<- x+y
3

> sqrt(16)
[1] 4
80
Importing Data
• How do we get data into R?
• First make sure your data is in an easy to read
format such as CSV (Comma Separated Values)
• Use code:
– D <- read.csv(“path”,sep=“,”,header=T)
– D <- read.table(“path”,sep=“,”,header=T)
Working with data.
• Accessing columns.
• D has our data in it…. But you can‟t see it directly.
• To select a column use D$column.
Basic Graphics

• Histogram
– hist(D$wg)
How to create a heat map in R ?
How to create a heat map in R ?
• Three steps:
– Read a CSV file
– Chose the colours for the heat map
– Create the heat map
How to create a heat map in R ?
• Step 1: Read a CSV file
read.csv(“FILE NAME", sep=",", header=T)
How to create a heat map in R ?
• Step 1: Read a CSV file
read.csv(“FILE NAME", sep=",", header=T)

• Assign it to a variable
Input <- read.csv(“FILE NAME", sep=",", header=T)
i.e. with „<„ (less than) and „-‟ (dash) symbols.
How to create a heat map in R ?
• Step 1: Read a CSV file
How to create a heat map in R ?
• Step 2: Chose the colours for the heat map
colours <- c(0) (Create an empty variable)
How to create a heat map in R ?
• Step 2: Chose the colours for the heat map
colours <- c(0)
colours[1] <- "#FDD49E"
colours[2] <- "#FDBB84"
colours[3] <- "#FC8D59"
colours[4] <- "#EF6548"
colours[5] <- "#D7301F"
colours[6] <- "#B30000"
colours[7] <- "#7F0000"
How to create a heat map in R ?
• Step 2: Chose the colours for the heat map
colours <- c(0)
colours[1] <- "#FDD49E"
colours[2] <- "#FDBB84"
colours[3] <- "#FC8D59"
colours[4] <- "#EF6548"
colours[5] <- "#D7301F"
colours[6] <- "#B30000"
colours[7] <- "#7F0000"
How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)
How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)

Input Data
How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)

Whether to apply scaling on the data. Options
are „col‟, „row‟, and „none‟.
How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)

Leave them as they are!
How to create a heat map in R ?
• Step 3: Create the heat map
heatmap(Input1_matrix, scale="col", Rowv = NA, Colv =
NA, col=colours)

Colours
•
•
•
•

Open Data
Crowd-Sourced Data (Social Media)
Analysis and Visualisation Challenges
Twitter Case Study
• Spatial Analysis
• Temporal Analysis

• R
• A brief introduction
• How to create heat maps

Any Questions ?

Contenu connexe

Similaire à Open Data: Analysis and Visualisation

Spatio-temporal demographic classification of the Twitter users
Spatio-temporal demographic classification of the Twitter usersSpatio-temporal demographic classification of the Twitter users
Spatio-temporal demographic classification of the Twitter usersDr Muhammad Adnan
 
Analysing the digital traces of Social Media users
Analysing the digital traces of Social Media usersAnalysing the digital traces of Social Media users
Analysing the digital traces of Social Media usersDr Muhammad Adnan
 
Linking Socio-economic and Demographic Characteristics to Twitter Topics - Gu...
Linking Socio-economic and Demographic Characteristics to Twitter Topics - Gu...Linking Socio-economic and Demographic Characteristics to Twitter Topics - Gu...
Linking Socio-economic and Demographic Characteristics to Twitter Topics - Gu...Guy Lansley
 
Data for the Non-Data Librarian
Data for the Non-Data LibrarianData for the Non-Data Librarian
Data for the Non-Data LibrarianSAGE Publishing
 
Going beyond google 2 philadelphia loss conference
Going beyond google 2 philadelphia loss conferenceGoing beyond google 2 philadelphia loss conference
Going beyond google 2 philadelphia loss conferencemikep007
 
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on TwitterSocial Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on TwitterAxel Bruns
 
Uncertainty of Identity: Classifying Twitter Data
Uncertainty of Identity: Classifying Twitter DataUncertainty of Identity: Classifying Twitter Data
Uncertainty of Identity: Classifying Twitter DataDr Muhammad Adnan
 
A Geodemographic Analysis of Ethnicity and Identity of Twitter Users in Great...
A Geodemographic Analysis of Ethnicity and Identity of Twitter Users in Great...A Geodemographic Analysis of Ethnicity and Identity of Twitter Users in Great...
A Geodemographic Analysis of Ethnicity and Identity of Twitter Users in Great...Dr Muhammad Adnan
 
Monitoring Regional Alcohol Consumption through Social Media
Monitoring Regional Alcohol Consumption through Social MediaMonitoring Regional Alcohol Consumption through Social Media
Monitoring Regional Alcohol Consumption through Social MediaDaniel Kershaw
 
Phd Colloquium Spatial Analysis
Phd Colloquium Spatial AnalysisPhd Colloquium Spatial Analysis
Phd Colloquium Spatial Analysisalistairleak
 
Big Data in Economic Research: Twitter, Phone calls and Political events
Big Data in Economic Research: Twitter, Phone calls and Political eventsBig Data in Economic Research: Twitter, Phone calls and Political events
Big Data in Economic Research: Twitter, Phone calls and Political eventsPhDSofiaUniversity
 
Public libraries and hispanics
Public libraries and hispanicsPublic libraries and hispanics
Public libraries and hispanicsSWX-WAL
 
The language of social media
The language of social mediaThe language of social media
The language of social mediaDiana Maynard
 
Social Web 2014: Final Presentations (Part II)
Social Web 2014: Final Presentations (Part II)Social Web 2014: Final Presentations (Part II)
Social Web 2014: Final Presentations (Part II)Lora Aroyo
 
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroDigital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroMichael Mitchell
 
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...Digital History
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]Frederick Zarndt
 
Evaluating the Utility of Geo-referenced Twitter Data as a Source of Reliable...
Evaluating the Utility of Geo-referenced Twitter Data as a Source of Reliable...Evaluating the Utility of Geo-referenced Twitter Data as a Source of Reliable...
Evaluating the Utility of Geo-referenced Twitter Data as a Source of Reliable...Guy Lansley
 
Twitter and Alcohol - BrightonSEO Pressentation
Twitter and Alcohol - BrightonSEO PressentationTwitter and Alcohol - BrightonSEO Pressentation
Twitter and Alcohol - BrightonSEO PressentationDaniel Kershaw
 
Social Media Training :: Market Research Assoc. 2010
Social Media Training :: Market Research Assoc. 2010Social Media Training :: Market Research Assoc. 2010
Social Media Training :: Market Research Assoc. 2010Eric Schwartzman
 

Similaire à Open Data: Analysis and Visualisation (20)

Spatio-temporal demographic classification of the Twitter users
Spatio-temporal demographic classification of the Twitter usersSpatio-temporal demographic classification of the Twitter users
Spatio-temporal demographic classification of the Twitter users
 
Analysing the digital traces of Social Media users
Analysing the digital traces of Social Media usersAnalysing the digital traces of Social Media users
Analysing the digital traces of Social Media users
 
Linking Socio-economic and Demographic Characteristics to Twitter Topics - Gu...
Linking Socio-economic and Demographic Characteristics to Twitter Topics - Gu...Linking Socio-economic and Demographic Characteristics to Twitter Topics - Gu...
Linking Socio-economic and Demographic Characteristics to Twitter Topics - Gu...
 
Data for the Non-Data Librarian
Data for the Non-Data LibrarianData for the Non-Data Librarian
Data for the Non-Data Librarian
 
Going beyond google 2 philadelphia loss conference
Going beyond google 2 philadelphia loss conferenceGoing beyond google 2 philadelphia loss conference
Going beyond google 2 philadelphia loss conference
 
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on TwitterSocial Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
 
Uncertainty of Identity: Classifying Twitter Data
Uncertainty of Identity: Classifying Twitter DataUncertainty of Identity: Classifying Twitter Data
Uncertainty of Identity: Classifying Twitter Data
 
A Geodemographic Analysis of Ethnicity and Identity of Twitter Users in Great...
A Geodemographic Analysis of Ethnicity and Identity of Twitter Users in Great...A Geodemographic Analysis of Ethnicity and Identity of Twitter Users in Great...
A Geodemographic Analysis of Ethnicity and Identity of Twitter Users in Great...
 
Monitoring Regional Alcohol Consumption through Social Media
Monitoring Regional Alcohol Consumption through Social MediaMonitoring Regional Alcohol Consumption through Social Media
Monitoring Regional Alcohol Consumption through Social Media
 
Phd Colloquium Spatial Analysis
Phd Colloquium Spatial AnalysisPhd Colloquium Spatial Analysis
Phd Colloquium Spatial Analysis
 
Big Data in Economic Research: Twitter, Phone calls and Political events
Big Data in Economic Research: Twitter, Phone calls and Political eventsBig Data in Economic Research: Twitter, Phone calls and Political events
Big Data in Economic Research: Twitter, Phone calls and Political events
 
Public libraries and hispanics
Public libraries and hispanicsPublic libraries and hispanics
Public libraries and hispanics
 
The language of social media
The language of social mediaThe language of social media
The language of social media
 
Social Web 2014: Final Presentations (Part II)
Social Web 2014: Final Presentations (Part II)Social Web 2014: Final Presentations (Part II)
Social Web 2014: Final Presentations (Part II)
 
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroDigital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
 
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]
 
Evaluating the Utility of Geo-referenced Twitter Data as a Source of Reliable...
Evaluating the Utility of Geo-referenced Twitter Data as a Source of Reliable...Evaluating the Utility of Geo-referenced Twitter Data as a Source of Reliable...
Evaluating the Utility of Geo-referenced Twitter Data as a Source of Reliable...
 
Twitter and Alcohol - BrightonSEO Pressentation
Twitter and Alcohol - BrightonSEO PressentationTwitter and Alcohol - BrightonSEO Pressentation
Twitter and Alcohol - BrightonSEO Pressentation
 
Social Media Training :: Market Research Assoc. 2010
Social Media Training :: Market Research Assoc. 2010Social Media Training :: Market Research Assoc. 2010
Social Media Training :: Market Research Assoc. 2010
 

Dernier

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Open Data: Analysis and Visualisation

  • 1. Open Data: Analysis and Visualisation Muhammad Adnan Department of Geography, University College London Web: http://www.uncertaintyofidentity.com Twitter: @gisandtech
  • 2. Dr. Muhammad Adnan • Research Associate – Working on an EPSRC funded project “Uncertainty of Identity” – http://www.uncertaintyofidentity.com Research Interests • Data Mining • Social Media Analysis • Data Visualisation
  • 3. Outline • Open Data • Crowd-Sourced Data (Social Media) • Analysis and Visualisation Challenges • Twitter Case Study • Spatial Analysis • Temporal Analysis • R • A brief introduction • How to create heat maps
  • 4. Open data Data that is:  Open and Free to the public  Complete  Accessible  Timely  Machine processable  Non-discriminatory
  • 5. Dataset examples • • • • • • • • • National Budgets Car registries National roads Water heights Schools Weather Public transport Council tax bands And many more
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Census Profiler • http://www.censusprofiler.org/ • Users can visualise 2001 Census data
  • 11. Education Profiler • http://www.educationprofiler.org/ • Users can visualise education datasets
  • 12. Open Data Profiler • http://www.opendataprofiler.com/ • Users can visualise 60 different 2011 Census datasets
  • 13. Crowd Sourced datasets • Twitter • Public streaming API can be used to download live tweets • Four Square • Has an API which can be used to access the Four Square data • Facebook • Facebook applications can access user information • Flickr • Wikipedia • Youtube
  • 14. How big are crowd sourced datasets ? • Facebook • Number of active users: 850 Million • Average daily uploaded photos: 360 Million • Total data size: 30+ Petabytes • Twitter • Number of active users: 200 Million • Daily tweets (posts): 350 Million • Foursquare • Number of active users: 15 Million • Total check-ins: 1.5 Billion
  • 15. What are the issues with these datasets ? • How representative social media data sets are of the Census or Electoral roll data ? • Who: Ethnicity, Gender, and Age of social media users • Where: Where social media conversations are happening and who is leading them • Intelligence about where people are located and what they are doing • When: What time of day conversations happen
  • 16. Twitter (www.twitter.com) • Online social-networking and micro blogging service • Launched in 2006 • Users can send messages of 140 characters or less • Approximately 200 million active users • 350 million tweets daily • In 2012, UK and London were ranked 4th and 3rd, respectively, in terms of the number of posted tweets
  • 17. Basic Analysis of the Twitter data
  • 18. Data available through the Twitter API • • • • • • • • • User Creation Date Followers Friends User ID Language Location Name Screen Name Time Zone • • • • • Geo Enabled Latitude Longitude Tweet date and time Tweet text Users can download 1% sample of the live tweets through the API
  • 19. Created with approx. 100 million tweets
  • 20.
  • 21. 4 million geo-tagged tweets downloaded during August and December, 2012
  • 22. 4 million geo-tagged tweets downloaded during August and December, 2012
  • 23. Hourly and Daily Twitter Activity in London
  • 25. Daily Twitter Activity in London Monday Tuesday 12000 12000 10000 10000 8000 8000 6000 6000 4000 4000 2000 2000 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 Hour 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour Wednesday Thursday 12000 12000 10000 10000 8000 8000 6000 6000 4000 4000 2000 2000 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour
  • 26. Daily Twitter Activity in London Friday Saturday 12000 12000 10000 10000 8000 8000 6000 6000 4000 4000 2000 2000 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour Sunday 12000 10000 8000 6000 4000 2000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Hour
  • 27. Analysis of User Names on Twitter • A name is a statement of the person‟s ethnic, linguistic, and cultural identity. • E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo mateos is a Spanish (Hispanic) name.
  • 28. Analysing Names on Twitter • Some examples of NAME variations on Twitter Real Names Kevin Hodge Andre Alves Jose de Franco Carolina Thomas, Dr. Prof. Martha Del Val Fabíola Sanchez Fernandes Fake Names JustinBieber_Home. WHAT IS LOVE? MysticMind KIRILL_aka_KID Vanessa Petuna
  • 29. Analysing Names on Twitter • Some examples of NAME variations on Twitter Real Names Kevin Hodge -> F: „Kevin‟ ; S: „Hodge‟ Andre Alves -> F: „Andre‟ ; S: „Alves‟ Jose De Franco -> F: „Jose‟ ; S: „De Franco‟ Carolina Thomas, Dr. -> F: „Carolina‟ ; S: „Thomas‟ Prof. Martha Del Val -> F: „Martha‟ ; S: „Del Val‟ Fabíola Sanchez Fernandes -> F: „Fabíola‟ ; S: „Fernandes‟
  • 33. Predicting Ethnicity of Twitter Users by using their „Names‟ • A name is a statement of the person‟s ethnic, linguistic, and cultural identity. • E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo mateos is a Spanish (Hispanic) name.
  • 34. Classifying Twitter Data to ethnic origins • Applied ONOMAP (www.onomap.org) on FORENAME + SURNAME pairs Kevin Hodge (ENGLISH) Pablo Mateos (Spanish) … … … …
  • 35. Top 10 Ethnic Groups of Twitter Users
  • 36. Tweeting Activity by different Ethnic Groups
  • 37. Comparison of Ethnic Groups between „2011 Census‟ and „Twitter‟ • Onomap groups were aggregated to match the appropriate groups from the Census London Total White British White other Indian Pakistani Bangladeshi Black Chinese African Week Night 53611 71.35% 12.12% 2.63% 2.63% 1.82% 1.52% 1.74% Week Day 80676 73.12% 11.80% 2.41% 2.41% 1.56% 1.25% 1.61% Weekend 67351 72.86% 12.17% 2.61% 2.61% 1.67% 1.39% 1.73% 44.89% 12.65% 6.64% 2.74% 2.72% 7.02% 1.52% 2011 Census
  • 38. Comparison of the distribution of ethnicity with the 2011 Census White British (Quintiles) 2011 Census Twitter
  • 39. Gender and Age Analysis of Twitter Users by using their „forenames‟
  • 40. Gender Analysis of Twitter Users 60% 50% 40% 30% 20% 10% 0% Male Female Number of Tweets Unisex Number of Unique Users Not Found
  • 41. Age estimation from „forenames‟ Data: Monica (CACI, Ltd.) and Birth Certificate Data (Office of National Statistics) 45% 40% 35% Percent 30% 25% 20% 15% 10% 5% 0% Age group PAUL BETTY GUY MUHAMMAD
  • 42. Age-Sex structure of Twitter Users and 2011 Census Male Female
  • 43. Tweets by different Land-use Categories
  • 44. Temporal Activity: Tweets from different Land-use Categories
  • 45. Ethnic Segregation of Twitter Users
  • 47. Segregation Analysis • The value of the information theory index is between 0 (low segregation) and 1 (high segregation). Ethnic Groups H (Domestic buildings and gardens) H (Week Nights) H (Week Days) H (Weekend) British 0.483 0.401 0.211 0.315 Irish 0.670 0.571 0.357 0.475 White Other 0.630 0.510 0.303 0.420 Pakistani 0.765 0.679 0.488 0.633 Indian 0.748 0.673 0.451 0.590 Bangladeshi 0.864 0.834 0.671 0.784 Black Caribbean 0.831 0.808 0.548 0.666 Black African 0.764 0.704 0.492 0.640 Chinese 0.712 0.608 0.403 0.524 Other 0.710 0.593 0.374 0.497
  • 48. Extending the analysis to other cities
  • 49. Tweet density map of London
  • 50. Tweet density map of Paris
  • 51. Tweet density map of New York City
  • 52. Top 10 ethnic groups in London
  • 53. Top 10 ethnic groups in Paris
  • 54. Top 10 ethnic groups in NYC
  • 55. Tweeting Activity by different Ethnic Groups (NYC)
  • 56. Tweeting Activity by different Ethnic Groups (Paris)
  • 59. Data available through the Twitter API • • • • • • • • • User Creation Date Followers Friends User ID Language Location Name Screen Name Time Zone • • • • • Geo Enabled Latitude Longitude Tweet date and time Tweet text
  • 65. Temporal Analysis of the data sets
  • 66. Temporal Analysis of the Twitter Data • Data: 12 September, 2012 – 25 September, 2013 • We extracted a total of approx. 800 million tweets over the last year • A temporal activity analysis of different cities could potentially reveal a lot of information about the residents of the city • But Twitter data is not clean and has lots of problems !
  • 67. Problems with the data 1) Extracting the data for individual cities or places • Use of bounding boxes to extract the data • New York City NW: 40.91762, -73.7004 SW: 40.47662, -74.2589 • http://isithackday.com could be used to find the bounding boxes of different cities
  • 68. Problems with the data 2) Twitter data has a GMT and BST timestamp. Conversion to other time stamp is very time consuming • 12p.m. in „London‟ is 5a.m in Los Angeles, if the time stamp is GMT. • 12p.m. in „London‟ is 6a.m in Los Angeles, if the time stamp is BST.
  • 69. Temporal Analysis of different cities • Approx. 170 million tweets were sent from the following 30 cities. 40 35 Number of Tweets (Millions) 30 25 20 15 10 5 0
  • 70. Temporal Analysis of different cities LONDON
  • 71. Temporal Analysis of different cities LONDON PARIS
  • 72. Temporal Analysis of different cities JAKARTA
  • 73. Temporal Analysis of different cities JAKARTA RIYADH
  • 74. Temporal Analysis of different cities ISTANBUL JAKARTA
  • 76. What is R? • The R statistical programming language is a free open source package based on the S language developed by Bell Labs. • The language is very powerful for writing programs. • Many statistical functions are already built in. • Very easy to create maps and different visualizations.
  • 77. What is R? • You will have to write some code to get the things done ! • R is available @ www.r-project.org • Supports both 32 and 64 bit Windows PCs, Linux, Unix, and Mac OS operating sytems
  • 80. Interacting with R Math: > 1 + 1 [1] 2 > 1 + 1 * 7 [1] 8 > (1 + 1) * 7 [1] 14 Variables: > x > x [1] > y > y [1] > z > z [1] <- 1 1 <- 2 2 <- x+y 3 > sqrt(16) [1] 4 80
  • 81. Importing Data • How do we get data into R? • First make sure your data is in an easy to read format such as CSV (Comma Separated Values) • Use code: – D <- read.csv(“path”,sep=“,”,header=T) – D <- read.table(“path”,sep=“,”,header=T)
  • 82. Working with data. • Accessing columns. • D has our data in it…. But you can‟t see it directly. • To select a column use D$column.
  • 84. How to create a heat map in R ?
  • 85. How to create a heat map in R ? • Three steps: – Read a CSV file – Chose the colours for the heat map – Create the heat map
  • 86. How to create a heat map in R ? • Step 1: Read a CSV file read.csv(“FILE NAME", sep=",", header=T)
  • 87. How to create a heat map in R ? • Step 1: Read a CSV file read.csv(“FILE NAME", sep=",", header=T) • Assign it to a variable Input <- read.csv(“FILE NAME", sep=",", header=T) i.e. with „<„ (less than) and „-‟ (dash) symbols.
  • 88. How to create a heat map in R ? • Step 1: Read a CSV file
  • 89. How to create a heat map in R ? • Step 2: Chose the colours for the heat map colours <- c(0) (Create an empty variable)
  • 90. How to create a heat map in R ? • Step 2: Chose the colours for the heat map colours <- c(0) colours[1] <- "#FDD49E" colours[2] <- "#FDBB84" colours[3] <- "#FC8D59" colours[4] <- "#EF6548" colours[5] <- "#D7301F" colours[6] <- "#B30000" colours[7] <- "#7F0000"
  • 91. How to create a heat map in R ? • Step 2: Chose the colours for the heat map colours <- c(0) colours[1] <- "#FDD49E" colours[2] <- "#FDBB84" colours[3] <- "#FC8D59" colours[4] <- "#EF6548" colours[5] <- "#D7301F" colours[6] <- "#B30000" colours[7] <- "#7F0000"
  • 92. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)
  • 93. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours) Input Data
  • 94. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours) Whether to apply scaling on the data. Options are „col‟, „row‟, and „none‟.
  • 95. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours) Leave them as they are!
  • 96. How to create a heat map in R ? • Step 3: Create the heat map heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours) Colours
  • 97. • • • • Open Data Crowd-Sourced Data (Social Media) Analysis and Visualisation Challenges Twitter Case Study • Spatial Analysis • Temporal Analysis • R • A brief introduction • How to create heat maps Any Questions ?