SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Geo-statistical Exploration of 
Milano Datasets 
Irene Celino and Gloria Re Calegari 
CEFRIEL 
1 
The 13th International Semantic Web Conference 2014 
Riva del Garda, Italy 
19 – 23 October 2014 
Semantic Statistics workshop 
19th October 2014
Research goal 
Long term goal 
• Compare data related to the same city but obtained from heterogeneous 
sources. Do they provide the same ‘picture’ of the city? 
• Can we update a dataset (expensive and time consuming, updated once every 
5/10 years) with another dataset (cheaper and always up to date)? 
Presentation target 
• Comparison of two heterogeneous datasets referring to the same city (Milan) to 
discover if they have any intrinsic correlation 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 2
Datasets available 
• Telecom (phone activity data) provided for their “Big Data Challenge” 
• Milan + surroundings 
• Nov-dic 2013 
• Grid of 10.000 cells 
• Activity recorded every ten minutes 
• A footprint for each cell 
• ISTAT - Italian National Statistical Institute 
• demographic data of 2011 and 2001 
• divided by sex, age and nationality 
Analysis performed using data in their original format (GeoJSON, csv) and serialization in RDF format 
at the end of the analysis. 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 3
Datasets available – different spatial granularity 
ISTAT – 50 surrounding towns + 9 
internal Milano districts 
Telecom – grid of 10.000 cells (250 x 
250 m each) 
Pre-processing of 
data required. 
Mapping of 
Telecom data into 
municipality 
granularity. 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 4
Methodology of analysis 
Goal of analysis: do the two datasets have the same intrinsic meaning? 
Unsupervised clustering to group data in each dataset: 
• k-means (euclidean distance) 
• Adaptive k-means (cosine distance) 
Comparison of the two clustering results to find correlation between the two datasets. 
Validation of the comparison: 
• Rand Index 
• Kappa Index 
(the closer to 1 the index is, the stronger the correlation between the data clusterings) 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 5
Experiments- Telecom vs ISTAT (range on total poulation) 
Telecom - Kmeans 6 classes 
Rand 
Index 
Kappa 
Index 
0,23 0,23 
ISTAT 2011 - range 6 classes 
Only partial 
correlation 
Try to add 
information to total 
population (feature 
vector with 
distribution of 
populatin divided by 
sex, age and 
nationality) 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 6
Experiments - Telecom vs ISTAT (2011 fine-grained population) 
Telecom - Kmeans 4 classes 
Rand 
Index 
Kappa 
Index 
0,77 0,81 
ISTAT 2011 – Kmeans 4 classes 
Good correlation 
Try to add historical 
information (2001 
datasets). 
Does adding 
temporal dynamic 
improve corrrelation? 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 7
Telecom - Kmeans 4 classes 
Rand 
Index 
Kappa 
Index 
0,77 0,81 
ISTAT 2011 + 2001 – Kmeans 4 classes 
Good correlation. 
The same as the 
previous test. 
Experiments- Telecom vs ISTAT (2011+2001 fine-grained population) 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 8
Clustering results serialization 
Results of 
clustering 
analysis 
Geographic format 
RDF format 
Geo + RDF format 
GeoJSON file 
RDF Data Cube vocabulary (adding in the 
Observation definition concepts of clustering 
algo and cluster number) 
Enrich GeoJSON with power of 
JSON-LD -> ‘’GeoJSON-LD’’ 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 9
‘’GeoJSON-LD’’ 
‘’GeoJSON-LD’’ is obtained by adding the ‘@context’ prefix to the GeoJSON file. 
The prefix specify how to interpret GeoJSON tags as RDF resources 
@context prefix GeoJSON file 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 10
‘’GeoJSON-LD’’ pros and cons 
Correct interpretation of the geographic information 
in the GeoJSON-LD file 
Problems in the interpretation of the RDF information. 
‘’GeoJSON-LD’’ does not interpret correctly a lists of lists 
(polygon coordinates representation) 
GeoJSON-LD is correctly interpreted as GeoJSON by 
GIS 
GeoJSON list of lists 
RDF serialization 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 11
Conclusion and future works 
• Identification of a correlation between phone activity data and 
demographic information 
• Further tests using different datasets (land use, Point of interests) 
• Find efficient method for handling the different granularity levels of the 
datasets 
• Method for handling temporal aspect of phone activity data (we lost temporal 
information during clustering process) 
• Find a method to overcome ‘’GeoJSON-LD’’ serialization problem 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 12
Thank you! Any question? 
Further details at: http://swa.cefriel.it/geo/ 
Irene Celino and Gloria Re Calegari 
CEFRIEL 
Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 13

Contenu connexe

Tendances

Tendances (8)

The RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPsThe RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPs
 
Dr Richard Fry - Using R as a GIS
Dr Richard Fry - Using R as a GISDr Richard Fry - Using R as a GIS
Dr Richard Fry - Using R as a GIS
 
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE Atlas
 
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial DataA Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
 
From Data to City Indicators: A Knowledge Graph for Supporting Automatic Gene...
From Data to City Indicators: A Knowledge Graph for Supporting Automatic Gene...From Data to City Indicators: A Knowledge Graph for Supporting Automatic Gene...
From Data to City Indicators: A Knowledge Graph for Supporting Automatic Gene...
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
So much data so many uses
So much data so many usesSo much data so many uses
So much data so many uses
 

Similaire à Geo-statistical Exploration of Milano Datasets

2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
GIS in the Rockies
 
Data Ecosystems for Geospatial Data
Data Ecosystems for Geospatial DataData Ecosystems for Geospatial Data
Data Ecosystems for Geospatial Data
Slim Turki, Dr.
 
Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...
Paul Box
 
Egu2013 final
Egu2013 finalEgu2013 final
Egu2013 final
sirf13
 
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Sergio Consoli
 
Location Intelligence for Italian and UK Justice and Public Safety - BIWASumm...
Location Intelligence for Italian and UK Justice and Public Safety - BIWASumm...Location Intelligence for Italian and UK Justice and Public Safety - BIWASumm...
Location Intelligence for Italian and UK Justice and Public Safety - BIWASumm...
Iconsulting
 
LokeshShanmuganandam_BigData_FinalProjectReport
LokeshShanmuganandam_BigData_FinalProjectReportLokeshShanmuganandam_BigData_FinalProjectReport
LokeshShanmuganandam_BigData_FinalProjectReport
lokesh shanmuganandam
 

Similaire à Geo-statistical Exploration of Milano Datasets (20)

City Data Dating: emerging affinities between diverse urban datasets
City Data Dating: emerging affinities between diverse urban datasetsCity Data Dating: emerging affinities between diverse urban datasets
City Data Dating: emerging affinities between diverse urban datasets
 
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
 
Lemmens kessler-agile-linked data v3-slideshare
Lemmens kessler-agile-linked data v3-slideshareLemmens kessler-agile-linked data v3-slideshare
Lemmens kessler-agile-linked data v3-slideshare
 
Data Ecosystems for Geospatial Data
Data Ecosystems for Geospatial DataData Ecosystems for Geospatial Data
Data Ecosystems for Geospatial Data
 
Hotspot Analysis with QGIS - FOSS4G-IT 2017
Hotspot Analysis with QGIS  - FOSS4G-IT 2017Hotspot Analysis with QGIS  - FOSS4G-IT 2017
Hotspot Analysis with QGIS - FOSS4G-IT 2017
 
Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...
 
An evaluation of SimRank and Personalized PageRank to build a recommender sys...
An evaluation of SimRank and Personalized PageRank to build a recommender sys...An evaluation of SimRank and Personalized PageRank to build a recommender sys...
An evaluation of SimRank and Personalized PageRank to build a recommender sys...
 
03 --metadata
03 --metadata03 --metadata
03 --metadata
 
Egu2013 final
Egu2013 finalEgu2013 final
Egu2013 final
 
02 -how-will-inspire-influence-local-authorities-and-spatial-planning
02 -how-will-inspire-influence-local-authorities-and-spatial-planning02 -how-will-inspire-influence-local-authorities-and-spatial-planning
02 -how-will-inspire-influence-local-authorities-and-spatial-planning
 
GISrock1_DG_REwork
GISrock1_DG_REworkGISrock1_DG_REwork
GISrock1_DG_REwork
 
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...Towards emergency vehicle routing using Geolinked Open Data: the case study o...
Towards emergency vehicle routing using Geolinked Open Data: the case study o...
 
Big Data Day LA 2016/ Data Science Track - The Evolving Data Science Landscap...
Big Data Day LA 2016/ Data Science Track - The Evolving Data Science Landscap...Big Data Day LA 2016/ Data Science Track - The Evolving Data Science Landscap...
Big Data Day LA 2016/ Data Science Track - The Evolving Data Science Landscap...
 
Murphy, P. - Defining functional areas in Canada through open source packages
Murphy, P. - Defining functional areas in Canada through open source packagesMurphy, P. - Defining functional areas in Canada through open source packages
Murphy, P. - Defining functional areas in Canada through open source packages
 
Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...
 
Online Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesOnline Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data Sources
 
03 SensLog – solution for sensors and VGI
03 SensLog – solution for sensors and VGI03 SensLog – solution for sensors and VGI
03 SensLog – solution for sensors and VGI
 
Location Intelligence for Italian and UK Justice and Public Safety - BIWASumm...
Location Intelligence for Italian and UK Justice and Public Safety - BIWASumm...Location Intelligence for Italian and UK Justice and Public Safety - BIWASumm...
Location Intelligence for Italian and UK Justice and Public Safety - BIWASumm...
 
TEAMS 6, 7 and 8
TEAMS 6, 7 and 8TEAMS 6, 7 and 8
TEAMS 6, 7 and 8
 
LokeshShanmuganandam_BigData_FinalProjectReport
LokeshShanmuganandam_BigData_FinalProjectReportLokeshShanmuganandam_BigData_FinalProjectReport
LokeshShanmuganandam_BigData_FinalProjectReport
 

Dernier

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Dernier (20)

Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 

Geo-statistical Exploration of Milano Datasets

  • 1. Geo-statistical Exploration of Milano Datasets Irene Celino and Gloria Re Calegari CEFRIEL 1 The 13th International Semantic Web Conference 2014 Riva del Garda, Italy 19 – 23 October 2014 Semantic Statistics workshop 19th October 2014
  • 2. Research goal Long term goal • Compare data related to the same city but obtained from heterogeneous sources. Do they provide the same ‘picture’ of the city? • Can we update a dataset (expensive and time consuming, updated once every 5/10 years) with another dataset (cheaper and always up to date)? Presentation target • Comparison of two heterogeneous datasets referring to the same city (Milan) to discover if they have any intrinsic correlation Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 2
  • 3. Datasets available • Telecom (phone activity data) provided for their “Big Data Challenge” • Milan + surroundings • Nov-dic 2013 • Grid of 10.000 cells • Activity recorded every ten minutes • A footprint for each cell • ISTAT - Italian National Statistical Institute • demographic data of 2011 and 2001 • divided by sex, age and nationality Analysis performed using data in their original format (GeoJSON, csv) and serialization in RDF format at the end of the analysis. Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 3
  • 4. Datasets available – different spatial granularity ISTAT – 50 surrounding towns + 9 internal Milano districts Telecom – grid of 10.000 cells (250 x 250 m each) Pre-processing of data required. Mapping of Telecom data into municipality granularity. Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 4
  • 5. Methodology of analysis Goal of analysis: do the two datasets have the same intrinsic meaning? Unsupervised clustering to group data in each dataset: • k-means (euclidean distance) • Adaptive k-means (cosine distance) Comparison of the two clustering results to find correlation between the two datasets. Validation of the comparison: • Rand Index • Kappa Index (the closer to 1 the index is, the stronger the correlation between the data clusterings) Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 5
  • 6. Experiments- Telecom vs ISTAT (range on total poulation) Telecom - Kmeans 6 classes Rand Index Kappa Index 0,23 0,23 ISTAT 2011 - range 6 classes Only partial correlation Try to add information to total population (feature vector with distribution of populatin divided by sex, age and nationality) Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 6
  • 7. Experiments - Telecom vs ISTAT (2011 fine-grained population) Telecom - Kmeans 4 classes Rand Index Kappa Index 0,77 0,81 ISTAT 2011 – Kmeans 4 classes Good correlation Try to add historical information (2001 datasets). Does adding temporal dynamic improve corrrelation? Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 7
  • 8. Telecom - Kmeans 4 classes Rand Index Kappa Index 0,77 0,81 ISTAT 2011 + 2001 – Kmeans 4 classes Good correlation. The same as the previous test. Experiments- Telecom vs ISTAT (2011+2001 fine-grained population) Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 8
  • 9. Clustering results serialization Results of clustering analysis Geographic format RDF format Geo + RDF format GeoJSON file RDF Data Cube vocabulary (adding in the Observation definition concepts of clustering algo and cluster number) Enrich GeoJSON with power of JSON-LD -> ‘’GeoJSON-LD’’ Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 9
  • 10. ‘’GeoJSON-LD’’ ‘’GeoJSON-LD’’ is obtained by adding the ‘@context’ prefix to the GeoJSON file. The prefix specify how to interpret GeoJSON tags as RDF resources @context prefix GeoJSON file Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 10
  • 11. ‘’GeoJSON-LD’’ pros and cons Correct interpretation of the geographic information in the GeoJSON-LD file Problems in the interpretation of the RDF information. ‘’GeoJSON-LD’’ does not interpret correctly a lists of lists (polygon coordinates representation) GeoJSON-LD is correctly interpreted as GeoJSON by GIS GeoJSON list of lists RDF serialization Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 11
  • 12. Conclusion and future works • Identification of a correlation between phone activity data and demographic information • Further tests using different datasets (land use, Point of interests) • Find efficient method for handling the different granularity levels of the datasets • Method for handling temporal aspect of phone activity data (we lost temporal information during clustering process) • Find a method to overcome ‘’GeoJSON-LD’’ serialization problem Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 12
  • 13. Thank you! Any question? Further details at: http://swa.cefriel.it/geo/ Irene Celino and Gloria Re Calegari CEFRIEL Semantic Statistics - ISWC 2014 - Riva del Garda, Italy 13