young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
Geographic knowledge discovery (PhD Theme) by Roberto Zagal
1. Geographical Knowledge
Discovery
applied to the
Social Perception of Pollution
in Mexico City
Roberto Zagal,Instituto Politecnico Nacional, ESCOM-IPN
Felix Mata, Instituto Politecnico Nacional, UPIITA-IPN
Christophe Claramunt, Naval Academy Research Institute
1
5. Introduction (4)
What about ...
inconsistency?
Id Type Description
1 Tweet
newspaper1
The index of IMECAS is 135 #CDMX
2 Tweet
Newspaper2
@ the #contamination of air is 127 IMECAS
#CDMX #bad #new
6. Related work
•
The social data problem has been faced:
1. KDD and Social Mining
2. Formal publications (news media) guide the classification
of the interests of social media users [1]
3. Opinion mining and topic modeling [2].
But not using a GKD with an approach of crossing data
layers
6
7. Goal
Know how to:
Discover the certainty level of information
by
Crossing geographic and social information
7
9. Data extraction: Sample tweet (Phase 1)
9
Id Type Description
1 Tweet
newspaper1
TheThe index of IMECAS is 135 #CDMX
2 Tweet
Newspaper2
@ the #contamination of air is 127 IMECAS
#CDMX #bad #news
We consider tweets from accounts that periodically
reports data of air pollution
10. Data extraction: Domain Detection
(Phase 1)
10
Id Type Description
2 Tweet
Newspape
r2
@ #contamination air is
127 IMECAS #CDMX #bad
#new
The post is related to a pollution topic
11. Preprocessing (Phase 2)
•
Emotion detection [3]
•
Location extraction
11
Id Type Description
2 Tweet
Newspaper2
@ #contamination air is 127 IMECAS #CDMX
#bad #new
12. •
If we detect to which category belongs each set of data:
•
Health and Pollution, Transport and Pollution
Then, we can select which data sources should beThen, we can select which data sources should be
crossed with the tweet , in order to discovercrossed with the tweet , in order to discover
KnowledgeKnowledge
12
Classification C5 algorithm (Phase 3)
Id Description Category
2 @ #contamination air is 127 IMECAS
#CDMX #bad #new
Health and
pollution
13. Crossing data (Phase 4)
•
Example 1:
•
Inconsistencies in tweet 1 and 2?
13
Id Type Description
1 Tweet
Newspaper1
The index of IMECAS is 135 #CDMX
2 Tweet
Newspaper2
@ the #contamination of air is 127 IMECAS
#CDMX
What is correct?
14. How to know what tweet is correct?
Answer:
It was classified in the domain of:
Health and pollution ( In Phase 3 )
Then
The official data from Healt reports and pollution reports are
selected to be crosssed with the Tweet (in Phase 4)
28/10/16
Crossing data (Phase 4)
15. Crossing data (Phase 4)
• Data are crossed considering different attributes,
from the tweet is taken the date and hour of
publication
• When is crossed with the date and hour from
official reports of air quality: a match is found
28/10/16
16. We discovered the tweets are correct but with
different location (the location is not include in
the original tweet)
28/10/16
1 Tweet
newspaper1
The index of IMECAS is in
135 #CDMX
#Taxqueña 10:00
hours
2 Tweet
Newspaper2
The #contaminación of air
is in 127 IMECAS #CDMX
#Indios
Verdes
15:00
hours
Knowledge
Discovered!
Crossing data (Phase 4)
17. Other preliminary results
•
Following the same approach
•
Knowledge discovered: what topic are talked by region
17
Topic Geographic Period
Health
South , West March-June
Transport
North, East January
December
Policy and
programs
Center January
December
Pollution
Surrounding Mexico City January-June
Public roads
Surrounding Mexico City January-
December
18. Conclusions and Future work
•
The integration of the geographical and temporal
dimensions allow us to discover data correlations
knowledge can increase certainty of some
information in social networks .
•
The main contribution is the domain discovery and
classification of information is a key element of news
aproaches for to discover geographic information.
18
19. Conclusions and future work
•
Future work
•
Use of clustering or deep learning approaches to improve the
classification process
•
The location detection is a hard problem. It can be test another
machine learning methods for social media [4, 5]
•
¿How can we improve the geographic discovery knowledge
considering no explicit links between traditional data sources and
social sources?
19
21. References
[1] Jonghyun Han, Hyunju Lee, Characterizing the interests of social media users: Refinement of a topic model for
incorporating heterogeneous media, Information Sciences, Volumes 358–359, 1 September 2016, Pages 112-128, ISSN
0020-0255.
[2] Schubert, E., Weiler, M., & Kriegel, H. P. (2014, August). Signitrend: scalable detection of emerging topics in textual
streams by hashed significance thresholds. In Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 871-880). ACM.
architecture for analysis of feelings in Facebook with semantic approach (Spanish), pp. 59–69; rec. 2014-06-22; acc.
2014-07-21 59 Research in Computing Science 75 (2014). http://www.rcs.cic.ipn.mx/rcs/2014_75/
[4] Ting Hua, Liang Zhao, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. 2016. How events unfold: spatiotemporal
mining in social media. SIGSPATIAL Special 7, 3 (January 2016), 19-25. DOI=http://dx.doi.org/10.1145/2876480.2876485
[5] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter users: real-time event detection by social
sensors. In Proceedings of the 19th International Conference on World Wide Web, pages 851–860. ACM, 2010.
28/10/16
Notes de l'éditeur
SLIDE 1:
1.- Good morning.
2.- My name is Roberto. I'm PHD student of National Polytechnic Institute in Mexico City.
3.- Thanks for the invitation to be here today
4.- I’m talking about of“Geographical Knowledge Discovery applied to the Social Perception of Pollution in Mexico City”
5.- This research has the advice of Dr. Felix and Dr. Christophe Claramunt
7.- in recent years, air pollution in Mexico City has increased considerably
8.- The air pollution, it is a problem that requires analysis of multiple domains of knowledge
because actually we have more information in data sources more complex.
SLIDE 2:
Currently, social networks become increasingly relevant as a means of diffusion and sharing of citizen views.
In order to discover new knowledge in air pollution, We need to consider data from diferent sources, like:
Government, Social groups, social media and other web data.
In social media, the people make comments and observations, they might reflect important on different topics related in air pollution.
SLIDE 3.
1.-We reviewed three representative and heterogenous data sources:
2.- Government of Mexico City, because it generates information in traditional databases about pollution. The informaiton is trustworthy
3.- News media, it is an important element, because it provide a valuable source for deriving on-the-fly citizens opinions.
4.- For example, people in social networks express complaints, opinions, reports of problems and observations regarding air pollution topic,
5.- We consider the social networks as a instantaneous picture of the social perception of air pollution.
6.- Now, the question is: How can we cross this information to discover new confidence knowledge about pollution?
SLIDE 4:
1. Information produced by institutions has degree of certainty and veracity, It is assumed that it is true.
2. But.
3. All information produced in social networks ¿can be trustworthy?.
4.- What is the level of certainty in the information produced in social networks related to others sources?.
5.- This is the statement problem of this preliminary investigation.
SLIDE 5:
1.The information, sometimes needs to be verified to KNOW if it is correct or not
2. For example:
3. We have an inconsistency in the following two tweets about air quality
4. The IMECAS is the acronym of The Metropolitan Index of Air Quality in the city of Mexico.
5. In tweet 1: newspaper report that the imecas index is one hundred thirty five (135).
6. In tweet 2: newspaper report that the imecas index is one hundred twenty seven (127).
7. Which one have the correct information?.
8. How can we detect and resolve the inconsistency in the information?.
SLIDE 6:
1.- The papers have not a explicit relation with the geographic dimension
2.- And they don’t explore the certainty of information.
SLIDE 7:
1. It means, that we can discover the level of certain of the publications that appear in social media
2. by crossing these data with other additional formal of .
4. The geographic information can be used as a linker to different data sources.
SLIDE 8:
1.- We propose a GKD Framework for Air Polluttion that includes four Phases:
2.- Data extraction: is oriented to get information from social sources and newspapers.
3.- The processing phase: includes locations and sentiment detection.
4.- The Classification categoriza los datos en topicos especificos.
5.- Crossing data, helps to detect of level of information certainty.
SLIDE 9:
1.- For extraction, we consider tweets from accounts that periodically report data of air pollution, for example digital newspapers of Mexico
2.- Extraction continues using initial key phrases and hashtags, like #CDMX or #AirPollution.
4.- After, a data cleaning is developed: that includes tokenization, removing of stop words and stemming.
SLIDE 10:
1. Domain detection is pre-classify semantically tweets to a category of pollution, for example:
2. In tweet 2 the term “contamination" matches with the “pollution” class, by synonymy
3. Next, the word IMECAS matches with the class “IMECAS” that is a subclass of “IndexOfAirQuality”.
4. We can say, that the post is related to a pollution topic, it is a generic class.
5. it is possible that the tweet belongs to a more specific category that describes the nature of the post.
SLIDE 11:
1. In this part, we detect if the post is related to a positive or negative feeling by words or emoticons.
This detection is useful for identifying trends in the social perception of a specific topic of pollution, for example tweets positive to talk about politics and pollution.
2. Regarding the location of the tweet, we assume that each tweet contains the information in the metadata about of its place and time of publication.
3. Sometimes a tweet not contain explicit or implicit information that allows to define its location. In this case only it considered the time of publication for the following phases.
SLIDE 12:
1. If we detect to which specific category belongs each set of data:
we can select the data sources which should be crossed with the tweet , in order to discover new Knowledge and certainty .
2. The Tweet 2 is classified in a more specific category; health and pollution.
3. We choose C5 because, is one of the algorithms that have shown good performance in knolewge discovery in data bases.
Slide 13.
At this stage quantitative values and qualitative values are separated.
1) Using the ontology we can identify and separate the terms like: IMECAS, Air and Pollution.
2) The a numerical value IMECA is separated.
3) Now, we know that this value must be in a range from 0 to 201 according to definition of index IMECA. If this happens, we can say that we have found a valid value of air quality.
4) Is this possible that this approach does not work in some cases.
5) The Tweets do not contain information about its location but we consider the time of publication.
6) Using the IMECA value and time of Tweet, we proceed to search for matches in government data sources on air quality
Slide 14:
1. Through the categorization of the tweet, we know that we can exchange information with the database of air quality, because it is related to pollution and public health topics.
SLIDE 15:
1.- The Air Quality Data is provided by: Environmental monitoring ministry of CDMX goverment
SLIDE 16:
1. The tweet have not no location, but using its time component
2. We find in official data a match using the value of IMECA
3. Then, the official data help us to discover the tweet location
SLIDE 17:
1. In these additional results, we can see the classification of tweets by topic and location.
2. These results show the trend of social perception in certain subjects and geographical areas.
Slide 18.
1.- The information of dimensions.
2.- The domain discovery.