Information access can be limited in some situations where traditional media outlets can’t cover the events due to geographical limitations or censorship. Examples of those situations can be civil unrest, war or natural disasters. In these situations citizen journalism replace or complement traditional media in the documentation of such events. Microblogging services such as Twitter have become of great use in these scenarios due their mobile nature and multimedia capabilities.
In this research we propose a method to create searchable, semantically annotated news articles from tweets in an automated way using the cloud of linked open data.
News construction from microblogging post using open data
1. News construction from microblogging
posts using open data
Francisco Berrizbeitia
Universidad Simón Bolívar
Caracas, Venezuela
fberrizbeitia@gmail.com
June, 2014
Abstract
Information access can be limited in some situations where traditional media outlets can’t
cover the events due to geographical limitations or censorship. Examples of those situations
can be civil unrest, war or natural disasters. In these situations citizen journalism replace or
complement traditional media in the documentation of such events. Microblogging services
such as Twitter have become of great use in these scenarios due their mobile nature and
multimedia capabilities.
In this research we propose a method to create searchable, semantically annotated news
articles from tweets in an automated way using the cloud of linked open data.
Keywords
Semantic web, News, Microblogging, Twitter, Automatic document generation, Data
Journalims, citizen journalism.
1 Introduction
Citizen journalism has become a very common practice with the arrival of the smartphones
and microblogging services such as Twitter. Due to the multimedia capacities of the devices
and the mobile nature of the social network people all over the word are documenting all sort
events and publishing on the Web on real time. This type of journalism has particular
importance in situations where the traditional media can’t cover the events, such as natural
disasters, war, civil unrest or due to government or self-imposed censorship.
Citizen journalism is protected by the Universal Declaration of Human Rights, article 19:
(United Nations)
“Everyone has the right to freedom of opinion and expression; this right includes
freedom to hold opinions without interference and to seek, receive and impart
information and ideas through any media and regardless of frontiers.”
2. This protection has had tremendous implications in the recent past, in situations where the
only available information was found on social networks and in international media outlets
with very limited coverage. We believe it’s of great importance to develop a technology that
allows the creation of “fair” documents from all the contributions made by the users during
such events.
The hope is that the automated documents created by this technology will be closer to what
really happened and guarantee impartiality.
As a first step in this research we want to construct a news article from a single 140 character
message using the open data cloud.
In the rest of the report we will first describe the overall approach we took to the problem
then describe the system we developed for this task and finally look at the results.
2 Related Work
Information extraction on from Twitter and other microblogging plataforms has been done in
the past. (David Laniado, 2010) explored the semantic value of hashtags as identifiers for the
semantic web. (Shinavier, 2010) proposed the possibility of creating a real-time semantic web
using structured microblogging messages. (Ritter, 2012) uses natural language processing and
information extraction techniques over a corpora of tweets to extract machine readable
information.
Sentiment analysis has also been a topic of research like the work of (Alexander Pak, 2010)
where they propose a machine learning method to classify the tweets in positive, negative and
neutral.
3 Description
The main objective is to obtain the semantically meaningful concepts expressed in the
micropost from the Open Data Cloud and then create a document that extends the original
text with the retrieved concepts. If we succeed in this task we will end up with a news article
where the questions: who, what, where, when and why (Wikipedia, 2014) are going to be
derived from the micropost and extended with the linked open data cloud.
3. Figure 1. Overall view of the process
In figure 1 we can see the overall process of the news creation. Being this our first approach to
the problem we decided to limit the sources of information to Twitter as the only microblog
input and DBpedia as our source of semantically annotated information.
The system was implemented as a web application written in PHP. In the next section we will
describe each part of the system.
3.1 Information gathering and text preparation
The first task consists in gathering the posted information by a user of the social network; we
collect not only the published text, but also the media when available and information about
the author. We obtain all the information using the public API provided by Twitter. As shown in
figure 2, the only input the system need is the tweet ID .
Figure 2. Input screen of the system
4. After the text is retrieved it must be “denoised” before any further processing. At this point all
the stop words are removed as well as links and Tweeter specific words such as RT or FF. The
hashtag character (#) is removed leaving the remaining word.
3.2 Candidate selection
Before querying the DBpedia endpoint we run first a local analysis using a version of the
Wordnet database. Each word is analyzed and a matrix of acceptations for the words is
created. Following a set rules we create a list possible 2-words and 1-word candidates that
may be relevant concepts, places or persons. By doing this we wanted to reduce the queries
we need to make the endpoint.
Since the Wikipedia and the DBpedia are tightly related, we decided to query first the
Wikipedia page using the API to obtain the Wikipedia page URL of witch the candidate is the
main topic.
And the end of this process we ended up having a list of candidate with known Wikipedia
pages.
3.3 Semantically annotated information retrieval from the Open Data
Cloud
The next step is to query DBpedia ‘s sparql endpoint to retrieve the semantically annotated
information related to the tweet topic detected in the previous step. Once the information is
received from the endpoint it is put together with the author information from Twitter in a
turtle file in order to make it available via a sparql endpoint. We used a subset of the rNews
Ontology (International Press Telecominication Counsil, 2011) shown in Figure 3.
Figure 3. Subset of the rNews Ontology used for the project
5. 4 Results
To test the approach and the system we selected 90 tweets directly from the Twitter search on
3 subjects: The Brazilian riot during the 2014 world cup, Barack Obama and Venezuela. The
process of collecting the microposts consisted on making the search thru the API and collect
the first 30 messages with an associated picture, doing the same process for each of the
selected topics.
After the sample was obtained we proceeded to manually tag each tweet. This was made two
times by different persons to minimize the human errors. After the sample was manually
tagged we ran the automated process for each tweet and saved the results for each case. The
results can be seen on Figure 4. We expected to find 415 terms for all tweets and found 433, of
those 317 were an exact match to what was expected in the manual process, 63 resulted in
information that is not wrong but adds no real value, 53 that were wrong concepts. This give a
precision of 76.36%, that’s the expect terms that were automatically detected using the
method and 12.24% of errors.
Figure 4. Result of the test cases
Analyzing the errors we noticed that, the automatically retrieved concept brought a wrong a
meaning for the context. For example, in the context of the Brazilian riots, the concept “fire”
was defined as in “a burning fire” instead of “fire a gun”. Similar cases can be found in the
other topics that were tested.
The terms that were not detected by the automated method were candidate with known
Wikipedia pages that had no corresponding entrance in the DBpedia.
6. 5 Future work
We’re encouraged with the obtained results to further develop the method and include
automated context detection as a way to maximize the precision. A possible approach to solve
this is described in (Esther Villar Rodríguez, 2012) and (Nebhi, 2012).
We also would like to further develop the system, to not only detect, retrieve and save
information of one message but to be able to create a complete documentation of an event
for extended period of time, based on several micro blogging platforms and media outlets,
both independent and corporate. The end result we hope to reach is create a full searchable,
semantically annotated news stream that will serve as a neutral and centralized endpoint for
data journalism.
6 Conclusions
In this research we proposed a method to automatically create a news article from a tweet
using the cloud of linked open data, to do it we successfully implemented a web system that
takes a Tweet ID as input and generate semantically annotated news article based on a subset
of the rNews Ontology. To test our approach we collected a group of 90 tweet on three
subjects: the Brazilian riots during the 2014 World Cup, Barack Obama and Venezuela. The
messages where tagged manually and then compared with automatically found annotations.
Our method was able to capture 76.36% of the manually detected terms with an error of rate
12.24% due mostly to disambiguation problems.
These results encourage us to further develop the method and the system to solve first the
disambiguation problems and to create a more ambitious approach that will allow us to create
a semantically annotated news stream based not only on tweet, but also includes other
microblogging services, independent blogs and corporate media outlets that can serve a
centralized semantic endpoint for data journalism.
7 References
Alexander Pak, P. P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining.
Valletta, Malta: Proceedings of the Seventh International Conference on Language
Resources and Evaluation.
David Laniado, P. M. (2010). Making sense of Twitter. Shangai, China: ISWC 2010.
Esther Villar Rodríguez, A. I. (2012). Using Linked Open Data sources for Entity Disambiguation.
Rome: CLEF Iniciative.
International Press Telecominication Counsil. (2011, 10 7). rNews. Retrieved 6 21, 2014, from
IPTC site for developers: http://dev.iptc.org/rNews
7. Nebhi, K. (2012). Ontology-Based Information Extraction from Twitter. (pp. 17-22). Mumbai:
Proceedings of the Workshop on Information Extraction and Entity Analytics on Social
Media Data.
Ritter, A. (2012). Extracting Knowledge from Twitter and The Web. Doctorate Thesis. University
of Washington.
Shinavier, J. (2010). Realtime #SemanticWeb in <= 140 Characters. WWW2010. Raleigh, North
Carolina.
United Nations. (n.d.). United Nations. Retrieved 6 22, 2014, from The Universal Declaration of
Human Rights: http://www.un.org/en/documents/udhr/index.shtml
Wikipedia. (2014, 6 11). Five Ws. Retrieved 6 20, 2014, from wikipedia.org:
http://en.wikipedia.org/wiki/Five_Ws