With online publication and social media taking the main role in dissemination of news, and with the decline of traditional printed media, it has become necessary to devise ways to automatically extract meaningful information from the plethora of sources available and to make that information readily available to interested parties. In this paper we present a method of automated analysis of the underlying structure of online newspapers based on Q-analysis and modularity. We show how the combination of the two strategies allows for the identification of well defined news clusters that are free of noise (unrelated stories) and provide automated clustering of information on trending topics on news published online.
Identifying news clusters using Q-analysis and Modularity
1. Iden%fying
news
clusters
using
Q-‐analysis
and
Modularity
David
Rodrigues+
Centre
for
Complexity
and
Design
+The Open University, UK – david.rodrigues@open.ac.uk
1
3. Mo%va%on
• Find
Structure
in
collec%ons
of
text
documents
• Create
Computer
Algorithms
to
automate
this
discovery
with
minimal
human
supervision.
• Use
of
hybrid
methodologies
to
improve
quality
of
results
– Topology
based
approach
describes
data
– Clustering
technique
to
iden%fy
modules
3
4. Problem
Descrip%on
• Iden%fy
the
Structure
of
the
news
published
online
by
The
Guardian
(among
other
newspapers)
– Clustering?
– Topology?
– Topic
Modelling?
– Noise?
– Novelty?
– Change?
4
[Kohut,
A.
and
Remez,
M.
(2008)]
5. Clustering
Techniques
in
Topic
Modelling
• Nearest
neighbour
classifica%on
• Bayesian
probabilis%c
techniques
• Decision
trees
• Regression
Models
• Neural
Networks
• Support
Vector
Machines
• Language
dependent
/
Human
interven%on
in
the
defini%on
of
categories
for
training
samples.
5
6. Clustering
in
Graphs
is
Community
Detec%on
• Modularity
based
techniques
[majority]
• Spectral
algorithms
• Synchroniza%on
based
techniques
• …
• [Community
detecBon
in
graphs
-‐
Fortunato,
2010,
for
comprehensive
review]
• Binary
rela%ons
between
nodes
don’t
capture
the
mul%-‐level
structure
of
exis%ng
rela%ons.
– Move
to
n-‐ary
rela%ons
and
descrip%ons
6
7. Previously
• We
used
a
sliding
window
over
the
%me
series
of
the
news
stories
• Used
Varia%on
of
Informa%on
to
measure
changes
in
an
evolving
adap%ve
network
of
news[Meilã
2007,
Rodrigues
2010]
7
8. Our
Proposal
• Use
a
high
dimensional
representa%on
of
the
documents
(Simplicial
Complex)
• Use
Q-‐analysis
to
describe
the
system
constructed
from
the
Documents
x
Tags
Incidence
Matrix
• Use
Q-‐connected
components
to
filter
noise.
• Use
modularity
opBmisaBon
to
find
communi%es
in
the
resul%ng
induced
graphs
8
9. Noise?
• In
the
news
context,
we
define
noise
news
as
news
that
are
loosely
related
to
the
main
topics
published.
• We
can
filter
them
by
assuming
that
the
Q-‐
connectedness
of
this
news
is
very
low.
9
10. The
Guardian
• Classifies
news
with
useful
metadata:
– …
– Sec%on
– Tags
– …
hkp://www.theguardian.com/open-‐plalorm
Open
Plalorm
with
API
for
applica%on
development.
3
years
of
data:
2010,
2011
and
2012
10
11. Pseudo
code
for
the
automated
news
clustering
and
filtering
algorithm
11
12. Pseudo
code
for
the
automated
news
clustering
and
filtering
algorithm
12
26. Developed
Tools
• Theseus
–
A
python
applica%on
for
collec%ng,
processing
and
visualisa%on
of
the
textual
dataset
-‐
hkps://github.com/sixhat/theseus
• Visualisa%on
tool
26
28. Conclusions
• Q-‐analysis
gives
an
descrip%ve
overview
of
the
structure
of
the
system,
it
terms
of
the
local
connec%vity
of
the
news
stories.
• Clustering
(on
top
of
the
Q-‐analysis)
gives
a
natural
(highly
modular)
division
of
the
resul%ng
structures.
• This
allows
the
iden%fica%on
of
coherent
news
cluster
and
the
filtering
of
noise
news.
28
29. Generalisa%on
of
applicability
• Instead
of
Human
tagged
documents,
one
can
apply
this
to
any
kind
of
text
based
documents:
– HTML
Webpages:
Use
keywords
tag
from
header
• or
– Extract
keywords
with
topic
modelling
(LDA,
for
example)
– Scien%fic
Documents:
Tag
documents
with
topic
modelling
strategies
like
LDA
and
instead
of
noise,
explore
the
possibility
that
low
connected
stories
might
be
emerging
scien%fic
trends.
29
30. Take
home
message
• Real
Complex
Systems
are
mul%-‐dimensional.
Community
detec%on
methods
need
to
take
into
account
those
descrip%ons
• The
construc%on
of
descrip%ons
with
all
the
rela%ons
(hyper-‐simplicies)
gives
beker
qualita%ve
of
the
results
• In
the
newspapers
case,
this
helps
the
filtering
of
``noise’’
news
(unrelated
news).
30