Flux of MEME - final report

ﬂux of meme - ﬁnal report
telecom italia, milan 30.9.11
thomas alisi
@grudelsud

Friday, September 30, 11

the basics


the idea

Meme: a postulated unit or element of cultural ideas transmitted from one mind to
another through speech or similar phenomena.

Zeitgeist: German language expression referring to "the spirit of the times"

Semantic Web: an evolving development of the World Wide Web in which the
meaning (semantics) of information on the web is deﬁned, making it possible for
machines to process it

Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated
and shared on social media mainly via mobile networks


background

yahoo research
WWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, Watts
WSDM2011 - Who Uses Web Search for What? And How? - Weber, Jaimes
CSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog
Conversations - Shamma, Kennedy, Churchill

others
WWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee,
Park, Moon
Tech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, Lafferty
Tech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How
Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, Blei


algorithm steps

1. fetch data 2. create clusters 3. extract topics 4. analyze stats


implementation


step 1. fetch data!

using the free Spritzer access to
Twitter streaming API (~1% of total
tweets)
deﬁned set of location boxes (Italy, UK,
France, Spain)
reinforcing locations with geonames
didn’t prove to be efﬁcient (origin: from
a galaxy far far away)
enrich content through web scraping,
also carrying meta & opengraph
keywords
blacklist of noisy sources


step 2. create geo-clusters

create time slices
select all the posts within a time slice
choose geo-granularity (radius of clusters)
agglomerate posts with Hierarchical
Agglomerative Clustering (HAC)


step 3. extract topics
a geo-cluster represents the whole bag of word used to define
a document
topic extraction is implemented with LDA
α Dirichlet prior param. on the per-document topic
distributions (frontend output: weight)
β Dirichlet prior param on the per-topic word distribution
θi is the topic distribution for document i,
zij is the topic for the jth word in document i, and
wij is the specific word.
user defined params:
number of topics,
number of words per topic,
min followers


step 4. analyze data

deﬁne search context: topics or keywords
perform live search with TF-IDF indicators
display time-lapse of clusters’ analytics
evolution (log-scale count and average size)
quick and easy interface: toggle visibility of
clusters



drag and zoom on speciﬁc location boxes
select time interval
display aggregated stats of clusters (count
and size) within location box
show and export breakdown of posts’
languages



show stats and content of
speciﬁc clusters
lat-lon of centroids, std.
deviation, surface and
radius
display weighted topics,
TF-IDF of terms within
topics, TF-IDF of meta
keywords
show / export list of posts
show related links



show query metrics and
parameters
display overall TF-IDF for
the selected query


demo
http://fom.londondroids.com/fom/


sorry guys, now the boring stuff...
backend, front-end API, cron jobs


Backend
Streaming API
a batch process is constantly
running and saving data on the
db
options: fetch by search query,
expand terms with wikiminer,
access all the stream, filter
geotagged, filter location box,
fetch related content
Clustering and Topic extraction
define geo granularity
time/size of geo clusters
followers and retweets
number of topics / keywords
language mapping


API

search clusters containing
speciﬁc topics / keywords
returns lists of clusters
ordered by topic weight
all the data extraction API
conforms to a RESTful
model and returns JSON
structured data


API

read list of geographic
clusters
usually called after a search
topic has been raised


API

read semantic content of a
geographic cluster
topics group by score (alpha
parameter in LDA) and word
weighted with TF-IDF with
respect to the whole cluster
content


API

read meta / opengraph
content of a geographic
cluster


API
export list of posts
exports all the posts contained in a cluster
example request: /cluster/export_posts/1026/csv
read post content
reads the content of a post
example request: /cluster/read_post/560951
read related link
read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above)
example request: /cluster/read_link/16268
execute cluster stats within a location box
read list of clusters contained within a location box and creates stat charts (in form of google chart images)
example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33
execute post stats within a location box
read list of posts contained within a location box and perform stats on languages
example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/
neLon=11.33
read query content
reads the list of geo-clusters associated to a speciﬁc query id (usually fetched by the function above)
example request: /cluster/read/2


Cron

keep everything running
restart the streaming API
now and then, so as to
keep twitter happy
create the clusters at the
end of the day


servers


ﬁnal thoughts


improvements

optimize time slicing!
emerging topics should be checked on hourly basis among the complete dataset
train models!
a training set would be ideal to create models and optimize performances of the topic
extraction algorithm
models could relate to specific context in order to improve results (e.g. all the tweets from
newspapers)
create language classifiers
increase the precision of language detection with naive bayes classifiers
think of scalability
increasing the amount of data makes it necessary to scale up to Map/Reduce architectures
increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...)
enhance analysis and visualization (e.g. reinforce topic correlation / n-grams)


other refs

algorithms
LDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
HAC - http://en.wikipedia.org/wiki/Cluster_analysis
libraries
twitter 4 java - http://twitter4j.org
machine learning - http://mallet.cs.umass.edu/
jquery (core + ui) - http://jquery.org/
data tables - http://datatables.net/
chart api - http://code.google.com/apis/chart/
image courtesy
http://yesyesno.com/nike-city-runs


?
thanks!
codebase source + wiki https://github.com/grudelsud/fom
thomas alisi
@grudelsud
giuseppe serra
@giuseppeserra
marco bertini
@bertinimarco


Flux of MEME - final report

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Flux of MEME - final report

Similar to Flux of MEME - final report (20)

Recently uploaded

Recently uploaded (20)

Flux of MEME - final report