3. the idea
Meme: a postulated unit or element of cultural ideas transmitted from one mind to
another through speech or similar phenomena.
Zeitgeist: German language expression referring to "the spirit of the times"
Semantic Web: an evolving development of the World Wide Web in which the
meaning (semantics) of information on the web is defined, making it possible for
machines to process it
Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated
and shared on social media mainly via mobile networks
Friday, September 30, 11
4. background
yahoo research
WWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, Watts
WSDM2011 - Who Uses Web Search for What? And How? - Weber, Jaimes
CSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog
Conversations - Shamma, Kennedy, Churchill
others
WWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee,
Park, Moon
Tech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, Lafferty
Tech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How
Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, Blei
Friday, September 30, 11
5. algorithm steps
1. fetch data 2. create clusters 3. extract topics 4. analyze stats
Friday, September 30, 11
7. step 1. fetch data!
using the free Spritzer access to
Twitter streaming API (~1% of total
tweets)
defined set of location boxes (Italy, UK,
France, Spain)
reinforcing locations with geonames
didn’t prove to be efficient (origin: from
a galaxy far far away)
enrich content through web scraping,
also carrying meta & opengraph
keywords
blacklist of noisy sources
Friday, September 30, 11
8. step 2. create geo-clusters
create time slices
select all the posts within a time slice
choose geo-granularity (radius of clusters)
agglomerate posts with Hierarchical
Agglomerative Clustering (HAC)
Friday, September 30, 11
9. step 3. extract topics
a geo-cluster represents the whole bag of word used to define
a document
topic extraction is implemented with LDA
α Dirichlet prior param. on the per-document topic
distributions (frontend output: weight)
β Dirichlet prior param on the per-topic word distribution
θi is the topic distribution for document i,
zij is the topic for the jth word in document i, and
wij is the specific word.
user defined params:
number of topics,
number of words per topic,
min followers
Friday, September 30, 11
10. step 4. analyze data
define search context: topics or keywords
perform live search with TF-IDF indicators
display time-lapse of clusters’ analytics
evolution (log-scale count and average size)
quick and easy interface: toggle visibility of
clusters
Friday, September 30, 11
11. step 4. analyze data
drag and zoom on specific location boxes
select time interval
display aggregated stats of clusters (count
and size) within location box
show and export breakdown of posts’
languages
Friday, September 30, 11
12. step 4. analyze data
show stats and content of
specific clusters
lat-lon of centroids, std.
deviation, surface and
radius
display weighted topics,
TF-IDF of terms within
topics, TF-IDF of meta
keywords
show / export list of posts
show related links
Friday, September 30, 11
13. step 4. analyze data
show query metrics and
parameters
display overall TF-IDF for
the selected query
Friday, September 30, 11
14. demo
http://fom.londondroids.com/fom/
Friday, September 30, 11
15. sorry guys, now the boring stuff...
backend, front-end API, cron jobs
Friday, September 30, 11
16. Backend
Streaming API
a batch process is constantly
running and saving data on the
db
options: fetch by search query,
expand terms with wikiminer,
access all the stream, filter
geotagged, filter location box,
fetch related content
Clustering and Topic extraction
define geo granularity
time/size of geo clusters
followers and retweets
number of topics / keywords
language mapping
Friday, September 30, 11
17. API
search clusters containing
specific topics / keywords
returns lists of clusters
ordered by topic weight
all the data extraction API
conforms to a RESTful
model and returns JSON
structured data
Friday, September 30, 11
18. API
read list of geographic
clusters
usually called after a search
topic has been raised
Friday, September 30, 11
19. API
read semantic content of a
geographic cluster
topics group by score (alpha
parameter in LDA) and word
weighted with TF-IDF with
respect to the whole cluster
content
Friday, September 30, 11
20. API
read meta / opengraph
content of a geographic
cluster
Friday, September 30, 11
21. API
export list of posts
exports all the posts contained in a cluster
example request: /cluster/export_posts/1026/csv
read post content
reads the content of a post
example request: /cluster/read_post/560951
read related link
read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above)
example request: /cluster/read_link/16268
execute cluster stats within a location box
read list of clusters contained within a location box and creates stat charts (in form of google chart images)
example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33
execute post stats within a location box
read list of posts contained within a location box and perform stats on languages
example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/
neLon=11.33
read query content
reads the list of geo-clusters associated to a specific query id (usually fetched by the function above)
example request: /cluster/read/2
Friday, September 30, 11
22. Cron
keep everything running
restart the streaming API
now and then, so as to
keep twitter happy
create the clusters at the
end of the day
Friday, September 30, 11
26. improvements
optimize time slicing!
emerging topics should be checked on hourly basis among the complete dataset
train models!
a training set would be ideal to create models and optimize performances of the topic
extraction algorithm
models could relate to specific context in order to improve results (e.g. all the tweets from
newspapers)
create language classifiers
increase the precision of language detection with naive bayes classifiers
think of scalability
increasing the amount of data makes it necessary to scale up to Map/Reduce architectures
increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...)
enhance analysis and visualization (e.g. reinforce topic correlation / n-grams)
Friday, September 30, 11
27. other refs
algorithms
LDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
HAC - http://en.wikipedia.org/wiki/Cluster_analysis
libraries
twitter 4 java - http://twitter4j.org
machine learning - http://mallet.cs.umass.edu/
jquery (core + ui) - http://jquery.org/
data tables - http://datatables.net/
chart api - http://code.google.com/apis/chart/
image courtesy
http://yesyesno.com/nike-city-runs
Friday, September 30, 11
28. ?
thanks!
codebase source + wiki https://github.com/grudelsud/fom
thomas alisi
@grudelsud
giuseppe serra
@giuseppeserra
marco bertini
@bertinimarco
Friday, September 30, 11