SlideShare a Scribd company logo
1 of 28
Download to read offline
flux of meme - final report
            telecom italia, milan 30.9.11
            thomas alisi
            @grudelsud




Friday, September 30, 11
the basics




Friday, September 30, 11
the idea

                  Meme: a postulated unit or element of cultural ideas transmitted from one mind to
                  another through speech or similar phenomena.


                  Zeitgeist: German language expression referring to "the spirit of the times"


                  Semantic Web: an evolving development of the World Wide Web in which the
                  meaning (semantics) of information on the web is defined, making it possible for
                  machines to process it


                  Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated
                  and shared on social media mainly via mobile networks


Friday, September 30, 11
background

                  yahoo research
                           WWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, Watts
                           WSDM2011 - Who Uses Web Search for What? And How? - Weber, Jaimes
                           CSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog
                           Conversations - Shamma, Kennedy, Churchill


                  others
                           WWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee,
                           Park, Moon
                           Tech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, Lafferty
                           Tech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How
                           Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, Blei

Friday, September 30, 11
algorithm steps




                           1. fetch data   2. create clusters   3. extract topics   4. analyze stats




Friday, September 30, 11
implementation




Friday, September 30, 11
step 1. fetch data!


                  using the free Spritzer access to
                  Twitter streaming API (~1% of total
                  tweets)
                  defined set of location boxes (Italy, UK,
                  France, Spain)
                  reinforcing locations with geonames
                  didn’t prove to be efficient (origin: from
                  a galaxy far far away)
                  enrich content through web scraping,
                  also carrying meta & opengraph
                  keywords
                  blacklist of noisy sources


Friday, September 30, 11
step 2. create geo-clusters




                  create time slices
                  select all the posts within a time slice
                  choose geo-granularity (radius of clusters)
                  agglomerate posts with Hierarchical
                  Agglomerative Clustering (HAC)




Friday, September 30, 11
step 3. extract topics
                  a geo-cluster represents the whole bag of word used to define
                  a document
                  topic extraction is implemented with LDA
                           α Dirichlet prior param. on the per-document topic
                           distributions (frontend output: weight)
                           β Dirichlet prior param on the per-topic word distribution
                           θi is the topic distribution for document i,
                           zij is the topic for the jth word in document i, and
                           wij is the specific word.
                  user defined params:
                           number of topics,
                           number of words per topic,
                           min followers

Friday, September 30, 11
step 4. analyze data




                  define search context: topics or keywords
                  perform live search with TF-IDF indicators
                  display time-lapse of clusters’ analytics
                  evolution (log-scale count and average size)
                  quick and easy interface: toggle visibility of
                  clusters




Friday, September 30, 11
step 4. analyze data




                  drag and zoom on specific location boxes
                  select time interval
                  display aggregated stats of clusters (count
                  and size) within location box
                  show and export breakdown of posts’
                  languages




Friday, September 30, 11
step 4. analyze data

                                   show stats and content of
                                   specific clusters
                                     lat-lon of centroids, std.
                                     deviation, surface and
                                     radius
                                   display weighted topics,
                                   TF-IDF of terms within
                                   topics, TF-IDF of meta
                                   keywords
                                   show / export list of posts
                                   show related links




Friday, September 30, 11
step 4. analyze data




                                   show query metrics and
                                   parameters
                                   display overall TF-IDF for
                                   the selected query




Friday, September 30, 11
demo
            http://fom.londondroids.com/fom/




Friday, September 30, 11
sorry guys, now the boring stuff...
            backend, front-end API, cron jobs




Friday, September 30, 11
Backend
                  Streaming API
                           a batch process is constantly
                           running and saving data on the
                           db
                           options: fetch by search query,
                           expand terms with wikiminer,
                           access all the stream, filter
                           geotagged, filter location box,
                           fetch related content
                  Clustering and Topic extraction
                           define geo granularity
                           time/size of geo clusters
                           followers and retweets
                           number of topics / keywords
                           language mapping

Friday, September 30, 11
API




                  search clusters containing
                  specific topics / keywords
                  returns lists of clusters
                  ordered by topic weight
                  all the data extraction API
                  conforms to a RESTful
                  model and returns JSON
                  structured data




Friday, September 30, 11
API




                  read list of geographic
                  clusters
                  usually called after a search
                  topic has been raised




Friday, September 30, 11
API




                  read semantic content of a
                  geographic cluster
                  topics group by score (alpha
                  parameter in LDA) and word
                  weighted with TF-IDF with
                  respect to the whole cluster
                  content




Friday, September 30, 11
API




                  read meta / opengraph
                  content of a geographic
                  cluster




Friday, September 30, 11
API
                  export list of posts
                           exports all the posts contained in a cluster
                           example request: /cluster/export_posts/1026/csv
                  read post content
                           reads the content of a post
                           example request: /cluster/read_post/560951
                  read related link
                           read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above)
                           example request: /cluster/read_link/16268
                  execute cluster stats within a location box
                           read list of clusters contained within a location box and creates stat charts (in form of google chart images)
                           example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33
                  execute post stats within a location box
                           read list of posts contained within a location box and perform stats on languages
                           example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/
                           neLon=11.33
                  read query content
                           reads the list of geo-clusters associated to a specific query id (usually fetched by the function above)
                           example request: /cluster/read/2


Friday, September 30, 11
Cron




                  keep everything running
                           restart the streaming API
                           now and then, so as to
                           keep twitter happy
                           create the clusters at the
                           end of the day




Friday, September 30, 11
Friday, September 30, 11
servers




Friday, September 30, 11
final thoughts




Friday, September 30, 11
improvements

                  optimize time slicing!
                           emerging topics should be checked on hourly basis among the complete dataset
                  train models!
                           a training set would be ideal to create models and optimize performances of the topic
                           extraction algorithm
                           models could relate to specific context in order to improve results (e.g. all the tweets from
                           newspapers)
                  create language classifiers
                           increase the precision of language detection with naive bayes classifiers
                  think of scalability
                           increasing the amount of data makes it necessary to scale up to Map/Reduce architectures
                  increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...)
                  enhance analysis and visualization (e.g. reinforce topic correlation / n-grams)

Friday, September 30, 11
other refs

                  algorithms
                           LDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
                           HAC - http://en.wikipedia.org/wiki/Cluster_analysis
                  libraries
                           twitter 4 java - http://twitter4j.org
                           machine learning - http://mallet.cs.umass.edu/
                           jquery (core + ui) - http://jquery.org/
                           data tables - http://datatables.net/
                           chart api - http://code.google.com/apis/chart/
                  image courtesy
                           http://yesyesno.com/nike-city-runs

Friday, September 30, 11
?
            thanks!
                  codebase source + wiki https://github.com/grudelsud/fom
                  thomas alisi
                  @grudelsud
                  giuseppe serra
                  @giuseppeserra
                  marco bertini
                  @bertinimarco




Friday, September 30, 11

More Related Content

Viewers also liked

Viewers also liked (8)

Flux of MEME
Flux of MEMEFlux of MEME
Flux of MEME
 
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semesterFlux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
 
WGVU PBS Kids
WGVU PBS KidsWGVU PBS Kids
WGVU PBS Kids
 
MMSC
MMSCMMSC
MMSC
 
Using Websites
Using WebsitesUsing Websites
Using Websites
 
Legacies of ancient greece
Legacies of ancient greeceLegacies of ancient greece
Legacies of ancient greece
 
MMS Consulting
MMS ConsultingMMS Consulting
MMS Consulting
 
The river valley civilizations
The river valley civilizationsThe river valley civilizations
The river valley civilizations
 

Similar to Flux of MEME - final report

Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...Carsten Saathoff
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Ralf Stockmann
 
FRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelFRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelMarcia Zeng
 
Giving researchers credit for data
Giving researchers credit for dataGiving researchers credit for data
Giving researchers credit for dataJisc
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoAshok Venkatesan
 
WP3 Further specification of Functionality and Interoperability - Gradmann
WP3 Further specification of Functionality and Interoperability - GradmannWP3 Further specification of Functionality and Interoperability - Gradmann
WP3 Further specification of Functionality and Interoperability - GradmannEuropeana
 
semantic and social (intra)webs
semantic and social (intra)webssemantic and social (intra)webs
semantic and social (intra)websFabien Gandon
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research WorkbenchStuart Chalk
 
Indexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkIndexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkPaolo Nesi
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinalDeborah McGuinness
 
HyperTED - Searching and browsing through fragments of TED Talks
HyperTED - Searching and browsing through fragments of TED TalksHyperTED - Searching and browsing through fragments of TED Talks
HyperTED - Searching and browsing through fragments of TED TalksMariella Sabatino
 
Ch11 OS
Ch11 OSCh11 OS
Ch11 OSC.U
 
Searching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile RelationshipsSearching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile RelationshipsTakashi Kobayashi
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD VivaAidan Hogan
 
Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...George Thomas
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaStuart Chalk
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Chris Mattmann
 

Similar to Flux of MEME - final report (20)

Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
 
FRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelFRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data model
 
Giving researchers credit for data
Giving researchers credit for dataGiving researchers credit for data
Giving researchers credit for data
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
 
WP3 Further specification of Functionality and Interoperability - Gradmann
WP3 Further specification of Functionality and Interoperability - GradmannWP3 Further specification of Functionality and Interoperability - Gradmann
WP3 Further specification of Functionality and Interoperability - Gradmann
 
semantic and social (intra)webs
semantic and social (intra)webssemantic and social (intra)webs
semantic and social (intra)webs
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench
 
Indexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkIndexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social Network
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
HyperTED - Searching and browsing through fragments of TED Talks
HyperTED - Searching and browsing through fragments of TED TalksHyperTED - Searching and browsing through fragments of TED Talks
HyperTED - Searching and browsing through fragments of TED Talks
 
Ch11 OS
Ch11 OSCh11 OS
Ch11 OS
 
OSCh11
OSCh11OSCh11
OSCh11
 
OS_Ch11
OS_Ch11OS_Ch11
OS_Ch11
 
Searching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile RelationshipsSearching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile Relationships
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - Eureka
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Flux of MEME - final report

  • 1. flux of meme - final report telecom italia, milan 30.9.11 thomas alisi @grudelsud Friday, September 30, 11
  • 3. the idea Meme: a postulated unit or element of cultural ideas transmitted from one mind to another through speech or similar phenomena. Zeitgeist: German language expression referring to "the spirit of the times" Semantic Web: an evolving development of the World Wide Web in which the meaning (semantics) of information on the web is defined, making it possible for machines to process it Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated and shared on social media mainly via mobile networks Friday, September 30, 11
  • 4. background yahoo research WWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, Watts WSDM2011 - Who Uses Web Search for What? And How? - Weber, Jaimes CSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog Conversations - Shamma, Kennedy, Churchill others WWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee, Park, Moon Tech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, Lafferty Tech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, Blei Friday, September 30, 11
  • 5. algorithm steps 1. fetch data 2. create clusters 3. extract topics 4. analyze stats Friday, September 30, 11
  • 7. step 1. fetch data! using the free Spritzer access to Twitter streaming API (~1% of total tweets) defined set of location boxes (Italy, UK, France, Spain) reinforcing locations with geonames didn’t prove to be efficient (origin: from a galaxy far far away) enrich content through web scraping, also carrying meta & opengraph keywords blacklist of noisy sources Friday, September 30, 11
  • 8. step 2. create geo-clusters create time slices select all the posts within a time slice choose geo-granularity (radius of clusters) agglomerate posts with Hierarchical Agglomerative Clustering (HAC) Friday, September 30, 11
  • 9. step 3. extract topics a geo-cluster represents the whole bag of word used to define a document topic extraction is implemented with LDA α Dirichlet prior param. on the per-document topic distributions (frontend output: weight) β Dirichlet prior param on the per-topic word distribution θi is the topic distribution for document i, zij is the topic for the jth word in document i, and wij is the specific word. user defined params: number of topics, number of words per topic, min followers Friday, September 30, 11
  • 10. step 4. analyze data define search context: topics or keywords perform live search with TF-IDF indicators display time-lapse of clusters’ analytics evolution (log-scale count and average size) quick and easy interface: toggle visibility of clusters Friday, September 30, 11
  • 11. step 4. analyze data drag and zoom on specific location boxes select time interval display aggregated stats of clusters (count and size) within location box show and export breakdown of posts’ languages Friday, September 30, 11
  • 12. step 4. analyze data show stats and content of specific clusters lat-lon of centroids, std. deviation, surface and radius display weighted topics, TF-IDF of terms within topics, TF-IDF of meta keywords show / export list of posts show related links Friday, September 30, 11
  • 13. step 4. analyze data show query metrics and parameters display overall TF-IDF for the selected query Friday, September 30, 11
  • 14. demo http://fom.londondroids.com/fom/ Friday, September 30, 11
  • 15. sorry guys, now the boring stuff... backend, front-end API, cron jobs Friday, September 30, 11
  • 16. Backend Streaming API a batch process is constantly running and saving data on the db options: fetch by search query, expand terms with wikiminer, access all the stream, filter geotagged, filter location box, fetch related content Clustering and Topic extraction define geo granularity time/size of geo clusters followers and retweets number of topics / keywords language mapping Friday, September 30, 11
  • 17. API search clusters containing specific topics / keywords returns lists of clusters ordered by topic weight all the data extraction API conforms to a RESTful model and returns JSON structured data Friday, September 30, 11
  • 18. API read list of geographic clusters usually called after a search topic has been raised Friday, September 30, 11
  • 19. API read semantic content of a geographic cluster topics group by score (alpha parameter in LDA) and word weighted with TF-IDF with respect to the whole cluster content Friday, September 30, 11
  • 20. API read meta / opengraph content of a geographic cluster Friday, September 30, 11
  • 21. API export list of posts exports all the posts contained in a cluster example request: /cluster/export_posts/1026/csv read post content reads the content of a post example request: /cluster/read_post/560951 read related link read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above) example request: /cluster/read_link/16268 execute cluster stats within a location box read list of clusters contained within a location box and creates stat charts (in form of google chart images) example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33 execute post stats within a location box read list of posts contained within a location box and perform stats on languages example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/ neLon=11.33 read query content reads the list of geo-clusters associated to a specific query id (usually fetched by the function above) example request: /cluster/read/2 Friday, September 30, 11
  • 22. Cron keep everything running restart the streaming API now and then, so as to keep twitter happy create the clusters at the end of the day Friday, September 30, 11
  • 26. improvements optimize time slicing! emerging topics should be checked on hourly basis among the complete dataset train models! a training set would be ideal to create models and optimize performances of the topic extraction algorithm models could relate to specific context in order to improve results (e.g. all the tweets from newspapers) create language classifiers increase the precision of language detection with naive bayes classifiers think of scalability increasing the amount of data makes it necessary to scale up to Map/Reduce architectures increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...) enhance analysis and visualization (e.g. reinforce topic correlation / n-grams) Friday, September 30, 11
  • 27. other refs algorithms LDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation HAC - http://en.wikipedia.org/wiki/Cluster_analysis libraries twitter 4 java - http://twitter4j.org machine learning - http://mallet.cs.umass.edu/ jquery (core + ui) - http://jquery.org/ data tables - http://datatables.net/ chart api - http://code.google.com/apis/chart/ image courtesy http://yesyesno.com/nike-city-runs Friday, September 30, 11
  • 28. ? thanks! codebase source + wiki https://github.com/grudelsud/fom thomas alisi @grudelsud giuseppe serra @giuseppeserra marco bertini @bertinimarco Friday, September 30, 11