The document summarizes the SocialSensor project, which aims to surface trusted and relevant information from social media streams in real-time. It describes the project's objectives, architecture, and two use cases - news and capturing large-scale events. For news, it discusses challenges in verifying information from social media. For events, it presents the EventSense framework and its application at the Thessaloniki Film Festival to detect entities, topics, and sentiment in tweets about films. Evaluation showed good performance in entity and topic detection but room for improvement in sentiment analysis.
DevoxxFR 2024 Reproducible Builds with Apache Maven
SocialSensor Project Captures Real-Time Insights from Social Media
1. SocialSensor: Sensing User Generated Input for
Improved Media Discovery and Experience
Social Multimedia Crawling & Mining
EventSense: Capturing the Pulse of Large-scale Events by
Mining Social Media Streams
Dr. Yiannis Kompatsiaris, Project Coordinator
Samos 2013 Summit on Digital Innovation for Government, Business and Society
2. #2
Overview
• Motivation
• Objectives
• Architecture
• Use Cases and Requirements
• News
– Social Multimedia Crawling & Mining
• Infotainment
– EventSense: Capturing the Pulse of Large-scale Events by
Mining Social Media Streams
• Conclusions
3. #3
What is SocialSensor?
• 3-year FP7 European Integrated Project
– http://www.socialsensor.eu
• Members: CERTH, ATC (Greece), Deutsche Welle,
University Koblenz, Research Center for Artificial
Intelligence (Germany), The City University London,
Alcatel – Lucent Bell Labs, JCP Consult (France),
University of Klagenfurt (Austria), IBM Israel, Yahoo
Iberia + Robert Gordon University Aberdeen (UK)
• 1.5 years into the project (Development of user
requirements, use case scenarios, architecture and
implementation and first R&D components and
prototypes. Currently: Evaluation and 2nd
round)
4. Motivation: Social Networks as Sensors
• Social Networks is a data source with an
extremely dynamic nature that reflects
events and the evolution of community
focus (user’s interests)
• Transform individually rare but
collectively frequent media to meaningful
topics, events, points of interest,
emotional states and social connections
• Mine the data and their relations and
exploit them in the right context
• Scalable mining and indexing approaches
taking into account the content and social
context of social networks
5. Relevant Applications
Xin Jin, Andrew Gallagher, Liangliang Cao, Jiebo Luo, and
Jiawei Han. The wisdom of social multimedia:
using flickr for prediction and forecast,
International conference on Multimedia (MM '10). ACM.
Federal Emergency Management Agency plans to
engage the public more in disaster response by
sharing data and leveraging reports from mobile
phones and social media
5
“…if you're more than 100 km away from the epicenter
[of an earthquake] you can read about the quake on
twitter before it hits you…”
6. Objective
SocialSensor quickly surfaces trusted and relevant material
from social media – with context
DySCODySCO
behaviou
r
location
timecontent
usage
social context
Massive social media
and unstructured web
Social media mining
Aggregation & indexing
News - Infotainment
Personalised access
Ad-hoc P2P networks
7. #7
The SocialSensor Vision
SocialSensor quickly surfaces trusted and relevant
material from social media – with context.
•“quickly”: in real time
•“surfaces”: automatically discovers, clusters and searches
•“trusted”: automatic support in verification process
•“relevant”: to the users, personalized
•“material”: any material (text, image, audio, video =
multimedia), aggregated with other sources (e.g. web)
•“social media”: across many relevant social media platforms
•“with context”: location, time, sentiment, influence, trust
8. #10
Conceptual Architecture and Main components
SEMANTIC MIDDLEWARE
Public
Data
In-project
Data
SEARCH & RECOMMENDATION
USER MODELLING & PRESENTATION
INDEXINGMINING
STORAGE
DATA COLLECTION / CRAWLING
• Real time dynamic topic
and event clustering
• Trend, popularity
and sentiment analysis
• Calculate trust/influence
scores around people
• Personalized search,
access & presentation
based on social network
interactions
• Semantic enrichment
and discovery of services
9. DySCO concept
• Integrate social content mining, search and intelligent
presentation in a personalized, context and network-
aware way, through the new concept of Dynamic Social
COntainers (DySCOs)
• Composite objects containing a number of items (e.g. articles,
tweets, images, videos)
• Focused on a particular topic of interest (e.g. an event, a story)
• Contain all available information about the topic
• Metadata can be added dynamically
• Ability to search for DySCOS by matching “DySCO features” with
“search features”
• Ability to check and recommend similar DySCOs
recommendations
#11
11. #14
“It has changed the way we do
news”(MSN)
“Social media is the key place for emerging stories –
internationally, nationally, locally” (BBC)
“Social media is transforming the way we do journalism”
(New York Times)
Source: picture alliance / dpa
13. #16
Source: Getty Images
“It’s really hard to find the nuggets of useful stuff
in an ocean of content” (BBC)
“Things that aren’t relevant crowd out the content
you are looking for” (MSN)
“The filters aren’t configurable
enough” (CNN)
15. #18
An example: BBC Verification Procedure: Arab
Spring Coverage
• Referencing locations against maps and existing images
from, in particular, geo-located ones.
• Working with our colleagues in BBC Arabic and BBC
Monitoring to ascertain that accents and language are
correct for the location.
• Searching for the original source of the
upload/sequences as an indicator of date.
• Examining weather reports and shadows to confirm that
the conditions shown fit with the claimed date and time.
• Maintaining lists of previously verified material to act as
reference for colleagues covering the stories.
• Checking scenery, weaponry, vehicles and licence plates
against those known for the given country.
16. News application
• Real-time Search/browse news items crawled from different social media
• Automatically discovered trending topics
• Web analytics
• Sentiment scores for topics
#19
17. Alethiometer
• Measuring the degree of truth behind tweets
• Overall trust score for a tweet
• Various Contributor, Content, Context validity metrics
#20
18. #21
Social Multimedia Crawling & Mining
Case study: #OccupyGezi
E. Schinas, S. Papadopoulos, I. Tsampoulatidis, K. Iliakopoulou, Y. Kompatsiaris
19. #22
Multimedia crawling & mining
• Monitor/query multiple sources for shared media
content: Twitter, Facebook, Flickr, YouTube, etc.
• Multiple indexing schemes:
– text-based (Solr)
– visual content-based (SURF+VLAD for feature extraction,
ADC for similarity-based indexing)
• Clustering
– geo-spatial (BIRCH)
– visual (SCAN)
• Web-based presentation of results
20. #23
Crawling & mining system deployment
StreamManager
Twitter Facebook Flickr YouTube RSS Instagram
160.xx.xx.207
MongoDBWrapper
160.xx.xx.207
TextIndexer (Solr)
160.xx.xx.207
160.xx.xx.207
MediaFetcher, FeatureExtractor (HDFS)
160.xx.xx.58 160.xx.xx.107
Social Focused Crawler (HDFS)
160.xx.xx.187
Nutch
Nutch VLAD
FeatureIndexer (HDFS)
160.xx.xx.207
IVFADC
Data Mining
160.xx.xx.191
Visual Clust. Geo Clust. Statistics
Web server
160.xx.xx.116
API (3)API (4)
API (1) API (2)
22. #25
Geographical spread of event
• Seems like it is not a localized event (as several
official Turkish news sources claimed), but spreads
all over Turkey and even in major cities abroad
28. #32
Capturing & mining large-scale events
• Large-scale events attended by thousands of
people captured by mobile devices in the form of
status updates, photos, ratings, etc.
• Challenge:
– Organize information
around entities of
interest
– Extract meaningful
insights, obtain
informative summaries
• EventSense framework
31. #36
ThessFest
• Gather “realistic” user requirements
• Early showcase and evaluation of SocialSensor
technologies in real-world event scale
• Engage users and create an informed user basis
• TDF14, 15 + TIFF53
– 1400+ users
– 40K+ user sessions
– positive response to social media
• Next version
– Updated features bases on SocialSensor prototype
32. Fête de la Musique Berlin app
• FETEberlin in App Store and Google Play
• More than 100K visitors
• About 5K musicians
• More than 5K app downloads
App features
•Browse and filter detailed program
•Interactive maps and routing
•Social Sharing
•Artists’ and Stages Details
•Social Monitoring
Main benefits for attendants
•Visitors can browse through maps and
don’t get lost as stages are numerous
•Event schedule is available always and per
stage
– Very useful when the server was down and
there was no access to the online schedule
#37
34. FETEberlin Facts & User Feedback
Unique Users Sessions Frequency of Use
App Store 2904 13751 2,5 sessions per day
Google Play 2210 12097 2,9 sessions per day
Total 5114 25848 Avg: 2,7 sessions
per day
Future Plans
•Enhanced Event and Visitor
Engagement
•Send Last minute updates
•Create buzz around the event and make
users the event ambassadors
•Gain insightful knowledge on the
impact that the event made via social
media analysis
– Organize better events
#39
35. #40
EventSense: Capturing the Pulse of Large-scale
Events by Mining Social Media Streams
Case study: Thessaloniki International Film Festival
E. Schinas, S. Papadopoulos, S. Diplaris, Y. Kompatsiaris,
Y. Mass, J. Herzig, L. Boudakidis
36. #41
Entity Detection
• Entities are defined as lists of properties:
– a film consists of a title, description, names of
director(s)/actors
• Matching status updates (tweets) to entities relies on
representing both as vectors, using cosine similarity,
and thresholding:
m: message (tweet), f: feature (term), M: set of all event messages
boost(f): boosting factor when f is a named entity
37. #42
Topic detection
• For each new message, find the Nearest Neighbour
(NN) using Locality Sensitive Hashing (LSH)
• If similarity exceeds an empirically selected
threshold, assign to the topic of NN, otherwise
create new topic
• Clusters of one messages are discarded as outliers
• Cluster-merging is conducted as a post-processing
step to compensate for topic over-segmentation
• Similar approach to (Petrovic et al., 2010)
S. Petrovic, M. Osborne, V. Lavrenko. Streaming first story detection with application to twitter. In
Human Language Technologies, HLT ’10, pages 181–189, Stroudsburg, PA, USA, 2010. ACL
38. #43
Sentiment detection (1/2)
• Build positive/negative sentiment classifiers using
emoticons
• Build neutral classifier using positive/negative
classifiers
• Feature extraction:
– Remove stop words, emoticons, terms occurring only
once, trim repeated letters
– Negation terms (“not”, “isn’t”) are attached to subsequent
terms to form new unigrams (e.g. “nothappy”)
– Treat user mentions, URLs, punctuation, repeated letters
and all-caps words as additional features
39. #44
Sentiment detection (2/2)
• Naïve Bayes classifier per sentiment
• P(f|c) estimate using ML and Laplace correction
• Probability estimate for special features (e.g. user mentions)
using Bernoulli model and Laplace correction
• Classification using maximum log-likelihood
• Neutral messages:
– Mutual Information (MI)
– MI of features
– Sentiment intensity of message
– Use thresholding for decision
40. #45
Evaluation
• Case study: 53rd
Thessaloniki International Film Festival
(TIFF53), Nov 2-11, 2012
• 168 films included in the program (titles, descriptions in
Greek and English)
• 3974 tweets using #tiff53
• Manual annotation regarding:
– film
– sentiment (pos/neg/neut)
• Additional data using ThessFest mobile app:
– #bookmarks per film (number of times a user added the film to their
schedule)
– #ratings + avg. rating per film
42. #47
Topic analysis
• Top-10 topics
• Manual inspection
of clusters:
– 53.8% of topic titles
considered
informative
– 98.5% of clusters
were found to be
“clean”
• Topics in time
43. #48
Sentiment analysis
• Training (using emoticons and Twitter API)
– 800K positive & negative tweets for English
– 12K positive & negative tweets for Greek
• Tuning (for threshold)
– Manually annotated dataset from Thessaloniki Documentary Festival
(similar event)
– 325/73/553 in English and 781/216/781 in Greek
• Testing
– 324/33/724 in English and 901/315/1667 in Greek
– Best accuracy (English) ~ 0.75
– Performance in Greek much poorer
compared to English
need for richer training corpus
pos neg neut
44. #49
Aggregation & summarization (1/2)
#T: number of tweets
Pol: polarity of film tweets
Subj: subjectivity of film tweets
R: average rating
#R: number of ratings
#F: number of times the film was bookmarked
• Films with positive polarity are rated higher.
• Films that are tweeted a lot are also more
likely to be rated.
• Films that are tweet a lot are also more likely
to be added to the users’ bookmarks.
45. #50
Aggregation & summarization (2/2)
Most active & influential Twitter
accounts (+sentiment per user)
Most shared photos
(+number of retweets)
46. Conclusions
• Great interest in both use cases
• In news social media have transformed both news generation and
consumption
• Social media data mining can provide interesting results in
many applications
• Not all data always available (e.g. User queries, fb)
– Infrastructure, Policy issues
• Technical challenges
– Fusion (multi-modality, context), real-time, noise, big data,
aggregation (web, Linked Open Data)
• Applications challenges
• User engagement, visualization, become part of existing workflows,
privacy, copyright, commercialization
Benefits: (i) Intelligent extraction of objects and events from the social Web, (ii) multimodal indexing and organization, (iii) personalized access and presentation of content (incl. media delivery and caching), and (iv) concrete and real integration of the social dimension of the current Web.
----- Besprechungsnotizen (03.04.12 14:41) ----- In the course of the project we have interviewed a considerable number of journalists and executives from some of the worlds biggest media outlets like CNN, the BBC, The New York Times and others... Here are some of the quotes. 3x klicken (bis alle 3 Quotes sichtbar). But journalists are not only describing the positive side. There are also huge challenges. And you can see from the following slide what is most challenging...
----- Besprechungsnotizen (03.04.12 16:44) ----- Or we can turn it the other way round: We have a known source whom the reporter trusts...