SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Twinder: A Search Engine for Twitter Streams

                                   #ICWE2012 Berlin, Germany
                                              July 25th, 2012




     Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben
                      Web Information Systems, TU Delft



                                                            1
Get information from Twitter
How do people use Twitter as a source of information?

•  Twitter is more like a news media.


•  How do people search Twitter?
   •  Search on Twitter (Teevan et al.)



                                  Web Search     Twitter Search
      Query length (chars)            18.80          12.00
     Query length (words)                 3.08       1.64
      Is a celebrity name             3.11%         15.22%



           Twinder: A search engine for Twitter streams           2
Research Questions
What are the challenges we are facing?

 1. Given a topic, can we identify the relevant tweets
 based on the characteristics of the tweets?


 2. Are semantics meaningful for determining the tweets’
 relevance for a topic?


 3. How can we design an architecture that is scalable?




          Twinder: A search engine for Twitter streams     3
Search on Twitter
What you can do on current Twitter?


• Twitter search interface

• Ordered by time

• Keyword-based match




          Twinder: A search engine for Twitter streams   4
Twinder = TWeet +fINDER
Our solution - Architecture
                                                                                               $03.0&

                                                            B%"&@'                                 3""?:#*<'
                                                                                    &"1%5$1'

                                                                        ."#&*/'01"&'2-$"&3#*"'

                             784*%3.&&
                             93/.2:&&                                   4"5"6#-*"'(1+7#+,-'
                             ;*+4*3&
                                                                         !"#$%&"'()$&#*+,-'
                             !"#$%&"'()$&#*+,-'


                                                   1#(4254*03*04)63&&                              1#(42503*04)63&&
                                ;#1<'=&,<"&'


                                                  .@-$#*+*#5'!"#$%&"1'                            C"@D,&?>:#1"?'
                                                                                                    4"5"6#-*"'
                 3"#$%&"'                         A,-$")$%#5'!"#$%&"1'          2-?")'
                ")$&#*+,-'                                                                        ."7#-+*1>:#1"?'
!"#$%&!#'($)*+&   $#1<1'                          ."7#-+*'!"#$%&"1'                                  4"5"6#-*"'
 ,*-./01.$21$.3&


                                                                                     7"11#E"1'


                                                                         .,*8#5'9":'.$&"#71'



                   Twinder: A search engine for Twitter streams                                                       5
Core Components of Twinder
Feature Extraction

 •  Receive Twitter messages from Social Web Streams


 •  Features of two categories:
    •  (1) Topic-sensitive features

    •  (2) Topic-insensitive features



 •  Different extracting strategies designed for different features




          Twinder: A search engine for Twitter streams                6
Core Components of Twinder
Feature Extraction Task Broker

 •  Twinder makes use of MapReduce and cloud computing
    infrastructures to allow for high scalability and frequent
    updates of its multifaceted index.


 •  Feature Extration Task Broker dispatches features
    extration tasks and indexing tasks to cloud computing
    infrastructure.




          Twinder: A search engine for Twitter streams           7
Core Components of Twinder
Relevance Estimation

 •  Accepting search queries from front-end, passing them to
    Feature Extration Component.


 •  Tweets are classified into the relevant and the non-relevant
    by Relevance Estimation component, and are further
    delivered to front-end for rendering.


 •  Twinder can learn the classification model, initially from
    training dataset, then from usage data.




          Twinder: A search engine for Twitter streams             8
Efficiency of Indexing
How good does Twinder make use of cloud-computing
infrastructure?


Corpus size        Mainstream Server EMR(10 instances)
100k (13MBytes)    0.4 min             5 min
1m (122MBytes)     5 min               8 min
10m (1.3GBytes)    48 min              19 min
32m (3.9GBytes)    283 min             47 min




         Twinder: A search engine for Twitter streams    9
Features of Microposts

 Hypothesis H1: The greater the keyword-
 based relevance score, the more relevant
 and interesting the tweet is to the topic.

 We already have keyword-based relevance,
 and…?

   Topic sensitive            Topic insensitive
    Keyword-based
      relevance
          ?
       Twinder: A search engine for Twitter streams   10
Semantic-based relevance
Expand the queries to match more tweets




                               dbp:United_States

  Reformulated query is expected to get a
            dbp:Hu_Jintao
  more accurate retrieval score.

 Hypothesis H2 : The greater the semantic-
 based relevance score, the more relevant
 and interesting the tweet is.
         Twinder: A search engine for Twitter streams   11
Semantic-based relatedness
Is there a semantic overlap between the query and the tweet?




                                dbp:the_United_States

dbp:Hu_Jintao


Hypothesis H3 : If a tweet is considered to
be semantically related to the query then it
is also relevant and interesting for the user.


          Twinder: A search engine for Twitter streams         12
Overview of features
What do we have now?




   Topic sensitive             Topic insensitive
    Keyword-based                      ?
   Semantic-based




         Twinder: A search engine for Twitter streams   13
Syntactical feature : Hashtag
Is a tweet more relevant if it contains a #hashtag?

  Hypothesis 4: tweets that contain hashtags
  are more likely to be relevant than tweets
  that do not contain hashtags.




          Twinder: A search engine for Twitter streams   14
Syntactical feature : hasURL
Is a tweet that contains a URL more relevant?

  Hypothesis 5: tweets that contain a URL are
  more likely to be relevant than tweets that
  do not contain a URL.




          Twinder: A search engine for Twitter streams   15
Syntactical feature : isReply
Is a tweet which is a reply to @somebody more relevant?

  Hypothesis 6: tweets that are formulated as
  a reply to another tweet are less likely to be
  relevant than other tweets.




          Twinder: A search engine for Twitter streams    16
Syntactical feature : length
Does the length of a tweet influence its relevance for a topic?




  Hypothesis 7: the longer a tweet, the more
  likely it is to be relevant and interesting.



          Twinder: A search engine for Twitter streams            17
Overview of features
 Short summary




    Topic sensitive             Topic insensitive
     Keyword-based              Syntactical features
    Semantic-based                       ?


   Are there further features that
allow for estimating the relevance?

          Twinder: A search engine for Twitter streams   18
Semantic features
Find semantics in a tweet to estimate the relevance


         dbp:Tim_Berners-Lee          dbp:World_Wide_Web




                                              dbp:France

                                       dbp:Lyon

                          dbp:International_World_Wide_Web_Conference




          Twinder: A search engine for Twitter streams             19
Semantic features : #entity
Is a tweet with more entities more interesting?


• 5 entities extracted.




  Hypothesis 8: the more entities a tweet
  mentions, the more likely it is to be relevant
  and interesting.
          Twinder: A search engine for Twitter streams   20
Semantic features : diversity
How many types are there in the entities?


• 4 types of entities




  Hypothesis 9: the greater the diversity of
  concepts mentioned in a tweet, the more
  likely it is to be interesting and relevant.
          Twinder: A search engine for Twitter streams   21
Semantic features : sentiment
Was the author of the tweet happy or not?


• Sentiment : Neutral




  Hypothesis 10: the likelihood of a tweet’s
  relevance is influenced by its sentiment
  polarity.
          Twinder: A search engine for Twitter streams   22
Overview of features
By now, we have 4 types of features.




    Topic sensitive              Topic insensitive
     Keyword-based                  Syntactical
    Semantic-based                  Semantics



       Can we utilize the contextual
          information of tweets?
          Twinder: A search engine for Twitter streams   23
Contextual features
Does the number of followers influence the relatedness?




    Hypothesis 11: The higher the number of
    followers a creator of a message has, the
    more likely it is that her tweets are relevant.


          Twinder: A search engine for Twitter streams    24
Contextual features
Or the number of followers that the author appears in?

 Hypothesis 12: The higher the number of
 lists in which the creator of a message
 appears, the more likely it is that her tweets
 are relevant.




          Twinder: A search engine for Twitter streams   25
Contextual features
How long has been the author on Twitter?

 Hypothesis 13: The older the Twitter
 account of a user, the more likely it is that
 her tweets are relevant.




         Signed up                    Post


   July 2008                    June 2012




           Twinder: A search engine for Twitter streams   26
Summary of Features
The features




    Topic sensitive             Topic insensitive
     Keyword-based                 Syntactical
    Semantic-based                 Semantics
                                   Contextual




          Twinder: A search engine for Twitter streams   27
Analysis

• Research Questions:
  1.  Which features are more influential on predicting the
      relatedness of a tweet to a certain topic?
  2.  Which types of features are more important? Are
      semantics meaningful?
  3.  What’s the performance that we can achieve by utilizing
      these features?

• Twinder Setup
  •  Consider the search problem as a classification task
  •  Classification algorithm = Logistic Regression



         Twinder: A search engine for Twitter streams           28
Dataset
From TREC 2011 Microblog Track


•  Twitter corpus
   •  16 million tweets (Jan. 24th, 2011 – Feb. 8th)
   •  4,766,901 tweets classified as English
   •  6.2 million entity-extractions (140k distinct entities)

•  Relevance judgments
   •  49 topics
   •  40,855 (topic, tweet) pairs
   •  60.31 relevant tweets per topic (on average)




           Twinder: A search engine for Twitter streams         29
Results
Which type of features matters?
Features             Precision       Recall            F-measure
keyword relevance           0.3036            0.2851         0.2940
semantic relevance         0.3050             0.3294        0.3167
topic-sensitive            0.3135             0.3252        0.3192
topic-insensitive           0.1956            0.0064         0.0123
without semantics           0.3363            0.4618         0.3965
without sentiment           0.3701            0.3923         0.4048
without context            0.3827             0.4714        0.4225
all features                0.3674            0.4736         0.4138

 Overall, we can achieve the precision and
 recall of over 35% and 45% respectively by
 applying all the features.
           Twinder: A search engine for Twitter streams            30
Weights of features                              Syntactical
     Which feature matters?         2
                                    1
         Keyword-based              0    hasHashtag     hasURL               isReply       length
2
                                    -1
1
0                                                       Semantics
-1       Keyword-based relevance    2
                                    1
         Semantic-based             0
     2                                      #entities            diversity             sentiment

                                    -1
     1
                                                      Contextual
     0
                                    2
 -1      Relevance    Relatedness
                                    1
                                    0      #followers             #lists                 Age

                                    -1
               Twinder: A search engine for Twitter streams                                        31
Topics of different categories
The impact on the performance and models

•  49 topics categorized into 2 parts w.r.t. 3 dimensions:
   •  Popularity
   •  Gobal vs. Local
   •  Temporal persistence

•  Popularity
   •  Higher recall for popular topics
   •  Less impact from sentiment features on unpopular topics

•  Temporal persistence
   •  Higher performance on shorter-term topics
   •  Less impact from sentiment features on persistent topic



           Twinder: A search engine for Twitter streams         32
Conclusions
What are our contributions?

1.  Twinder search engine proposed: analyzing various
    features to determine the relevance and interestingness of
    Twitter messages for a given topic.

2.  Scalability demonstrated for the Twinder search engine.

3.  Extensive analysis on 13 features along two-dimensions:
    topic-sensitive features and topic-insensitive features.




          Twinder: A search engine for Twitter streams           33
Conclusions
The lessons learned

1.  The learned models which take advantage of semantics
    and topic-sensitive features outperform those which do not
    take the semantics and topic-sensitive features into
    account.

2.  Contextual features that characterize the users who are
    posting the messages have little impact on the relevance
    estimation.

3.  The importance of a feature differs depending on the topic
    characteristics; for example, the sentiment-based features
    are more important for popular than for unpopular topics.



          Twinder: A search engine for Twitter streams           34
QUESTIONS?



July 25th, 2012

k.tao@tudelft.nl       http://ktao.nl/


THANK YOU!


            Twinder: A search engine for Twitter streams   35

Contenu connexe

En vedette

Leveraging User Modeling on the Social Web with Linked Data
Leveraging User Modeling on the Social Web with Linked DataLeveraging User Modeling on the Social Web with Linked Data
Leveraging User Modeling on the Social Web with Linked DataKe Tao
 
2014-User Modeling for Contextual Suggestion-TREC
2014-User Modeling for Contextual Suggestion-TREC2014-User Modeling for Contextual Suggestion-TREC
2014-User Modeling for Contextual Suggestion-TRECHua Li, PhD
 
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...GUANGYUAN PIAO
 
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...GUANGYUAN PIAO
 
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...GUANGYUAN PIAO
 
GeniUS: Generic User Modeling Library for the Social Semantic Web
GeniUS: Generic User Modeling Library for the Social Semantic WebGeniUS: Generic User Modeling Library for the Social Semantic Web
GeniUS: Generic User Modeling Library for the Social Semantic WebWeb Information Systems, TU Delft
 

En vedette (6)

Leveraging User Modeling on the Social Web with Linked Data
Leveraging User Modeling on the Social Web with Linked DataLeveraging User Modeling on the Social Web with Linked Data
Leveraging User Modeling on the Social Web with Linked Data
 
2014-User Modeling for Contextual Suggestion-TREC
2014-User Modeling for Contextual Suggestion-TREC2014-User Modeling for Contextual Suggestion-TREC
2014-User Modeling for Contextual Suggestion-TREC
 
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
 
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
 
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
 
GeniUS: Generic User Modeling Library for the Social Semantic Web
GeniUS: Generic User Modeling Library for the Social Semantic WebGeniUS: Generic User Modeling Library for the Social Semantic Web
GeniUS: Generic User Modeling Library for the Social Semantic Web
 

Similaire à Twinder: A Search Engine for Twitter Streams

Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsFabian Abel
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Ke Tao
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insightDigital Reasoning
 
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsWeb Information Systems, TU Delft
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
 
Development of Twitter Application #2 - Twitter for Websites
Development of Twitter Application #2 - Twitter for WebsitesDevelopment of Twitter Application #2 - Twitter for Websites
Development of Twitter Application #2 - Twitter for WebsitesMyungjin Lee
 
How Twitter Usage Differs Between Humans and Company Accounts
How Twitter Usage Differs Between Humans and Company AccountsHow Twitter Usage Differs Between Humans and Company Accounts
How Twitter Usage Differs Between Humans and Company AccountsAlex Ishida
 
Twitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for ResearcherTwitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for ResearcherKMb Unit, York University
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash TagIRJET Journal
 
Under the Hood with Social Metadata - SEMpdx Feb 2014
Under the Hood with Social Metadata - SEMpdx Feb 2014Under the Hood with Social Metadata - SEMpdx Feb 2014
Under the Hood with Social Metadata - SEMpdx Feb 2014Mike Arnesen
 
Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonPyData
 
Perceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataPerceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataDavid Gerson
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfConnorShorten2
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentationPriyaj Kumar
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterKrist Wongsuphasawat
 

Similaire à Twinder: A Search Engine for Twitter Streams (20)

The Value of Twitter
The Value of TwitterThe Value of Twitter
The Value of Twitter
 
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
 
Development of Twitter Application #2 - Twitter for Websites
Development of Twitter Application #2 - Twitter for WebsitesDevelopment of Twitter Application #2 - Twitter for Websites
Development of Twitter Application #2 - Twitter for Websites
 
How Twitter Usage Differs Between Humans and Company Accounts
How Twitter Usage Differs Between Humans and Company AccountsHow Twitter Usage Differs Between Humans and Company Accounts
How Twitter Usage Differs Between Humans and Company Accounts
 
Twitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for ResearcherTwitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for Researcher
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 
Under the Hood with Social Metadata - SEMpdx Feb 2014
Under the Hood with Social Metadata - SEMpdx Feb 2014Under the Hood with Social Metadata - SEMpdx Feb 2014
Under the Hood with Social Metadata - SEMpdx Feb 2014
 
Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David Gerson
 
Perceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataPerceptual Mapping using Twitter Data
Perceptual Mapping using Twitter Data
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdf
 
Using Twitter
Using TwitterUsing Twitter
Using Twitter
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at Twitter
 

Dernier

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Dernier (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Twinder: A Search Engine for Twitter Streams

  • 1. Twinder: A Search Engine for Twitter Streams #ICWE2012 Berlin, Germany July 25th, 2012 Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben Web Information Systems, TU Delft 1
  • 2. Get information from Twitter How do people use Twitter as a source of information? •  Twitter is more like a news media. •  How do people search Twitter? •  Search on Twitter (Teevan et al.) Web Search Twitter Search Query length (chars) 18.80 12.00 Query length (words) 3.08 1.64 Is a celebrity name 3.11% 15.22% Twinder: A search engine for Twitter streams 2
  • 3. Research Questions What are the challenges we are facing? 1. Given a topic, can we identify the relevant tweets based on the characteristics of the tweets? 2. Are semantics meaningful for determining the tweets’ relevance for a topic? 3. How can we design an architecture that is scalable? Twinder: A search engine for Twitter streams 3
  • 4. Search on Twitter What you can do on current Twitter? • Twitter search interface • Ordered by time • Keyword-based match Twinder: A search engine for Twitter streams 4
  • 5. Twinder = TWeet +fINDER Our solution - Architecture $03.0& B%"&@' 3""?:#*<' &"1%5$1' ."#&*/'01"&'2-$"&3#*"' 784*%3.&& 93/.2:&& 4"5"6#-*"'(1+7#+,-' ;*+4*3& !"#$%&"'()$&#*+,-' !"#$%&"'()$&#*+,-' 1#(4254*03*04)63&& 1#(42503*04)63&& ;#1<'=&,<"&' .@-$#*+*#5'!"#$%&"1' C"@D,&?>:#1"?' 4"5"6#-*"' 3"#$%&"' A,-$")$%#5'!"#$%&"1' 2-?")' ")$&#*+,-' ."7#-+*1>:#1"?' !"#$%&!#'($)*+& $#1<1' ."7#-+*'!"#$%&"1' 4"5"6#-*"' ,*-./01.$21$.3& 7"11#E"1' .,*8#5'9":'.$&"#71' Twinder: A search engine for Twitter streams 5
  • 6. Core Components of Twinder Feature Extraction •  Receive Twitter messages from Social Web Streams •  Features of two categories: •  (1) Topic-sensitive features •  (2) Topic-insensitive features •  Different extracting strategies designed for different features Twinder: A search engine for Twitter streams 6
  • 7. Core Components of Twinder Feature Extraction Task Broker •  Twinder makes use of MapReduce and cloud computing infrastructures to allow for high scalability and frequent updates of its multifaceted index. •  Feature Extration Task Broker dispatches features extration tasks and indexing tasks to cloud computing infrastructure. Twinder: A search engine for Twitter streams 7
  • 8. Core Components of Twinder Relevance Estimation •  Accepting search queries from front-end, passing them to Feature Extration Component. •  Tweets are classified into the relevant and the non-relevant by Relevance Estimation component, and are further delivered to front-end for rendering. •  Twinder can learn the classification model, initially from training dataset, then from usage data. Twinder: A search engine for Twitter streams 8
  • 9. Efficiency of Indexing How good does Twinder make use of cloud-computing infrastructure? Corpus size Mainstream Server EMR(10 instances) 100k (13MBytes) 0.4 min 5 min 1m (122MBytes) 5 min 8 min 10m (1.3GBytes) 48 min 19 min 32m (3.9GBytes) 283 min 47 min Twinder: A search engine for Twitter streams 9
  • 10. Features of Microposts Hypothesis H1: The greater the keyword- based relevance score, the more relevant and interesting the tweet is to the topic. We already have keyword-based relevance, and…? Topic sensitive Topic insensitive Keyword-based relevance ? Twinder: A search engine for Twitter streams 10
  • 11. Semantic-based relevance Expand the queries to match more tweets dbp:United_States Reformulated query is expected to get a dbp:Hu_Jintao more accurate retrieval score. Hypothesis H2 : The greater the semantic- based relevance score, the more relevant and interesting the tweet is. Twinder: A search engine for Twitter streams 11
  • 12. Semantic-based relatedness Is there a semantic overlap between the query and the tweet? dbp:the_United_States dbp:Hu_Jintao Hypothesis H3 : If a tweet is considered to be semantically related to the query then it is also relevant and interesting for the user. Twinder: A search engine for Twitter streams 12
  • 13. Overview of features What do we have now? Topic sensitive Topic insensitive Keyword-based ? Semantic-based Twinder: A search engine for Twitter streams 13
  • 14. Syntactical feature : Hashtag Is a tweet more relevant if it contains a #hashtag? Hypothesis 4: tweets that contain hashtags are more likely to be relevant than tweets that do not contain hashtags. Twinder: A search engine for Twitter streams 14
  • 15. Syntactical feature : hasURL Is a tweet that contains a URL more relevant? Hypothesis 5: tweets that contain a URL are more likely to be relevant than tweets that do not contain a URL. Twinder: A search engine for Twitter streams 15
  • 16. Syntactical feature : isReply Is a tweet which is a reply to @somebody more relevant? Hypothesis 6: tweets that are formulated as a reply to another tweet are less likely to be relevant than other tweets. Twinder: A search engine for Twitter streams 16
  • 17. Syntactical feature : length Does the length of a tweet influence its relevance for a topic? Hypothesis 7: the longer a tweet, the more likely it is to be relevant and interesting. Twinder: A search engine for Twitter streams 17
  • 18. Overview of features Short summary Topic sensitive Topic insensitive Keyword-based Syntactical features Semantic-based ? Are there further features that allow for estimating the relevance? Twinder: A search engine for Twitter streams 18
  • 19. Semantic features Find semantics in a tweet to estimate the relevance dbp:Tim_Berners-Lee dbp:World_Wide_Web dbp:France dbp:Lyon dbp:International_World_Wide_Web_Conference Twinder: A search engine for Twitter streams 19
  • 20. Semantic features : #entity Is a tweet with more entities more interesting? • 5 entities extracted. Hypothesis 8: the more entities a tweet mentions, the more likely it is to be relevant and interesting. Twinder: A search engine for Twitter streams 20
  • 21. Semantic features : diversity How many types are there in the entities? • 4 types of entities Hypothesis 9: the greater the diversity of concepts mentioned in a tweet, the more likely it is to be interesting and relevant. Twinder: A search engine for Twitter streams 21
  • 22. Semantic features : sentiment Was the author of the tweet happy or not? • Sentiment : Neutral Hypothesis 10: the likelihood of a tweet’s relevance is influenced by its sentiment polarity. Twinder: A search engine for Twitter streams 22
  • 23. Overview of features By now, we have 4 types of features. Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics Can we utilize the contextual information of tweets? Twinder: A search engine for Twitter streams 23
  • 24. Contextual features Does the number of followers influence the relatedness? Hypothesis 11: The higher the number of followers a creator of a message has, the more likely it is that her tweets are relevant. Twinder: A search engine for Twitter streams 24
  • 25. Contextual features Or the number of followers that the author appears in? Hypothesis 12: The higher the number of lists in which the creator of a message appears, the more likely it is that her tweets are relevant. Twinder: A search engine for Twitter streams 25
  • 26. Contextual features How long has been the author on Twitter? Hypothesis 13: The older the Twitter account of a user, the more likely it is that her tweets are relevant. Signed up Post July 2008 June 2012 Twinder: A search engine for Twitter streams 26
  • 27. Summary of Features The features Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics Contextual Twinder: A search engine for Twitter streams 27
  • 28. Analysis • Research Questions: 1.  Which features are more influential on predicting the relatedness of a tweet to a certain topic? 2.  Which types of features are more important? Are semantics meaningful? 3.  What’s the performance that we can achieve by utilizing these features? • Twinder Setup •  Consider the search problem as a classification task •  Classification algorithm = Logistic Regression Twinder: A search engine for Twitter streams 28
  • 29. Dataset From TREC 2011 Microblog Track •  Twitter corpus •  16 million tweets (Jan. 24th, 2011 – Feb. 8th) •  4,766,901 tweets classified as English •  6.2 million entity-extractions (140k distinct entities) •  Relevance judgments •  49 topics •  40,855 (topic, tweet) pairs •  60.31 relevant tweets per topic (on average) Twinder: A search engine for Twitter streams 29
  • 30. Results Which type of features matters? Features Precision Recall F-measure keyword relevance 0.3036 0.2851 0.2940 semantic relevance 0.3050 0.3294 0.3167 topic-sensitive 0.3135 0.3252 0.3192 topic-insensitive 0.1956 0.0064 0.0123 without semantics 0.3363 0.4618 0.3965 without sentiment 0.3701 0.3923 0.4048 without context 0.3827 0.4714 0.4225 all features 0.3674 0.4736 0.4138 Overall, we can achieve the precision and recall of over 35% and 45% respectively by applying all the features. Twinder: A search engine for Twitter streams 30
  • 31. Weights of features Syntactical Which feature matters? 2 1 Keyword-based 0 hasHashtag hasURL isReply length 2 -1 1 0 Semantics -1 Keyword-based relevance 2 1 Semantic-based 0 2 #entities diversity sentiment -1 1 Contextual 0 2 -1 Relevance Relatedness 1 0 #followers #lists Age -1 Twinder: A search engine for Twitter streams 31
  • 32. Topics of different categories The impact on the performance and models •  49 topics categorized into 2 parts w.r.t. 3 dimensions: •  Popularity •  Gobal vs. Local •  Temporal persistence •  Popularity •  Higher recall for popular topics •  Less impact from sentiment features on unpopular topics •  Temporal persistence •  Higher performance on shorter-term topics •  Less impact from sentiment features on persistent topic Twinder: A search engine for Twitter streams 32
  • 33. Conclusions What are our contributions? 1.  Twinder search engine proposed: analyzing various features to determine the relevance and interestingness of Twitter messages for a given topic. 2.  Scalability demonstrated for the Twinder search engine. 3.  Extensive analysis on 13 features along two-dimensions: topic-sensitive features and topic-insensitive features. Twinder: A search engine for Twitter streams 33
  • 34. Conclusions The lessons learned 1.  The learned models which take advantage of semantics and topic-sensitive features outperform those which do not take the semantics and topic-sensitive features into account. 2.  Contextual features that characterize the users who are posting the messages have little impact on the relevance estimation. 3.  The importance of a feature differs depending on the topic characteristics; for example, the sentiment-based features are more important for popular than for unpopular topics. Twinder: A search engine for Twitter streams 34
  • 35. QUESTIONS? July 25th, 2012 k.tao@tudelft.nl http://ktao.nl/ THANK YOU! Twinder: A search engine for Twitter streams 35