SlideShare a Scribd company logo
1 of 36
Digital Enterprise Research Institute                                                           www.deri.ie




                            Augmenting Social Media Items with
                            Metadata using Related Web Content

                                                                              Sheila Kinsella




 Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
                                                                                                1
Outline
Digital Enterprise Research Institute                        www.deri.ie




              Motivation
              Example Scenario
              Tag prediction                  Approach
              Geolocation                     Evaluation
              Topic classification            Summary

        Combining the approaches
        Impact
        Conclusions




                                        2
Motivation
Digital Enterprise Research Institute                                               www.deri.ie




              Social media is an important information source
                     e.g., real time citizen journalism, Q&A sites, niche topics
              Search and navigation can be challenging
                     Short and informal posts
                     Items are not curated and often lack metadata
                     Users conversing share a common context and therefore omit
                      relevant information, e.g. location
              Making use of related Web data can help us to infer
               such context information
                     e.g., hyperlinks, posts with similar content




                                               3
Example Scenario:
       Adding metadata to a blog post
Digital Enterprise Research Institute                                             www.deri.ie



                                        Last night I saw Connacht
                                        play at The Sportsground.
          tags?                         The match started well for
                                        Connacht with a great try but
                                        after half time the opposition
                                        closed the gap. Finally we
                                        managed to hold out for the      topic?
                                        win. It was a great game
          location?                     from both sides. Here's a clip
                                        of the first try.




                                                   4
Example Scenario:
       Possible clues from content
Digital Enterprise Research Institute                                             www.deri.ie



                                        Last night I saw Connacht

       tags?                            play at The Sportsground.
                                        The match started well for
                                        Connacht with a great try but
                                        after half time the opposition
                                        closed the gap. Finally we
                                        managed to hold out for the      topic?
                                        win. It was a great game
          location?                     from both sides. Here's a clip
                                        of the first try.




                                                   5
Example Scenario:
       Possible clues from content
Digital Enterprise Research Institute                                             www.deri.ie



                                        Last night I saw Connacht
                                        play at The Sportsground.
          tags?                         The match started well for
                                        Connacht with a great try but
                                        after half time the opposition
                                        closed the gap. Finally we
                                        managed to hold out for the      topic?
                                        win. It was a great game
location?                               from both sides. Here's a clip
                                        of the first try.




                                                   6
Example Scenario:
       Possible clues from content
Digital Enterprise Research Institute                                         www.deri.ie



                                        Last night I saw Connacht
                                        play at The Sportsground.
          tags?                         The match started well for
                                        Connacht with a great try but
                                        after half time the opposition
                                        closed the gap. Finally we
                                        managed to hold out for the
                                        win. It was a great game         topic?
         location?                      from both sides. Here's a clip
                                        of the first try.




                                                   7
Example Scenario:
       Exploiting related Web content
Digital Enterprise Research Institute                                    www.deri.ie



   ...didn’t see                        Last night I saw Connacht
   the match but                        play at The Sportsground.
   here’s      a                        The match started well for
   summary       href                   Connacht with a great try but
   from John..                          after half time the opposition
                                        closed the gap. Finally we
                                        managed to hold out for the
  ..............This                    win. It was a great game
  review of the                         from both sides. Here's a clip
  Connacht           href               of the first try.
  match shows
  that they are
  getting back in
  form!......
                                   tags from anchortext
                                                   8
Example Scenario:
       Exploiting related Web content
Digital Enterprise Research Institute                                          www.deri.ie



                                        Last night I saw Connacht
                                        play at The Sportsground.
                                        The match started well for
                                        Connacht with a great try but
                                                                         location
                                        after half time the opposition   from
                                        closed the gap. Finally we
                                        managed to hold out for the      geotagged
                                        win. It was a great game
                                        from both sides. Here's a clip   social
                                        of the first try.
                                                                         media
                                            JohnSmith John Smith
                                            I’m at the Galway
                                            Sportsground


                                                   9
Example Scenario:
       Exploiting related Web content
Digital Enterprise Research Institute                                                       www.deri.ie



                                        Last night I saw Connacht
                                        play at The Sportsground.                 YouTube
                                        The match started well for              Title:
                                                                                Fionn Carr try
                                        Connacht with a great try but
                                        after half time the opposition
                                        closed the gap. Finally we
                                        managed to hold out for the             Category:
                                                                         href
                                        win. It was a great game                Sport
                                        from both sides. Here's a clip          Tags:
                                                                                rugby, try, carr,
                                        of the first try.
                                                                                connacht



   topic from hyperlinked objects
                                                   10
Example Scenario:
       Overview of approaches
Digital Enterprise Research Institute                                                        www.deri.ie



   ...didn’t see                        Last night I saw Connacht
   the match but                        play at The Sportsground.                  YouTube
   here’s      a                        The match started well for               Title:
                                                                                 Fionn Carr try
   summary       href                   Connacht with a great try but
   from John..                          after half time the opposition
                                        closed the gap. Finally we       href
                                        managed to hold out for the              Category:
  ..............This                    win. It was a great game                 Sport
  review of the                         from both sides. Here's a clip           Tags:
  Connacht           href                                                        rugby, try, carr,
                                        of the first try.
  match shows                                                                    connacht
  that they are
  getting back in                            JohnSmith John Smith
  form!......                                I’m at the Galway
                                             Sportsground                          TOPIC
TAG PREDICTION                                GEOLOCATION                       CLASSIFICATIO
                                                                                     N
                                                    11
Tag Prediction: Approach
Digital Enterprise Research Institute                                         www.deri.ie




              Aim: Automatic tag generation based on anchortext
       1.      Data collection and preprocessing
                   Retrieve document and extract META information
                   Retrieve inlinking documents and extract anchortext
                   Preprocessing (e.g. stemming, stopword removal)
       2.      Tag indexing and ranking
                   Generate term vectors from the preprocessed annotations
                   Ranking: tf and tf-idf




                                             12
Tag Prediction: Evaluation (1)
Digital Enterprise Research Institute                                             www.deri.ie




              Datasets:
                      Web: WEBSPAM-2007, 12M pages from .uk domain
                      Delicious: 2007 Crawl containing tags for 4.5M URLs
                      Overlap between datasets: 192k URLs


              Goals:
                      Compare overlap of predicted tags and delicious tags
                      Assess relevance of predicted tags and relevance of delicious
                       tags




                                              13
Tag Prediction: Evaluation (2)
Digital Enterprise Research Institute                                                www.deri.ie




              Automatic Evaluation
                      Relative precision@k (Average proportion of predicted tags
                       that are also among delicious tags)

                                        k=1    k=2         k=3    k=4    k=5
                                        0.48   0.45        0.42   0.39   0.37

                      Relative recall@k (Average proportion of delicious tags can
                       also be inferred from anchortext)

                                        k=1    k=2         k=3    k=4    k=5
                                        0.41   0.35        0.32   0.29   0.28




                                                      14
Tag Prediction: Evaluation (3)
Digital Enterprise Research Institute                                                         www.deri.ie




              Human Evaluation
                      80 documents, each assessed by 3 judges
                       – 0: not relevant; 1: quite relevant; 2: very relevant
                      Evaluator agreement
                       – In 85% of cases, judges at least almost agree
                               –    i.e., two agree and the third differs by just one point




                                                          15
Tag Prediction: Evaluation (4)
Digital Enterprise Research Institute                                              www.deri.ie




              Human Evaluation
                     Precision@k (Average proportion of tags judged relevant by
                      evaluators)
                      – Relevance threshold: 1

                                        k=1    k=2     k=3    k=4    k=5
                    Delicious           0.86   0.84    0.82   0.80   0.78
                    predicted           0.78   0.76    0.69   0.67   0.66


                    Not feasible to measure recall




                                                  16
Tag Prediction: Summary
Digital Enterprise Research Institute                           www.deri.ie




            Substantial overlap between tags assigned on a social
             bookmarking site and terms from anchortext
            Human evaluators rate relevance of terms from
             anchortext as not much lower than tags
            This approach can provide useful and novel
             annotations for untagged social media items, if other
             users link to them with anchortext




                                        17
Geolocation: Approach (1)
Digital Enterprise Research Institute                                               www.deri.ie




              Aim: Location prediction based on models built from
               geotagged social media
                     Enables detection of implicit location clues such as slang,
                      venues, other terms of local relevance
       1.      Reverse Geocoding
                   Filter geotagged tweets from Twitter stream
                   Reverse-geocode each coordinate to corresponding places
                       – Postal code, City, State, Country
                       – Yahoo! Geoplanet service
                   Aggregate all of the text from each place together for model
                    building




                                                 18
Geolocation: Approach (2)
Digital Enterprise Research Institute                                               www.deri.ie




       2.      Language Modelling
                   Approach from information retrieval – given a query, find the
                    most relevant document in a collection
                   Model each document and query as bag of words
                   For each document, calculate probability that a random sampling
                    would result in the query
                   Based on the intuition that users create queries by guessing
                    words that would occur in the document
                   For our geolocation task: estimates the probability that a random
                    sampling of a location would result in the social media post




                                             19
Geolocation: Evaluation (1)
Digital Enterprise Research Institute                                            www.deri.ie




            Dataset
                   Twitter Firehose stream
                   7.3 million geotagged tweets posted during Summer 2010
                   Retweets removed, #hashtags and @usernames preserved
                                 Place type    # Tweets    # Distinct places
                                 Country            7.3m                  222
                                 State              7.3m                 2.3k
                                 City               6.3m                72.6k
                                 Postal code        7.2m               104.7k

            Baseline: Yahoo Placemaker!
                   identifies and disambiguates placenames in text and returns the
                    spatial entity most likely to encompass them

                                                   20
Geolocation: Evaluation (2)
Digital Enterprise Research Institute                                                        www.deri.ie




            Prediction Methods
                   Trivial Classifier
                       – Each tweet assigned to the most common place in training set
                   Placemaker (Tweet)
                       – Each tweet is submitted to Placemaker and the most probable
                         candidate is selected. Allows detection of explicit geographic
                         references in the tweet
                   Language Model
                       – Locations are ranking according to their query likelihood and the
                         location whose model ranks highest is selected
                   Placemaker (Location)
                       – The location field from the tweet is submitted to Placemaker and the
                         most probable candidate is selected. Allows detection of explicit
                         geographic references in the self-reported location


                                                 21
Geolocation: Evaluation (3)
Digital Enterprise Research Institute                                     www.deri.ie




            Tweet location prediction accuracy
                   Common location focused services removed

                                         Zip     Town    State   Country
      Trivial
                                        0.005    0.061   0.060    0.434
      Classifier
      Placemaker
                                        0.018    0.060   0.076    0.120
      (Tweet)
      Language
                                        0.052    0.217   0.246    0.514
      Model
      Placemaker
                                        0.017    0.269   0.401    0.518
      (Location)




                                                22
Geolocation: Summary
Digital Enterprise Research Institute                         www.deri.ie




            Language models of geotagged tweets enables the
             location of non-geotagged items to be predicted
            The approach gives large improvements compared to
             parsing for explicit placenames
                   City level accuracy – 21.7% versus 6%
            The approach can be used to detect implicit
             geographical information in social media posts




                                            23
Topic Classification: Approach
Digital Enterprise Research Institute                                                    www.deri.ie



              Aim: Improve topic classification using structured data from
               hyperlinks
       1.      Identify sources of structured data from hyperlinks
                    Based on domains, e.g., wikipedia.org
       2.      Retrieve structured data for these hyperlinks
                    From Linked Data/APIs, e.g., dbpedia.org
       3.      Perform text classification
                    Requires set of already categorised posts for training
                    Post content and external metadata as sources of textual features
                    Compare accuracy achieved by different metadata types
       4.      Related to IR studies that classify documents based on fielded text
               from hyperlinked pages, but they consider structural rather than
               semantic fields


                                                 24
Topic Classification: Evaluation (1)
Digital Enterprise Research Institute                                                 www.deri.ie




            Datasets
                                                         Forum           Twitter
                Data source                     message board    microblogging site
                Ground truth topics             forums           #hashtags
                # classes (topics)              10               6
                # posts                         6,626            2,415

            External data sources
                                  Linked Data                    Web APIs




                                                        25
Topic Classification: Evaluation (2)
Digital Enterprise Research Institute                                           www.deri.ie




            Experimental Setup
                   Multinomial Naïve Bayes classifier (WEKA)
                   10-fold cross-validation
                   Compared classification accuracy for different post
                    representations based
                       –   post content
                       –   hyperlinked HTML pages
                       –   hyperlinked object metadata
                       –   combinations of these
                   Experimented to find optimal ways of combining feature vectors
                    (e.g., weightings)




                                                  26
Topic Classification: Evaluation (3)
Digital Enterprise Research Institute                               www.deri.ie




            Results
                       Data Source            Forum       Twitter
                       Content (no URLs)      0.745        0.722
                       Content (with URLs)    0.811        0.759
                       HTML                   0.730        0.645
                       Metadata               0.835        0.683
                       Content + HTML         0.832        0.784
                       Content + Metadata     0.899        0.820
                                              (micro-averaged F1)




                                         27
Topic Classification: Evaluation (4)
Digital Enterprise Research Institute                                                   www.deri.ie




            Results – comparing metadata types
                                                      Wikipedia
    Metadata type                       Content (no URLs)   Metadata only   Content+M’data
  Category                                                        0.811         0.851
  Description                                 0.761               0.798         0.850
  Title                                                           0.685         0.809

                                                       YouTube
     Metadata type                      Content (no URLs)   Metadata only   Content+M’data
  Tag                                                             0.838         0.864
  Title                                                           0.773         0.824
                                              0.709
  Description                                                     0.752         0.810
  Category                                                        0.514         0.753


                                                       28
Topic Classification: Summary
Digital Enterprise Research Institute                               www.deri.ie




            Topic classification in social media can be improved by
             making use of structured metadata from hyperlinked
             objects
            The most useful metadata types can be found
             experimentally, but for different objects, the usefulness
             of metadata types varies
            The categories assigned by this approach would allow a
             user to browse social media posts with hyperlinks by
             topic, even if the text of the post itself is not
             sufficient for accurate automatic categorisation of
             the post.


                                        29
Combining the approaches (1)
Digital Enterprise Research Institute                www.deri.ie



            location
                                             topic




                                             tags


                                        30
Combining the approaches (2)
Digital Enterprise Research Institute                                          www.deri.ie



                                        Last night I watched
                                        Connacht play at The
          tags?                         Sportsground. The match
                                        started well for Connacht
                                        with a great try but after
                                        half time the opposition
                                        closed the gap. Finally       topic?
                                        we managed to hold out
          location?                     for the win. It was a great
                                        game from both sides.
                                        Here's a clip of the first
                                        try.




                                                 31
Combining the approaches (3)
Digital Enterprise Research Institute                                  www.deri.ie

@prefix        ex: <http://example.org/> .
@prefix        content: <http://purl.org/rss/1.0/modules/content/> .
@prefix        dc: <http://purl.org/dc/terms/> .
@prefix        sioc: <http://rdfs.org/sioc/ns#> .

ex:post1 rdf:type sioc:Post .
ex:post1 content:encoded “Last night I watched Connacht play at The
   Sportsground. The match started well for Connacht with a great try but
   after half time the opposition closed the gap. Finally we managed to hold
   out for the win. It was a great game from both sides. Here's a
   [url=„http://www.youtube.com/watch?v=[...]‟]clip of the first try.[/url]” .
ex:post1 sioc:links_to <http://www.youtube.com/watch?v=[...]> .
ex:post1 dc:subject “connacht” .
ex:post1 dc:subject “match” .
ex:post1 dc:subject “review” .
ex:post1 dc:subject “summary” .
ex:post1 dc:spatial <http://sws.geonames.org/2964180/> .
ex:post1 sioc:topic <http://www.dmoz.org/Sports/Football/Rugby_Union/> .



                                          32
Combining the approaches (4)
Digital Enterprise Research Institute                                                       www.deri.ie


    Use-case 1:                        PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
                                        PREFIX sioc: <http://rdfs.org/sioc/ns#> .
     Local search                       PREFIX dc: <http://purl.org/dc/terms/> .

                                        SELECT ?post WHERE {
    A blogger is                         ?post rdf:type sioc:Post .
                                          ?post dc:spatial <http://sws.geonames.org/2964180/> .
     looking for media                    ?post dc:created ?date .
     to enhance a                         FILTER (str(?date) > ``2009-05-23T00:00:00'') .
                                          FILTER (str(?date) < ``2009-06-06T23:59:59'') .
     post about the
                                          FILTER EXISTS {
     Volvo Ocean                            { ?post dc:subject ``volvooceanrace'' } UNION
     Race                                   { ?post dc:subject ``vor'' } UNION
                                            { ?post dc:subject ``oceanrace'' } UNION
                                            { ?post dc:subject ``yacht'' }
                                          }
                                        }




                                                   33
Combining the approaches (5)
Digital Enterprise Research Institute                                                   www.deri.ie


    Use-case 2: local browsing
    A sports fan wants to follow conversations about sports in their
     local area
                    PREFIX       rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
                    PREFIX       sioc: <http://rdfs.org/sioc/ns#> .
                    PREFIX       gn: <http://www.geonames.org/ontology#> .
                    PREFIX       skos: <http://www.w3.org/2004/02/skos/core#> .

                    SELECT ?post WHERE {
                      ?post rdf:type sioc:Post .
                      ?post dc:spatial ?location .
                      ?location gn:parentADM1 <http://sws.geonames.org/2963597/> .
                      ?post sioc:topic ?topic .
                      ?topic skos:broader+ <http://www.dmoz.org/Sports/> .
                    }




                                                    34
Impact
Digital Enterprise Research Institute                                            www.deri.ie




            5 conference papers
                   ESWC 2011, ECIR 2011, I-Semantics 2010, IV 2008, ASNA 2008
            2 workshop papers
                   WIDM @ CIKM 2008, SMUC @ CIKM 2011
            2 book chapters
                   Advances in Computers 76 (Elsevier)
                   Reasoning Web (Springer)
            Tutorial
                   "Combining the Social and the Semantic Web”, ESWC 2011




                                               35
Summary
Digital Enterprise Research Institute                             www.deri.ie




            Proposed approaches for automatically generating
             metadata for social media posts using related Web
             content
                   Tags, location and topic
            Evaluated the accuracy of each approach
            Illustrated how the approaches can be used in
             combination in order to semantically enrich social
             media posts and enable enhanced search and
             browsing in a social media dataset




                                               36

More Related Content

Viewers also liked

Viewers also liked (9)

Areia
AreiaAreia
Areia
 
Ap teachers-guide-q12
Ap teachers-guide-q12Ap teachers-guide-q12
Ap teachers-guide-q12
 
C O I S A S Q U E V O C E D E V E F A Z E R S O M
C O I S A S  Q U E  V O C E  D E V E  F A Z E R  S O MC O I S A S  Q U E  V O C E  D E V E  F A Z E R  S O M
C O I S A S Q U E V O C E D E V E F A Z E R S O M
 
Tcacs Presentation
Tcacs PresentationTcacs Presentation
Tcacs Presentation
 
Reflexes By Sherek
Reflexes By SherekReflexes By Sherek
Reflexes By Sherek
 
Trane Home Comfort Guide (British Columbia)
Trane Home Comfort Guide (British Columbia)Trane Home Comfort Guide (British Columbia)
Trane Home Comfort Guide (British Columbia)
 
Imbag-Hero powerpoint
Imbag-Hero powerpointImbag-Hero powerpoint
Imbag-Hero powerpoint
 
EDG Enneagram Nine types
EDG Enneagram Nine typesEDG Enneagram Nine types
EDG Enneagram Nine types
 
ACTIVE LISTENING
ACTIVE LISTENINGACTIVE LISTENING
ACTIVE LISTENING
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Sheila Kinsella PhD Defense

  • 1. Digital Enterprise Research Institute www.deri.ie Augmenting Social Media Items with Metadata using Related Web Content Sheila Kinsella Copyright 2010 Digital Enterprise Research Institute. All rights reserved. 1
  • 2. Outline Digital Enterprise Research Institute www.deri.ie  Motivation  Example Scenario  Tag prediction  Approach  Geolocation  Evaluation  Topic classification  Summary  Combining the approaches  Impact  Conclusions 2
  • 3. Motivation Digital Enterprise Research Institute www.deri.ie  Social media is an important information source  e.g., real time citizen journalism, Q&A sites, niche topics  Search and navigation can be challenging  Short and informal posts  Items are not curated and often lack metadata  Users conversing share a common context and therefore omit relevant information, e.g. location  Making use of related Web data can help us to infer such context information  e.g., hyperlinks, posts with similar content 3
  • 4. Example Scenario: Adding metadata to a blog post Digital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. tags? The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the topic? win. It was a great game location? from both sides. Here's a clip of the first try. 4
  • 5. Example Scenario: Possible clues from content Digital Enterprise Research Institute www.deri.ie Last night I saw Connacht tags? play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the topic? win. It was a great game location? from both sides. Here's a clip of the first try. 5
  • 6. Example Scenario: Possible clues from content Digital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. tags? The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the topic? win. It was a great game location? from both sides. Here's a clip of the first try. 6
  • 7. Example Scenario: Possible clues from content Digital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. tags? The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game topic? location? from both sides. Here's a clip of the first try. 7
  • 8. Example Scenario: Exploiting related Web content Digital Enterprise Research Institute www.deri.ie ...didn’t see Last night I saw Connacht the match but play at The Sportsground. here’s a The match started well for summary href Connacht with a great try but from John.. after half time the opposition closed the gap. Finally we managed to hold out for the ..............This win. It was a great game review of the from both sides. Here's a clip Connacht href of the first try. match shows that they are getting back in form!...... tags from anchortext 8
  • 9. Example Scenario: Exploiting related Web content Digital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but location after half time the opposition from closed the gap. Finally we managed to hold out for the geotagged win. It was a great game from both sides. Here's a clip social of the first try. media JohnSmith John Smith I’m at the Galway Sportsground 9
  • 10. Example Scenario: Exploiting related Web content Digital Enterprise Research Institute www.deri.ie Last night I saw Connacht play at The Sportsground. YouTube The match started well for Title: Fionn Carr try Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the Category: href win. It was a great game Sport from both sides. Here's a clip Tags: rugby, try, carr, of the first try. connacht topic from hyperlinked objects 10
  • 11. Example Scenario: Overview of approaches Digital Enterprise Research Institute www.deri.ie ...didn’t see Last night I saw Connacht the match but play at The Sportsground. YouTube here’s a The match started well for Title: Fionn Carr try summary href Connacht with a great try but from John.. after half time the opposition closed the gap. Finally we href managed to hold out for the Category: ..............This win. It was a great game Sport review of the from both sides. Here's a clip Tags: Connacht href rugby, try, carr, of the first try. match shows connacht that they are getting back in JohnSmith John Smith form!...... I’m at the Galway Sportsground TOPIC TAG PREDICTION GEOLOCATION CLASSIFICATIO N 11
  • 12. Tag Prediction: Approach Digital Enterprise Research Institute www.deri.ie  Aim: Automatic tag generation based on anchortext 1. Data collection and preprocessing  Retrieve document and extract META information  Retrieve inlinking documents and extract anchortext  Preprocessing (e.g. stemming, stopword removal) 2. Tag indexing and ranking  Generate term vectors from the preprocessed annotations  Ranking: tf and tf-idf 12
  • 13. Tag Prediction: Evaluation (1) Digital Enterprise Research Institute www.deri.ie  Datasets:  Web: WEBSPAM-2007, 12M pages from .uk domain  Delicious: 2007 Crawl containing tags for 4.5M URLs  Overlap between datasets: 192k URLs  Goals:  Compare overlap of predicted tags and delicious tags  Assess relevance of predicted tags and relevance of delicious tags 13
  • 14. Tag Prediction: Evaluation (2) Digital Enterprise Research Institute www.deri.ie  Automatic Evaluation  Relative precision@k (Average proportion of predicted tags that are also among delicious tags) k=1 k=2 k=3 k=4 k=5 0.48 0.45 0.42 0.39 0.37  Relative recall@k (Average proportion of delicious tags can also be inferred from anchortext) k=1 k=2 k=3 k=4 k=5 0.41 0.35 0.32 0.29 0.28 14
  • 15. Tag Prediction: Evaluation (3) Digital Enterprise Research Institute www.deri.ie  Human Evaluation  80 documents, each assessed by 3 judges – 0: not relevant; 1: quite relevant; 2: very relevant  Evaluator agreement – In 85% of cases, judges at least almost agree – i.e., two agree and the third differs by just one point 15
  • 16. Tag Prediction: Evaluation (4) Digital Enterprise Research Institute www.deri.ie  Human Evaluation  Precision@k (Average proportion of tags judged relevant by evaluators) – Relevance threshold: 1 k=1 k=2 k=3 k=4 k=5 Delicious 0.86 0.84 0.82 0.80 0.78 predicted 0.78 0.76 0.69 0.67 0.66  Not feasible to measure recall 16
  • 17. Tag Prediction: Summary Digital Enterprise Research Institute www.deri.ie  Substantial overlap between tags assigned on a social bookmarking site and terms from anchortext  Human evaluators rate relevance of terms from anchortext as not much lower than tags  This approach can provide useful and novel annotations for untagged social media items, if other users link to them with anchortext 17
  • 18. Geolocation: Approach (1) Digital Enterprise Research Institute www.deri.ie  Aim: Location prediction based on models built from geotagged social media  Enables detection of implicit location clues such as slang, venues, other terms of local relevance 1. Reverse Geocoding  Filter geotagged tweets from Twitter stream  Reverse-geocode each coordinate to corresponding places – Postal code, City, State, Country – Yahoo! Geoplanet service  Aggregate all of the text from each place together for model building 18
  • 19. Geolocation: Approach (2) Digital Enterprise Research Institute www.deri.ie 2. Language Modelling  Approach from information retrieval – given a query, find the most relevant document in a collection  Model each document and query as bag of words  For each document, calculate probability that a random sampling would result in the query  Based on the intuition that users create queries by guessing words that would occur in the document  For our geolocation task: estimates the probability that a random sampling of a location would result in the social media post 19
  • 20. Geolocation: Evaluation (1) Digital Enterprise Research Institute www.deri.ie  Dataset  Twitter Firehose stream  7.3 million geotagged tweets posted during Summer 2010  Retweets removed, #hashtags and @usernames preserved Place type # Tweets # Distinct places Country 7.3m 222 State 7.3m 2.3k City 6.3m 72.6k Postal code 7.2m 104.7k  Baseline: Yahoo Placemaker!  identifies and disambiguates placenames in text and returns the spatial entity most likely to encompass them 20
  • 21. Geolocation: Evaluation (2) Digital Enterprise Research Institute www.deri.ie  Prediction Methods  Trivial Classifier – Each tweet assigned to the most common place in training set  Placemaker (Tweet) – Each tweet is submitted to Placemaker and the most probable candidate is selected. Allows detection of explicit geographic references in the tweet  Language Model – Locations are ranking according to their query likelihood and the location whose model ranks highest is selected  Placemaker (Location) – The location field from the tweet is submitted to Placemaker and the most probable candidate is selected. Allows detection of explicit geographic references in the self-reported location 21
  • 22. Geolocation: Evaluation (3) Digital Enterprise Research Institute www.deri.ie  Tweet location prediction accuracy  Common location focused services removed Zip Town State Country Trivial 0.005 0.061 0.060 0.434 Classifier Placemaker 0.018 0.060 0.076 0.120 (Tweet) Language 0.052 0.217 0.246 0.514 Model Placemaker 0.017 0.269 0.401 0.518 (Location) 22
  • 23. Geolocation: Summary Digital Enterprise Research Institute www.deri.ie  Language models of geotagged tweets enables the location of non-geotagged items to be predicted  The approach gives large improvements compared to parsing for explicit placenames  City level accuracy – 21.7% versus 6%  The approach can be used to detect implicit geographical information in social media posts 23
  • 24. Topic Classification: Approach Digital Enterprise Research Institute www.deri.ie  Aim: Improve topic classification using structured data from hyperlinks 1. Identify sources of structured data from hyperlinks  Based on domains, e.g., wikipedia.org 2. Retrieve structured data for these hyperlinks  From Linked Data/APIs, e.g., dbpedia.org 3. Perform text classification  Requires set of already categorised posts for training  Post content and external metadata as sources of textual features  Compare accuracy achieved by different metadata types 4. Related to IR studies that classify documents based on fielded text from hyperlinked pages, but they consider structural rather than semantic fields 24
  • 25. Topic Classification: Evaluation (1) Digital Enterprise Research Institute www.deri.ie  Datasets Forum Twitter Data source message board microblogging site Ground truth topics forums #hashtags # classes (topics) 10 6 # posts 6,626 2,415  External data sources Linked Data Web APIs 25
  • 26. Topic Classification: Evaluation (2) Digital Enterprise Research Institute www.deri.ie  Experimental Setup  Multinomial Naïve Bayes classifier (WEKA)  10-fold cross-validation  Compared classification accuracy for different post representations based – post content – hyperlinked HTML pages – hyperlinked object metadata – combinations of these  Experimented to find optimal ways of combining feature vectors (e.g., weightings) 26
  • 27. Topic Classification: Evaluation (3) Digital Enterprise Research Institute www.deri.ie  Results Data Source Forum Twitter Content (no URLs) 0.745 0.722 Content (with URLs) 0.811 0.759 HTML 0.730 0.645 Metadata 0.835 0.683 Content + HTML 0.832 0.784 Content + Metadata 0.899 0.820 (micro-averaged F1) 27
  • 28. Topic Classification: Evaluation (4) Digital Enterprise Research Institute www.deri.ie  Results – comparing metadata types Wikipedia Metadata type Content (no URLs) Metadata only Content+M’data Category 0.811 0.851 Description 0.761 0.798 0.850 Title 0.685 0.809 YouTube Metadata type Content (no URLs) Metadata only Content+M’data Tag 0.838 0.864 Title 0.773 0.824 0.709 Description 0.752 0.810 Category 0.514 0.753 28
  • 29. Topic Classification: Summary Digital Enterprise Research Institute www.deri.ie  Topic classification in social media can be improved by making use of structured metadata from hyperlinked objects  The most useful metadata types can be found experimentally, but for different objects, the usefulness of metadata types varies  The categories assigned by this approach would allow a user to browse social media posts with hyperlinks by topic, even if the text of the post itself is not sufficient for accurate automatic categorisation of the post. 29
  • 30. Combining the approaches (1) Digital Enterprise Research Institute www.deri.ie location topic tags 30
  • 31. Combining the approaches (2) Digital Enterprise Research Institute www.deri.ie Last night I watched Connacht play at The tags? Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally topic? we managed to hold out location? for the win. It was a great game from both sides. Here's a clip of the first try. 31
  • 32. Combining the approaches (3) Digital Enterprise Research Institute www.deri.ie @prefix ex: <http://example.org/> . @prefix content: <http://purl.org/rss/1.0/modules/content/> . @prefix dc: <http://purl.org/dc/terms/> . @prefix sioc: <http://rdfs.org/sioc/ns#> . ex:post1 rdf:type sioc:Post . ex:post1 content:encoded “Last night I watched Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a [url=„http://www.youtube.com/watch?v=[...]‟]clip of the first try.[/url]” . ex:post1 sioc:links_to <http://www.youtube.com/watch?v=[...]> . ex:post1 dc:subject “connacht” . ex:post1 dc:subject “match” . ex:post1 dc:subject “review” . ex:post1 dc:subject “summary” . ex:post1 dc:spatial <http://sws.geonames.org/2964180/> . ex:post1 sioc:topic <http://www.dmoz.org/Sports/Football/Rugby_Union/> . 32
  • 33. Combining the approaches (4) Digital Enterprise Research Institute www.deri.ie  Use-case 1: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . PREFIX sioc: <http://rdfs.org/sioc/ns#> . Local search PREFIX dc: <http://purl.org/dc/terms/> . SELECT ?post WHERE {  A blogger is ?post rdf:type sioc:Post . ?post dc:spatial <http://sws.geonames.org/2964180/> . looking for media ?post dc:created ?date . to enhance a FILTER (str(?date) > ``2009-05-23T00:00:00'') . FILTER (str(?date) < ``2009-06-06T23:59:59'') . post about the FILTER EXISTS { Volvo Ocean { ?post dc:subject ``volvooceanrace'' } UNION Race { ?post dc:subject ``vor'' } UNION { ?post dc:subject ``oceanrace'' } UNION { ?post dc:subject ``yacht'' } } } 33
  • 34. Combining the approaches (5) Digital Enterprise Research Institute www.deri.ie  Use-case 2: local browsing  A sports fan wants to follow conversations about sports in their local area PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . PREFIX sioc: <http://rdfs.org/sioc/ns#> . PREFIX gn: <http://www.geonames.org/ontology#> . PREFIX skos: <http://www.w3.org/2004/02/skos/core#> . SELECT ?post WHERE { ?post rdf:type sioc:Post . ?post dc:spatial ?location . ?location gn:parentADM1 <http://sws.geonames.org/2963597/> . ?post sioc:topic ?topic . ?topic skos:broader+ <http://www.dmoz.org/Sports/> . } 34
  • 35. Impact Digital Enterprise Research Institute www.deri.ie  5 conference papers  ESWC 2011, ECIR 2011, I-Semantics 2010, IV 2008, ASNA 2008  2 workshop papers  WIDM @ CIKM 2008, SMUC @ CIKM 2011  2 book chapters  Advances in Computers 76 (Elsevier)  Reasoning Web (Springer)  Tutorial  "Combining the Social and the Semantic Web”, ESWC 2011 35
  • 36. Summary Digital Enterprise Research Institute www.deri.ie  Proposed approaches for automatically generating metadata for social media posts using related Web content  Tags, location and topic  Evaluated the accuracy of each approach  Illustrated how the approaches can be used in combination in order to semantically enrich social media posts and enable enhanced search and browsing in a social media dataset 36