SlideShare une entreprise Scribd logo
1  sur  32
The Semantic Web and the News:
Exploitation and Adoption

Ken Ellis
Chief Scientist
Agenda

  Intro to Daylife

 Exploiting the Semantic Web
        Named Entities
    
        Toolsets, issues
    

    Adopting / Enabling

        Others
    
        Daylife
    
Daylife

                A Platform for News Innovation:

A scalable solution for publishers of all sizes to generate more content
        and more inventory – with no additional personnel costs
Daylife: What We Do
    Aggregate Content

        Licensed photos (Getty, AP, Reuters)
    
        Articles (scraped, real-time)
    
    Create Metadata

        Topics (people, organizations, concepts)
    
        Topic taxonomy, descriptions
    
        Quotes with attribution
    
        Photo identification
    
        Relatedness
    
        Authorship, sentiment analysis, etc.
    
    Deliver to Clients

        Web Sites / Modules / Data
    
        Flexibility: API w/ 500 distinct queries
    
        Novel search/ranking algorithms
    
        Free API
    
[Wiki|DB]Pedia and Named Entites

 We also want to collect content around a named entity
…and associate it with external data (Wikipedia, Freebase)
[Wiki|DB]Pedia and Named Entites
                                         … for a lot of NE’s
                                   (55k newsworthy ones last month)
                     1000000



                      100000
Articles Per Month




                      10000



                       1000



                        100



                         10



                          1
                               1       10      100             1000   10000   100000

                                                     NE Rank
[Wiki|DB]Pedia and Named Entites

    Without getting swamped

Daylife and the Semantic Web

    Wikipedia

        website
    
        API
    
        Wikimedia dumps
    

    DBPedia

    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
Resources for News Organizations

                                   Named Entities
    Wikipedia                  

                                        vetting
        website                     
    
                                        disambiguation
        API                         
    
                                        aliases
        Wikimedia dumps             
    
                                        prominence
    DBPedia                         

    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
[Wiki|DB]Pedia and Named Entites

                        But:
“… Now, team owner Kevin Buckler is looking to debut
 in NASCAR Sprint Cup Series competition, when Mike
     Wallace runs in Thursday's Gatorade Duel …”

    Which Mike Wallace?

        Mike_Wallace_(journalist)
    
        Mike_Wallace_(NASCAR)
    


    Two disambiguation approaches

        Given an article, extracted name, what Wikipedia entry does
    
        it map to?
        Given a Wikipedia entry, what articles match?
    
[Wiki|DB]Pedia and Named Entites

    Articles First:

    Wikimedia dumps and DBPedia

        Filter for people, organizations, other NE
    
        Construct weighted graph from links
    
        Proxy for prominence (# edits, pageviews, dumps only)
    
        Redirects & disambiguation pages
    
            “Hillary Clinton” redirect to Hillary_Rodham_Clinton: human
        
            decided reference is unambiguous; Usama/Osama


    Identify names, possibly matching graph nodes

    Select set of nodes that minimizes total distance

        Perhaps factor in node prominence
    
[Wiki|DB]Pedia and Named Entites

      Mike
     Wallace
    journalist


                                             NASCAR
Chicago
  Sun-
 Times

                                                       Mike
                              Kevin
                 Chicago
                                                      Wallace
                             Buckler
                  Bulls
                                                      NASCAR




          Gatorade

                           I made this up!
[Wiki|DB]Pedia and Named Entites

    Another possibility: compare text of Wikipedia entry to

    the article

    But:

        Wikipedia entries largely historical, small fraction related to
    
        current events
        Journalists, in providing context for lesser-known individuals,
    
        often mention a few other named entities
[Wiki|DB]Pedia and Named Entites

    NE First approach:

    Classifier for race car drivers, Wikipedia to identify names

        Filter based on prominence
    
        See EVRI taxonomical paths
    
        http://www.evri.com/mainline-ui/jsp/index.jsf#searching-with-
        taxonomical-paths
[Wiki|DB]Pedia and Named Entites

    NE First:

        Tractable for a human (limited number of classifiers)
    
        Better for low-recall high-precision
    


    Article First:

        Low editorial oversight
    
        Best-guess
    


    Neither is a complete solution

    Not for locations

[Wiki|DB]Pedia and Named Entites
General Nits

    Sticky Graffiti

        Wikipedia can be updated
    
        real-time if you don’t like it
        Some derived data sets
    
        can’t. Makes it our
        problem!
        On-demand updates from
    
        Wikipedia API / HTML
[Wiki|DB]Pedia and Named Entites
General Nits

    Career Changes

        Mike Wallace (journalist)
    
        becomes a NASCAR driver
        Joe Wurtzelbacher
    
        becomes a political pundit
    Not a complete solution,

    but we knew that.
[Wiki|DB]Pedia and Named Entites
General Nits

    Staleness

    Infrequent Wikimedia

    dumps
    GWB is still president?

        DBPedia bad
    
        Wikimedia dumps bad
    
        Freebase good
    
        Wikipedia HTML/API good
    
                                  DBPedia, 3/5/09
[Wiki|DB]Pedia and Named Entites
    Obscure Information

    Clint Eastwood:

        Is prominent, is a politician
    
        Not a prominent politician
    
[Wiki|DB]Pedia and Named Entites
      URI Stability
  
      If this were 1981, unambiguous “George Bush”:
  

<rdf:RDF xmlns:rdf=quot;http://www.w3.org/1981/02/22-rdf-syntax-ns#quot;
        xmlns:dc=quot;http://purl.org/dc/elements/1.1/quot;>
  <rdf:Description rdf:about=quot;http://en.wikipedia.org/wiki/George_Bushquot;>
    <dc:title>George Bush</dc:title>
    <dc:publisher>Wikipedia</dc:publisher>
  </rdf:Description>
</rdf:RDF>



      The NYTimes did this, and still does (API):
  
          “George Bush” tag  George H. W. Bush
      

      A lucky problem to have!
  
Resources

                                   Named Entities
    Wikipedia                  

                                        GUID’s!
        website                     
    
                                        tagging
        API                         
    
                                        associations (members of
        Wikimedia dumps             
    
                                        teams)
    DBPedia

                                        other data
                                    
    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
Freebase
    GUID’s are stable

    Query by Wikipedia URI
                                        http://www.freebase.com/api/service/mqlre
                                         ad?query={quot;queryquot;:{quot;*quot;:null,quot;idquot;:quot;/wikipedia/
    Easy-to-find redirects

                                         en/Mike_Wallace_$0028journalist$0029quot;}}
    GWB isn’t president

    Professions vs. Types

    Easier for topic tagging



    Clint Eastwood still a politician

        but: easier to tell he’s a minor one
    
        multiple types/professions, not much political data
    


    No good proxy for significance

        cross-reference
    
Resources

                                   Inter-agency standards
    Wikipedia                  

                                   Newswire services
        website                
    
                                   IPTC: photo information
        API                    
    
                                   NewsML: article information,
        Wikimedia dumps        
    
                                   topics
    DBPedia

    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
Interagency Metadata

     Data:

       authorship
       location
       caption
       sometimes people,
        category
       NE’s hand-typed,
        often quickly
     RSS almost as good

        Stripped
    
        Matching problem,
    
        but STILL USEFUL
Resources

                                   Q: “Can you use our metadata”
    Wikipedia                  

                                   A: “Sometimes”
        website                
    
        API
    
                                   Again, matching problem, but
        Wikimedia dumps        
    
                                   good for client-specific topics,
    DBPedia

                                   still useful
    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
Others Using the Semantic Web

    Having an API

        not the Semantic Web, but at least machine-friendly
    
        eventually common, even for publishers
    



    Publishing URI’s for Wikipedia, Freebase, IMDB, etc.

        common among non-publishers
    
        parasitic (not bad!)
    



    Querying using the same URI’s

        not so common
    
        mutualistic
    
Others Using the Semantic Web

    EVRI

        API
    
        Topics (mostly, all?) from Wikipedia
    
        Probably taxonomic pathways, facets, derived from Wikipedia
    
        Disambiguation based on above
    
        Published Wikipedia URL’s
    
        Can’t query by Wikipedia, other URI’s
    
Others Using the Semantic Web

    Zemanta

        Lots of Linked Data
    
        API provides text markup
    


        Developing (with others)
    
        simplified RDFa based
        semantic tagging standard
Others Using the Semantic Web

    Calais (Thomson Reuters)

        API extracts NE’s, other information
    
        Provides Linked Data URI’s to others (one-way)
    
        Provides their own endpoints
    
        Not an aggregator
    
        Eventual support for querying
    
        Very clean!
    
Others Using the Semantic Web
    The New York Times

        Leading charge with publisher API
    
        Their own tagging, great quality
    
        Some major newspapers
    
        following suit
        Others APIs: NewsGator, Inform,
    
        Outside.in
    Slow Moves to Digital Access

        Full-text RSS rare
    
        API rare
    
        Semantic Web standards rare
    
    Wouldn’t it be great if:

        You could ask for content about Mike_Wallace_(American_football)
    
        They pointed you to other rich data sources
    
Wikipedia URI Lookup
A quick service to support lookup for Wikipedia URI’s

         http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=
          http://en.wikipedia.org/wiki/Mike_Wallace_(journalist)
                                     or
http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=Barack_Obama
Thank you


            Web Site
            http://www.daylife.com

            Daylife API
            http://developer.daylife.com

            Labs
            http://labs.daylife.com

            Email
            ken@daylife.com

Contenu connexe

Similaire à The Semantic Web And The News

Freebase: Wikipedia Mining 20080416
Freebase: Wikipedia Mining 20080416Freebase: Wikipedia Mining 20080416
Freebase: Wikipedia Mining 20080416zenkat
 
“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”bridgingworlds2008
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?milesw
 
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationACSG Section Montréal
 
London HUG
London HUGLondon HUG
London HUGBoudicca
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Datahuguk
 
Lessons Learnt From Working With Rails
Lessons Learnt From Working With RailsLessons Learnt From Working With Rails
Lessons Learnt From Working With Railsmartinbtt
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending InfluenceRichard Wallis
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Communitytinacallahan
 
Smart Data Applications powered by the Wikidata Knowledge Graph
Smart Data Applications powered by the Wikidata Knowledge GraphSmart Data Applications powered by the Wikidata Knowledge Graph
Smart Data Applications powered by the Wikidata Knowledge GraphPeter Haase
 
Schema.org - Extending Benefits
Schema.org - Extending BenefitsSchema.org - Extending Benefits
Schema.org - Extending BenefitsRichard Wallis
 
Creating A Web 2.0 Toolbox For The Academic Library
Creating A Web 2.0 Toolbox For The Academic LibraryCreating A Web 2.0 Toolbox For The Academic Library
Creating A Web 2.0 Toolbox For The Academic LibraryDarylyne Provost
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big DataPierre De Wilde
 
Creating Narrative with Digital Objects
Creating Narrative with Digital ObjectsCreating Narrative with Digital Objects
Creating Narrative with Digital ObjectsShawn Day
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesArcadia Data
 
Building_Decentralized_Web_Apps.pdf
Building_Decentralized_Web_Apps.pdfBuilding_Decentralized_Web_Apps.pdf
Building_Decentralized_Web_Apps.pdfzkxlnpn696
 

Similaire à The Semantic Web And The News (20)

Freebase: Wikipedia Mining 20080416
Freebase: Wikipedia Mining 20080416Freebase: Wikipedia Mining 20080416
Freebase: Wikipedia Mining 20080416
 
“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?
 
Organisational Wiki Adoption
Organisational Wiki AdoptionOrganisational Wiki Adoption
Organisational Wiki Adoption
 
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
 
London HUG
London HUGLondon HUG
London HUG
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
 
Lessons Learnt From Working With Rails
Lessons Learnt From Working With RailsLessons Learnt From Working With Rails
Lessons Learnt From Working With Rails
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending Influence
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Community
 
Smart Data Applications powered by the Wikidata Knowledge Graph
Smart Data Applications powered by the Wikidata Knowledge GraphSmart Data Applications powered by the Wikidata Knowledge Graph
Smart Data Applications powered by the Wikidata Knowledge Graph
 
Schema.org - Extending Benefits
Schema.org - Extending BenefitsSchema.org - Extending Benefits
Schema.org - Extending Benefits
 
Wikimedia, MediaWiki & Education in IT: Notes
Wikimedia, MediaWiki & Education in IT: NotesWikimedia, MediaWiki & Education in IT: Notes
Wikimedia, MediaWiki & Education in IT: Notes
 
Wikis And Your Business
Wikis And Your BusinessWikis And Your Business
Wikis And Your Business
 
Creating A Web 2.0 Toolbox For The Academic Library
Creating A Web 2.0 Toolbox For The Academic LibraryCreating A Web 2.0 Toolbox For The Academic Library
Creating A Web 2.0 Toolbox For The Academic Library
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
 
Creating Narrative with Digital Objects
Creating Narrative with Digital ObjectsCreating Narrative with Digital Objects
Creating Narrative with Digital Objects
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
 
datavault2.pptx
datavault2.pptxdatavault2.pptx
datavault2.pptx
 
Building_Decentralized_Web_Apps.pdf
Building_Decentralized_Web_Apps.pdfBuilding_Decentralized_Web_Apps.pdf
Building_Decentralized_Web_Apps.pdf
 

Dernier

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

The Semantic Web And The News

  • 1. The Semantic Web and the News: Exploitation and Adoption Ken Ellis Chief Scientist
  • 2. Agenda Intro to Daylife   Exploiting the Semantic Web Named Entities  Toolsets, issues  Adopting / Enabling  Others  Daylife 
  • 3. Daylife A Platform for News Innovation: A scalable solution for publishers of all sizes to generate more content and more inventory – with no additional personnel costs
  • 4. Daylife: What We Do Aggregate Content  Licensed photos (Getty, AP, Reuters)  Articles (scraped, real-time)  Create Metadata  Topics (people, organizations, concepts)  Topic taxonomy, descriptions  Quotes with attribution  Photo identification  Relatedness  Authorship, sentiment analysis, etc.  Deliver to Clients  Web Sites / Modules / Data  Flexibility: API w/ 500 distinct queries  Novel search/ranking algorithms  Free API 
  • 5. [Wiki|DB]Pedia and Named Entites We also want to collect content around a named entity …and associate it with external data (Wikipedia, Freebase)
  • 6. [Wiki|DB]Pedia and Named Entites … for a lot of NE’s (55k newsworthy ones last month) 1000000 100000 Articles Per Month 10000 1000 100 10 1 1 10 100 1000 10000 100000 NE Rank
  • 7. [Wiki|DB]Pedia and Named Entites Without getting swamped 
  • 8. Daylife and the Semantic Web Wikipedia  website  API  Wikimedia dumps  DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 9. Resources for News Organizations Named Entities Wikipedia   vetting website   disambiguation API   aliases Wikimedia dumps   prominence DBPedia   Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 10. [Wiki|DB]Pedia and Named Entites But: “… Now, team owner Kevin Buckler is looking to debut in NASCAR Sprint Cup Series competition, when Mike Wallace runs in Thursday's Gatorade Duel …” Which Mike Wallace?  Mike_Wallace_(journalist)  Mike_Wallace_(NASCAR)  Two disambiguation approaches  Given an article, extracted name, what Wikipedia entry does  it map to? Given a Wikipedia entry, what articles match? 
  • 11. [Wiki|DB]Pedia and Named Entites Articles First:  Wikimedia dumps and DBPedia  Filter for people, organizations, other NE  Construct weighted graph from links  Proxy for prominence (# edits, pageviews, dumps only)  Redirects & disambiguation pages  “Hillary Clinton” redirect to Hillary_Rodham_Clinton: human  decided reference is unambiguous; Usama/Osama Identify names, possibly matching graph nodes  Select set of nodes that minimizes total distance  Perhaps factor in node prominence 
  • 12. [Wiki|DB]Pedia and Named Entites Mike Wallace journalist NASCAR Chicago Sun- Times Mike Kevin Chicago Wallace Buckler Bulls NASCAR Gatorade I made this up!
  • 13. [Wiki|DB]Pedia and Named Entites Another possibility: compare text of Wikipedia entry to  the article But:  Wikipedia entries largely historical, small fraction related to  current events Journalists, in providing context for lesser-known individuals,  often mention a few other named entities
  • 14. [Wiki|DB]Pedia and Named Entites NE First approach:  Classifier for race car drivers, Wikipedia to identify names  Filter based on prominence  See EVRI taxonomical paths  http://www.evri.com/mainline-ui/jsp/index.jsf#searching-with- taxonomical-paths
  • 15. [Wiki|DB]Pedia and Named Entites NE First:  Tractable for a human (limited number of classifiers)  Better for low-recall high-precision  Article First:  Low editorial oversight  Best-guess  Neither is a complete solution  Not for locations 
  • 16. [Wiki|DB]Pedia and Named Entites General Nits Sticky Graffiti  Wikipedia can be updated  real-time if you don’t like it Some derived data sets  can’t. Makes it our problem! On-demand updates from  Wikipedia API / HTML
  • 17. [Wiki|DB]Pedia and Named Entites General Nits Career Changes  Mike Wallace (journalist)  becomes a NASCAR driver Joe Wurtzelbacher  becomes a political pundit Not a complete solution,  but we knew that.
  • 18. [Wiki|DB]Pedia and Named Entites General Nits Staleness  Infrequent Wikimedia  dumps GWB is still president?  DBPedia bad  Wikimedia dumps bad  Freebase good  Wikipedia HTML/API good  DBPedia, 3/5/09
  • 19. [Wiki|DB]Pedia and Named Entites Obscure Information  Clint Eastwood:  Is prominent, is a politician  Not a prominent politician 
  • 20. [Wiki|DB]Pedia and Named Entites URI Stability  If this were 1981, unambiguous “George Bush”:  <rdf:RDF xmlns:rdf=quot;http://www.w3.org/1981/02/22-rdf-syntax-ns#quot; xmlns:dc=quot;http://purl.org/dc/elements/1.1/quot;> <rdf:Description rdf:about=quot;http://en.wikipedia.org/wiki/George_Bushquot;> <dc:title>George Bush</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF> The NYTimes did this, and still does (API):  “George Bush” tag  George H. W. Bush  A lucky problem to have! 
  • 21. Resources Named Entities Wikipedia   GUID’s! website   tagging API   associations (members of Wikimedia dumps   teams) DBPedia  other data  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 22. Freebase GUID’s are stable  Query by Wikipedia URI  http://www.freebase.com/api/service/mqlre ad?query={quot;queryquot;:{quot;*quot;:null,quot;idquot;:quot;/wikipedia/ Easy-to-find redirects  en/Mike_Wallace_$0028journalist$0029quot;}} GWB isn’t president  Professions vs. Types  Easier for topic tagging  Clint Eastwood still a politician  but: easier to tell he’s a minor one  multiple types/professions, not much political data  No good proxy for significance  cross-reference 
  • 23. Resources Inter-agency standards Wikipedia   Newswire services website   IPTC: photo information API   NewsML: article information, Wikimedia dumps   topics DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 24. Interagency Metadata Data:   authorship  location  caption  sometimes people, category  NE’s hand-typed, often quickly  RSS almost as good Stripped  Matching problem,  but STILL USEFUL
  • 25. Resources Q: “Can you use our metadata” Wikipedia   A: “Sometimes” website   API  Again, matching problem, but Wikimedia dumps   good for client-specific topics, DBPedia  still useful Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 26. Others Using the Semantic Web Having an API  not the Semantic Web, but at least machine-friendly  eventually common, even for publishers  Publishing URI’s for Wikipedia, Freebase, IMDB, etc.  common among non-publishers  parasitic (not bad!)  Querying using the same URI’s  not so common  mutualistic 
  • 27. Others Using the Semantic Web EVRI  API  Topics (mostly, all?) from Wikipedia  Probably taxonomic pathways, facets, derived from Wikipedia  Disambiguation based on above  Published Wikipedia URL’s  Can’t query by Wikipedia, other URI’s 
  • 28. Others Using the Semantic Web Zemanta  Lots of Linked Data  API provides text markup  Developing (with others)  simplified RDFa based semantic tagging standard
  • 29. Others Using the Semantic Web Calais (Thomson Reuters)  API extracts NE’s, other information  Provides Linked Data URI’s to others (one-way)  Provides their own endpoints  Not an aggregator  Eventual support for querying  Very clean! 
  • 30. Others Using the Semantic Web The New York Times  Leading charge with publisher API  Their own tagging, great quality  Some major newspapers  following suit Others APIs: NewsGator, Inform,  Outside.in Slow Moves to Digital Access  Full-text RSS rare  API rare  Semantic Web standards rare  Wouldn’t it be great if:  You could ask for content about Mike_Wallace_(American_football)  They pointed you to other rich data sources 
  • 31. Wikipedia URI Lookup A quick service to support lookup for Wikipedia URI’s http://labs.daylife.com/wikipedia_topic_getInfo.php?uri= http://en.wikipedia.org/wiki/Mike_Wallace_(journalist) or http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=Barack_Obama
  • 32. Thank you Web Site http://www.daylife.com Daylife API http://developer.daylife.com Labs http://labs.daylife.com Email ken@daylife.com