Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

BibBase Linked Data Triplification Challenge 2010 Presentation

1 204 vues

Publié le

This was a short talk given on BibBase (http://data.bibbase.org) in LDTC 2010, Graz, Austria.

Publié dans : Formation, Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

BibBase Linked Data Triplification Challenge 2010 Presentation

  1. 1. BibBase Triplified http://data.bibbase.org/ Presented by: Reynold S. Xin UC Berkeley Joint work with: Oktie Hassanzadeh, Yang Yang, Jiang Du, Minghua Zhao, Renee J. Miller University of Toronto Christian Fritz University of Southern California
  2. 2. Outline  Goals and Status  Duplicate detection  Interlinking of data sources  Additional features  Conclusions and future work
  3. 3. Goals http://www.bibbase.org  Makes it easy for scientists to maintain publications pages  Scientists maintain a bibtex file; BibBase does the rest  Publishes them in HTML
  4. 4. Goals http://data.bibbase.org  Makes it easy for scientists to maintain publications pages  Scientists maintain a bibtex file; BibBase does the rest  Publishes them in HTML  Publishes them in RDF  Links entries to the open linked data cloud  With incentive, scientists are helping us build a bibliographic database (think DBLP but automated)  Invaluable data set for benchmarking duplicate detection and semantic link discovery systems
  5. 5. Some statistics  “Beta” went online in June 2010  As of yesterday (September 1, 2010)  ~ 100 active users  4520 publications, 4883 authors, 502 journals, 1881 proceedings, 88 keywords  39201 author links, 2768 publication links, 30 keyword links  Note that this is before we do any form of “marketing”
  6. 6. Duplicate Detection  Examples  Authors: “Renee J. Miller” or “R. J. Miller” or “RJ Miller”  Publication entries  Journal & conferences: “VLDB” or “Very Large Data Base”  Solutions  Local detection (within a single bibtex file)  Global detection (across multiple files)
  7. 7. Local Detection  A set of predefined rules to identify duplicates.  E.g. within a single file, it is highly likely that “Renee J Miller” is the same as “RJ Miller”.  Users can specify a suffix to the name to differentiate them (DBLP approach).  E.g. “Min Wang” vs “Min Wang2”
  8. 8. Global Detection  Duplicate detection, also known as entity resolution, record linkage, or reference reconciliation is a well- studied problem and an active research area. [Tutorial- VLDB’05, Tutorial-SIGMOD’06]  We use existing declarative techniques [D.App.σ-SIGMOD’07] to detect duplicates across multiple files.  Display disambiguation page on HTML interface and rdfs:seeAlso attribute on RDF interface.  Also enables user to provide feedback by @string{vldb = Very Large Data Base}
  9. 9. Interlinking of Data Sources  Leverages both offline dictionaries and online real-time URL verifications.  Some external data sources  DBLP  DBpedia  RKBExplorer  Semantic Web Dogfood  LOD foaf
  10. 10. Additional Features  Storage and publication of provenance information  Dynamic grouping of entities (by year, keyword, etc)  RSS feed for notification  DBLP scraper to generate bibtex files from DBLP records  Statistics on usage  Enhancement to existing MIT bibtex ontology file
  11. 11. Conclusion and Future Work  BibBase  Light-weight publication of bibliographic data  Semantic web technologies as a result of complex triplification performed inside the system  Invaluable data set  Future Work  More comprehensive duplicate detection  Links to more external data sources  Better engineering and service level agreement (99.99%?)  Broader user base
  12. 12. Questions?