SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
distilling the Web of Data
           drop by drop (with Java)


Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisano

Wednesday, June 29, 2011
the shortest introduction
                                ever to the Web o f Data

      Web pages markup technologies are
      intended for human consumption

      they let machines to present raw
      data to humans

      extracting valuable data may
      require fancy scraping techniques

      scraping: one size doesn’t fit all



Wednesday, June 29, 2011
the shortest introduction
                                                 ever to the Web o f Data

     <div>
         <div> Canon Rebel T2i (EOS 550D) $899 </div>
         <div> The Rebel T2i EOS 550D is Cannon's
             top-of-the-line consumer digital SLR
     camera. It can shoot up

                           <div> AN_UCC-13: 013803123784 </div>
                           <div> price: 899 USD </div>

        </div>
     </div>


Wednesday, June 29, 2011
the shortest introduction
                                      ever to the Web o f Data

     <div>
         <div> Canon Rebel T2i (EOS 550D) $899 </div>
         <div> The Rebel T2i EOS 550D is Cannon's
             top-of-the-line consumer digital SLR
     camera. It can shoot up

                  <div> AN_UCC-13: 013803123784 </div>
                  <div> price: 899 USD </div>
                                              what does this
             </div>                           tag mean?
     </div>


Wednesday, June 29, 2011
the shortest introduction
                                  ever to the Web o f Data

     <div>
         <div> Canon Rebel T2i (EOS 550D) $899 </div>
         <div> The Rebel T2i EOS 550D is Cannon's
             top-of-the-line consumer digital SLR
     camera. It can shoot up

             <div> AN_UCC-13: 013803123784 </div>
             <div> price: 899 USD </div>
                                         what does this
        </div>       is this a           tag mean?
     </div>          currency or what?


Wednesday, June 29, 2011
“meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, Italy

Wednesday, June 29, 2011
Microformats




    “Microformats are a way of adding simple markup
    to human-readable data items such as events,
    contact details or locations, on web pages”
                                        Andy Mabbett


    -    community driven initiative
    -    largely adopted
    -    quick & dirty
    -    scarcely extensibility


Wednesday, June 29, 2011
Microformats



     <div class=”hlisting item”>
         <div> Canon Rebel T2i (EOS 550D) $899< /div>
         <div class=”description”> The Rebel T2i EOS
     550D is Cannon's
             top-of-the-line consumer digital SLR
     camera. It can shoot up
                           <div> AN_UCC-13: 013803123784 </div>
                           <div class=”price”> price: 899 USD </div>
        </div>
     </div>

Wednesday, June 29, 2011
RDFa: RDF in attribute


       model your data as they were Web pages
       connected with named links and properties
       and embed them in your (X)HTML using
       @attributes

       - RDF, graph-based model
       - W3C Recommandation
       - highly extensible

       i.e GoodRelations[1], a fully flavored
       vocabulary for the e-commerce



Wednesday, June 29, 2011
RDFa: RDF in attribute

       model your data

             http://mystore.com/product/5642

                                    ex:price       ex:value      899

      ex:producer
                                                        ex:currency


                                      ex:description
                                                               USD
            http://canon.co.uk



                                    The Rebel T2i EOS
                                    550D blah blah
Wednesday, June 29, 2011
RDFa: RDF in attribute

       and then embed them in your
       (X)HTML pages
    <div about=”http://mystore.com/product/5642”>
        <div>Canon Rebel T2i (EOS 550D) $899</div>
        <div property=”gr:description”>The Rebel T2i EOS 550D
    is Cannon's blah blah</div>

        <div rel=”gr:hasPriceSpecification”>
            <span> price:
               <span property=”gr:hasCurrencyValue”>899</span>
               <span property=”gr:hasCurrency”>USD</span>
           </span>
        </div>
    </div>
Wednesday, June 29, 2011
HTML5: Microdata


       Microdata allows nested groups of name-value
       pairs to be added to HTML documents, in
       parallel with the existing content

       - W3C Working draft
       - native of HTML5 specification
       - serializable in RDF


       - Google, Yahoo! and Bing endorsed Schema.org
       - large adoption expected


Wednesday, June 29, 2011
HTML5: Microdata

          <div itemscop itemtype=”http://schema.org/Offer”>
              <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899
          </div>
              <div itemprop=”description”> The Rebel T2i EOS 550D
          is Cannon's blah blah</div>

              <div>
                  <span> price:
                     <span itemprop=”price”> 899 </span>
                     <span itemprop=”priceCurrency”> USD </span>
                 </span>
              </div>
          </div>


Wednesday, June 29, 2011
% of marked up Web pages

                                                                       3.5


                                                                    3

                                                                    2.5

                                                                   2

                                                                   1.5

                                                               1

                 RDFa                                          0.5
                             hCard
                                     adr
                  09/2008                    xfn               0
                  03/2009                            hReview
                  10/2010
      data from Yahoo! [2]

Wednesday, June 29, 2011
tie ‘em all together




 uniform, reconciled and
 unified RDF representation

Wednesday, June 29, 2011
a drop-by-drop distiller

        Anything To Triples (any23) is an open source,
        Apache-licensed:

            - Java library,
            - Web service and
            - a command-line tool

        able to distill RDF triples from a
        variety of semantically marked up Web
        documents

        http://developers.any23.org

Wednesday, June 29, 2011
live demo http://any23.org




                Web site with ~5000 products description with
                GoodRelations using RDFa

Wednesday, June 29, 2011
use Any23 in your Java
                                                 programs
      Any23 runner = new Any23();
      runner.setHTTPUserAgent("test-user-agent");
      HTTPClient httpClient = runner.getHTTPClient();
      DocumentSource source = new HTTPDocumentSource(
            httpClient,
            "http://test.com/index.html"
         );
      ByteArrayOutputStream out = new
            ByteArrayOutputStream();
      TripleHandler handler = new NTriplesWriter(out);
      runner.extract(source, handler);
      String n3 = out.toString("UTF-8");




Wednesday, June 29, 2011
Any23: Command-Line tool
      any23-core/bin$ ./any23

      usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]
             [-p] [-s] [-t] [-v] {<url>|<file>}
       -e <arg>            comma-separated list of extractors, e.g.
                           rdf-xml,rdf-turtle
       -f,--format <arg>   Output format [turtle (default),
      ntriples, rdfxml, quad, uris]
       -l,--log <arg>      logging, please specify a file
       -n,--nesting        disable production of nesting triples
       -o,--output <arg>   ouput file (defaults to stdout)
       -p,--pedantic       validates and fixes HTML content
      detecting commons issues
       -s,--stats          print out statistics of Any23


Wednesday, June 29, 2011
Any23: Web Service
  blacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http://
  www.bbc.co.uk/programmes/b00kygwh&report=on

      <response>
          <extractors>
              <extractor>rdf-xml</extractor>
          </extractors>
          <report>
              ...
              <validationReport>
                  <ruleActivations></ruleActivations>
                  ...
              </validationReport>
           </report>
          <data>
           <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/
      b00kygwh#programme">
               <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/>
               <po:pid>b00kygwh</po:pid>
               <dc:title>The Terminator</dc:title>
             </rdf:Description>
          </data>
      </response>
Wednesday, June 29, 2011
Apache Tika
       mimetype detection

                                                                  Cyber Neko HTML
       DOM extraction


                                                           Rule                  Fix
       Validator


           Microdata         RDFa       hListing   hReview      hCalendar       hCard
           Extractor       Extractor

                                       Microformat Extractors

        Sesame                                      RDF/XML NQuads              JSON
                                                     Writer Writer              Writer
       ExtractionResult



Wednesday, June 29, 2011
extractor
  public interface Extractor<Input> {

        /**
         * Executes the extractor. Will be invoked only once, extractors are
         * not reusable.
         *
         * @param in         The extractor's input
         * @param documentURI The document's URI
         * @param out        Sink for extracted data
         * @throws IOException         On error while reading from the input stream
         * @throws ExtractionException On other error, such as parse errors
         */
        void run(Input in, URI documentURI, ExtractionResult out)
               throws IOException, ExtractionException;

        /**
         * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of
         * this extractor.
         */
        ExtractorDescription getDescription();

  }

Wednesday, June 29, 2011
validate and fix
  public interface Rule {

        String getHRName();

        boolean applyOn(
           DOMDocument document,
           RuleContext context,
           ValidationReportBuilder validationReportBuilder
        );
  }

  public interface Fix {

        String getHRName();

        void execute(Rule rule, RuleContext context, DOMDocument document);

  }



      void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix);


Wednesday, June 29, 2011
plugins
  @PluginImplementation
    @Author(name="Michele Mostarda (mostarda@fbk.eu)")
    public class HTMLScraperPlugin implements ExtractorPlugin {

      private static final Logger logger =
          LoggerFactory.getLogger(HTMLScraperPlugin.class);

          @Init
          public void init() {
              logger.info("Plugin initialization.");
          }

          @Shutdown
          public void shutdown() {
              logger.info("Plugin shutdown.");
          }

      public ExtractorFactory getExtractorFactory() {
          return HTMLScraperExtractor.factory;
      }

    }

Wednesday, June 29, 2011
roadmap
      incoming 0.6.0 release
       - support for Microdata
       - support for CSV
       - support for RDFa 1.1 prefix mechanism
       - improved app configuration
       - bug fixing

      Apache (pre) Incubation process
          - http://wiki.apache.org/incubator/Any23Proposal
          - supporters and mentors (thanks guys!)
            Simone Tripodi (@stripodi)
            Tommaso Teofili (@tteofili)
          - we’re looking for mentors

Wednesday, June 29, 2011
closing credits




                                  active committers

                             Giovanni Tummarello ( @jccq )
                              Michele Mostarda ( @micmos )
                           Davide Palmisano ( @dpalmisano )
                              Richard Cyganiak ( @cygri )

                   thanks to the whole Semantic Web community,
                  especially those who tirelessly challenge us
                         with bugs and features requests

Wednesday, June 29, 2011
References



      [1] http://purl.org/goodrelations/v1

      [2] http://tripletalk.wordpress.com/2011/01/25/
      rdfa-deployment-across-the-web/




Wednesday, June 29, 2011

Contenu connexe

Similaire à distilling the Web of Data drop by drop (with Java)

Proactive Web Performance Optimization.(Marcel Duran)
Proactive Web Performance Optimization.(Marcel Duran)Proactive Web Performance Optimization.(Marcel Duran)
Proactive Web Performance Optimization.(Marcel Duran)
Ontico
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Deepak Singh
 
Moving to the cloud azure, office365, and intune - concurrency
Moving to the cloud   azure, office365, and intune - concurrencyMoving to the cloud   azure, office365, and intune - concurrency
Moving to the cloud azure, office365, and intune - concurrency
Concurrency, Inc.
 
buildingxmlbasedapplications-180322042009.pptx
buildingxmlbasedapplications-180322042009.pptxbuildingxmlbasedapplications-180322042009.pptx
buildingxmlbasedapplications-180322042009.pptx
NKannanCSE
 
CloudCon Data Mining Presentation
CloudCon Data Mining PresentationCloudCon Data Mining Presentation
CloudCon Data Mining Presentation
Brian Johnson
 
cdac@parag.gajbhiye@test123
cdac@parag.gajbhiye@test123cdac@parag.gajbhiye@test123
cdac@parag.gajbhiye@test123
Parag Gajbhiye
 
The Java Content Repository
The Java Content RepositoryThe Java Content Repository
The Java Content Repository
nobby
 
前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo
Michael Zhang
 

Similaire à distilling the Web of Data drop by drop (with Java) (20)

Building XML Based Applications
Building XML Based ApplicationsBuilding XML Based Applications
Building XML Based Applications
 
Proactive Web Performance Optimization.(Marcel Duran)
Proactive Web Performance Optimization.(Marcel Duran)Proactive Web Performance Optimization.(Marcel Duran)
Proactive Web Performance Optimization.(Marcel Duran)
 
Callimachus
CallimachusCallimachus
Callimachus
 
RESTful OGC Services
RESTful OGC ServicesRESTful OGC Services
RESTful OGC Services
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
 
HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.
HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.
HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.
 
Moving to the cloud azure, office365, and intune - concurrency
Moving to the cloud   azure, office365, and intune - concurrencyMoving to the cloud   azure, office365, and intune - concurrency
Moving to the cloud azure, office365, and intune - concurrency
 
buildingxmlbasedapplications-180322042009.pptx
buildingxmlbasedapplications-180322042009.pptxbuildingxmlbasedapplications-180322042009.pptx
buildingxmlbasedapplications-180322042009.pptx
 
CloudCon Data Mining Presentation
CloudCon Data Mining PresentationCloudCon Data Mining Presentation
CloudCon Data Mining Presentation
 
Iz Pack
Iz PackIz Pack
Iz Pack
 
cdac@parag.gajbhiye@test123
cdac@parag.gajbhiye@test123cdac@parag.gajbhiye@test123
cdac@parag.gajbhiye@test123
 
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
 
Freeing the cloud, one service at a time
Freeing the cloud, one service at a timeFreeing the cloud, one service at a time
Freeing the cloud, one service at a time
 
Kliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_finalKliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_final
 
The Java Content Repository
The Java Content RepositoryThe Java Content Repository
The Java Content Repository
 
Powering the Next Generation Services with Java Platform - Spark IT 2010
Powering the Next Generation Services with Java Platform - Spark IT 2010Powering the Next Generation Services with Java Platform - Spark IT 2010
Powering the Next Generation Services with Java Platform - Spark IT 2010
 
Macruby - RubyConf Presentation 2010
Macruby - RubyConf Presentation 2010Macruby - RubyConf Presentation 2010
Macruby - RubyConf Presentation 2010
 
前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo
 

Plus de Davide Palmisano (6)

beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz
 
NoTube: past, present and future
NoTube: past, present and futureNoTube: past, present and future
NoTube: past, present and future
 
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking up
 
Unwinding The Twine
Unwinding The TwineUnwinding The Twine
Unwinding The Twine
 
NoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social WebNoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social Web
 

Dernier

一比一原版Offer(文凭)意大利那不勒斯美术学院毕业证成绩单学历认证
一比一原版Offer(文凭)意大利那不勒斯美术学院毕业证成绩单学历认证一比一原版Offer(文凭)意大利那不勒斯美术学院毕业证成绩单学历认证
一比一原版Offer(文凭)意大利那不勒斯美术学院毕业证成绩单学历认证
20goy65g
 
一比一原版(YU学位证书)约克大学毕业证学历认证新版办理
一比一原版(YU学位证书)约克大学毕业证学历认证新版办理一比一原版(YU学位证书)约克大学毕业证学历认证新版办理
一比一原版(YU学位证书)约克大学毕业证学历认证新版办理
txkonu
 
codes and conventions of film magazine and website.pptx
codes and conventions of film magazine and website.pptxcodes and conventions of film magazine and website.pptx
codes and conventions of film magazine and website.pptx
17duffyc
 
一比一原版(UofG学位证书)圭尔夫大学毕业证学历认证快速办理
一比一原版(UofG学位证书)圭尔夫大学毕业证学历认证快速办理一比一原版(UofG学位证书)圭尔夫大学毕业证学历认证快速办理
一比一原版(UofG学位证书)圭尔夫大学毕业证学历认证快速办理
txkonu
 
如何办理(Flinders毕业证)弗林德斯大学毕业证毕业证成绩单原版一比一
如何办理(Flinders毕业证)弗林德斯大学毕业证毕业证成绩单原版一比一如何办理(Flinders毕业证)弗林德斯大学毕业证毕业证成绩单原版一比一
如何办理(Flinders毕业证)弗林德斯大学毕业证毕业证成绩单原版一比一
8jg9cqy
 
Call Girls In Sindhudurg Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service E...
Call Girls In Sindhudurg Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service E...Call Girls In Sindhudurg Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service E...
Call Girls In Sindhudurg Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service E...
Nitya salvi
 
architect Hassan Khalil portfolio Year 2024
architect Hassan Khalil portfolio  Year 2024architect Hassan Khalil portfolio  Year 2024
architect Hassan Khalil portfolio Year 2024
hassan khalil
 
如何办理澳洲迪肯大学毕业证(Deakin毕业证书)毕业证成绩单原版一比一
如何办理澳洲迪肯大学毕业证(Deakin毕业证书)毕业证成绩单原版一比一如何办理澳洲迪肯大学毕业证(Deakin毕业证书)毕业证成绩单原版一比一
如何办理澳洲迪肯大学毕业证(Deakin毕业证书)毕业证成绩单原版一比一
avy6anjnd
 

Dernier (20)

Call Girls Varanasi Just Call 8617370543Top Class Call Girl Service Available
Call Girls Varanasi Just Call 8617370543Top Class Call Girl Service AvailableCall Girls Varanasi Just Call 8617370543Top Class Call Girl Service Available
Call Girls Varanasi Just Call 8617370543Top Class Call Girl Service Available
 
HUMA Final Presentation About Chicano Culture
HUMA Final Presentation About Chicano CultureHUMA Final Presentation About Chicano Culture
HUMA Final Presentation About Chicano Culture
 
一比一原版Offer(文凭)意大利那不勒斯美术学院毕业证成绩单学历认证
一比一原版Offer(文凭)意大利那不勒斯美术学院毕业证成绩单学历认证一比一原版Offer(文凭)意大利那不勒斯美术学院毕业证成绩单学历认证
一比一原版Offer(文凭)意大利那不勒斯美术学院毕业证成绩单学历认证
 
一比一原版(YU学位证书)约克大学毕业证学历认证新版办理
一比一原版(YU学位证书)约克大学毕业证学历认证新版办理一比一原版(YU学位证书)约克大学毕业证学历认证新版办理
一比一原版(YU学位证书)约克大学毕业证学历认证新版办理
 
Top Rated Lucknow Escorts Service, ₹5000 Best Hot Call Girls With Room +91-82...
Top Rated Lucknow Escorts Service, ₹5000 Best Hot Call Girls With Room +91-82...Top Rated Lucknow Escorts Service, ₹5000 Best Hot Call Girls With Room +91-82...
Top Rated Lucknow Escorts Service, ₹5000 Best Hot Call Girls With Room +91-82...
 
codes and conventions of film magazine and website.pptx
codes and conventions of film magazine and website.pptxcodes and conventions of film magazine and website.pptx
codes and conventions of film magazine and website.pptx
 
SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)
SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)
SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)
 
Jaro je tady - Spring is here (Judith) 3
Jaro je tady - Spring is here (Judith) 3Jaro je tady - Spring is here (Judith) 3
Jaro je tady - Spring is here (Judith) 3
 
一比一原版(UofG学位证书)圭尔夫大学毕业证学历认证快速办理
一比一原版(UofG学位证书)圭尔夫大学毕业证学历认证快速办理一比一原版(UofG学位证书)圭尔夫大学毕业证学历认证快速办理
一比一原版(UofG学位证书)圭尔夫大学毕业证学历认证快速办理
 
Turn Off The Air Con - The Singapore Punk Scene
Turn Off The Air Con - The Singapore Punk SceneTurn Off The Air Con - The Singapore Punk Scene
Turn Off The Air Con - The Singapore Punk Scene
 
Headshots and Personal Branding by Julie King Photography
Headshots and Personal Branding by Julie King PhotographyHeadshots and Personal Branding by Julie King Photography
Headshots and Personal Branding by Julie King Photography
 
如何办理(Flinders毕业证)弗林德斯大学毕业证毕业证成绩单原版一比一
如何办理(Flinders毕业证)弗林德斯大学毕业证毕业证成绩单原版一比一如何办理(Flinders毕业证)弗林德斯大学毕业证毕业证成绩单原版一比一
如何办理(Flinders毕业证)弗林德斯大学毕业证毕业证成绩单原版一比一
 
Call Girls In Sindhudurg Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service E...
Call Girls In Sindhudurg Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service E...Call Girls In Sindhudurg Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service E...
Call Girls In Sindhudurg Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service E...
 
Completed Event Presentation for Huma 1305
Completed Event Presentation for Huma 1305Completed Event Presentation for Huma 1305
Completed Event Presentation for Huma 1305
 
architect Hassan Khalil portfolio Year 2024
architect Hassan Khalil portfolio  Year 2024architect Hassan Khalil portfolio  Year 2024
architect Hassan Khalil portfolio Year 2024
 
Digital C-Type Printing: Revolutionizing The Future Of Photographic Prints
Digital C-Type Printing: Revolutionizing The Future Of Photographic PrintsDigital C-Type Printing: Revolutionizing The Future Of Photographic Prints
Digital C-Type Printing: Revolutionizing The Future Of Photographic Prints
 
Jaro je tady - Spring is here (Judith) 2
Jaro je tady - Spring is here (Judith) 2Jaro je tady - Spring is here (Judith) 2
Jaro je tady - Spring is here (Judith) 2
 
如何办理澳洲迪肯大学毕业证(Deakin毕业证书)毕业证成绩单原版一比一
如何办理澳洲迪肯大学毕业证(Deakin毕业证书)毕业证成绩单原版一比一如何办理澳洲迪肯大学毕业证(Deakin毕业证书)毕业证成绩单原版一比一
如何办理澳洲迪肯大学毕业证(Deakin毕业证书)毕业证成绩单原版一比一
 
Call Girls In Firozabad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...
Call Girls In Firozabad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...Call Girls In Firozabad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...
Call Girls In Firozabad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...
 
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...
 

distilling the Web of Data drop by drop (with Java)

  • 1. distilling the Web of Data drop by drop (with Java) Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisano Wednesday, June 29, 2011
  • 2. the shortest introduction ever to the Web o f Data Web pages markup technologies are intended for human consumption they let machines to present raw data to humans extracting valuable data may require fancy scraping techniques scraping: one size doesn’t fit all Wednesday, June 29, 2011
  • 3. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> </div> </div> Wednesday, June 29, 2011
  • 4. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> tag mean? </div> Wednesday, June 29, 2011
  • 5. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> is this a tag mean? </div> currency or what? Wednesday, June 29, 2011
  • 6. “meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, Italy Wednesday, June 29, 2011
  • 7. Microformats “Microformats are a way of adding simple markup to human-readable data items such as events, contact details or locations, on web pages” Andy Mabbett - community driven initiative - largely adopted - quick & dirty - scarcely extensibility Wednesday, June 29, 2011
  • 8. Microformats <div class=”hlisting item”> <div> Canon Rebel T2i (EOS 550D) $899< /div> <div class=”description”> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div class=”price”> price: 899 USD </div> </div> </div> Wednesday, June 29, 2011
  • 9. RDFa: RDF in attribute model your data as they were Web pages connected with named links and properties and embed them in your (X)HTML using @attributes - RDF, graph-based model - W3C Recommandation - highly extensible i.e GoodRelations[1], a fully flavored vocabulary for the e-commerce Wednesday, June 29, 2011
  • 10. RDFa: RDF in attribute model your data http://mystore.com/product/5642 ex:price ex:value 899 ex:producer ex:currency ex:description USD http://canon.co.uk The Rebel T2i EOS 550D blah blah Wednesday, June 29, 2011
  • 11. RDFa: RDF in attribute and then embed them in your (X)HTML pages <div about=”http://mystore.com/product/5642”> <div>Canon Rebel T2i (EOS 550D) $899</div> <div property=”gr:description”>The Rebel T2i EOS 550D is Cannon's blah blah</div> <div rel=”gr:hasPriceSpecification”> <span> price: <span property=”gr:hasCurrencyValue”>899</span> <span property=”gr:hasCurrency”>USD</span> </span> </div> </div> Wednesday, June 29, 2011
  • 12. HTML5: Microdata Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content - W3C Working draft - native of HTML5 specification - serializable in RDF - Google, Yahoo! and Bing endorsed Schema.org - large adoption expected Wednesday, June 29, 2011
  • 13. HTML5: Microdata <div itemscop itemtype=”http://schema.org/Offer”> <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899 </div> <div itemprop=”description”> The Rebel T2i EOS 550D is Cannon's blah blah</div> <div> <span> price: <span itemprop=”price”> 899 </span> <span itemprop=”priceCurrency”> USD </span> </span> </div> </div> Wednesday, June 29, 2011
  • 14. % of marked up Web pages 3.5 3 2.5 2 1.5 1 RDFa 0.5 hCard adr 09/2008 xfn 0 03/2009 hReview 10/2010 data from Yahoo! [2] Wednesday, June 29, 2011
  • 15. tie ‘em all together uniform, reconciled and unified RDF representation Wednesday, June 29, 2011
  • 16. a drop-by-drop distiller Anything To Triples (any23) is an open source, Apache-licensed: - Java library, - Web service and - a command-line tool able to distill RDF triples from a variety of semantically marked up Web documents http://developers.any23.org Wednesday, June 29, 2011
  • 17. live demo http://any23.org Web site with ~5000 products description with GoodRelations using RDFa Wednesday, June 29, 2011
  • 18. use Any23 in your Java programs Any23 runner = new Any23(); runner.setHTTPUserAgent("test-user-agent"); HTTPClient httpClient = runner.getHTTPClient(); DocumentSource source = new HTTPDocumentSource(      httpClient,      "http://test.com/index.html"   ); ByteArrayOutputStream out = new ByteArrayOutputStream(); TripleHandler handler = new NTriplesWriter(out); runner.extract(source, handler); String n3 = out.toString("UTF-8"); Wednesday, June 29, 2011
  • 19. Any23: Command-Line tool any23-core/bin$ ./any23 usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]        [-p] [-s] [-t] [-v] {<url>|<file>}  -e <arg>            comma-separated list of extractors, e.g.                      rdf-xml,rdf-turtle  -f,--format <arg>   Output format [turtle (default), ntriples, rdfxml, quad, uris]  -l,--log <arg>      logging, please specify a file  -n,--nesting        disable production of nesting triples  -o,--output <arg>   ouput file (defaults to stdout)  -p,--pedantic       validates and fixes HTML content detecting commons issues  -s,--stats          print out statistics of Any23 Wednesday, June 29, 2011
  • 20. Any23: Web Service blacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http:// www.bbc.co.uk/programmes/b00kygwh&report=on <response> <extractors> <extractor>rdf-xml</extractor> </extractors> <report> ... <validationReport> <ruleActivations></ruleActivations> ... </validationReport> </report> <data> <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/ b00kygwh#programme"> <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/> <po:pid>b00kygwh</po:pid> <dc:title>The Terminator</dc:title> </rdf:Description> </data> </response> Wednesday, June 29, 2011
  • 21. Apache Tika mimetype detection Cyber Neko HTML DOM extraction Rule Fix Validator Microdata RDFa hListing hReview hCalendar hCard Extractor Extractor Microformat Extractors Sesame RDF/XML NQuads JSON Writer Writer Writer ExtractionResult Wednesday, June 29, 2011
  • 22. extractor public interface Extractor<Input> { /** * Executes the extractor. Will be invoked only once, extractors are * not reusable. * * @param in The extractor's input * @param documentURI The document's URI * @param out Sink for extracted data * @throws IOException On error while reading from the input stream * @throws ExtractionException On other error, such as parse errors */ void run(Input in, URI documentURI, ExtractionResult out) throws IOException, ExtractionException; /** * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of * this extractor. */ ExtractorDescription getDescription(); } Wednesday, June 29, 2011
  • 23. validate and fix public interface Rule { String getHRName(); boolean applyOn( DOMDocument document, RuleContext context, ValidationReportBuilder validationReportBuilder ); } public interface Fix { String getHRName(); void execute(Rule rule, RuleContext context, DOMDocument document); } void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix); Wednesday, June 29, 2011
  • 24. plugins @PluginImplementation   @Author(name="Michele Mostarda (mostarda@fbk.eu)")   public class HTMLScraperPlugin implements ExtractorPlugin {     private static final Logger logger = LoggerFactory.getLogger(HTMLScraperPlugin.class);     @Init     public void init() {         logger.info("Plugin initialization.");     }     @Shutdown     public void shutdown() {         logger.info("Plugin shutdown.");     }     public ExtractorFactory getExtractorFactory() {         return HTMLScraperExtractor.factory;     }   } Wednesday, June 29, 2011
  • 25. roadmap incoming 0.6.0 release - support for Microdata - support for CSV - support for RDFa 1.1 prefix mechanism - improved app configuration - bug fixing Apache (pre) Incubation process - http://wiki.apache.org/incubator/Any23Proposal - supporters and mentors (thanks guys!) Simone Tripodi (@stripodi) Tommaso Teofili (@tteofili) - we’re looking for mentors Wednesday, June 29, 2011
  • 26. closing credits active committers Giovanni Tummarello ( @jccq ) Michele Mostarda ( @micmos ) Davide Palmisano ( @dpalmisano ) Richard Cyganiak ( @cygri ) thanks to the whole Semantic Web community, especially those who tirelessly challenge us with bugs and features requests Wednesday, June 29, 2011
  • 27. References [1] http://purl.org/goodrelations/v1 [2] http://tripletalk.wordpress.com/2011/01/25/ rdfa-deployment-across-the-web/ Wednesday, June 29, 2011