This document discusses techniques for extracting structured data from unstructured web pages, including microformats, RDFa, HTML5 Microdata, and the Any23 tool. Any23 is an open-source Java library and command-line tool that can extract RDF triples from various semantically marked up web documents like those using RDFa, microformats, etc. It allows distilling the semantic web "drop by drop" from ordinary web pages.
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...
distilling the Web of Data drop by drop (with Java)
1. distilling the Web of Data
drop by drop (with Java)
Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisano
Wednesday, June 29, 2011
2. the shortest introduction
ever to the Web o f Data
Web pages markup technologies are
intended for human consumption
they let machines to present raw
data to humans
extracting valuable data may
require fancy scraping techniques
scraping: one size doesn’t fit all
Wednesday, June 29, 2011
3. the shortest introduction
ever to the Web o f Data
<div>
<div> Canon Rebel T2i (EOS 550D) $899 </div>
<div> The Rebel T2i EOS 550D is Cannon's
top-of-the-line consumer digital SLR
camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div>
<div> price: 899 USD </div>
</div>
</div>
Wednesday, June 29, 2011
4. the shortest introduction
ever to the Web o f Data
<div>
<div> Canon Rebel T2i (EOS 550D) $899 </div>
<div> The Rebel T2i EOS 550D is Cannon's
top-of-the-line consumer digital SLR
camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div>
<div> price: 899 USD </div>
what does this
</div> tag mean?
</div>
Wednesday, June 29, 2011
5. the shortest introduction
ever to the Web o f Data
<div>
<div> Canon Rebel T2i (EOS 550D) $899 </div>
<div> The Rebel T2i EOS 550D is Cannon's
top-of-the-line consumer digital SLR
camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div>
<div> price: 899 USD </div>
what does this
</div> is this a tag mean?
</div> currency or what?
Wednesday, June 29, 2011
7. Microformats
“Microformats are a way of adding simple markup
to human-readable data items such as events,
contact details or locations, on web pages”
Andy Mabbett
- community driven initiative
- largely adopted
- quick & dirty
- scarcely extensibility
Wednesday, June 29, 2011
8. Microformats
<div class=”hlisting item”>
<div> Canon Rebel T2i (EOS 550D) $899< /div>
<div class=”description”> The Rebel T2i EOS
550D is Cannon's
top-of-the-line consumer digital SLR
camera. It can shoot up
<div> AN_UCC-13: 013803123784 </div>
<div class=”price”> price: 899 USD </div>
</div>
</div>
Wednesday, June 29, 2011
9. RDFa: RDF in attribute
model your data as they were Web pages
connected with named links and properties
and embed them in your (X)HTML using
@attributes
- RDF, graph-based model
- W3C Recommandation
- highly extensible
i.e GoodRelations[1], a fully flavored
vocabulary for the e-commerce
Wednesday, June 29, 2011
10. RDFa: RDF in attribute
model your data
http://mystore.com/product/5642
ex:price ex:value 899
ex:producer
ex:currency
ex:description
USD
http://canon.co.uk
The Rebel T2i EOS
550D blah blah
Wednesday, June 29, 2011
11. RDFa: RDF in attribute
and then embed them in your
(X)HTML pages
<div about=”http://mystore.com/product/5642”>
<div>Canon Rebel T2i (EOS 550D) $899</div>
<div property=”gr:description”>The Rebel T2i EOS 550D
is Cannon's blah blah</div>
<div rel=”gr:hasPriceSpecification”>
<span> price:
<span property=”gr:hasCurrencyValue”>899</span>
<span property=”gr:hasCurrency”>USD</span>
</span>
</div>
</div>
Wednesday, June 29, 2011
12. HTML5: Microdata
Microdata allows nested groups of name-value
pairs to be added to HTML documents, in
parallel with the existing content
- W3C Working draft
- native of HTML5 specification
- serializable in RDF
- Google, Yahoo! and Bing endorsed Schema.org
- large adoption expected
Wednesday, June 29, 2011
14. % of marked up Web pages
3.5
3
2.5
2
1.5
1
RDFa 0.5
hCard
adr
09/2008 xfn 0
03/2009 hReview
10/2010
data from Yahoo! [2]
Wednesday, June 29, 2011
15. tie ‘em all together
uniform, reconciled and
unified RDF representation
Wednesday, June 29, 2011
16. a drop-by-drop distiller
Anything To Triples (any23) is an open source,
Apache-licensed:
- Java library,
- Web service and
- a command-line tool
able to distill RDF triples from a
variety of semantically marked up Web
documents
http://developers.any23.org
Wednesday, June 29, 2011
17. live demo http://any23.org
Web site with ~5000 products description with
GoodRelations using RDFa
Wednesday, June 29, 2011
18. use Any23 in your Java
programs
Any23 runner = new Any23();
runner.setHTTPUserAgent("test-user-agent");
HTTPClient httpClient = runner.getHTTPClient();
DocumentSource source = new HTTPDocumentSource(
httpClient,
"http://test.com/index.html"
);
ByteArrayOutputStream out = new
ByteArrayOutputStream();
TripleHandler handler = new NTriplesWriter(out);
runner.extract(source, handler);
String n3 = out.toString("UTF-8");
Wednesday, June 29, 2011
19. Any23: Command-Line tool
any23-core/bin$ ./any23
usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]
[-p] [-s] [-t] [-v] {<url>|<file>}
-e <arg> comma-separated list of extractors, e.g.
rdf-xml,rdf-turtle
-f,--format <arg> Output format [turtle (default),
ntriples, rdfxml, quad, uris]
-l,--log <arg> logging, please specify a file
-n,--nesting disable production of nesting triples
-o,--output <arg> ouput file (defaults to stdout)
-p,--pedantic validates and fixes HTML content
detecting commons issues
-s,--stats print out statistics of Any23
Wednesday, June 29, 2011
21. Apache Tika
mimetype detection
Cyber Neko HTML
DOM extraction
Rule Fix
Validator
Microdata RDFa hListing hReview hCalendar hCard
Extractor Extractor
Microformat Extractors
Sesame RDF/XML NQuads JSON
Writer Writer Writer
ExtractionResult
Wednesday, June 29, 2011
22. extractor
public interface Extractor<Input> {
/**
* Executes the extractor. Will be invoked only once, extractors are
* not reusable.
*
* @param in The extractor's input
* @param documentURI The document's URI
* @param out Sink for extracted data
* @throws IOException On error while reading from the input stream
* @throws ExtractionException On other error, such as parse errors
*/
void run(Input in, URI documentURI, ExtractionResult out)
throws IOException, ExtractionException;
/**
* Returns a {@link org.deri.any23.extractor.ExtractorDescription} of
* this extractor.
*/
ExtractorDescription getDescription();
}
Wednesday, June 29, 2011
24. plugins
@PluginImplementation
@Author(name="Michele Mostarda (mostarda@fbk.eu)")
public class HTMLScraperPlugin implements ExtractorPlugin {
private static final Logger logger =
LoggerFactory.getLogger(HTMLScraperPlugin.class);
@Init
public void init() {
logger.info("Plugin initialization.");
}
@Shutdown
public void shutdown() {
logger.info("Plugin shutdown.");
}
public ExtractorFactory getExtractorFactory() {
return HTMLScraperExtractor.factory;
}
}
Wednesday, June 29, 2011
25. roadmap
incoming 0.6.0 release
- support for Microdata
- support for CSV
- support for RDFa 1.1 prefix mechanism
- improved app configuration
- bug fixing
Apache (pre) Incubation process
- http://wiki.apache.org/incubator/Any23Proposal
- supporters and mentors (thanks guys!)
Simone Tripodi (@stripodi)
Tommaso Teofili (@tteofili)
- we’re looking for mentors
Wednesday, June 29, 2011
26. closing credits
active committers
Giovanni Tummarello ( @jccq )
Michele Mostarda ( @micmos )
Davide Palmisano ( @dpalmisano )
Richard Cyganiak ( @cygri )
thanks to the whole Semantic Web community,
especially those who tirelessly challenge us
with bugs and features requests
Wednesday, June 29, 2011