SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
XPath (1.0) for web scraping
Paul Tremberth, 17 October 2015, PyCon FR
1
Who am I?
I’m currently Head of Support at
Scrapinghub.
I got introduced to Python through web
scraping.
You can find me on StackOverflow: “xpath”,
“scrapy”, “lxml” tags.
I have a few repos on Github.
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 2
Talk outline
● What is XPath?
● Location paths
● HTML data extraction examples
● Advanced use-cases
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 3
What is XPath?
4
XPath is a language
"XPath is a language for addressing parts of an
XML document" — XML Path Language 1.0
XPath data model is a tree of nodes:
● element nodes (<p>...</p>)
● attribute nodes (href="page.html" )
● text nodes ("Some Title")
● comment nodes (<!-- a comment --> )
(and 3 other types that we won’t cover here.)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 5
Why learn XPath?
● navigate everywhere inside a DOM tree
● a must-have skill for accurate web data
extraction
● more powerful than CSS selectors
○ fine-grained look at the text content
○ complex conditioning with axes
● extensible with custom functions
(we won’t cover that in this talk though)
Also, it’s kind of fun :-)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 6
<html>
<head>
<title>Ceci est un titre</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
<div>
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
</div>
</body>
</html>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
XPath Data Model: sample HTML
7
XPath Data Model (cont.)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
root
meta5
html1
head2
title3
body6
div7
p8
p10
br15 Apparem
ment.16
Ceci...
9 Est-ce..11
Ceci...4
div17
Rien...18
a19
.21
comment22
element node
http-equiv
a12 ?14
content
href
attribute node
un lien13
comment node
href autre lien20
text node
nodes have an order,
the document order,
the order they appear in
the XML/HTML source
8
XPath return types
XPath expressions can return different
things:
● node-sets (most common case, and
most often element nodes)
● strings
● numbers (floating point)
● booleans
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 9
Example XPath expressions
<html>
<head>
<title>Ceci est un titre</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
<div>
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
</div>
</body>
</html>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
/html/head/title
//meta/@content
//div/p
//div/a/@href
//div/a/text()
//div/div[@class="second"]
10
Location Paths:
how to move inside the document tree
11
Location Paths
Location path is the most common XPath
expression.
Used to move in any direction from a starting
point (the context node) to any node(s) in the
tree.
● a string, with a series of “steps”:
○ "step1 / step2 / step3 ..."
● represents selection & filtering of nodes,
processed step by step, from left to right
● each step is:
○ AXIS :: NODETEST [PREDICATE]*
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 12
whitespace does NOT
matter,
except for “//”,
“/ /” is a syntax error.
So don’t be afraid of
indenting your
XPath expressions
Relative vs. absolute paths
● "step1/step2/step3" is relative
● "/step1/step2/step3" is absolute
● i.e. an absolute path is a relative path
starting with "/" (slash)
● in fact, absolute paths are relative to the
root node
● use relative paths whenever possible
○ prevents unexpected selection of
same nodes in loop iterations...
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 13
Location Paths: abbreviations
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
Abbreviated syntax Full syntax (again, whitespace doesn’t matter)
/html/head/title /child:: html /child:: head /child:: title
//meta/@content /descendant-or-self::node()/child::meta/attribute::content
//div/div[@class="second"] /descendant-or-self::node()
/child::div
/child::div [ attribute::class = "second" ]
//div/a/text() /descendant-or-self::node()
/child::div/child::a/child::text()
What we’ve seen earlier is in fact “abbreviated syntax”.
Full syntax is quite verbose:
14
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
*
Axes give the direction to go next.
● self (where you are)
● parent, child (direct hop)
● ancestor, ancestor-or-self, descendant,
descendant-or-self (multi-hop)
● following, following-sibling, preceding,
preceding-sibling (document order)
● attribute, namespace (non-element)
15
Axes: moving around
AXIS :: nodetest [predicate]*
Axes: move up or down the tree
<body>
<div>
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
</div>
</body>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
● self: context node
● child: children of context
node in the tree
● descendant: children of
context node, children of
children, …
● ancestor: parent of
context node (here,
<body>)
16
Axes: move on same tree level
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
● preceding-sibling: same
level in tree, but BEFORE
in document order
● self: context node
● following-sibling: same
level in tree, but AFTER
in document order
17
Axes: document partitioning
<html>
<head>
<title>Ceci est un titre</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
<div>
<div>
<p>Ceci est un paragraphe.</p>
<p>Est-ce <a href="page2.html">un lien</a>?</p>
<br>
Apparemment.
</div>
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
</div>
</body>
</html>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
ancestor&
preceding
descendants
&following
● ancestor: parent, parent
of parent, ...
● preceding: before in
document order,
excluding ancestors
● self: context node
● descendant: children,
children of children, ...
● following: after in
document order,
excluding descendants
self U (ancestor U preceding)
U (descendant U following)
== all nodes
18
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
Select nodes types along the axes:
● either a name test:
○ element names: “html”, “p”, ...
○ attribute names: “content-type”, “href”, …
● or a node type test:
○ node(): ALL nodes types
○ text(): text nodes, i.e. character data
○ comment(): comment nodes
○ or * (i.e. the axis’ principal node type)
19
Node tests
text() is not
a function call
axis :: NODETEST [predicate]*
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
*
Additional node properties to filter on:
● simplest are positional (start at 1, not 0):
//div[3] (i.e. 3rd child div element)
● can be nested, can use location paths:
//div[p[a/@href="sample.html"]]
● ordered, from left to right:
//div[2][@class="content"]
vs.
//div[@class="content"][2]
20
Predicates
axis :: nodetest [PREDICATE]*
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 21
Abbreviations: to remember
Abbreviated step Meaning
*
(asterisk)
all element nodes (excluding text nodes, attribute nodes, etc.) ;
.//* != .//node()
Note that there is no element() test node.
@* attribute::*, all attribute nodes
//
/descendant-or-self::node()/
(exactly this, so .//* != ./descendant-or-self::*)
.
(a single dot)
self::node(), the context node
..
(2 dots)
parent::node()
Use cases:
“Show me some Python code!”
22
Text extraction
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> import lxml.html
>>> root = lxml.html.fromstring(htmlsource)
>>> root.xpath('//div[@class="second"]/text()') # you get back a text nodes node-set
[u'n Rien xe0 ajouter.n Sauf cet ', '. n ', 'n ']
>>> root.xpath('//div[@class="second"]//text()')
[u'n Rien xe0 ajouter.n Sauf cet ', 'autre lien', '. n ', 'n ']
>>> root.xpath('string(//div[@class="second"])') # you get back a single string
u'n Rien xe0 ajouter.n Sauf cet autre lien. n n '
<div class="second">
Rien &agrave; ajouter.
Sauf cet <a href="page3.html">autre lien</a>.
<!-- Et ce commentaire -->
</div>
23
Attributes extraction
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> import lxml.html
>>> root = lxml.html.fromstring(htmlsource)
>>> root.xpath('/html/head/meta/@content')
['text/html; charset=utf-8']
>>> root.xpath('/html/head/meta/@*')
['text/html; charset=utf-8', 'content-type']
<html><head>
<title>Ceci est un titre</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>...</html>
24
Attribute names extraction
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> for element in root.xpath('/html/head/meta'):
... attributes = []
...
... # loop over all attribute nodes of the element
... for index, attribute in enumerate(element.xpath('@*'), start=1):
...
... # use XPath's name() string function on each attribute,
... # using their position
... attribute_name = element.xpath('name(@*[%d])' % index)
... attributes.append((attribute_name, attribute))
...
>>> attributes
[('content', 'text/html; charset=utf-8'), ('http-equiv', 'content-type')]
>>> dict(attributes)
{'content': 'text/html; charset=utf-8', 'http-equiv': 'content-type'}
25
CSS Selectors
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> selector.css('html > body > ul > li:first-child').extract()
[u'<li class="a b">apple</li>']
>>> selector.css('ul li + li').extract()
[u'<li class="b c">banana</li>', u'<li class="c a lastone">carrot</li>']
<html>
<body>
<ul>
<li class="a b">apple</li>
<li class="b c">banana</li>
<li class="c a lastone">carrot</li>
</ul>
<body>
</html>
26
lxml, scrapy and parsel
use cssselect under
the hood
CSS Selectors (cont.)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> selector.css('li.a').extract()
[u'<li class="a b">apple</li>', u'<li class="c a lastone">carrot</li>']
>>> selector.css('li.a.c').extract()
[u'<li class="c a lastone">carrot</li>']
>>> selector.css('li[class$="one"]').extract()
[u'<li class="c a lastone">carrot</li>']
<html>
<body>
<ul>
<li class="a b">apple</li>
<li class="b c">banana</li>
<li class="c a lastone">carrot</li>
</ul>
<body>
</html>
27
Loop on elements (rows, lists, ...)
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> import parsel
>>> selector = parsel.Selector(text=htmlsource)
>>> dict((li.xpath('string(./strong)').extract_first(),
li.xpath('normalize-space( 
./strong/following-sibling::text())').extract_first())
for li in selector.css('div.detail-product > ul > li'))
{u'Doublure': u'Textile',
u'Ref': u'22369',
u'Type': u'Baskets'}
<div class='detail-product'>
<ul>
<li><strong>Type</strong> Baskets</li>
<li><strong>Ref</strong> 22369</li>
<li><strong>Doublure</strong> Textile</li>
</ul>
</div> <!-- borrowed from spartoo.com... -->
28
XPath buckets
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> for cnt, h in enumerate(selector.css('h2'), start=1):
... print h.xpath('string()').extract_first()
... print [p.xpath('string()').extract_first()
... for p in h.xpath('''./following-sibling::p[
... count(preceding-sibling::h2) = %d]''' % cnt)]
...
My BBQ Invitees
[u'Joe', u'Jeff', u'Suzy']
My Dinner Invitees
[u'Dylan', u'Hobbes']
<h2>My BBQ Invitees</h2>
<p>Joe</p>
<p>Jeff</p>
<p>Suzy</p>
<h2>My Dinner Invitees</h2>
<p>Dylan</p>
<p>Hobbes</p>
29
All elements are at the same
level, all siblings.
The idea here is to select <p>
and filter them by how many
<h2> siblings came before
XPath buckets: generalization
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> all_elements = selector.css('h2, p')
>>> h2_elements = selector.css('h2')
>>> order = lambda e: int(float(e.xpath('count(preceding::*) 
... + count(ancestor::*)').extract_first()))
>>> boundaries = [order(h) for h in h2_elements]
>>> buckets = []
>>> for pos, e in sorted((order(e), e) for e in all_elements):
... if pos in boundaries:
... bucket = []
... buckets.append(bucket)
... bucket.append((e.xpath('name()').extract_first(),
... e.xpath('string()').extract_first()))
>>> buckets
[[(u'h2', u'My BBQ Invitees'),
(u'p', u'Joe'), (u'p', u'Jeff'), (u'p', u'Suzy')],
[(u'h2', u'My Dinner Invitees'), (u'p', u'Dylan'), (u'p', u'Hobbes')]]
30
counting ancestors and preceding
elements gives you the node
position in document order
EXSLT extensions
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Avatar</h1>
<div itemprop="director" itemscope itemtype="http://schema.org/Person">
Director: <span itemprop="name">James Cameron</span> (born <span itemprop="
birthDate">August 16, 1954</span>)
</div>
<span itemprop="genre">Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
_xp_item = lxml.etree.XPath('descendant-or-self::*[@itemscope]')
_xp_prop = lxml.etree.XPath("""set:difference(.//*[@itemprop],
.//*[@itemscope]//*[@itemprop])""",
namespaces = {"set": "http://exslt.org/sets"})
31
EXSLT extensions (cont.)
{'items': [{'properties': {'director': {'properties': {'birthDate': u'August 16,
1954',
'name': u'James Cameron'},
'type': ['http://schema.org/Person']},
'genre': u'Science fiction',
'name': u'Avatar',
'trailer': 'http://www.example.com/../movies/avatar-
theatrical-trailer.html'},
'type': ['http://schema.org/Movie']}]}
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 32
Scraping Javascript code
<div id="map"></div>
<script>
function initMap() {
var myLatLng = {lat: -25.363, lng: 131.044};
}
</script>
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015
>>> import js2xml
>>> import parsel
>>>
>>> selector = parsel.Selector(text=htmlsource)
>>> jssnippet = selector.css('div#map + script::text').extract_first()
>>> jstree = js2xml.parse(jssnippet)
>>> js2xml.jsonlike.make_dict(jstree.xpath('//object[1]')[0])
{'lat': -25.363, 'lng': 131.044}
33
XPath tips & tricks
● Use relative XPath expressions
whenever possible
● Know your axes
● Remember XPath has string() and
normalize-space()
● text() is a node test, not a function call
● CSS selectors are very handy, easier to
maintain, but also less powerful
● js2xml is easier than regex+json.loads()
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 34
XPath resources
● Read http://www.w3.org/TR/xpath/
(it’s worth it!)
● XPath 1.0 tutorial by Zvon.org
● Concise XPath by Aristotle Pagaltzis
● XPath in lxml
● cssselect by Simon Sapin
● EXSLT extensions
XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 35
Thank you!
Paul Tremberth, 17 October 2015
paul@scrapinghub.com
https://github.com/redapple
http://linkedin.com/in/paultremberth
Oh, and we’re hiring!
http://scrapinghub.com/jobs
36

Contenu connexe

Tendances

Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Spiderストレージエンジンのご紹介
Spiderストレージエンジンのご紹介Spiderストレージエンジンのご紹介
Spiderストレージエンジンのご紹介Kentoku
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
削除フラグのはなし
削除フラグのはなし削除フラグのはなし
削除フラグのはなしShigetaka Yachi
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
SQLアンチパターン メンター用資料
SQLアンチパターン メンター用資料SQLアンチパターン メンター用資料
SQLアンチパターン メンター用資料Hironori Miura
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかShogo Wakayama
 
PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!Kouhei Sutou
 
The hardest part of microservices: your data
The hardest part of microservices: your dataThe hardest part of microservices: your data
The hardest part of microservices: your dataChristian Posta
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Javaでやってみる The Twelve Factor App JJUG-CCC 2014 Fall 講演資料
Javaでやってみる The Twelve Factor App JJUG-CCC 2014 Fall 講演資料Javaでやってみる The Twelve Factor App JJUG-CCC 2014 Fall 講演資料
Javaでやってみる The Twelve Factor App JJUG-CCC 2014 Fall 講演資料Y Watanabe
 
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Amazon Web Services
 

Tendances (20)

Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Spiderストレージエンジンのご紹介
Spiderストレージエンジンのご紹介Spiderストレージエンジンのご紹介
Spiderストレージエンジンのご紹介
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
削除フラグのはなし
削除フラグのはなし削除フラグのはなし
削除フラグのはなし
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Python oop third class
Python oop   third classPython oop   third class
Python oop third class
 
SQLアンチパターン メンター用資料
SQLアンチパターン メンター用資料SQLアンチパターン メンター用資料
SQLアンチパターン メンター用資料
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Mt
MtMt
Mt
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するか
 
PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!
 
The hardest part of microservices: your data
The hardest part of microservices: your dataThe hardest part of microservices: your data
The hardest part of microservices: your data
 
PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
pg_bigmを用いた全文検索のしくみ(前編)
pg_bigmを用いた全文検索のしくみ(前編)pg_bigmを用いた全文検索のしくみ(前編)
pg_bigmを用いた全文検索のしくみ(前編)
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
分散トレーシング技術について(Open tracingやjaeger)
分散トレーシング技術について(Open tracingやjaeger)分散トレーシング技術について(Open tracingやjaeger)
分散トレーシング技術について(Open tracingやjaeger)
 
Javaでやってみる The Twelve Factor App JJUG-CCC 2014 Fall 講演資料
Javaでやってみる The Twelve Factor App JJUG-CCC 2014 Fall 講演資料Javaでやってみる The Twelve Factor App JJUG-CCC 2014 Fall 講演資料
Javaでやってみる The Twelve Factor App JJUG-CCC 2014 Fall 講演資料
 
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
 

Similaire à XPath for web scraping

Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracketjnewmanux
 
Python: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersPython: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersGlenn De Backer
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в MagentoMagecom Ukraine
 
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted NewardArchitecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted NewardJAX London
 
Inroduction to XSLT with PHP4
Inroduction to XSLT with PHP4Inroduction to XSLT with PHP4
Inroduction to XSLT with PHP4Stephan Schmidt
 
The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09Bastian Feder
 
Drupal7 Theming session on the occassion of Drupal7 release party in Delhi NCR
Drupal7 Theming session on the occassion of  Drupal7 release party in Delhi NCRDrupal7 Theming session on the occassion of  Drupal7 release party in Delhi NCR
Drupal7 Theming session on the occassion of Drupal7 release party in Delhi NCRGaurav Mishra
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 
Kill mysql-performance
Kill mysql-performanceKill mysql-performance
Kill mysql-performancekriptonium
 
The beautyandthebeast phpbat2010
The beautyandthebeast phpbat2010The beautyandthebeast phpbat2010
The beautyandthebeast phpbat2010Bastian Feder
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpointwebhostingguy
 

Similaire à XPath for web scraping (20)

Lumberjack XPath 101
Lumberjack XPath 101Lumberjack XPath 101
Lumberjack XPath 101
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
 
Introduccion a HTML5
Introduccion a HTML5Introduccion a HTML5
Introduccion a HTML5
 
html5
html5html5
html5
 
Python: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersPython: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopers
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в Magento
 
JavaScripts & jQuery
JavaScripts & jQueryJavaScripts & jQuery
JavaScripts & jQuery
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted NewardArchitecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
 
Inroduction to XSLT with PHP4
Inroduction to XSLT with PHP4Inroduction to XSLT with PHP4
Inroduction to XSLT with PHP4
 
The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09The Beauty And The Beast Php N W09
The Beauty And The Beast Php N W09
 
Drupal7 Theming session on the occassion of Drupal7 release party in Delhi NCR
Drupal7 Theming session on the occassion of  Drupal7 release party in Delhi NCRDrupal7 Theming session on the occassion of  Drupal7 release party in Delhi NCR
Drupal7 Theming session on the occassion of Drupal7 release party in Delhi NCR
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
DOM Quick Overview
DOM Quick OverviewDOM Quick Overview
DOM Quick Overview
 
Kill mysql-performance
Kill mysql-performanceKill mysql-performance
Kill mysql-performance
 
The beautyandthebeast phpbat2010
The beautyandthebeast phpbat2010The beautyandthebeast phpbat2010
The beautyandthebeast phpbat2010
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpoint
 

Dernier

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

XPath for web scraping

  • 1. XPath (1.0) for web scraping Paul Tremberth, 17 October 2015, PyCon FR 1
  • 2. Who am I? I’m currently Head of Support at Scrapinghub. I got introduced to Python through web scraping. You can find me on StackOverflow: “xpath”, “scrapy”, “lxml” tags. I have a few repos on Github. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 2
  • 3. Talk outline ● What is XPath? ● Location paths ● HTML data extraction examples ● Advanced use-cases XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 3
  • 5. XPath is a language "XPath is a language for addressing parts of an XML document" — XML Path Language 1.0 XPath data model is a tree of nodes: ● element nodes (<p>...</p>) ● attribute nodes (href="page.html" ) ● text nodes ("Some Title") ● comment nodes (<!-- a comment --> ) (and 3 other types that we won’t cover here.) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 5
  • 6. Why learn XPath? ● navigate everywhere inside a DOM tree ● a must-have skill for accurate web data extraction ● more powerful than CSS selectors ○ fine-grained look at the text content ○ complex conditioning with axes ● extensible with custom functions (we won’t cover that in this talk though) Also, it’s kind of fun :-) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 6
  • 7. <html> <head> <title>Ceci est un titre</title> <meta content="text/html; charset=utf-8" http-equiv="content-type"> </head> <body> <div> <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> </div> </body> </html> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 XPath Data Model: sample HTML 7
  • 8. XPath Data Model (cont.) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 root meta5 html1 head2 title3 body6 div7 p8 p10 br15 Apparem ment.16 Ceci... 9 Est-ce..11 Ceci...4 div17 Rien...18 a19 .21 comment22 element node http-equiv a12 ?14 content href attribute node un lien13 comment node href autre lien20 text node nodes have an order, the document order, the order they appear in the XML/HTML source 8
  • 9. XPath return types XPath expressions can return different things: ● node-sets (most common case, and most often element nodes) ● strings ● numbers (floating point) ● booleans XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 9
  • 10. Example XPath expressions <html> <head> <title>Ceci est un titre</title> <meta content="text/html; charset=utf-8" http-equiv="content-type"> </head> <body> <div> <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> </div> </body> </html> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 /html/head/title //meta/@content //div/p //div/a/@href //div/a/text() //div/div[@class="second"] 10
  • 11. Location Paths: how to move inside the document tree 11
  • 12. Location Paths Location path is the most common XPath expression. Used to move in any direction from a starting point (the context node) to any node(s) in the tree. ● a string, with a series of “steps”: ○ "step1 / step2 / step3 ..." ● represents selection & filtering of nodes, processed step by step, from left to right ● each step is: ○ AXIS :: NODETEST [PREDICATE]* XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 12 whitespace does NOT matter, except for “//”, “/ /” is a syntax error. So don’t be afraid of indenting your XPath expressions
  • 13. Relative vs. absolute paths ● "step1/step2/step3" is relative ● "/step1/step2/step3" is absolute ● i.e. an absolute path is a relative path starting with "/" (slash) ● in fact, absolute paths are relative to the root node ● use relative paths whenever possible ○ prevents unexpected selection of same nodes in loop iterations... XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 13
  • 14. Location Paths: abbreviations XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 Abbreviated syntax Full syntax (again, whitespace doesn’t matter) /html/head/title /child:: html /child:: head /child:: title //meta/@content /descendant-or-self::node()/child::meta/attribute::content //div/div[@class="second"] /descendant-or-self::node() /child::div /child::div [ attribute::class = "second" ] //div/a/text() /descendant-or-self::node() /child::div/child::a/child::text() What we’ve seen earlier is in fact “abbreviated syntax”. Full syntax is quite verbose: 14
  • 15. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 * Axes give the direction to go next. ● self (where you are) ● parent, child (direct hop) ● ancestor, ancestor-or-self, descendant, descendant-or-self (multi-hop) ● following, following-sibling, preceding, preceding-sibling (document order) ● attribute, namespace (non-element) 15 Axes: moving around AXIS :: nodetest [predicate]*
  • 16. Axes: move up or down the tree <body> <div> <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> </div> </body> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 ● self: context node ● child: children of context node in the tree ● descendant: children of context node, children of children, … ● ancestor: parent of context node (here, <body>) 16
  • 17. Axes: move on same tree level XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> ● preceding-sibling: same level in tree, but BEFORE in document order ● self: context node ● following-sibling: same level in tree, but AFTER in document order 17
  • 18. Axes: document partitioning <html> <head> <title>Ceci est un titre</title> <meta content="text/html; charset=utf-8" http-equiv="content-type"> </head> <body> <div> <div> <p>Ceci est un paragraphe.</p> <p>Est-ce <a href="page2.html">un lien</a>?</p> <br> Apparemment. </div> <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> </div> </body> </html> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 ancestor& preceding descendants &following ● ancestor: parent, parent of parent, ... ● preceding: before in document order, excluding ancestors ● self: context node ● descendant: children, children of children, ... ● following: after in document order, excluding descendants self U (ancestor U preceding) U (descendant U following) == all nodes 18
  • 19. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 Select nodes types along the axes: ● either a name test: ○ element names: “html”, “p”, ... ○ attribute names: “content-type”, “href”, … ● or a node type test: ○ node(): ALL nodes types ○ text(): text nodes, i.e. character data ○ comment(): comment nodes ○ or * (i.e. the axis’ principal node type) 19 Node tests text() is not a function call axis :: NODETEST [predicate]*
  • 20. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 * Additional node properties to filter on: ● simplest are positional (start at 1, not 0): //div[3] (i.e. 3rd child div element) ● can be nested, can use location paths: //div[p[a/@href="sample.html"]] ● ordered, from left to right: //div[2][@class="content"] vs. //div[@class="content"][2] 20 Predicates axis :: nodetest [PREDICATE]*
  • 21. XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 21 Abbreviations: to remember Abbreviated step Meaning * (asterisk) all element nodes (excluding text nodes, attribute nodes, etc.) ; .//* != .//node() Note that there is no element() test node. @* attribute::*, all attribute nodes // /descendant-or-self::node()/ (exactly this, so .//* != ./descendant-or-self::*) . (a single dot) self::node(), the context node .. (2 dots) parent::node()
  • 22. Use cases: “Show me some Python code!” 22
  • 23. Text extraction XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> import lxml.html >>> root = lxml.html.fromstring(htmlsource) >>> root.xpath('//div[@class="second"]/text()') # you get back a text nodes node-set [u'n Rien xe0 ajouter.n Sauf cet ', '. n ', 'n '] >>> root.xpath('//div[@class="second"]//text()') [u'n Rien xe0 ajouter.n Sauf cet ', 'autre lien', '. n ', 'n '] >>> root.xpath('string(//div[@class="second"])') # you get back a single string u'n Rien xe0 ajouter.n Sauf cet autre lien. n n ' <div class="second"> Rien &agrave; ajouter. Sauf cet <a href="page3.html">autre lien</a>. <!-- Et ce commentaire --> </div> 23
  • 24. Attributes extraction XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> import lxml.html >>> root = lxml.html.fromstring(htmlsource) >>> root.xpath('/html/head/meta/@content') ['text/html; charset=utf-8'] >>> root.xpath('/html/head/meta/@*') ['text/html; charset=utf-8', 'content-type'] <html><head> <title>Ceci est un titre</title> <meta content="text/html; charset=utf-8" http-equiv="content-type"> </head>...</html> 24
  • 25. Attribute names extraction XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> for element in root.xpath('/html/head/meta'): ... attributes = [] ... ... # loop over all attribute nodes of the element ... for index, attribute in enumerate(element.xpath('@*'), start=1): ... ... # use XPath's name() string function on each attribute, ... # using their position ... attribute_name = element.xpath('name(@*[%d])' % index) ... attributes.append((attribute_name, attribute)) ... >>> attributes [('content', 'text/html; charset=utf-8'), ('http-equiv', 'content-type')] >>> dict(attributes) {'content': 'text/html; charset=utf-8', 'http-equiv': 'content-type'} 25
  • 26. CSS Selectors XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> selector.css('html > body > ul > li:first-child').extract() [u'<li class="a b">apple</li>'] >>> selector.css('ul li + li').extract() [u'<li class="b c">banana</li>', u'<li class="c a lastone">carrot</li>'] <html> <body> <ul> <li class="a b">apple</li> <li class="b c">banana</li> <li class="c a lastone">carrot</li> </ul> <body> </html> 26 lxml, scrapy and parsel use cssselect under the hood
  • 27. CSS Selectors (cont.) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> selector.css('li.a').extract() [u'<li class="a b">apple</li>', u'<li class="c a lastone">carrot</li>'] >>> selector.css('li.a.c').extract() [u'<li class="c a lastone">carrot</li>'] >>> selector.css('li[class$="one"]').extract() [u'<li class="c a lastone">carrot</li>'] <html> <body> <ul> <li class="a b">apple</li> <li class="b c">banana</li> <li class="c a lastone">carrot</li> </ul> <body> </html> 27
  • 28. Loop on elements (rows, lists, ...) XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> import parsel >>> selector = parsel.Selector(text=htmlsource) >>> dict((li.xpath('string(./strong)').extract_first(), li.xpath('normalize-space( ./strong/following-sibling::text())').extract_first()) for li in selector.css('div.detail-product > ul > li')) {u'Doublure': u'Textile', u'Ref': u'22369', u'Type': u'Baskets'} <div class='detail-product'> <ul> <li><strong>Type</strong> Baskets</li> <li><strong>Ref</strong> 22369</li> <li><strong>Doublure</strong> Textile</li> </ul> </div> <!-- borrowed from spartoo.com... --> 28
  • 29. XPath buckets XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> for cnt, h in enumerate(selector.css('h2'), start=1): ... print h.xpath('string()').extract_first() ... print [p.xpath('string()').extract_first() ... for p in h.xpath('''./following-sibling::p[ ... count(preceding-sibling::h2) = %d]''' % cnt)] ... My BBQ Invitees [u'Joe', u'Jeff', u'Suzy'] My Dinner Invitees [u'Dylan', u'Hobbes'] <h2>My BBQ Invitees</h2> <p>Joe</p> <p>Jeff</p> <p>Suzy</p> <h2>My Dinner Invitees</h2> <p>Dylan</p> <p>Hobbes</p> 29 All elements are at the same level, all siblings. The idea here is to select <p> and filter them by how many <h2> siblings came before
  • 30. XPath buckets: generalization XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> all_elements = selector.css('h2, p') >>> h2_elements = selector.css('h2') >>> order = lambda e: int(float(e.xpath('count(preceding::*) ... + count(ancestor::*)').extract_first())) >>> boundaries = [order(h) for h in h2_elements] >>> buckets = [] >>> for pos, e in sorted((order(e), e) for e in all_elements): ... if pos in boundaries: ... bucket = [] ... buckets.append(bucket) ... bucket.append((e.xpath('name()').extract_first(), ... e.xpath('string()').extract_first())) >>> buckets [[(u'h2', u'My BBQ Invitees'), (u'p', u'Joe'), (u'p', u'Jeff'), (u'p', u'Suzy')], [(u'h2', u'My Dinner Invitees'), (u'p', u'Dylan'), (u'p', u'Hobbes')]] 30 counting ancestors and preceding elements gives you the node position in document order
  • 31. EXSLT extensions <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Avatar</h1> <div itemprop="director" itemscope itemtype="http://schema.org/Person"> Director: <span itemprop="name">James Cameron</span> (born <span itemprop=" birthDate">August 16, 1954</span>) </div> <span itemprop="genre">Science fiction</span> <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a> </div> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 _xp_item = lxml.etree.XPath('descendant-or-self::*[@itemscope]') _xp_prop = lxml.etree.XPath("""set:difference(.//*[@itemprop], .//*[@itemscope]//*[@itemprop])""", namespaces = {"set": "http://exslt.org/sets"}) 31
  • 32. EXSLT extensions (cont.) {'items': [{'properties': {'director': {'properties': {'birthDate': u'August 16, 1954', 'name': u'James Cameron'}, 'type': ['http://schema.org/Person']}, 'genre': u'Science fiction', 'name': u'Avatar', 'trailer': 'http://www.example.com/../movies/avatar- theatrical-trailer.html'}, 'type': ['http://schema.org/Movie']}]} XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 32
  • 33. Scraping Javascript code <div id="map"></div> <script> function initMap() { var myLatLng = {lat: -25.363, lng: 131.044}; } </script> XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 >>> import js2xml >>> import parsel >>> >>> selector = parsel.Selector(text=htmlsource) >>> jssnippet = selector.css('div#map + script::text').extract_first() >>> jstree = js2xml.parse(jssnippet) >>> js2xml.jsonlike.make_dict(jstree.xpath('//object[1]')[0]) {'lat': -25.363, 'lng': 131.044} 33
  • 34. XPath tips & tricks ● Use relative XPath expressions whenever possible ● Know your axes ● Remember XPath has string() and normalize-space() ● text() is a node test, not a function call ● CSS selectors are very handy, easier to maintain, but also less powerful ● js2xml is easier than regex+json.loads() XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 34
  • 35. XPath resources ● Read http://www.w3.org/TR/xpath/ (it’s worth it!) ● XPath 1.0 tutorial by Zvon.org ● Concise XPath by Aristotle Pagaltzis ● XPath in lxml ● cssselect by Simon Sapin ● EXSLT extensions XPath for web scraping - Paul Tremberth, 17 October 2015 - Scrapinghub ⓒ 2015 35
  • 36. Thank you! Paul Tremberth, 17 October 2015 paul@scrapinghub.com https://github.com/redapple http://linkedin.com/in/paultremberth Oh, and we’re hiring! http://scrapinghub.com/jobs 36