SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web
Alexander Sibiryakov, October 15, 2015
• Software Engineer @
• Born in Yekaterinburg, RU
• 5 years at Yandex, search
quality department: social and
QA search, snippets.
• 2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
Hola a todos!
• Over 2 billion requests per month (~800 per second)
• Focused crawls & Broad crawls
We help turn web content
into useful data
"text": "'Extreme poverty'
to fall below 10% of world
population for first time",
"points": "9 points",
• Crawl Spanish web to gather
statistics about hosts and their
• Limit crawl to .es zone.
• Breadth-ﬁrst strategy: ﬁrst crawl
1-click distance documents,
next 2-clicks, and so on,
• Finishing condition: absence of
hosts with less than 100
• Low costs.
Spanish internet (.es) in 2012
• Domain names registered - 1,56М (39% growth per
• Web server in zone - 283,4K (33,1%)
• Hosts - 4,2M (21%)
• Spanish web sites in DMOZ catalog - 22043
* - отчет OECD Communications Outlook 2013
• Scrapy* - network operations.
• Apache Kafka - data bus (offsets, partitioning).
• Apache HBase - storage (random access, linear scanning,
• Twisted.Internet - library for async primitives for use in workers.
• Snappy - efﬁcient compression algorithm for IO-bounded
* - network operations in Scrapy are implemented asynchronously,
based on the same Twisted.Internet
1. Big and small hosts
• When crawler comes to huge
number of links from some
host, along with usage of
simple prioritization models, it
turns out queue is ﬂooded with
URLs from the same host.
• That causes underuse of
• We adopted additional per-
host (optionally per-IP)
queue and metering
algorithm: URLs from big
hosts are cached in memory.
2. DDoS DNS service
• Breadth-ﬁrst strategy assumes
ﬁrst visiting of previously
unknown hosts, therefore
generating huge amount of
• Recursive DNS server on each
downloading node, with
upstream set to Verizon and
• We used dnsmasq.
3. Tuning Scrapy thread pool’а
for efﬁcient DNS resolution
• Scrapy uses a thread pool to
resolve DNS name to IP.
• When ip is absent in cache,
request is sent to DNS server
in it’s own thread, which is
• Scrapy reported numerous
errors related to DNS name
resolution and timeouts.
• We added option to Scrapy
for thread pool size and
4. Overloaded HBase region
servers during state check
• Crawler extracts from document
hundreds of links in average.
• Before adding this links to queue, they
needs to be checked if they weren’t
already crawled (to avoid repetitive
• On small volumes SSDs were just ﬁne.
After increase of table size, we had to
move to HDDs, and response times
dramatically grew up.
• Host-local ﬁngerprint function for
keys in HBase.
• Tuning HBase block cache to ﬁt
average host states into one block.
5. Intensive network trafﬁc
from workers to services
• We noticed throughput
between workers Kafka and
HBase up to 1Gbit/s.
• Switched to Thrift compact
protocol for HBase
• Message compression in
Kafka using Snappy.
6. Further query and trafﬁc
optimizations to HBase
• State check required lion’s
share of requests and
• Consistency was another
• We created local state cache
in strategy worker.
• For consistency, spider log
was partitioned by host, to
avoid cache overlap
• All operations are batched:
• If key is absent in cache, it’s
requested from HBase,
• every ~4K documents
cache is ﬂushed to HBase.
• When achieving 3M (~1Гб)
elements, ﬂush and cleanup
• It seems Least-Recently-Used
(LRU) algorithm is a good ﬁt
Spider priority queue (slot)
• Cell has an array of:
• Dequeueing top N.
• Such design is prone to huge
• Partially this problem can be
solved using scoring model
taking into account known
document count per host.
7. Problem of big and small
hosts (strikes back!)
• During crawling we’ve found few
very huge hosts (>20M docs)
• All queue partitions were
ﬂooded with pages from few
huge hosts, because of queue
design and scoring model used.
• We made two MapReduce
• queue shufﬂing,
• limiting all hosts to no more
than 100 documents.
• Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Maintaining Cloudera Hadoop on
• CDH is very sensitive to free space on root partition, parcels, and
storage of Cloudera Manager.
• We’ve moved it using symbolic links to separate EBS partition.
• EBS should be at least 30Gb, base IOPS should be enough.
• Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD).
• After one week of crawling, we ran out of space, and started to
move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).
Spanish (.es) internet crawl results
• fnac.es, rakuten.es, adidas.es,
are the biggest websites
• 68.7K domains found (~600K
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than
A. Broder et al. / Computer Networks 33 (2000) 309-320
Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Canonical URLs resolution abstraction: each
document has many URLs, which to use?
• Scrapy ecosystem: good documentation, big
community, ease of customization.
• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
• Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate
• Polite by design: each website is downloaded by
at most one spider.
• Python: workers, spiders.
Distributed Frontera features
• Lighter version, without HBase
and Kafka. Communicating
• Revisiting strategy out-of-box.
• Watchdog solution: tracking
website content changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
• Integration into Scrapinghub
• Testing on larger volumes.
• Distributed Frontera is a
historically ﬁrst attempt to
implement web scale web
crawler using Python.
• Truly resource-intensive task:
CPU, network, disks.
• Made in Scrapinghub, a
company where Scrapy was
• A plans to become an Apache
Software Foundation project.