SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Frontera: open source, large scale web
crawling framework
Alexander Sibiryakov, May 20, 2016, PyData Berlin 2016
sibiryakov@scrapinghub.com
• Software Engineer @
Scrapinghub
• Born in Yekaterinburg, RU
• 5 years at Yandex, search
quality department: social and
QA search, snippets.
• 2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
download attempts.
About myself
2
• Over 2 billion requests per month
(~800/sec.)
• Focused crawls & Broad crawls
We help turn web content
into useful data
3
{
"content": [
{
"title": {
"text": "'Extreme poverty' to fall below
10% of world population for first time",
"href": "http://www.theguardian.com/
society/2015/oct/05/world-bank-extreme-
poverty-to-fall-below-10-of-world-population-
for-first-time"
},
"points": "9 points",
"time_ago": {
"text": "2 hours ago",
"href": "https://news.ycombinator.com/
item?id=10352189"
},
"username": {
"text": "hliyan",
"href": "https://news.ycombinator.com/
user?id=hliyan"
}
},
Broad crawl usages
4
Lead generation
(extracting contact
information)
News analysis
Topical crawling
Plagiarism detection
Sentiment analysis
(popularity, likability)
Due diligence (profile/
business data)
Track criminal activity & find lost persons (DARPA)
Saatchi Global Gallery Guide
www.globalgalleryguide.com
• Discover 11K online
galleries.
• Extract general
information, art samples,
descriptions.
• NLP-based extraction.
• Find more galleries on the
web.
Frontera recipes
• Multiple websites data collection automation
• «Grep» of the internet segment
• Topical crawling
• Extracting data from arbitrary document
Multiple websites data collection
automation
• Scrapers from multiple websites.
• Data items collected and updated.
• Frontera can be used to
• crawl in parallel and scale the process,
• schedule revisiting (within fixed time),
• prioritize the URLs during crawling.
«Grep» of the internet segment
• alternative to Google,
• collect the zone files from registrars
(.com/.net/.org),
• setup Frontera in distributed mode,
• implement text processing in spider code,
• output items with matched pages.
Topical crawling
• document topic classifier & seeds URL list,
• if document is classified as positive crawler ->
extracted links,
• Frontera in distributed mode,
• topic classifier code put in spider.
Extensions: link classifier, follow/final classifiers.
Extracting data from arbitrary
document
• Tough problem. Can’t be solved completely.
• Can be seen as a structured prediction problem:
• Conditional Random Fields (CRF) or
• Hidden Markov Models (HMM).
• Tagged sequence of tokens and HTML tags can be
used to predict the data fields boundaries.
• Webstruct and WebAnnotator Firefox extension.
Task
• Spanish web: hosts and
their sizes statistics.
• Only .es ccTLD.
• Breadth-first strategy:
• first 1-click environ,
• 2,
• 3,
• …
• Finishing condition: 100 docs
from host max., all hosts
• Low costs.
11
Spanish, Russian, German and
world Web in 2012
12
Domains Web servers Hosts DMOZ*
Spanish (.es) 1,5M 280K 4,2M 122K
Russian
(.ru, .рф, .su) 4,8M 2,6M ? 105K
German (.de) 15,0M 3,7M 20,4M 466K
World 233M 62M 890M 3,9M
Sources: OECD Communications Outlook 2013, statdom.ru
* - current period (October 2015)
Solution
• Scrapy (based on Twisted) - network operations.
• Apache Kafka - data bus (offsets, partitioning).
• Apache HBase - storage (random access, linear
scanning, scalability).
• Twisted.Internet - library for async primitives for use in
workers.
• Snappy - efficient compression algorithm for IO-
bounded applications.
13
Architecture
Kafka topic
SW
Crawling strategy
workers
Storage workers
14
DB
1. Big and small hosts
problem
• Queue is flooded with
URLs from the same
host.
• → underuse of spider
resources.
• additional per-host
(per-IP) queue and
metering algorithm.
• URLs from big hosts
are cached in memory.
15
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of unknown
hosts →
generating huge amount
of DNS reqs.
Recursive DNS server
• on every spider node,
• upstream to Verizon &
OpenDNS.
We used dnsmasq.
16
3. Tuning Scrapy thread pool’а
for efficient DNS resolution
• OS DNS resolver,
• blocking calls,
• thread pool to resolve
DNS name to IP.
• numerous errors and
timeouts 🆘
• A patch for thread
pool size and
timeout adjustment.
17
4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
• HDDs.
• Small volume 🆗
• With ⬆table size, response
times ⬆
• Disk queue ⬆
• Host-local fingerprint
function for keys in HBase.
• Tuning HBase block cache to
fit average host states into
one block.
18
3Tb of metadata.
URLs, timestamps,…
275 b/doc
5. Intensive network traffic
from workers to services
• Throughput
between workers
and Kafka/HBase 

~ 1Gbit/s.
• Thrift compact
protocol for HBase
• Message
compression in
Kafka with Snappy
19
6. Further query and traffic
optimizations to HBase
• State check: lots of
reqs and network
• Consistency
• Local state cache
in strategy worker.
• For consistency,
spider log was
partitioned by
host.
20
State cache
• All ops are batched:
– If no key in cache→
read HBase
– every ~4K docs →
flush
• Close to 3M (~1Gb)
elms → flush & cleanup
• Least-Recently-Used
(LRU) 👍
21
Spider priority queue (slot)
• Cell:
Array of:

- fingerprint, 

- Crc32(hostname), 

- URL, 

- score
• Dequeueing top N.
• Prone to huge hosts
• Scoring model: document
count per host.
22
7. Problem of big and small
hosts (strikes back!)
• Discovered few very
huge hosts (>20M
docs)
• All queue partitions
were flooded with huge
hosts,
• Two MapReduce jobs:
– queue shuffling,
– limit all hosts to
100 docs MAX.
23
Spanish (.es) internet crawl results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es,
druni.es,
docentesconeducacion.es -
are the biggest websites
• 68.7K domains found (~600K
expected),
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than
50M pages
24
where are the rest of
web servers?!
Bow-tie model
A. Broder et al. / Computer Networks 33 (2000) 309-320
26
Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
27
12 years dynamics
Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
28
• Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Hardware requirements
(distributed backend+spiders)
29
Software requirements
30
Single process Distributed spiders
Distributed
backend
Python 2.7+, Scrapy 1.0.4+
sqlite or any other
RDBMS
HBase/RDBMS
- ZeroMQ or Kafka
- - DNS Service
• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Run modes: single process, distributed spiders,
dist. backend
• Scrapy ecosystem: good documentation, big
community, ease of customization.
Main features
31
• Message bus abstraction (ZeroMQ and Kafka are
available out-of-the box).
• Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate module.
• Polite by design: each website is downloaded by at most
one spider.
• Canonical URLs resolution abstraction: each document
has many URLs, which to use?
• Python: workers, spiders.
Main features
32
References
GitHub: https://github.com/scrapinghub/frontera
RTD: http://frontera.readthedocs.org/
Google groups: Frontera (https://goo.gl/ak9546)
33
Future plans
• Python 3 support,
• Docker images,
• Web UI,
• Watchdog solution: tracking
website content changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
• Integration into Scrapinghub
services.
• Testing on larger volumes.
34
Run your business using Frontera
 SCALABLE
 OPEN
 CUSTOMIZABLE
Made in Scrapinghub
(authors of Scrapy)
Contribute!
• Web scale crawler,
• Historically first
attempt in Python,
• Truly resource-
intensive task: CPU,
network, disks.
36
We’re hiring!
http://scrapinghub.com/jobs/
37
38
Mandatory sales slide
Crawl the web, at scale
• cloud-based platform
• smart proxy rotator Get data, hassle-free
• off-the-shelf datasets
• turn-key web scraping
try.scrapinghub.com/PDB16
Questions?
Thank you!
Alexander Sibiryakov,
sibiryakov@scrapinghub.com

Contenu connexe

Tendances

FIWARE Training: JSON-LD and NGSI-LD
FIWARE Training: JSON-LD and NGSI-LDFIWARE Training: JSON-LD and NGSI-LD
FIWARE Training: JSON-LD and NGSI-LD
FIWARE
 

Tendances (16)

BigchainDB: Blockchains for Artificial Intelligence by Trent McConaghy
BigchainDB: Blockchains for Artificial Intelligence by Trent McConaghyBigchainDB: Blockchains for Artificial Intelligence by Trent McConaghy
BigchainDB: Blockchains for Artificial Intelligence by Trent McConaghy
 
Blockchain Satellites - The Future of Space Commerce
Blockchain Satellites - The Future of Space CommerceBlockchain Satellites - The Future of Space Commerce
Blockchain Satellites - The Future of Space Commerce
 
Hyperledger Consensus Algorithms
Hyperledger Consensus AlgorithmsHyperledger Consensus Algorithms
Hyperledger Consensus Algorithms
 
Blockchain
BlockchainBlockchain
Blockchain
 
Vilnius blockchain club 20170413 consensus
Vilnius blockchain club 20170413 consensusVilnius blockchain club 20170413 consensus
Vilnius blockchain club 20170413 consensus
 
Distributed Ledger Technology
Distributed Ledger TechnologyDistributed Ledger Technology
Distributed Ledger Technology
 
Indexing Decentralized Data with Ethereum, IPFS & The Graph
Indexing Decentralized Data with Ethereum, IPFS & The GraphIndexing Decentralized Data with Ethereum, IPFS & The Graph
Indexing Decentralized Data with Ethereum, IPFS & The Graph
 
BigchainDB - Big Data meets Blockchain
BigchainDB - Big Data meets BlockchainBigchainDB - Big Data meets Blockchain
BigchainDB - Big Data meets Blockchain
 
Bitcoin & Ethereum Address
Bitcoin & Ethereum AddressBitcoin & Ethereum Address
Bitcoin & Ethereum Address
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
CPaaS.io Y1 Review Meeting - Holistic Data Management
CPaaS.io Y1 Review Meeting - Holistic Data ManagementCPaaS.io Y1 Review Meeting - Holistic Data Management
CPaaS.io Y1 Review Meeting - Holistic Data Management
 
FIWARE Training: JSON-LD and NGSI-LD
FIWARE Training: JSON-LD and NGSI-LDFIWARE Training: JSON-LD and NGSI-LD
FIWARE Training: JSON-LD and NGSI-LD
 
Records keeper product deck
Records keeper   product deckRecords keeper   product deck
Records keeper product deck
 
An introduction to blockchain and hyperledger v ru
An introduction to blockchain and hyperledger v ruAn introduction to blockchain and hyperledger v ru
An introduction to blockchain and hyperledger v ru
 
Datafying Bitcoins
Datafying BitcoinsDatafying Bitcoins
Datafying Bitcoins
 
Blockchain - definition, benefits, issues
Blockchain -  definition, benefits, issuesBlockchain -  definition, benefits, issues
Blockchain - definition, benefits, issues
 

En vedette

Evolutionary Algorithms: Perfecting the Art of "Good Enough"
Evolutionary Algorithms: Perfecting the Art of "Good Enough"Evolutionary Algorithms: Perfecting the Art of "Good Enough"
Evolutionary Algorithms: Perfecting the Art of "Good Enough"
PyData
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural Networks
PyData
 

En vedette (16)

Acerノートpcバッテリー,リチウムイオンバッテリー
Acerノートpcバッテリー,リチウムイオンバッテリーAcerノートpcバッテリー,リチウムイオンバッテリー
Acerノートpcバッテリー,リチウムイオンバッテリー
 
2016 Juvenile Justice and Youth Voice
2016 Juvenile Justice and Youth Voice2016 Juvenile Justice and Youth Voice
2016 Juvenile Justice and Youth Voice
 
Produccion de un video
Produccion de un video Produccion de un video
Produccion de un video
 
Ayurveda massage videos | AyurvedaSchool
Ayurveda massage videos | AyurvedaSchoolAyurveda massage videos | AyurvedaSchool
Ayurveda massage videos | AyurvedaSchool
 
Algae Report Final
Algae Report FinalAlgae Report Final
Algae Report Final
 
Comunicazione tecnica a misura di nativi digitali nell’Industria 4.0
Comunicazione tecnica a misura di nativi digitali nell’Industria 4.0Comunicazione tecnica a misura di nativi digitali nell’Industria 4.0
Comunicazione tecnica a misura di nativi digitali nell’Industria 4.0
 
Servant Leadership e Lean Development. L'unico matrimonio possibile.
Servant Leadership e Lean Development. L'unico matrimonio possibile.Servant Leadership e Lean Development. L'unico matrimonio possibile.
Servant Leadership e Lean Development. L'unico matrimonio possibile.
 
Ufficio stampa e Digital PR - Smau Bologna
Ufficio stampa e Digital PR - Smau BolognaUfficio stampa e Digital PR - Smau Bologna
Ufficio stampa e Digital PR - Smau Bologna
 
NỘI QUY CTY DVMS
NỘI QUY CTY DVMSNỘI QUY CTY DVMS
NỘI QUY CTY DVMS
 
Smau Milano 2015 - IWA Ferdinando Acerbi
Smau Milano 2015 - IWA Ferdinando AcerbiSmau Milano 2015 - IWA Ferdinando Acerbi
Smau Milano 2015 - IWA Ferdinando Acerbi
 
Evolutionary Algorithms: Perfecting the Art of "Good Enough"
Evolutionary Algorithms: Perfecting the Art of "Good Enough"Evolutionary Algorithms: Perfecting the Art of "Good Enough"
Evolutionary Algorithms: Perfecting the Art of "Good Enough"
 
Gestione del Tempo 4. I sei Consigli di valutazione del Tempo
Gestione del Tempo 4. I sei Consigli di valutazione del Tempo Gestione del Tempo 4. I sei Consigli di valutazione del Tempo
Gestione del Tempo 4. I sei Consigli di valutazione del Tempo
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural Networks
 
JCC_2015120915212763
JCC_2015120915212763JCC_2015120915212763
JCC_2015120915212763
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
11 Stats You Didn’t Know About Employee Recognition
11 Stats You Didn’t Know About Employee Recognition11 Stats You Didn’t Know About Employee Recognition
11 Stats You Didn’t Know About Employee Recognition
 

Similaire à Alexander Sibiryakov- Frontera

Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
sixtyone
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
Minsk MongoDB User Group
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
MySQLConference
 

Similaire à Alexander Sibiryakov- Frontera (20)

Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

Plus de PyData

Plus de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Dernier

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 

Dernier (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 

Alexander Sibiryakov- Frontera

  • 1. Frontera: open source, large scale web crawling framework Alexander Sibiryakov, May 20, 2016, PyData Berlin 2016 sibiryakov@scrapinghub.com
  • 2. • Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. About myself 2
  • 3. • Over 2 billion requests per month (~800/sec.) • Focused crawls & Broad crawls We help turn web content into useful data 3 { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/ society/2015/oct/05/world-bank-extreme- poverty-to-fall-below-10-of-world-population- for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/ item?id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/ user?id=hliyan" } },
  • 4. Broad crawl usages 4 Lead generation (extracting contact information) News analysis Topical crawling Plagiarism detection Sentiment analysis (popularity, likability) Due diligence (profile/ business data) Track criminal activity & find lost persons (DARPA)
  • 5. Saatchi Global Gallery Guide www.globalgalleryguide.com • Discover 11K online galleries. • Extract general information, art samples, descriptions. • NLP-based extraction. • Find more galleries on the web.
  • 6. Frontera recipes • Multiple websites data collection automation • «Grep» of the internet segment • Topical crawling • Extracting data from arbitrary document
  • 7. Multiple websites data collection automation • Scrapers from multiple websites. • Data items collected and updated. • Frontera can be used to • crawl in parallel and scale the process, • schedule revisiting (within fixed time), • prioritize the URLs during crawling.
  • 8. «Grep» of the internet segment • alternative to Google, • collect the zone files from registrars (.com/.net/.org), • setup Frontera in distributed mode, • implement text processing in spider code, • output items with matched pages.
  • 9. Topical crawling • document topic classifier & seeds URL list, • if document is classified as positive crawler -> extracted links, • Frontera in distributed mode, • topic classifier code put in spider. Extensions: link classifier, follow/final classifiers.
  • 10. Extracting data from arbitrary document • Tough problem. Can’t be solved completely. • Can be seen as a structured prediction problem: • Conditional Random Fields (CRF) or • Hidden Markov Models (HMM). • Tagged sequence of tokens and HTML tags can be used to predict the data fields boundaries. • Webstruct and WebAnnotator Firefox extension.
  • 11. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2, • 3, • … • Finishing condition: 100 docs from host max., all hosts • Low costs. 11
  • 12. Spanish, Russian, German and world Web in 2012 12 Domains Web servers Hosts DMOZ* Spanish (.es) 1,5M 280K 4,2M 122K Russian (.ru, .рф, .su) 4,8M 2,6M ? 105K German (.de) 15,0M 3,7M 20,4M 466K World 233M 62M 890M 3,9M Sources: OECD Communications Outlook 2013, statdom.ru * - current period (October 2015)
  • 13. Solution • Scrapy (based on Twisted) - network operations. • Apache Kafka - data bus (offsets, partitioning). • Apache HBase - storage (random access, linear scanning, scalability). • Twisted.Internet - library for async primitives for use in workers. • Snappy - efficient compression algorithm for IO- bounded applications. 13
  • 15. 1. Big and small hosts problem • Queue is flooded with URLs from the same host. • → underuse of spider resources. • additional per-host (per-IP) queue and metering algorithm. • URLs from big hosts are cached in memory. 15
  • 16. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs. Recursive DNS server • on every spider node, • upstream to Verizon & OpenDNS. We used dnsmasq. 16
  • 17. 3. Tuning Scrapy thread pool’а for efficient DNS resolution • OS DNS resolver, • blocking calls, • thread pool to resolve DNS name to IP. • numerous errors and timeouts 🆘 • A patch for thread pool size and timeout adjustment. 17
  • 18. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 • With ⬆table size, response times ⬆ • Disk queue ⬆ • Host-local fingerprint function for keys in HBase. • Tuning HBase block cache to fit average host states into one block. 18 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 19. 5. Intensive network traffic from workers to services • Throughput between workers and Kafka/HBase 
 ~ 1Gbit/s. • Thrift compact protocol for HBase • Message compression in Kafka with Snappy 19
  • 20. 6. Further query and traffic optimizations to HBase • State check: lots of reqs and network • Consistency • Local state cache in strategy worker. • For consistency, spider log was partitioned by host. 20
  • 21. State cache • All ops are batched: – If no key in cache→ read HBase – every ~4K docs → flush • Close to 3M (~1Gb) elms → flush & cleanup • Least-Recently-Used (LRU) 👍 21
  • 22. Spider priority queue (slot) • Cell: Array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score • Dequeueing top N. • Prone to huge hosts • Scoring model: document count per host. 22
  • 23. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs) • All queue partitions were flooded with huge hosts, • Two MapReduce jobs: – queue shuffling, – limit all hosts to 100 docs MAX. 23
  • 24. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages 24
  • 25. where are the rest of web servers?!
  • 26. Bow-tie model A. Broder et al. / Computer Networks 33 (2000) 309-320 26
  • 27. Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005 27
  • 28. 12 years dynamics Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014 28
  • 29. • Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: • 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores. Hardware requirements (distributed backend+spiders) 29
  • 30. Software requirements 30 Single process Distributed spiders Distributed backend Python 2.7+, Scrapy 1.0.4+ sqlite or any other RDBMS HBase/RDBMS - ZeroMQ or Kafka - - DNS Service
  • 31. • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Run modes: single process, distributed spiders, dist. backend • Scrapy ecosystem: good documentation, big community, ease of customization. Main features 31
  • 32. • Message bus abstraction (ZeroMQ and Kafka are available out-of-the box). • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider. • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Python: workers, spiders. Main features 32
  • 34. Future plans • Python 3 support, • Docker images, • Web UI, • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers. • Integration into Scrapinghub services. • Testing on larger volumes. 34
  • 35. Run your business using Frontera  SCALABLE  OPEN  CUSTOMIZABLE Made in Scrapinghub (authors of Scrapy)
  • 36. Contribute! • Web scale crawler, • Historically first attempt in Python, • Truly resource- intensive task: CPU, network, disks. 36
  • 38. 38 Mandatory sales slide Crawl the web, at scale • cloud-based platform • smart proxy rotator Get data, hassle-free • off-the-shelf datasets • turn-key web scraping try.scrapinghub.com/PDB16