SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
An introduction to
Julien Nioche
julien@digitalpebble.com
@digitalpebble
Java Meetup – Bristol 10/03/2015
Storm Crawler
2 / 31
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Machine Learning
 Strong focus on Open Source & Apache ecosystem
 User | Contributor | Committer
– Nutch
– Tika
– SOLR, Lucene
– GATE, UIMA
– Behemoth
3 / 31
 Collection of resources (SDK) for building web crawlers on
Apache Storm
 https://github.com/DigitalPebble/storm-crawler
 Artefacts available from Maven Central
 Apache License v2
 Active and growing fast
=> Scalable
=> Low latency
=> Easily extensible
=> Polite
Storm-Crawler : what is it?
4 / 31
What it is not
 A ready-to-use, feature-complete web crawler
– Will provide that as a separate project using S/C later
 e.g. No PageRank or explicit ranking of pages
– Build your own
 No fancy UI, dashboards, etc...
– Build your own or integrate in your existing ones
 Compare with Apache Nutch later
5 / 31
Apache Storm
 Distributed real time computation system
 http://storm.apache.org/
 Scalable, fault-tolerant, polyglot and fun!
 Implemented in Clojure + Java
 “Realtime analytics, online machine learning, continuous
computation, distributed RPC, ETL, and more”
6 / 31
Storm Cluster Layout
http://cdn.oreillystatic.com/en/assets/1/event/85/Realtime
%20Processing%20with%20Storm%20Presentation.pdf
7 / 31
Topology = spouts + bolts + tuples +
streams
8 / 31
Basic Crawl Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
(1) pull URLs from an external
source
(2) group per host / domain / IP
(3) fetch URLs
Default
stream
(4) parse content
(3) fetch URLs
(5) index text and metadata
9 / 31
Basic Topology : Tuple perspective
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
(1) <String url, Metadata m>
(2) <String url, String key, Metadata m>
Default
stream
(4) <String url, byte[] content, Metadata m,
String text>
(3) <String url, byte[] content, Metadata m>
(1)
(2)
(3)
(4)
10 / 31
Basic scenario
 Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...
 Indexer : any form of storage but often index e.g. SOLR,
ElasticSearch
 Components in grey : default ones from SC
 Use case : non-recursive crawls (i.e. no links discovered),
URLs known in advance. Failures don't matter too much.
 “Great! What about recursive crawls and / or failure
handling?”
11 / 31
Frontier expansion
 Manual “discovery”
– Adding new URLs by
hand, “seeding”
 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
12 / 31
Recursive Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
Default
stream
Status Updater
Status
stream
Switch
DB
13 / 31
Status stream
 Used for handling errors and update info about URLs
 Newly discovered URLs from parser
public enum Status {
DISCOVERED, FETCHED, FETCH_ERROR,
REDIRECTION, ERROR;}
 StatusUpdater : writes to storage
 Can/should extend AbstractStatusUpdaterBolt
14 / 31
Error vs failure in SC
 Failure : structural issue e.g. dead node
– Storm's mechanism based on time-out
– Spout method ack() / fail() get message ID
• can decide whether to replay tuple or not
• message ID only => limited info about what went wrong
 Error : data issue
– Some temporary e.g. target server temporarily unavailable
– Some fatal e.g. document not parsable
– Handled via Status stream
– StatusUpdater can decide what to do with them
• Change status? Change scheduling?
15 / 31
Topology in code
Grouping!
Politeness and
data locality
16 / 31
Launching the Storm Crawler
 Modify resources in src/main/resources
– URLFilters, ParseFilters etc...
 Build uber-jar with Maven
– mvn clean package
 Set the configuration
– YAML file (e.g. crawler-conf.yaml)
 Call Storm
storm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with-
dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf
crawler-conf.yaml
17 / 31
18 / 31
Comparison with Nutch
19 / 31
Comparison with Nutch
 Nutch is batch driven : little control on when URLs are
fetched
– Potential issue for use cases where need sessions
– latency++
 Fetching only one of the steps in Nutch
– SC : 'always be fetching' (Ken Krugler); better use of resources
 Make it even more flexible
– Typical case : few custom classes (at least a Topology) the rest are just
dependencies and standard S/C components
 Not ready-to use as Nutch : it's a SDK
 Would not have existed without it
– Borrowed code and concepts
20 / 31
Overview of resources
https://www.flickr.com/photos/dipster1/1403240351/
21 / 31
Resources
 Separated in core vs external
 Core
– URLPartitioner
– Fetcher
– Parser
– SitemapParser
– …
 External
– Metrics-related : inc. connector for Librato
– ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics
 Also user maintained external resources
22 / 31
FetcherBolt
 Multi-threaded
 Polite
– Puts incoming tuples into internal queues based on IP/domain/hostname
– Sets delay between requests from same queue
– Respects robots.txt
 Protocol-neutral
– Protocol implementations are pluggable
– HTTP implementation taken from Nutch
23 / 31
ParserBolt
 Based on Apache Tika
 Supports most commonly used doc formats
– HTML, PDF, DOC etc...
 Calls ParseFilters on document
– e.g. scrape info with XPathFilter
 Calls URLFilters on outlinks
– e.g normalize and / or blacklists URLs based on RegExps
24 / 31
Other resources
 URLPartitionerBolt
– Generates a key based on the hostname / domain / IP of URL
– Output :
• String URL
• String key
• String metadata
– Useful for fieldGrouping
 URLFilters
– Control crawl expansion
– Delete or rewrites URLs
– Configured in JSON file
– Interface
• String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)
25 / 31
Other resources
 ParseFilters
– Called from ParserBolt
– Extracts information from document
– Configured in JSON file
– Interface
• filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata)
– com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java
• Xpath expressions
• Info extracted stored in metadata
• Used later for indexing
26 / 31
Integrate it!
 Write the Spout for your usecase
– Will work fine existing resources as long as it generates URL, Metadata
 Typical scenario
– Group URLs to fetch into separate external queues based on host or
domain (AWS SQS, Apache Kafka)
– Write Spout for it and throttle with topology.max.spout.pending
– So that can enforce politeness without getting timeout on Tuples → fail
– Parse and extract
– Send new URLs to queues
 Can use various forms of persistence for URLs
– ElasticSearch, DynamoDB, HBase, etc...
27 / 31
Some use cases (prototype stage)
 Processing of streams of URLs (natural fit for Storm)
– http://www.weborama.com
 Monitoring of finite set of URLs
– http://www.ontopic.io (more on them later)
– http://www.shopstyle.com : scraping + indexing
 One-off non-recursive crawling
– http://www.stolencamerafinder.com/ : scraping + indexing
 Recursive crawler
– http://www.shopstyle.com : discovery of product pages
28 / 31
What's next?
 0.5 released soon?
 All-in-one crawler based on ElasticSearch
– Also a good example of how to use SC
– Separate project
 Additional Parse/URLFilters
 Facilitate writing IndexingBolts
 More tests and documentation
 A nice logo or better name?
29 / 31
Resources
 https://storm.apache.org/
 “Storm Applied” http://www.manning.com/sallen/
 https://github.com/DigitalPebble/storm-crawler
 http://nutch.apache.org/
30 / 31
Questions
?
31 / 31

Contenu connexe

Tendances

「情報」を「書く」ということ(仮) #RedmineJapan
 「情報」を「書く」ということ(仮)  #RedmineJapan 「情報」を「書く」ということ(仮)  #RedmineJapan
「情報」を「書く」ということ(仮) #RedmineJapanKazuhito Miura
 
Inside WebM
Inside WebMInside WebM
Inside WebMmganeko
 
[AKIBA.AWS] NLBとPrivateLinkの仕様に立ち向かう
[AKIBA.AWS] NLBとPrivateLinkの仕様に立ち向かう[AKIBA.AWS] NLBとPrivateLinkの仕様に立ち向かう
[AKIBA.AWS] NLBとPrivateLinkの仕様に立ち向かうShuji Kikuchi
 
Developing .NET 6 Blazor WebAssemby apps with Radzen Blazor component library...
Developing .NET 6 Blazor WebAssemby apps with Radzen Blazor component library...Developing .NET 6 Blazor WebAssemby apps with Radzen Blazor component library...
Developing .NET 6 Blazor WebAssemby apps with Radzen Blazor component library...Shotaro Suzuki
 
AWS VM import / export ハンズオン
AWS VM import / export ハンズオンAWS VM import / export ハンズオン
AWS VM import / export ハンズオンEmma Haruka Iwao
 
go + swaggerでAPIサーバーを作ってみる
go + swaggerでAPIサーバーを作ってみるgo + swaggerでAPIサーバーを作ってみる
go + swaggerでAPIサーバーを作ってみる虎の穴 開発室
 
Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26
Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26
Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26Yahoo!デベロッパーネットワーク
 
Automated Web Testing Using Selenium
Automated Web Testing Using SeleniumAutomated Web Testing Using Selenium
Automated Web Testing Using SeleniumWeifeng Zhang
 
Camera2APIと画像フォーマット
Camera2APIと画像フォーマットCamera2APIと画像フォーマット
Camera2APIと画像フォーマットKiyotaka Soranaka
 
RedmineのFAQとアンチパターン集
RedmineのFAQとアンチパターン集RedmineのFAQとアンチパターン集
RedmineのFAQとアンチパターン集akipii Oga
 
はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化
はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化
はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化Lintaro Ina
 
コンテナで始める柔軟な AWS Lambda 生活
コンテナで始める柔軟な AWS Lambda 生活コンテナで始める柔軟な AWS Lambda 生活
コンテナで始める柔軟な AWS Lambda 生活Drecom Co., Ltd.
 
Redmine 5.0 + RedMica 2.1 新機能評価ガイド
Redmine 5.0 + RedMica 2.1 新機能評価ガイドRedmine 5.0 + RedMica 2.1 新機能評価ガイド
Redmine 5.0 + RedMica 2.1 新機能評価ガイドGo Maeda
 
決済サービスのSpring Bootのバージョンを2系に上げた話
決済サービスのSpring Bootのバージョンを2系に上げた話決済サービスのSpring Bootのバージョンを2系に上げた話
決済サービスのSpring Bootのバージョンを2系に上げた話Ryosuke Uchitate
 
Gradle入門
Gradle入門Gradle入門
Gradle入門orekyuu
 
Google Cloud Game Servers 徹底入門 | 第 10 回 Google Cloud INSIDE Games & Apps Online
Google Cloud Game Servers 徹底入門 | 第 10 回 Google Cloud INSIDE Games & Apps OnlineGoogle Cloud Game Servers 徹底入門 | 第 10 回 Google Cloud INSIDE Games & Apps Online
Google Cloud Game Servers 徹底入門 | 第 10 回 Google Cloud INSIDE Games & Apps OnlineGoogle Cloud Platform - Japan
 
Burp suite を使って iPhone アプリを診断した時に困ったことと、解決方法
Burp suite を使って iPhone アプリを診断した時に困ったことと、解決方法Burp suite を使って iPhone アプリを診断した時に困ったことと、解決方法
Burp suite を使って iPhone アプリを診断した時に困ったことと、解決方法tobaru_yuta
 
宇宙一早いreCapとreInvent2回目参加に向けての意気込み.pptx
宇宙一早いreCapとreInvent2回目参加に向けての意気込み.pptx宇宙一早いreCapとreInvent2回目参加に向けての意気込み.pptx
宇宙一早いreCapとreInvent2回目参加に向けての意気込み.pptxAkifumi Niida
 
Redmineでメトリクスを見える化する方法
Redmineでメトリクスを見える化する方法Redmineでメトリクスを見える化する方法
Redmineでメトリクスを見える化する方法Hidehisa Matsutani
 

Tendances (20)

「情報」を「書く」ということ(仮) #RedmineJapan
 「情報」を「書く」ということ(仮)  #RedmineJapan 「情報」を「書く」ということ(仮)  #RedmineJapan
「情報」を「書く」ということ(仮) #RedmineJapan
 
Inside WebM
Inside WebMInside WebM
Inside WebM
 
[AKIBA.AWS] NLBとPrivateLinkの仕様に立ち向かう
[AKIBA.AWS] NLBとPrivateLinkの仕様に立ち向かう[AKIBA.AWS] NLBとPrivateLinkの仕様に立ち向かう
[AKIBA.AWS] NLBとPrivateLinkの仕様に立ち向かう
 
Developing .NET 6 Blazor WebAssemby apps with Radzen Blazor component library...
Developing .NET 6 Blazor WebAssemby apps with Radzen Blazor component library...Developing .NET 6 Blazor WebAssemby apps with Radzen Blazor component library...
Developing .NET 6 Blazor WebAssemby apps with Radzen Blazor component library...
 
AWS VM import / export ハンズオン
AWS VM import / export ハンズオンAWS VM import / export ハンズオン
AWS VM import / export ハンズオン
 
go + swaggerでAPIサーバーを作ってみる
go + swaggerでAPIサーバーを作ってみるgo + swaggerでAPIサーバーを作ってみる
go + swaggerでAPIサーバーを作ってみる
 
Selenium WebDriver training
Selenium WebDriver trainingSelenium WebDriver training
Selenium WebDriver training
 
Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26
Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26
Micrometer/Prometheusによる大規模システムモニタリング #jsug #sf_26
 
Automated Web Testing Using Selenium
Automated Web Testing Using SeleniumAutomated Web Testing Using Selenium
Automated Web Testing Using Selenium
 
Camera2APIと画像フォーマット
Camera2APIと画像フォーマットCamera2APIと画像フォーマット
Camera2APIと画像フォーマット
 
RedmineのFAQとアンチパターン集
RedmineのFAQとアンチパターン集RedmineのFAQとアンチパターン集
RedmineのFAQとアンチパターン集
 
はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化
はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化
はてなブックマークにおけるアクセス制御 - 半環構造に基づくモデル化
 
コンテナで始める柔軟な AWS Lambda 生活
コンテナで始める柔軟な AWS Lambda 生活コンテナで始める柔軟な AWS Lambda 生活
コンテナで始める柔軟な AWS Lambda 生活
 
Redmine 5.0 + RedMica 2.1 新機能評価ガイド
Redmine 5.0 + RedMica 2.1 新機能評価ガイドRedmine 5.0 + RedMica 2.1 新機能評価ガイド
Redmine 5.0 + RedMica 2.1 新機能評価ガイド
 
決済サービスのSpring Bootのバージョンを2系に上げた話
決済サービスのSpring Bootのバージョンを2系に上げた話決済サービスのSpring Bootのバージョンを2系に上げた話
決済サービスのSpring Bootのバージョンを2系に上げた話
 
Gradle入門
Gradle入門Gradle入門
Gradle入門
 
Google Cloud Game Servers 徹底入門 | 第 10 回 Google Cloud INSIDE Games & Apps Online
Google Cloud Game Servers 徹底入門 | 第 10 回 Google Cloud INSIDE Games & Apps OnlineGoogle Cloud Game Servers 徹底入門 | 第 10 回 Google Cloud INSIDE Games & Apps Online
Google Cloud Game Servers 徹底入門 | 第 10 回 Google Cloud INSIDE Games & Apps Online
 
Burp suite を使って iPhone アプリを診断した時に困ったことと、解決方法
Burp suite を使って iPhone アプリを診断した時に困ったことと、解決方法Burp suite を使って iPhone アプリを診断した時に困ったことと、解決方法
Burp suite を使って iPhone アプリを診断した時に困ったことと、解決方法
 
宇宙一早いreCapとreInvent2回目参加に向けての意気込み.pptx
宇宙一早いreCapとreInvent2回目参加に向けての意気込み.pptx宇宙一早いreCapとreInvent2回目参加に向けての意気込み.pptx
宇宙一早いreCapとreInvent2回目参加に向けての意気込み.pptx
 
Redmineでメトリクスを見える化する方法
Redmineでメトリクスを見える化する方法Redmineでメトリクスを見える化する方法
Redmineでメトリクスを見える化する方法
 

En vedette

Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm CrawlerJulien Nioche
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Deep-Dive to Azure Search
Deep-Dive to Azure SearchDeep-Dive to Azure Search
Deep-Dive to Azure SearchGunnar Peipman
 

En vedette (20)

Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Search engine
Search engineSearch engine
Search engine
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Deep-Dive to Azure Search
Deep-Dive to Azure SearchDeep-Dive to Azure Search
Deep-Dive to Azure Search
 
Presentationjava
PresentationjavaPresentationjava
Presentationjava
 

Similaire à An introduction to Storm Crawler

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerGeorge Ang
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)Alexander Shalimov
 
Introduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFIntroduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFGlobus
 
C# 3.0 and LINQ Tech Talk
C# 3.0 and LINQ Tech TalkC# 3.0 and LINQ Tech Talk
C# 3.0 and LINQ Tech TalkMichael Heydt
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environmentsDocker, Inc.
 
Kubernetes acomprehensiveoverview
Kubernetes acomprehensiveoverviewKubernetes acomprehensiveoverview
Kubernetes acomprehensiveoverviewAnkit Shukla
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewBob Killen
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Boden Russell
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudySalman Baset
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at BristechJulien Nioche
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive Overview(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive OverviewBob Killen
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overviewGabriel Carro
 

Similaire à An introduction to Storm Crawler (20)

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web Crawler
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)
 
Introduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFIntroduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCF
 
C# 3.0 and LINQ Tech Talk
C# 3.0 and LINQ Tech TalkC# 3.0 and LINQ Tech Talk
C# 3.0 and LINQ Tech Talk
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
Kubernetes acomprehensiveoverview
Kubernetes acomprehensiveoverviewKubernetes acomprehensiveoverview
Kubernetes acomprehensiveoverview
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive Overview
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive Overview(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive Overview
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overview
 

Dernier

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 

Dernier (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 

An introduction to Storm Crawler

  • 1. An introduction to Julien Nioche julien@digitalpebble.com @digitalpebble Java Meetup – Bristol 10/03/2015 Storm Crawler
  • 2. 2 / 31 About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Machine Learning  Strong focus on Open Source & Apache ecosystem  User | Contributor | Committer – Nutch – Tika – SOLR, Lucene – GATE, UIMA – Behemoth
  • 3. 3 / 31  Collection of resources (SDK) for building web crawlers on Apache Storm  https://github.com/DigitalPebble/storm-crawler  Artefacts available from Maven Central  Apache License v2  Active and growing fast => Scalable => Low latency => Easily extensible => Polite Storm-Crawler : what is it?
  • 4. 4 / 31 What it is not  A ready-to-use, feature-complete web crawler – Will provide that as a separate project using S/C later  e.g. No PageRank or explicit ranking of pages – Build your own  No fancy UI, dashboards, etc... – Build your own or integrate in your existing ones  Compare with Apache Nutch later
  • 5. 5 / 31 Apache Storm  Distributed real time computation system  http://storm.apache.org/  Scalable, fault-tolerant, polyglot and fun!  Implemented in Clojure + Java  “Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more”
  • 6. 6 / 31 Storm Cluster Layout http://cdn.oreillystatic.com/en/assets/1/event/85/Realtime %20Processing%20with%20Storm%20Presentation.pdf
  • 7. 7 / 31 Topology = spouts + bolts + tuples + streams
  • 8. 8 / 31 Basic Crawl Topology Fetcher Parser URLPartitioner Indexer Some Spout (1) pull URLs from an external source (2) group per host / domain / IP (3) fetch URLs Default stream (4) parse content (3) fetch URLs (5) index text and metadata
  • 9. 9 / 31 Basic Topology : Tuple perspective Fetcher Parser URLPartitioner Indexer Some Spout (1) <String url, Metadata m> (2) <String url, String key, Metadata m> Default stream (4) <String url, byte[] content, Metadata m, String text> (3) <String url, byte[] content, Metadata m> (1) (2) (3) (4)
  • 10. 10 / 31 Basic scenario  Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...  Indexer : any form of storage but often index e.g. SOLR, ElasticSearch  Components in grey : default ones from SC  Use case : non-recursive crawls (i.e. no links discovered), URLs known in advance. Failures don't matter too much.  “Great! What about recursive crawls and / or failure handling?”
  • 11. 11 / 31 Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control – Requires content parsing and link extraction seed i = 1 i = 2 i = 3 [Slide courtesy of A. Bialecki]
  • 12. 12 / 31 Recursive Topology Fetcher Parser URLPartitioner Indexer Some Spout Default stream Status Updater Status stream Switch DB
  • 13. 13 / 31 Status stream  Used for handling errors and update info about URLs  Newly discovered URLs from parser public enum Status { DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR;}  StatusUpdater : writes to storage  Can/should extend AbstractStatusUpdaterBolt
  • 14. 14 / 31 Error vs failure in SC  Failure : structural issue e.g. dead node – Storm's mechanism based on time-out – Spout method ack() / fail() get message ID • can decide whether to replay tuple or not • message ID only => limited info about what went wrong  Error : data issue – Some temporary e.g. target server temporarily unavailable – Some fatal e.g. document not parsable – Handled via Status stream – StatusUpdater can decide what to do with them • Change status? Change scheduling?
  • 15. 15 / 31 Topology in code Grouping! Politeness and data locality
  • 16. 16 / 31 Launching the Storm Crawler  Modify resources in src/main/resources – URLFilters, ParseFilters etc...  Build uber-jar with Maven – mvn clean package  Set the configuration – YAML file (e.g. crawler-conf.yaml)  Call Storm storm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with- dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
  • 18. 18 / 31 Comparison with Nutch
  • 19. 19 / 31 Comparison with Nutch  Nutch is batch driven : little control on when URLs are fetched – Potential issue for use cases where need sessions – latency++  Fetching only one of the steps in Nutch – SC : 'always be fetching' (Ken Krugler); better use of resources  Make it even more flexible – Typical case : few custom classes (at least a Topology) the rest are just dependencies and standard S/C components  Not ready-to use as Nutch : it's a SDK  Would not have existed without it – Borrowed code and concepts
  • 20. 20 / 31 Overview of resources https://www.flickr.com/photos/dipster1/1403240351/
  • 21. 21 / 31 Resources  Separated in core vs external  Core – URLPartitioner – Fetcher – Parser – SitemapParser – …  External – Metrics-related : inc. connector for Librato – ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics  Also user maintained external resources
  • 22. 22 / 31 FetcherBolt  Multi-threaded  Polite – Puts incoming tuples into internal queues based on IP/domain/hostname – Sets delay between requests from same queue – Respects robots.txt  Protocol-neutral – Protocol implementations are pluggable – HTTP implementation taken from Nutch
  • 23. 23 / 31 ParserBolt  Based on Apache Tika  Supports most commonly used doc formats – HTML, PDF, DOC etc...  Calls ParseFilters on document – e.g. scrape info with XPathFilter  Calls URLFilters on outlinks – e.g normalize and / or blacklists URLs based on RegExps
  • 24. 24 / 31 Other resources  URLPartitionerBolt – Generates a key based on the hostname / domain / IP of URL – Output : • String URL • String key • String metadata – Useful for fieldGrouping  URLFilters – Control crawl expansion – Delete or rewrites URLs – Configured in JSON file – Interface • String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)
  • 25. 25 / 31 Other resources  ParseFilters – Called from ParserBolt – Extracts information from document – Configured in JSON file – Interface • filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata) – com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java • Xpath expressions • Info extracted stored in metadata • Used later for indexing
  • 26. 26 / 31 Integrate it!  Write the Spout for your usecase – Will work fine existing resources as long as it generates URL, Metadata  Typical scenario – Group URLs to fetch into separate external queues based on host or domain (AWS SQS, Apache Kafka) – Write Spout for it and throttle with topology.max.spout.pending – So that can enforce politeness without getting timeout on Tuples → fail – Parse and extract – Send new URLs to queues  Can use various forms of persistence for URLs – ElasticSearch, DynamoDB, HBase, etc...
  • 27. 27 / 31 Some use cases (prototype stage)  Processing of streams of URLs (natural fit for Storm) – http://www.weborama.com  Monitoring of finite set of URLs – http://www.ontopic.io (more on them later) – http://www.shopstyle.com : scraping + indexing  One-off non-recursive crawling – http://www.stolencamerafinder.com/ : scraping + indexing  Recursive crawler – http://www.shopstyle.com : discovery of product pages
  • 28. 28 / 31 What's next?  0.5 released soon?  All-in-one crawler based on ElasticSearch – Also a good example of how to use SC – Separate project  Additional Parse/URLFilters  Facilitate writing IndexingBolts  More tests and documentation  A nice logo or better name?
  • 29. 29 / 31 Resources  https://storm.apache.org/  “Storm Applied” http://www.manning.com/sallen/  https://github.com/DigitalPebble/storm-crawler  http://nutch.apache.org/