SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Building an open Web-Scale crawl using Hadoop.
Architect / Engineer at CommonCrawl
Who is CommonCrawl ?
• A 501(c)3 non-profit “dedicated to building, maintaining and
making widely available a comprehensive crawl of the
Internet for the purpose of enabling a new wave of
innovation, education and research.”
• Funded through a grant by Gil Elbaz, former Googler and
founder of Applied Semantics, and current CEO of Factual Inc.
• Board members include Carl Malamud and Nova Spivack.
Motivations Behind CommonCrawl
• Internet is a massively disruptive force.
• Exponential advances in computing capacity, storage and
bandwidth are creating constant flux and disequilibrium in the IT
• Cloud computing makes large scale, on-demand computing
affordable for even the smallest startup.
• Hadoop provides the technology stack that enables us to crunch
massive amounts of data.
• Having the ability to “Map-Reduce the Internet” opens up lots of
new opportunities for disruptive innovation and we would like to
reduce the cost of doing this by an order of magnitude, at least.
• White list only the major search engines trend by Webmasters puts
the future of the Open Web at risk and stifles future search
innovation and evolution.
• Crawl broadly and frequently across all TLDs.
• Prioritize the crawl based on simplified criteria (rank and
• Upload the crawl corpus to S3.
• Make our S3 bucket widely accessible to as many users as
• Build support libraries to facilitate access to the S3 data via
• Focus on doing a few things really well.
• Listen to customers and open up more metadata and services
• We are not a comprehensive crawl, and may never be
• URLs in Crawl DB – 14 billion
• URLs with inverse link graph – 1.6 billion
• URLS with content in S3 – 2.5 billion
• Recent crawled documents – 500 million
• Uploaded documents after Deduping 300 million.
• Newly discovered URLs – 1.9 billion
• # of Vertices in Page Rank (recent caclulation) – 3.5 billion
• # of Edges in Page Rank Graph (recent caclulation) – 17 billion
Current System Design
• Batch oriented crawl list generation.
• High volume crawling via independent crawlers.
• Crawlers dump data into HDFS.
• Map-Reduce jobs parse, extract metadata from crawled
documents in bulk independently of crawlers.
• Periodically, we ‘checkpoint’ the crawl, which involves, among
– Post processing of crawled documents (deduping etc.)
– ARC file generation
– Link graph updates
– Crawl database updates.
– Crawl list regeneration.
Our Cluster Config
• Modest internal cluster consisting of 24 Hadoop nodes,4
crawler nodes, and 2 NameNode / Database servers.
• Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore
Xeons with 24 or 32 GB of RAM.
• 9 Map Tasks per node, avg 4 Reducers per node, BLOCK
compression using LZO.
Crawler Design Details
• Java codebase.
• Asynchronous IO model using custom NIO based HTTP stack.
• Lots of worker threads that synchronize with main thread via
Asynchronous message queues.
• Can sustain a crawl rate of ~250 URLS per second.
• Up to 500 active HTTP connections at any one time.
• Currently, no document parsing in crawler process.
• We currently run 8 crawlers and crawl on average ~100 million
URLs per day, when crawling.
• During post processing phase, on average we process 800
• After Deduping, we package and upload on average
approximately 500 million documents to S3.
• Primary Keys are 128 bit URL fingerprints, consisting of 64 bit
domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash).
• Keys are distributed via modulo operation of URL portion of
• Currently, we run 4 reducers per node, and there is one node
down, so we have 92 unique shards.
• Keys in each shard are sorted by Domain FP, then URL FP.
• We like the 64 bit domain id, since it is a generated key, but it
• We may move to a 32 bit root domain id / 32 bit domain id +
64 URL fingerprint key scheme in the future, and then sort by
root domain, domain, and then FP per shard.
Crawl Database – Continued
• Values in the Crawl Database consist of extensible Metadata
• We currently use our own DDL and compiler for generating
structures (vs. using Thrift/ProtoBuffers/Avro).
• Avro / ProtoBufs were not available when we started, and we
added lots of Hadoop friendly stuff to our version (multipart [key]
attributes lead to auto WritableComparable derived classes, with
built-in Raw Comparator support etc.).
• Our compiler also generates RPC stubs, with Google ProtoBuf style
message passing semantics (Message w/ optional Struct In, optional
Struct Out) instead of Thrift style semantics (Method with multiple
arguments and a return type).
• We prefer the former because it is better attuned to our preference
towards the asynchronous style of RPC programming.
Map-Reduce Pipeline – Link Graph Construction
Link Graph Construction
Inverse Link Graph Construction
Map-Reduce Pipeline – PageRank Edge Graph Construction
Page Rank Process
Generate Page Rank Values
The Need For a Smarter Merge
• Pipelining nature of HDFS means each Reducer writes it’s
output to local disk first, then to Repl Level – 1 other nodes.
• If intermediate data record sets are already sorted, the need
to run an Identity Mapper/Shuffle/Merge Sort phase to join to
sorted record sets is very expensive.