7. How We Do
Crawl the Web
~25-30 billion pages per month
20 Crawler machines
8. How We Do
Crawl the Web
~25-30 billion pages per month
20 Crawler machines
~256 MB/sec aggregate download rate
9. How We Do
Compute Aggregates and Metrics
1:5 to 1:50 Compression Ratios
10. How We Do
Compute Aggregates and Metrics
1:5 to 1:50 Compression Ratios
Aggregates are Parallelized Linear Scans
11. How We Do
Compute Aggregates and Metrics
1:5 to 1:50 Compression Ratios
Aggregates are Parallelized Linear Scans
Communication Avoided where Possible
12. How We Do
Surface with a Read-Only API
~12 TB per Release in Amazon S3
13. How We Do
Surface with a Read-Only API
~12 TB per Release in Amazon S3
6 m2.4xlarge Instances for Cache
14. How We Do
Surface with a Read-Only API
~12 TB per Release in Amazon S3
6 m2.4xlarge Instances for Cache
~28k Requests per Minute
15. Observations and Strategy
Billions of Small, Similar Records
De-normalization Avoids Complex Joins
Batch-style Emphasizes Spatial Locality
20. Indexing
Columns have BDBs indexing by ID
Subset of IDs map to Compression Runs
Decompress Run and Scan to find Record
21. Physical Deployment
Crawlers run in Colo for white-listed IPs
Batch Process and API layer in EC2
The API might be in a colo too, but
ELB + Autoscaling are nice