SlideShare une entreprise Scribd logo
1  sur  22
Mozscape: NoSQL at Terabyte Scale




                        Phil Smith
                        Software Engineer
What We Do




   SEO & Inbound Marketing Metrics

   www.opensiteexplorer.org
What We Do

www.opensiteexplorer.org

Collect back links across the web
What We Do

www.opensiteexplorer.org

Collect back links across the web

Compute metrics estimating value
What We Do

www.opensiteexplorer.org

Collect back links across the web

Compute metrics estimating value

Serve links and metrics with API and OSE
How We Do

Crawl the Web

 ~25-30 billion pages per month
How We Do

Crawl the Web

 ~25-30 billion pages per month

 20 Crawler machines
How We Do

Crawl the Web

 ~25-30 billion pages per month

 20 Crawler machines

 ~256 MB/sec aggregate download rate
How We Do

Compute Aggregates and Metrics

 1:5 to 1:50 Compression Ratios
How We Do

Compute Aggregates and Metrics

 1:5 to 1:50 Compression Ratios

 Aggregates are Parallelized Linear Scans
How We Do

Compute Aggregates and Metrics

 1:5 to 1:50 Compression Ratios

 Aggregates are Parallelized Linear Scans

 Communication Avoided where Possible
How We Do

Surface with a Read-Only API

 ~12 TB per Release in Amazon S3
How We Do

Surface with a Read-Only API

 ~12 TB per Release in Amazon S3

 6 m2.4xlarge Instances for Cache
How We Do

Surface with a Read-Only API

 ~12 TB per Release in Amazon S3

 6 m2.4xlarge Instances for Cache

 ~28k Requests per Minute
Observations and Strategy


Billions of Small, Similar Records

De-normalization Avoids Complex Joins

Batch-style Emphasizes Spatial Locality
Data Layout


Column-Orientation exploits Locality

Broken into 5GB chunks for S3

~64KB Compression Runs within
Compression


Tuned to Overcome Disk Read Bound

By-Column, Run & Gap Encoding on LZO

Customized Pipelines per Column
Job Control


Each Stage has Parallel, Idempotent Tasks

Tasks are Procs with easy Command Line

stdout, exit code are logged to track state
Checkpoints


                         S3
               Barrier          Barrier
  Table Scan         Checkpoint           Table Scan



                         Time
Indexing


Columns have BDBs indexing by ID

Subset of IDs map to Compression Runs

Decompress Run and Scan to find Record
Physical Deployment


Crawlers run in Colo for white-listed IPs

Batch Process and API layer in EC2

The API might be in a colo too, but
ELB + Autoscaling are nice
Questions?

We’re Hiring!

Contenu connexe

En vedette

Main brief and mood board
Main brief and mood boardMain brief and mood board
Main brief and mood boardasmediac12
 
Taller de informatica
Taller de informaticaTaller de informatica
Taller de informaticakenin merchan
 
Question 1 - Paige Coles
Question 1 - Paige Coles Question 1 - Paige Coles
Question 1 - Paige Coles asmediac12
 
As media unit g321 evaluation
As media unit g321 evaluationAs media unit g321 evaluation
As media unit g321 evaluationasmediac12
 
Apex entrepreneurship club
Apex entrepreneurship clubApex entrepreneurship club
Apex entrepreneurship clubSunita Thapaliya
 
Analysis of double page spreads
Analysis of double page spreadsAnalysis of double page spreads
Analysis of double page spreadsasmediac12
 
Question 4 - Paige Coles
Question 4 - Paige Coles Question 4 - Paige Coles
Question 4 - Paige Coles asmediac12
 
Analysing cover 2
Analysing cover 2Analysing cover 2
Analysing cover 2asmediac12
 
Evaluation Question 7
Evaluation Question 7Evaluation Question 7
Evaluation Question 7asmediac12
 
Main Brief and Mood Board
Main Brief and Mood BoardMain Brief and Mood Board
Main Brief and Mood Boardasmediac12
 
Velocity-EHF for Android
Velocity-EHF for AndroidVelocity-EHF for Android
Velocity-EHF for Androidmichaeljfawcett
 
Perkembangan daerah penyangga Taman Nasional Rawa Aopa Watumohai
Perkembangan daerah penyangga Taman Nasional Rawa Aopa WatumohaiPerkembangan daerah penyangga Taman Nasional Rawa Aopa Watumohai
Perkembangan daerah penyangga Taman Nasional Rawa Aopa WatumohaiRahmah Farida
 
Продовольственный рынок России в условиях экономических санкций 2016 год
Продовольственный рынок России в условиях экономических санкций 2016 годПродовольственный рынок России в условиях экономических санкций 2016 год
Продовольственный рынок России в условиях экономических санкций 2016 годPogozheva Alexandra
 

En vedette (18)

Focus group
Focus groupFocus group
Focus group
 
Main brief and mood board
Main brief and mood boardMain brief and mood board
Main brief and mood board
 
Taller de informatica
Taller de informaticaTaller de informatica
Taller de informatica
 
Question 1 - Paige Coles
Question 1 - Paige Coles Question 1 - Paige Coles
Question 1 - Paige Coles
 
Task 6
Task 6Task 6
Task 6
 
Q3
Q3Q3
Q3
 
Evalutation
EvalutationEvalutation
Evalutation
 
Q6
Q6Q6
Q6
 
As media unit g321 evaluation
As media unit g321 evaluationAs media unit g321 evaluation
As media unit g321 evaluation
 
Apex entrepreneurship club
Apex entrepreneurship clubApex entrepreneurship club
Apex entrepreneurship club
 
Analysis of double page spreads
Analysis of double page spreadsAnalysis of double page spreads
Analysis of double page spreads
 
Question 4 - Paige Coles
Question 4 - Paige Coles Question 4 - Paige Coles
Question 4 - Paige Coles
 
Analysing cover 2
Analysing cover 2Analysing cover 2
Analysing cover 2
 
Evaluation Question 7
Evaluation Question 7Evaluation Question 7
Evaluation Question 7
 
Main Brief and Mood Board
Main Brief and Mood BoardMain Brief and Mood Board
Main Brief and Mood Board
 
Velocity-EHF for Android
Velocity-EHF for AndroidVelocity-EHF for Android
Velocity-EHF for Android
 
Perkembangan daerah penyangga Taman Nasional Rawa Aopa Watumohai
Perkembangan daerah penyangga Taman Nasional Rawa Aopa WatumohaiPerkembangan daerah penyangga Taman Nasional Rawa Aopa Watumohai
Perkembangan daerah penyangga Taman Nasional Rawa Aopa Watumohai
 
Продовольственный рынок России в условиях экономических санкций 2016 год
Продовольственный рынок России в условиях экономических санкций 2016 годПродовольственный рынок России в условиях экономических санкций 2016 год
Продовольственный рынок России в условиях экономических санкций 2016 год
 

Dernier

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Mozscape no sql-at-terabyte-scale

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n