Mozscape no sql-at-terabyte-scale

•Télécharger en tant que KEY, PDF•

0 j'aime•188 vues

philhsmith

Given at http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/events/62000472/

Technologie Divertissement et humour

Mozscape: NoSQL at Terabyte Scale

Phil Smith
Software Engineer

What We Do

SEO & Inbound Marketing Metrics

www.opensiteexplorer.org

What We Do

www.opensiteexplorer.org

Collect back links across the web

What We Do

www.opensiteexplorer.org

Collect back links across the web

Compute metrics estimating value

What We Do

www.opensiteexplorer.org

Collect back links across the web

Compute metrics estimating value

Serve links and metrics with API and OSE

How We Do

Crawl the Web

~25-30 billion pages per month

How We Do

Crawl the Web

~25-30 billion pages per month

20 Crawler machines

How We Do

Crawl the Web

~25-30 billion pages per month

20 Crawler machines

~256 MB/sec aggregate download rate

How We Do

Compute Aggregates and Metrics

1:5 to 1:50 Compression Ratios

How We Do

Compute Aggregates and Metrics

1:5 to 1:50 Compression Ratios

Aggregates are Parallelized Linear Scans

How We Do

Compute Aggregates and Metrics

1:5 to 1:50 Compression Ratios

Aggregates are Parallelized Linear Scans

Communication Avoided where Possible

How We Do

Surface with a Read-Only API

~12 TB per Release in Amazon S3

How We Do

Surface with a Read-Only API

~12 TB per Release in Amazon S3

6 m2.4xlarge Instances for Cache

How We Do

Surface with a Read-Only API

~12 TB per Release in Amazon S3

6 m2.4xlarge Instances for Cache

~28k Requests per Minute

Observations and Strategy

Billions of Small, Similar Records

De-normalization Avoids Complex Joins

Batch-style Emphasizes Spatial Locality

Data Layout

Column-Orientation exploits Locality

Broken into 5GB chunks for S3

~64KB Compression Runs within

Compression

Tuned to Overcome Disk Read Bound

By-Column, Run & Gap Encoding on LZO

Customized Pipelines per Column

Job Control

Each Stage has Parallel, Idempotent Tasks

Tasks are Procs with easy Command Line

stdout, exit code are logged to track state

Checkpoints

S3
Barrier Barrier
Table Scan Checkpoint Table Scan

Time

Indexing

Columns have BDBs indexing by ID

Subset of IDs map to Compression Runs

Decompress Run and Scan to ﬁnd Record

Physical Deployment

Crawlers run in Colo for white-listed IPs

Batch Process and API layer in EC2

The API might be in a colo too, but
ELB + Autoscaling are nice

Recommandé

A experiência do adoecimento é muito difícil não só para o paciente, mas tamb...Cristina Pressutto

Analysing nme dizzee cover prep for blog done.asmediac12

Double page spread analysisasmediac12

Términos básicos en estadísticaGabriel Marval'

Front cover analysis 3 – top of the popsasmediac12

Affix 'ter'Annur Indah Sari

Bab 2 reproduksisudana sudana

Second front cover, kerrangasmediac12

Recommandé

A experiência do adoecimento é muito difícil não só para o paciente, mas tamb...Cristina Pressutto

Analysing nme dizzee cover prep for blog done.asmediac12

Double page spread analysisasmediac12

Términos básicos en estadísticaGabriel Marval'

Front cover analysis 3 – top of the popsasmediac12

Affix 'ter'Annur Indah Sari

Bab 2 reproduksisudana sudana

Second front cover, kerrangasmediac12

Focus groupasmediac12

Main brief and mood boardasmediac12

Taller de informaticakenin merchan

Question 1 - Paige Coles asmediac12

Task 6asmediac12

Q3asmediac12

Evalutationasmediac12

Q6asmediac12

As media unit g321 evaluationasmediac12

Apex entrepreneurship clubSunita Thapaliya

Analysis of double page spreadsasmediac12

Question 4 - Paige Coles asmediac12

Analysing cover 2asmediac12

Evaluation Question 7asmediac12

Main Brief and Mood Boardasmediac12

Velocity-EHF for Androidmichaeljfawcett

Perkembangan daerah penyangga Taman Nasional Rawa Aopa WatumohaiRahmah Farida

Продовольственный рынок России в условиях экономических санкций 2016 годPogozheva Alexandra

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Histor y of HAM Radio presentation slidevu2urc

A Call to Action for Generative AI in 2024Results

Contenu connexe

En vedette

Focus groupasmediac12

Main brief and mood boardasmediac12

Taller de informaticakenin merchan

Question 1 - Paige Coles asmediac12

Task 6asmediac12

Q3asmediac12

Evalutationasmediac12

Q6asmediac12

As media unit g321 evaluationasmediac12

Apex entrepreneurship clubSunita Thapaliya

Analysis of double page spreadsasmediac12

Question 4 - Paige Coles asmediac12

Analysing cover 2asmediac12

Evaluation Question 7asmediac12

Main Brief and Mood Boardasmediac12

Velocity-EHF for Androidmichaeljfawcett

Perkembangan daerah penyangga Taman Nasional Rawa Aopa WatumohaiRahmah Farida

Продовольственный рынок России в условиях экономических санкций 2016 годPogozheva Alexandra

En vedette (18)

Focus group

Main brief and mood board

Taller de informatica

Question 1 - Paige Coles

Task 6

Evalutation

As media unit g321 evaluation

Apex entrepreneurship club

Analysis of double page spreads

Question 4 - Paige Coles

Analysing cover 2

Evaluation Question 7

Main Brief and Mood Board

Velocity-EHF for Android

Perkembangan daerah penyangga Taman Nasional Rawa Aopa Watumohai

Продовольственный рынок России в условиях экономических санкций 2016 год

Dernier

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Histor y of HAM Radio presentation slidevu2urc

A Call to Action for Generative AI in 2024Results

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Developing An App To Navigate The Roads of BrazilV3cube

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Dernier (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Histor y of HAM Radio presentation slide

A Call to Action for Generative AI in 2024

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Scaling API-first – The story of a global engineering organization

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Injustice - Developers Among Us (SciFiDevCon 2024)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

08448380779 Call Girls In Civil Lines Women Seeking Men

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Exploring the Future Potential of AI-Enabled Smartphone Processors

Developing An App To Navigate The Roads of Brazil

Axa Assurance Maroc - Insurer Innovation Award 2024

Data Cloud, More than a CDP by Matt Robison

The Codex of Business Writing Software for Real-World Solutions 2.pptx

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Presentation on how to chat with PDF using ChatGPT code interpreter

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Mozscape no sql-at-terabyte-scale

1. Mozscape: NoSQL at Terabyte Scale Phil Smith Software Engineer

2. What We Do SEO & Inbound Marketing Metrics www.opensiteexplorer.org

3. What We Do www.opensiteexplorer.org Collect back links across the web

4. What We Do www.opensiteexplorer.org Collect back links across the web Compute metrics estimating value

5. What We Do www.opensiteexplorer.org Collect back links across the web Compute metrics estimating value Serve links and metrics with API and OSE

6. How We Do Crawl the Web ~25-30 billion pages per month

7. How We Do Crawl the Web ~25-30 billion pages per month 20 Crawler machines

8. How We Do Crawl the Web ~25-30 billion pages per month 20 Crawler machines ~256 MB/sec aggregate download rate

9. How We Do Compute Aggregates and Metrics 1:5 to 1:50 Compression Ratios

10. How We Do Compute Aggregates and Metrics 1:5 to 1:50 Compression Ratios Aggregates are Parallelized Linear Scans

11. How We Do Compute Aggregates and Metrics 1:5 to 1:50 Compression Ratios Aggregates are Parallelized Linear Scans Communication Avoided where Possible

12. How We Do Surface with a Read-Only API ~12 TB per Release in Amazon S3

13. How We Do Surface with a Read-Only API ~12 TB per Release in Amazon S3 6 m2.4xlarge Instances for Cache

14. How We Do Surface with a Read-Only API ~12 TB per Release in Amazon S3 6 m2.4xlarge Instances for Cache ~28k Requests per Minute

15. Observations and Strategy Billions of Small, Similar Records De-normalization Avoids Complex Joins Batch-style Emphasizes Spatial Locality

16. Data Layout Column-Orientation exploits Locality Broken into 5GB chunks for S3 ~64KB Compression Runs within

17. Compression Tuned to Overcome Disk Read Bound By-Column, Run & Gap Encoding on LZO Customized Pipelines per Column

18. Job Control Each Stage has Parallel, Idempotent Tasks Tasks are Procs with easy Command Line stdout, exit code are logged to track state

19. Checkpoints S3 Barrier Barrier Table Scan Checkpoint Table Scan Time

20. Indexing Columns have BDBs indexing by ID Subset of IDs map to Compression Runs Decompress Run and Scan to ﬁnd Record

21. Physical Deployment Crawlers run in Colo for white-listed IPs Batch Process and API layer in EC2 The API might be in a colo too, but ELB + Autoscaling are nice

22. Questions? We’re Hiring!

Mozscape no sql-at-terabyte-scale

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (18)

Dernier

Dernier (20)

Mozscape no sql-at-terabyte-scale

Notes de l'éditeur