SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Scaling Out
Hadoop and NoSQL


    Age Mooij
An Introduction to Dealing with




Big Data
About me...




              @agemooij
Big Data
  ...and me
My Current Project...




           IP Address Registration for
           Europe, Middle East, Russia

           Ipv4:2 32   (4.3×109)addresses
           Ipv6: 2128 (3.4×1038) addresses
Challenge

10 years of historical registration/routing data in flat files
200+ billion (!) historical data records (25 TB)

                30 billion records per year (4 TB)
                80 million per day / 1,000 per second




        Make it searchable...
Big Data
  ...and you
Google             Yahoo          Amazon
                                                  eBay
            Facebookusers
                  300M           MySpace users
                                      264M         Wikipedia
LinkedInusers
                      Twitterusers
      50M

                           45M           Digg         Hyves
       Flickr users       YouTube
           32M
                                              Marktplaats 5.5M ads
                                                    6.5M users,
Scalability:

         Handling more load / requests
             Handling more data
          Handling more types of data



  ...without anything breaking or falling over
         ...and without going bankrupt
UP
          Out Out Out Out
          Out Out Out Out
          Out Out Out Out
     VS   Out Out Out Out
          Out Out Out Out
          Out Out Out Out
Scaling Out, Part 1

Processing Data
  a.k.a. Data Crunching
Map/Reduce

 Parallel Batch Processing of Data
     Break the data into chunks
       Distribute the chunks
    Process the chunks in parallel
         Merge the results
Reliable, Scalable, Distributed Computing




           (written in Java)
Distributed File System (DFS)

    Foundation for all Hadoop projects
        Automatic file replication
Automatic checksumming / error correction
   Based on Google’s File System (GFS)
Map / Reduce

Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages
4TB of raw image TIFF data (stored in S3)
       100 Amazon EC2 instances
          Hadoop Map/Reduce
        11 million finished PDFs
         24 hours, about $240
Scaling Out, Part 1I

Storing & Retrieving Data
       Reads and Writes
Relational Databases
are hard to scale out
Ways to Scale out an RDBMS (1)


    Replication
                       Good for scaling reads
     Master-Slave      Single point of failure
                       Single point of bottleneck
    Master-Master      Limited scaling of writes
                       Complicated
Ways to Scale out an RDBMS (2)


                           Partitioning
Vertical   : by function / table
Horizontal : by key / id (Sharding)


     Not truly Relational anymore (application joins)
      Limited Scalability (relocating, resharding)
Why are RDBMSs
so hard to
scale out
Brewer’s CAP Theorem

Consistency
Availability
Partition Tolerance   ...pick any two
Relational   Non-Relational



ACID vs      BASE
Atomic       Basic
Consistent   Availability
Isolated     Soft State
Durable      Eventual Consistency
NoSQL             NO-SQL

 Non-Relational Databases

    Better Different
Types of NOSQL
(Distributed) Key-Value
        Redis
        Voldemort             Document Oriented
        Scalaris (D)
                                            CouchDB
                                            MongoDB
                                            Riak (D)


  Column Oriented
       Cassandra (D)
       HBase (D)
                                  Graph Oriented
                                              Neo4J



                          (D) = Distributed (automatic out scaling)
RIPE NCC
Experiences so far...
Those Big Numbers Again...


10 years of historical data in flat files
200+ billion (!) historical data records (25 TB)

                  30 billion records per year (4 TB)
                  80 million per day / 1,000 per second




                       Make it searchable...
~ 200 000 000 000 records




        Map / Reduce




~ 15 000 000 000 records
Our Data is 3D

IP Address
             1     0..*
                           Record
                          Record
                                    1   0..*
                                                Timestamp
                                               Timestamp



       Best fit & performance:
                   Column Oriented


 Row             Column Name (!)               Values (!)
Facebook
Cassandra                                 Twitter
                                           Digg


  Tunable: Availability vs Consistency
  Very active community
  0.4.1
  No documentation
Yahoo Adobe
                      Meetup Tumblr
                       StumbleUpon
                          Streamy


Built on top of Hadoop DFS
Very active community
0.20.1
Good Documentation
Initial Results:
   Tested on an EC2 cluster of 8 XLarge instances


3.8 B (23 GB)                                        33 M (1 GB)
                            5 hours




33 M (1 GB)                                            15 GB
                                                 Record duplication: 6x

    75 minutes                        “Needle in a haystack” full on-disk table scan:
44000 inserts/second                             0.5 M records/second
In order to choose the right
  scaling tools, you need to:
       Understand your data
Know what you want to query and how
Big Data
   ...Be Prepared !
val shameless = <SelfPromotion>




    Try some Scala in the basement !



        </SelfPromotion>

Contenu connexe

Tendances

Tendances (20)

Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Introduction to Mobile Business Intelligence
Introduction to Mobile Business IntelligenceIntroduction to Mobile Business Intelligence
Introduction to Mobile Business Intelligence
 
Big Data
Big DataBig Data
Big Data
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Unit 2
Unit 2Unit 2
Unit 2
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
1.4 data warehouse
1.4 data warehouse1.4 data warehouse
1.4 data warehouse
 
Data mart
Data martData mart
Data mart
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Virus and its CounterMeasures -- Pruthvi Monarch
Virus and its CounterMeasures                         -- Pruthvi Monarch Virus and its CounterMeasures                         -- Pruthvi Monarch
Virus and its CounterMeasures -- Pruthvi Monarch
 
Big data-cheat-sheet
Big data-cheat-sheetBig data-cheat-sheet
Big data-cheat-sheet
 
Improving of software processes
Improving of software processesImproving of software processes
Improving of software processes
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 

En vedette

An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellMichel Rijnders
 
Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Alert Logic
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015Ivan Glushkov
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMatthew Aslett
 
Up to speed in domain driven design
Up to speed in domain driven designUp to speed in domain driven design
Up to speed in domain driven designRick van der Arend
 

En vedette (7)

An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using Haskell
 
Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey results
 
Up to speed in domain driven design
Up to speed in domain driven designUp to speed in domain driven design
Up to speed in domain driven design
 

Similaire à Scaling Out With Hadoop And HBase

Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big DataPierre De Wilde
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed DatabaseEric Evans
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLYan Cui
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWSAmazon Web Services
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 

Similaire à Scaling Out With Hadoop And HBase (20)

Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 
Mongodb lab
Mongodb labMongodb lab
Mongodb lab
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 

Dernier

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Dernier (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Scaling Out With Hadoop And HBase

  • 1. Scaling Out Hadoop and NoSQL Age Mooij
  • 2. An Introduction to Dealing with Big Data
  • 3. About me... @agemooij
  • 4. Big Data ...and me
  • 5. My Current Project... IP Address Registration for Europe, Middle East, Russia Ipv4:2 32 (4.3×109)addresses Ipv6: 2128 (3.4×1038) addresses
  • 6. Challenge 10 years of historical registration/routing data in flat files 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...
  • 7. Big Data ...and you
  • 8. Google Yahoo Amazon eBay Facebookusers 300M MySpace users 264M Wikipedia LinkedInusers Twitterusers 50M 45M Digg Hyves Flickr users YouTube 32M Marktplaats 5.5M ads 6.5M users,
  • 9. Scalability: Handling more load / requests Handling more data Handling more types of data ...without anything breaking or falling over ...and without going bankrupt
  • 10. UP Out Out Out Out Out Out Out Out Out Out Out Out VS Out Out Out Out Out Out Out Out Out Out Out Out
  • 11. Scaling Out, Part 1 Processing Data a.k.a. Data Crunching
  • 12. Map/Reduce Parallel Batch Processing of Data Break the data into chunks Distribute the chunks Process the chunks in parallel Merge the results
  • 13. Reliable, Scalable, Distributed Computing (written in Java)
  • 14. Distributed File System (DFS) Foundation for all Hadoop projects Automatic file replication Automatic checksumming / error correction Based on Google’s File System (GFS)
  • 15. Map / Reduce Simple Java API Powerful supporting framework Powerful tools Good support for non-java languages
  • 16.
  • 17. 4TB of raw image TIFF data (stored in S3) 100 Amazon EC2 instances Hadoop Map/Reduce 11 million finished PDFs 24 hours, about $240
  • 18. Scaling Out, Part 1I Storing & Retrieving Data Reads and Writes
  • 20. Ways to Scale out an RDBMS (1) Replication Good for scaling reads Master-Slave Single point of failure Single point of bottleneck Master-Master Limited scaling of writes Complicated
  • 21. Ways to Scale out an RDBMS (2) Partitioning Vertical : by function / table Horizontal : by key / id (Sharding) Not truly Relational anymore (application joins) Limited Scalability (relocating, resharding)
  • 22. Why are RDBMSs so hard to scale out
  • 24. Relational Non-Relational ACID vs BASE Atomic Basic Consistent Availability Isolated Soft State Durable Eventual Consistency
  • 25. NoSQL NO-SQL Non-Relational Databases Better Different
  • 26. Types of NOSQL (Distributed) Key-Value Redis Voldemort Document Oriented Scalaris (D) CouchDB MongoDB Riak (D) Column Oriented Cassandra (D) HBase (D) Graph Oriented Neo4J (D) = Distributed (automatic out scaling)
  • 28. Those Big Numbers Again... 10 years of historical data in flat files 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...
  • 29. ~ 200 000 000 000 records Map / Reduce ~ 15 000 000 000 records
  • 30. Our Data is 3D IP Address 1 0..* Record Record 1 0..* Timestamp Timestamp Best fit & performance: Column Oriented Row Column Name (!) Values (!)
  • 31. Facebook Cassandra Twitter Digg Tunable: Availability vs Consistency Very active community 0.4.1 No documentation
  • 32. Yahoo Adobe Meetup Tumblr StumbleUpon Streamy Built on top of Hadoop DFS Very active community 0.20.1 Good Documentation
  • 33. Initial Results: Tested on an EC2 cluster of 8 XLarge instances 3.8 B (23 GB) 33 M (1 GB) 5 hours 33 M (1 GB) 15 GB Record duplication: 6x 75 minutes “Needle in a haystack” full on-disk table scan: 44000 inserts/second 0.5 M records/second
  • 34. In order to choose the right scaling tools, you need to: Understand your data Know what you want to query and how
  • 35. Big Data ...Be Prepared !
  • 36. val shameless = <SelfPromotion> Try some Scala in the basement ! </SelfPromotion>