SlideShare une entreprise Scribd logo
1  sur  26
Whither Hadoop?

  (but not wither)
Questions I Have Been Asked
• I have a system that currently has 10 billion
  files
• Files range from 10K to 100MB and average
  about 1MB
• Access is from a legacy parallel app currently
  running on thousands of machines?
• I want to expand to 200 billion files of about
  the same size range
World-wide Grid
• I have 100,000 machines spread across 10
  data centers around the world
• I have a current grid architecture local to each
  data center (it works OK)
• But I want to load-level by mirroring data from
  data center to data center
• And I want to run both legacy and Hadoop
  programs
Little Records in Real Time
• I have a high rate stream of incoming small
  records
• I need to have several years of data on-line
  – about 10-30 PB
• But I have to have aggregates and alarms
  based in real-time
  – maximum response should be less than 10s end to
    end
  – typical response should be much faster (200ms)
Model Deployment
• I am building recommendation models off-line
  and would like to scale that using Hadoop
• But these models need to be deployed
  transactionally to machines around the world
• And I need to keep reference copies of every
  model and the exact data it was trained on
• But I can’t afford to keep many copies of my
  entire training data
Video Repository
• I have video output from thousands of sources
  indexed by source and time
• Each source wanders around a lot, I know where
• Each source has about 100,000 video
  snippets, each of which is 10-100MB
• I need to run map-reduce-like programs on all
  video for particular locations
• 10,000 x 100,000 x 100MB = 10^17 B = 100PB
• 10,000 x 100,000 = 10^9 files
And then they say
• Oh, yeah… we need to have backups, too

• Say, every 10 minutes for the last day or so

• And every hour for the last month

• You know… like Time Machine on my Mac
So what’s a poor guy to do?
Scenario 1
• 10 billion files x 1MB average
  – 100 federated name nodes?
• Legacy code access
• Expand to 200 billion files
  – 2,000 name nodes?! Let’s not talk about HA


• Or 400 node MapR – no special adaptation
World Wide Grid
•   10 x 10,000 machines + legacy code
•   10 x 10,000 or 10 x 10 x 1000 node cluster
•   NFS for legacy apps
•   Scheduled mirrors move data at end of shift
Little Records in Real Time
• Real-time + 10-30 PB
• Storm with pluggable services
• Or Ultra-messaging on the commercial side

• Key requirement is that real-time processors
  need distributed, mutable state storage
Model Deployment
• Snapshots
  – of training data at start of training
  – of models at the end of training
• Mirrors and data placement allow precise
  deployment
• Redundant snapshots require no space
Video Repository
• Media repositories typically have low average
  bandwidth
• This allows very high density machines
  – 36 x 3 TB = 100TB per 4U = 25TB net per 4U
• 100PB = 4,000 nodes = 400 racks
• Can be organized as one cluster or several
  pods
And Backups?
• Snapshots can be scheduled at high frequency
• Expiration allows complex retention schedules



• You know… like Time Machine on my Mac

• (and off-site backups work as well)
How does this work?
MapR Areas of Development

                   HBase    Map
                           Reduce
    Ecosystem


        Storage            Management
        Services
MapR Improvements
• Faster file system
  – Fewer copies
  – Multiple NICS
  – No file descriptor or page-buf competition
• Faster map-reduce
  – Uses distributed file system
  – Direct RPC to receiver
  – Very wide merges
MapR Innovations
• Volumes
  – Distributed management
  – Data placement
• Read/write random access file system
  – Allows distributed meta-data
  – Improved scaling
  – Enables NFS access
• Application-level NIC bonding
• Transactionally correct snapshots and mirrors
MapR's Containers
              Files/directories are sharded into blocks, which
              are placed into mini NNs (containers ) on disks
                                            Each container contains
                                               Directories & files
                                                Data blocks
                                            Replicated on servers
Containers are 16-
                                            No need to manage
32 GB segments of
                                             directly
disk, placed on
nodes
MapR's Containers




           Each container has a
            replication chain
           Updates are transactional
           Failures are handled by
            rearranging replication
Container locations and replication

           N1, N2              N1
           N3, N2
           N1, N2
           N1, N3              N2

           N3, N2

    CLDB
                               N3
 Container location database
 (CLDB) keeps track of nodes
 hosting each container and
 replication chain order
MapR Scaling
Containers represent 16 - 32GB of data
      Each can hold up to 1 Billion files and directories
      100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
      25GB to cache all containers for 2EB cluster
          But not necessary, can page to disk
      Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
      Serve 100x more data-nodes
      Increase container size to 64G to serve 4EB cluster
          Map/reduce not affected
Export to the world


               NFS
                 NFS
              Server
                  NFS
               Server
                    NFS
                 Server
 NFS              Server
Client
Local server

  Application

          NFS
         Server
Client




                  Cluster
                  Nodes
Universal export to self
                     Cluster Nodes




         Task

             NFS
    Cluster Server
    Node
Nodes are identical
     Task
                             Task
         NFS
                                 NFS
Cluster Server
Node                    Cluster Server
                        Node



             Task

                NFS
       Cluster Server
       Node

Contenu connexe

Tendances

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Community
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNCeph Community
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoDatabricks
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaKamesh Pemmaraju
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanCeph Community
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanCeph Community
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Ceph Community
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
An introduction and evaluations of a wide area distributed storage system
An introduction and evaluations of  a wide area distributed storage systemAn introduction and evaluations of  a wide area distributed storage system
An introduction and evaluations of a wide area distributed storage systemHiroki Kashiwazaki
 
Tobias Oetiker: RRDtool - how to make it sit up and beg
Tobias Oetiker: RRDtool - how to make it sit up and begTobias Oetiker: RRDtool - how to make it sit up and beg
Tobias Oetiker: RRDtool - how to make it sit up and begFriprogsenteret
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleJames Saint-Rossy
 
Case Study - DR on Demand
Case Study - DR on DemandCase Study - DR on Demand
Case Study - DR on DemandCTRLS
 
GlusterFS CTDB Integration
GlusterFS CTDB IntegrationGlusterFS CTDB Integration
GlusterFS CTDB IntegrationEtsuji Nakai
 
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFSCeli @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFSCELI
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 

Tendances (20)

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERN
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of Alabama
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 
Shignled disk
Shignled diskShignled disk
Shignled disk
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Ceph on Windows
Ceph on WindowsCeph on Windows
Ceph on Windows
 
An introduction and evaluations of a wide area distributed storage system
An introduction and evaluations of  a wide area distributed storage systemAn introduction and evaluations of  a wide area distributed storage system
An introduction and evaluations of a wide area distributed storage system
 
Tobias Oetiker: RRDtool - how to make it sit up and beg
Tobias Oetiker: RRDtool - how to make it sit up and begTobias Oetiker: RRDtool - how to make it sit up and beg
Tobias Oetiker: RRDtool - how to make it sit up and beg
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at Scale
 
Case Study - DR on Demand
Case Study - DR on DemandCase Study - DR on Demand
Case Study - DR on Demand
 
GlusterFS CTDB Integration
GlusterFS CTDB IntegrationGlusterFS CTDB Integration
GlusterFS CTDB Integration
 
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFSCeli @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 

En vedette

Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Laxman Kotte
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcampwesleyzhao
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcampwesleyzhao
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and IndexingHimani Tyagi
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...SEO monitor
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Mis vacaciones
Mis vacacionesMis vacaciones
Mis vacacionesVEGETAL777
 

En vedette (9)

Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcamp
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcamp
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and Indexing
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Mis vacaciones
Mis vacacionesMis vacaciones
Mis vacaciones
 

Similaire à Ted Dunning - Whither Hadoop

Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
OpenNebulaConf2017EU: Hyper converged infrastructure with OpenNebula and Ceph...
OpenNebulaConf2017EU: Hyper converged infrastructure with OpenNebula and Ceph...OpenNebulaConf2017EU: Hyper converged infrastructure with OpenNebula and Ceph...
OpenNebulaConf2017EU: Hyper converged infrastructure with OpenNebula and Ceph...OpenNebula Project
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsLars Nielsen
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDaehyeok Kim
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014Hassan Islamov
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...In-Memory Computing Summit
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integrationtrihug
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualizationSisimon Soman
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkSisimon Soman
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceMapR Technologies
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 

Similaire à Ted Dunning - Whither Hadoop (20)

Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Cmu 2011 09.pptx
Cmu 2011 09.pptxCmu 2011 09.pptx
Cmu 2011 09.pptx
 
OpenNebulaConf2017EU: Hyper converged infrastructure with OpenNebula and Ceph...
OpenNebulaConf2017EU: Hyper converged infrastructure with OpenNebula and Ceph...OpenNebulaConf2017EU: Hyper converged infrastructure with OpenNebula and Ceph...
OpenNebulaConf2017EU: Hyper converged infrastructure with OpenNebula and Ceph...
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualization
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talk
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
 
Llnl talk
Llnl talkLlnl talk
Llnl talk
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 

Dernier

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Ted Dunning - Whither Hadoop

  • 1. Whither Hadoop? (but not wither)
  • 2. Questions I Have Been Asked • I have a system that currently has 10 billion files • Files range from 10K to 100MB and average about 1MB • Access is from a legacy parallel app currently running on thousands of machines? • I want to expand to 200 billion files of about the same size range
  • 3. World-wide Grid • I have 100,000 machines spread across 10 data centers around the world • I have a current grid architecture local to each data center (it works OK) • But I want to load-level by mirroring data from data center to data center • And I want to run both legacy and Hadoop programs
  • 4. Little Records in Real Time • I have a high rate stream of incoming small records • I need to have several years of data on-line – about 10-30 PB • But I have to have aggregates and alarms based in real-time – maximum response should be less than 10s end to end – typical response should be much faster (200ms)
  • 5. Model Deployment • I am building recommendation models off-line and would like to scale that using Hadoop • But these models need to be deployed transactionally to machines around the world • And I need to keep reference copies of every model and the exact data it was trained on • But I can’t afford to keep many copies of my entire training data
  • 6. Video Repository • I have video output from thousands of sources indexed by source and time • Each source wanders around a lot, I know where • Each source has about 100,000 video snippets, each of which is 10-100MB • I need to run map-reduce-like programs on all video for particular locations • 10,000 x 100,000 x 100MB = 10^17 B = 100PB • 10,000 x 100,000 = 10^9 files
  • 7. And then they say • Oh, yeah… we need to have backups, too • Say, every 10 minutes for the last day or so • And every hour for the last month • You know… like Time Machine on my Mac
  • 8. So what’s a poor guy to do?
  • 9. Scenario 1 • 10 billion files x 1MB average – 100 federated name nodes? • Legacy code access • Expand to 200 billion files – 2,000 name nodes?! Let’s not talk about HA • Or 400 node MapR – no special adaptation
  • 10. World Wide Grid • 10 x 10,000 machines + legacy code • 10 x 10,000 or 10 x 10 x 1000 node cluster • NFS for legacy apps • Scheduled mirrors move data at end of shift
  • 11. Little Records in Real Time • Real-time + 10-30 PB • Storm with pluggable services • Or Ultra-messaging on the commercial side • Key requirement is that real-time processors need distributed, mutable state storage
  • 12. Model Deployment • Snapshots – of training data at start of training – of models at the end of training • Mirrors and data placement allow precise deployment • Redundant snapshots require no space
  • 13. Video Repository • Media repositories typically have low average bandwidth • This allows very high density machines – 36 x 3 TB = 100TB per 4U = 25TB net per 4U • 100PB = 4,000 nodes = 400 racks • Can be organized as one cluster or several pods
  • 14. And Backups? • Snapshots can be scheduled at high frequency • Expiration allows complex retention schedules • You know… like Time Machine on my Mac • (and off-site backups work as well)
  • 15. How does this work?
  • 16. MapR Areas of Development HBase Map Reduce Ecosystem Storage Management Services
  • 17. MapR Improvements • Faster file system – Fewer copies – Multiple NICS – No file descriptor or page-buf competition • Faster map-reduce – Uses distributed file system – Direct RPC to receiver – Very wide merges
  • 18. MapR Innovations • Volumes – Distributed management – Data placement • Read/write random access file system – Allows distributed meta-data – Improved scaling – Enables NFS access • Application-level NIC bonding • Transactionally correct snapshots and mirrors
  • 19. MapR's Containers Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks  Each container contains  Directories & files  Data blocks  Replicated on servers Containers are 16-  No need to manage 32 GB segments of directly disk, placed on nodes
  • 20. MapR's Containers  Each container has a replication chain  Updates are transactional  Failures are handled by rearranging replication
  • 21. Container locations and replication N1, N2 N1 N3, N2 N1, N2 N1, N3 N2 N3, N2 CLDB N3 Container location database (CLDB) keeps track of nodes hosting each container and replication chain order
  • 22. MapR Scaling Containers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster) 250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster But not necessary, can page to disk  Typical large 10PB cluster needs 2GB Container-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  Map/reduce not affected
  • 23. Export to the world NFS NFS Server NFS Server NFS Server NFS Server Client
  • 24. Local server Application NFS Server Client Cluster Nodes
  • 25. Universal export to self Cluster Nodes Task NFS Cluster Server Node
  • 26. Nodes are identical Task Task NFS NFS Cluster Server Node Cluster Server Node Task NFS Cluster Server Node

Notes de l'éditeur

  1. ----- Meeting Notes (3/13/12 19:23) -----