SlideShare une entreprise Scribd logo
1  sur  25
Introduction: MapR and Hadoop
  7/6/2012

© 2012 MapR Technologies   Introduction 1
Introduction
   Agenda
   • Hadoop Overview
   • MapReduce Overview
   • Hadoop Ecosystem
   • How is MapR Different?
   • Summary




© 2012 MapR Technologies      Introduction 2
Introduction
   Objectives
   At the end of this module you will be able to:
   • Explain why Hadoop is an important technology for effectively working with
     Big Data
   • Describe the phases of a MapReduce job
   • Identify some of the tools used with Hadoop
   • List the similarities and differences between MapR and other Hadoop
     distributions




© 2012 MapR Technologies         Introduction 3
Hadoop Overview




© 2012 MapR Technologies       Introduction 4
Data is Growing Faster than Moore’s Law

        Business Analytics Requires a New Approach



                                                   Data Volume
                                                   Growing 44x
                    2010:
                      1.2
                  Zettabytes                                                      2020: 35.2
                                                                                  Zettabytes          IDC
                                                                                               Digital Universe
                                                                                                 Study 2011



   © 2012 MapR Technologies                                      Introduction 5
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
Before Hadoop
  Web crawling to power search engines
  •    Must be able to handle gigantic data
  •    Must be fast!
  Problem: databases (B-Tree) not so fast, and do not scale
  Solution: Sort and Merge
  •    Eliminate the pesky seek time!




© 2012 MapR Technologies       Introduction 6
How to Scale?
  Big Data has Big Problems
  •   Petabytes of data
  •   MTBF on 1000s of nodes is < 1 day
  •   Something is always broken
  •   There are limits to scaling Big Iron
  •   Sequential and random access just don’t scale




© 2012 MapR Technologies       Introduction 7
Example: Update 1% of 1TB

     Data consists of 10 billion records, each 100 bytes
     Task: Update 1% of these records




© 2012 MapR Technologies        Introduction 8
Approach 1: Just Do It

     Each update involves read, modify and write
      –   t = 1 seek + 2 disk rotations = 20ms
      –   1% x 1010 x 20 ms = 2 mega-seconds = 23 days (552 hours)
     Total time dominated by seek and rotation times




© 2012 MapR Technologies            Introduction 9
Approach 2: The “Hard” Way

     Copy the entire database 1GB at a time
     Update records sequentially
      –   t = 2 x 1GB / 100MB/s + 20ms = 20s
      –   103 x 20s = 20,000s = 5.6 hours
     100x faster to move 100x more data!
     Moral: Read data sequentially even if you only want 1% of it




© 2012 MapR Technologies           Introduction 10
Introducing Hadoop!
     Now imagine you have thousands of disks on hundreds of
      machines with near linear scaling
      –   Commodity hardware – thousands of nodes!
      –   Handles Big Data – Petabytes and more!
      –   Sequential file access – all spindles at once!
      –   Sharding – data distributed evenly across cluster
      –   Reliability – self-healing, self-balancing
      –   Redundancy – data replicated across multiple hosts and disks
      –   MapReduce
          • Parallel computing framework
          • Moves the computation to the data




© 2012 MapR Technologies             Introduction 11
Hadoop Architecture
   • MapReduce: Parallel computing
           –   Move the computation to the data
           –   Minimizes network utilization

   • Distributed storage layer: Keeping track of data and metadata
           –   Data is sharded across the cluster

   • Cluster management tools
   • Applications and tools




© 2012 MapR Technologies              Introduction 12
What’s Driving Hadoop Adoption?


        “Simple algorithms and lots of data
            trump complex models ”



                                             Halevy, Norvig, and Pereira, Google
                                                         IEEE Intelligent Systems

© 2012 MapR Technologies   Introduction 13
MapReduce Overview




© 2012 MapR Technologies   Introduction 14
MapReduce
   •     A programming model for processing very large data sets
       ― A framework for processing parallel problems across huge datasets using
         a large number of nodes
       ― Brute force parallel computing paradigm

   •     Phases
       ― Map
            •    Job partitioned into “splits”

       ― Shuffle and sort
            •    Map output sent to reducer(s) using a hash

       ― Reduce


© 2012 MapR Technologies                Introduction 15
Inside Map-Reduce




                                  the, 1
              "The time has come," the Walrus said,
                                  time, 1
              "To talk of many things:    come, [3,2,1]
                                  has, 1
              Of shoes—and ships—and sealing-wax
                                          has, [1,5,2]
                                  come, 1                come, 6
                                          the, [1,2,1]   has, 8
                                  …
                                          time,          the, 4
                                          [10,1,3]       time, 14
                 Input      Map           …
                                      Shuffle       Reduce
                                                         …      Output
                                     and sort




© 2012 MapR Technologies              Introduction 16
JobTracker
   • Sends out tasks
   • Co-locates tasks with data
   • Gets data location
   • Manages TaskTrackers




© 2012 MapR Technologies    Introduction 17
TaskTracker
   •     Performs tasks (Map, Reduce)
   •     Slots determine number of concurrent tasks
   •     Notifies JobTracker of completed jobs
   •     Heartbeats to the JobTracker
   •     Each task is a separate Java process




© 2012 MapR Technologies       Introduction 18
Hadoop Ecosystem




© 2012 MapR Technologies       Introduction 19
Hadoop Ecosystem
   • PIG: It will eat anything
     –   High level language, set algebra, careful semantics
     –   Filter, transform, co-group, generate, flatten
     –   PIG generates and optimizes map-reduce programs
   • Hive: Busy as a bee
     –   High level language, more ad hoc than PIG
     –   SQL-ish
     –   Has central meta-data service
     –   Loves external scripts
   • HBase: NoSQL for your cluster
   • Mahout: distributed/scalable machine learning algorithms


© 2012 MapR Technologies            Introduction 20
How is MapR Different?




© 2012 MapR Technologies   Introduction 21
Mostly, It’s Not!

     API-compatible
      –   Move code over without modifications
      –   Use the familiar Hadoop Shell
     Supports popular tools and applications
      –   Hive, Pig, HBase—Flume, if you want it




© 2012 MapR Technologies            Introduction 22
Very Different Where It Counts
   No single point of failure
   Faster shuffle, faster file creation
   Read/write storage layer
   NFS-mountable
   Management tools—MCS, Rest API, CLI
   Data placement, protection, backup
   HA at all layers (Naming, NFS, JobTracker, MCS)




© 2012 MapR Technologies    Introduction 23
Summary




© 2012 MapR Technologies    Introduction 24
Questions




© 2012 MapR Technologies   Introduction 25

Contenu connexe

Tendances

Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 

Tendances (20)

Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hadoop
Hadoop Hadoop
Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 

En vedette

Hadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND OperationsHadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND OperationsMapR Technologies
 
Powering the "As it Happens" Business
Powering the "As it Happens" BusinessPowering the "As it Happens" Business
Powering the "As it Happens" BusinessMapR Technologies
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeMapR Technologies
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...ervogler
 

En vedette (9)

Hadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND OperationsHadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND Operations
 
Powering the "As it Happens" Business
Powering the "As it Happens" BusinessPowering the "As it Happens" Business
Powering the "As it Happens" Business
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 
TeraSort
TeraSortTeraSort
TeraSort
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
 

Similaire à 10c introduction

Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
David Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESDavid Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESSysFera
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopHortonworks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 

Similaire à 10c introduction (20)

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
David Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESDavid Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TES
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Yarnthug2014
Yarnthug2014Yarnthug2014
Yarnthug2014
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 

Plus de mapr-academy

80a disaster recovery
80a disaster recovery80a disaster recovery
80a disaster recoverymapr-academy
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
55a remote cluster
55a remote cluster55a remote cluster
55a remote clustermapr-academy
 
42 lab-managing services
42 lab-managing services42 lab-managing services
42 lab-managing servicesmapr-academy
 
41a managing services
41a managing services41a managing services
41a managing servicesmapr-academy
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your clustermapr-academy
 
3 map r installation & setup administration course description
3 map r installation & setup administration course description3 map r installation & setup administration course description
3 map r installation & setup administration course descriptionmapr-academy
 

Plus de mapr-academy (18)

52 nfs
52 nfs52 nfs
52 nfs
 
80a disaster recovery
80a disaster recovery80a disaster recovery
80a disaster recovery
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
58a migration
58a migration58a migration
58a migration
 
55a remote cluster
55a remote cluster55a remote cluster
55a remote cluster
 
53 lab-nfs
53 lab-nfs53 lab-nfs
53 lab-nfs
 
51 lab-volumes
51 lab-volumes51 lab-volumes
51 lab-volumes
 
50a volumes
50a volumes50a volumes
50a volumes
 
48a tuning
48a tuning48a tuning
48a tuning
 
42 lab-managing services
42 lab-managing services42 lab-managing services
42 lab-managing services
 
41a managing services
41a managing services41a managing services
41a managing services
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your cluster
 
22 configuration
22 configuration22 configuration
22 configuration
 
14 lab-planing
14 lab-planing14 lab-planing
14 lab-planing
 
20a installation
20a installation20a installation
20a installation
 
13c planning
13c planning13c planning
13c planning
 
12a architecture
12a architecture12a architecture
12a architecture
 
3 map r installation & setup administration course description
3 map r installation & setup administration course description3 map r installation & setup administration course description
3 map r installation & setup administration course description
 

Dernier

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 

Dernier (20)

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 

10c introduction

  • 1. Introduction: MapR and Hadoop 7/6/2012 © 2012 MapR Technologies Introduction 1
  • 2. Introduction Agenda • Hadoop Overview • MapReduce Overview • Hadoop Ecosystem • How is MapR Different? • Summary © 2012 MapR Technologies Introduction 2
  • 3. Introduction Objectives At the end of this module you will be able to: • Explain why Hadoop is an important technology for effectively working with Big Data • Describe the phases of a MapReduce job • Identify some of the tools used with Hadoop • List the similarities and differences between MapR and other Hadoop distributions © 2012 MapR Technologies Introduction 3
  • 4. Hadoop Overview © 2012 MapR Technologies Introduction 4
  • 5. Data is Growing Faster than Moore’s Law Business Analytics Requires a New Approach Data Volume Growing 44x 2010: 1.2 Zettabytes 2020: 35.2 Zettabytes IDC Digital Universe Study 2011 © 2012 MapR Technologies Introduction 5 Source: IDC Digital Universe Study, sponsored by EMC, May 2010
  • 6. Before Hadoop Web crawling to power search engines • Must be able to handle gigantic data • Must be fast! Problem: databases (B-Tree) not so fast, and do not scale Solution: Sort and Merge • Eliminate the pesky seek time! © 2012 MapR Technologies Introduction 6
  • 7. How to Scale? Big Data has Big Problems • Petabytes of data • MTBF on 1000s of nodes is < 1 day • Something is always broken • There are limits to scaling Big Iron • Sequential and random access just don’t scale © 2012 MapR Technologies Introduction 7
  • 8. Example: Update 1% of 1TB  Data consists of 10 billion records, each 100 bytes  Task: Update 1% of these records © 2012 MapR Technologies Introduction 8
  • 9. Approach 1: Just Do It  Each update involves read, modify and write – t = 1 seek + 2 disk rotations = 20ms – 1% x 1010 x 20 ms = 2 mega-seconds = 23 days (552 hours)  Total time dominated by seek and rotation times © 2012 MapR Technologies Introduction 9
  • 10. Approach 2: The “Hard” Way  Copy the entire database 1GB at a time  Update records sequentially – t = 2 x 1GB / 100MB/s + 20ms = 20s – 103 x 20s = 20,000s = 5.6 hours  100x faster to move 100x more data!  Moral: Read data sequentially even if you only want 1% of it © 2012 MapR Technologies Introduction 10
  • 11. Introducing Hadoop!  Now imagine you have thousands of disks on hundreds of machines with near linear scaling – Commodity hardware – thousands of nodes! – Handles Big Data – Petabytes and more! – Sequential file access – all spindles at once! – Sharding – data distributed evenly across cluster – Reliability – self-healing, self-balancing – Redundancy – data replicated across multiple hosts and disks – MapReduce • Parallel computing framework • Moves the computation to the data © 2012 MapR Technologies Introduction 11
  • 12. Hadoop Architecture • MapReduce: Parallel computing – Move the computation to the data – Minimizes network utilization • Distributed storage layer: Keeping track of data and metadata – Data is sharded across the cluster • Cluster management tools • Applications and tools © 2012 MapR Technologies Introduction 12
  • 13. What’s Driving Hadoop Adoption? “Simple algorithms and lots of data trump complex models ” Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems © 2012 MapR Technologies Introduction 13
  • 14. MapReduce Overview © 2012 MapR Technologies Introduction 14
  • 15. MapReduce • A programming model for processing very large data sets ― A framework for processing parallel problems across huge datasets using a large number of nodes ― Brute force parallel computing paradigm • Phases ― Map • Job partitioned into “splits” ― Shuffle and sort • Map output sent to reducer(s) using a hash ― Reduce © 2012 MapR Technologies Introduction 15
  • 16. Inside Map-Reduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes—and ships—and sealing-wax has, [1,5,2] come, 1 come, 6 the, [1,2,1] has, 8 … time, the, 4 [10,1,3] time, 14 Input Map … Shuffle Reduce … Output and sort © 2012 MapR Technologies Introduction 16
  • 17. JobTracker • Sends out tasks • Co-locates tasks with data • Gets data location • Manages TaskTrackers © 2012 MapR Technologies Introduction 17
  • 18. TaskTracker • Performs tasks (Map, Reduce) • Slots determine number of concurrent tasks • Notifies JobTracker of completed jobs • Heartbeats to the JobTracker • Each task is a separate Java process © 2012 MapR Technologies Introduction 18
  • 19. Hadoop Ecosystem © 2012 MapR Technologies Introduction 19
  • 20. Hadoop Ecosystem • PIG: It will eat anything – High level language, set algebra, careful semantics – Filter, transform, co-group, generate, flatten – PIG generates and optimizes map-reduce programs • Hive: Busy as a bee – High level language, more ad hoc than PIG – SQL-ish – Has central meta-data service – Loves external scripts • HBase: NoSQL for your cluster • Mahout: distributed/scalable machine learning algorithms © 2012 MapR Technologies Introduction 20
  • 21. How is MapR Different? © 2012 MapR Technologies Introduction 21
  • 22. Mostly, It’s Not!  API-compatible – Move code over without modifications – Use the familiar Hadoop Shell  Supports popular tools and applications – Hive, Pig, HBase—Flume, if you want it © 2012 MapR Technologies Introduction 22
  • 23. Very Different Where It Counts  No single point of failure  Faster shuffle, faster file creation  Read/write storage layer  NFS-mountable  Management tools—MCS, Rest API, CLI  Data placement, protection, backup  HA at all layers (Naming, NFS, JobTracker, MCS) © 2012 MapR Technologies Introduction 23
  • 24. Summary © 2012 MapR Technologies Introduction 24
  • 25. Questions © 2012 MapR Technologies Introduction 25

Notes de l'éditeur

  1. Problem: Scaling reliably is hardWhat you need is a Fault-tolerant store and a fault-tolerant framework.Handle hardware faults transparently and efficientlyHigh-availability - Not dependent on any one componentEven on a big cluster, some things take daysEven simple things are complicated in a failure-rich environmentEvery point is a point where things can fail, have to manage that failureWith many computers, many disks, failures are commonWith 1000 computers x 10 disk, we can have 1 node failure and 10 disk failures per daySome failures are intermittent or difficult to detectComputation must succeed and not run slower in these conditions
  2. Apache Hadoop - a new paradigm Scales to thousands of commodity computers Can effectively use all cores and spindles simultaneously If you buy hardware, you want to maximize use New software stack built on a different foundation Not very mature yet In use by most web 2.0 companies and many Fortune 500
  3. The first is “simple algorithms and lots of data trump complex models”. This comes from an IEEE article written by 3 research directors at Google. The article was titled the “Unreasonable effectiveness of Data” it was reaction to an article called “The Unreasonable Effectives of Mathematics in Natural Science” This paper made the point that simple formulas can explain the complex natural world. The most famous example being E=MC2 in physics. Their paper talked about how economist were jealous since they lacked similar models to neatly explain human behavior. But they found that in the area of Natural Language Processing an area notoriously complex that has been studied for years with many AI attempts at addressing this. They found that relatively simple approaches on massive data produced stunning results. They cited an example of scene completion. An algorithm is used to eliminate something in a picture a car for instance and based on a corpus of thousands of pictures fill in the the missing background. Well this algorithm did rather poorly until they increased the corpus to millions of photos and with this amount of data the same algorithm performed extremely well. While not a direct example from financial services I think it’s a great analogy. After all aren’t you looking for an approach that can fill in the missing pieces of a picture or pattern.