SlideShare une entreprise Scribd logo
1  sur  74
2 December 2011

Hadoop in Three Use Cases
Joey Echeverria | Solutions Architect
joey@cloudera.com | @fwiffo
About Joey

    •   Solutions Architect
    •   6 months
    •   3+ years
    •   Local




2
                       ©2011 Cloudera, Inc. All Rights Reserved.
Cloudera’s Distribution including Apache Hadoop


      File System Mount                  UI Framework                                   SDK
                       FUSE-DFS                                               HUE                HUE SDK



           Workflow                         Scheduling                                Metadata
                APACHE OOZIE*                              APACHE OOZIE*                    APACHE HIVE




                                  Languages / Compilers
                                                     APACHE PIG, APACHE HIVE          Fast Read/Write
     Data Integration
                                                                                          Access

      APACHE FLUME*,
      APACHE SQOOP*                                                                    APACHE HBASE


                                          Coordination
                                                                                     APACHE ZOOKEEPER

    *currently under incubation in the Apache Software Foundation


3
                                  Copyright 2011 Cloudera Inc. All rights reserved
Extract, Transform, and Load




4
                 Copyright 2011 Cloudera Inc. All rights reserved
ETL before Hadoop
    Difficult to maintain, not scalable




                        Relational
                        Databases




      Logs
                     Custom ETL                                                  Enterprise Data
                       Scripts                                                    Warehouse




                          Files




5
                                     ©2011 Cloudera, Inc. All Rights Reserved.
ETL before Hadoop
    May be scalable, expensive




                      Relational
                      Databases




      Logs          Enterprise Data
                     Warehouse
                                                           SQL:
                                                           raw table → warehouse tables


                         Files




6
                                      ©2011 Cloudera, Inc. All Rights Reserved.
ETL with Hadoop
    Managed, flexible, scalable




                          Relational
                          Databases




      Logs                                                                 Enterprise Data
                                                                            Warehouse




                            Files




7
                               ©2011 Cloudera, Inc. All Rights Reserved.
Steps


    1. In

    2. Process

    3. Out


8
                 Copyright 2011 Cloudera Inc. All rights reserved
Flume




9
            Copyright 2011 Cloudera Inc. All rights reserved
Flume




10
         Copyright 2011 Cloudera Inc. All rights reserved
ETL with Hadoop
 Managed, flexible, scalable


                       Relational
                       Databases




                                                                         Enterprise Data
     Logs   Flume                                                         Warehouse




                         Files




11
                             ©2011 Cloudera, Inc. All Rights Reserved.
HDFS




12
        Copyright 2011 Cloudera Inc. All rights reserved
HDFS
                       02, 06, 10
                                                               NameNode


open(“file.txt”)      DataNode                                   DataNode          DataNode
                         01                                         05                09

                      DataNode                                   DataNode          DataNode
                         02                                         06                10
      Client
                                                data                        data
                      DataNode                                   DataNode          DataNode
               data      03                                         07                11

                      DataNode                                   DataNode          DataNode
                         04                                         08                12




 13
                       Copyright 2011 Cloudera Inc. All rights reserved
HDFS

 •   Distributed
 •   Replication
 •   Bulk I/O
 •   Fault tolerant
 •   Scalable
 •   Append only
 •   Not POSIX


14
                      Copyright 2011 Cloudera Inc. All rights reserved
ETL with Hadoop
 Managed, flexible, scalable


                        Relational
                        Databases




                                                                          Enterprise Data
     Logs   Flume      HDFS                                                Warehouse




                          Files




15
                              ©2011 Cloudera, Inc. All Rights Reserved.
FUSE-DFS




16
            Copyright 2011 Cloudera Inc. All rights reserved
FUSE-DFS

 • FUSE
     – User space
     – File systems
 • FUSE-DFS
     – /hdfs
     – Mostly transparent




17
                      Copyright 2011 Cloudera Inc. All rights reserved
ETL with Hadoop
 Managed, flexible, scalable


                       Relational
                       Databases




                                                                         Enterprise Data
     Logs   Flume      HDFS                                               Warehouse




                     FUSE-DFS

                         Files




18
                             ©2011 Cloudera, Inc. All Rights Reserved.
Sqoop




19
         Copyright 2011 Cloudera Inc. All rights reserved
Sqoop

 • SQL to Hadoop
 • Parallel import
 • File formats




20
                Copyright 2011 Cloudera Inc. All rights reserved
ETL with Hadoop
 Managed, flexible, scalable


                       Relational
                       Databases



                       Sqoop


                                                                         Enterprise Data
     Logs   Flume      HDFS                                               Warehouse




                     FUSE-DFS

                         Files




21
                             ©2011 Cloudera, Inc. All Rights Reserved.
Pig




22
       Copyright 2011 Cloudera Inc. All rights reserved
Pig

   •   Scripting language
   •   Generates MapReduce jobs
   •   Perl for Hadoop
   •   Great for ETL
A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);
DUMP C;


  23
                        Copyright 2011 Cloudera Inc. All rights reserved
ETL with Hadoop
 Managed, flexible, scalable


                       Relational
                       Databases                               Pig

                       Sqoop


                                                                         Enterprise Data
     Logs   Flume      HDFS                                               Warehouse




                     FUSE-DFS

                         Files




24
                             ©2011 Cloudera, Inc. All Rights Reserved.
Sqoop with connectors




25
             Copyright 2011 Cloudera Inc. All rights reserved
Sqoop with connectors

 •   MySQL*
 •   PostgreSQL*
 •   Teradata*
 •   Netezza*
 •   Oracle*
 •   Couchbase*
 •   Microsoft SQL Server
 •   VoltDB
                                                                     *Cloudera certified connector


26
                  Copyright 2011 Cloudera Inc. All rights reserved
ETL with Hadoop
 Managed, flexible, scalable


                       Relational
                       Databases                               Pig

                      Sqoop


                                                                         Enterprise Data
     Logs   Flume      HDFS                                               Warehouse




                    FUSE-DFS                                              Sqoop

                         Files




27
                             ©2011 Cloudera, Inc. All Rights Reserved.
Recommendations




28
           Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop


                                                              CUSTOMERS
            Relational
            Databases


                                                                Web
                                                              Application




     Logs




29
                  ©2011 Cloudera, Inc. All Rights Reserved.
Flume




30
         Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop


                                                                      CUSTOMERS
                    Relational
                    Databases


                                                                        Web
                                                                      Application




     Logs   Flume




31
                          ©2011 Cloudera, Inc. All Rights Reserved.
HDFS




32
        Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop


                                                                      CUSTOMERS
                    Relational
                    Databases


                                                                        Web
                                                                      Application




     Logs   Flume   HDFS




33
                          ©2011 Cloudera, Inc. All Rights Reserved.
Sqoop




34
         Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop


                                                                      CUSTOMERS
                    Relational
                    Databases


                                                                        Web
                    Sqoop                                             Application




     Logs   Flume   HDFS




35
                          ©2011 Cloudera, Inc. All Rights Reserved.
Pig




36
       Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop


                                                                      CUSTOMERS
                    Relational
                    Databases


                                                                        Web
                    Sqoop                                             Application




     Logs   Flume   HDFS




 Pig




37
                          ©2011 Cloudera, Inc. All Rights Reserved.
Mahout




38
          Copyright 2011 Cloudera Inc. All rights reserved
Mahout

 • Scalable machine learning algorithms
     – Collaborative Filtering
     – User and Item based recommenders
     – K-Means, Fuzzy K-Means clustering
     – Mean Shift clustering
     – Singular value decomposition
     – Complementary Naive Bayes classifier
       …


39
                   Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop


                                                                        CUSTOMERS
                      Relational
                      Databases


                                                                          Web
                      Sqoop                                             Application




     Logs   Flume     HDFS




 Pig         Mahout




40
                            ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce




41
             Copyright 2011 Cloudera Inc. All rights reserved
MapReduce
        map            shuffle                                         reduce
               :1
     toOne()
               :1

               :1                                        :[1,1,1,1] count()        :4
                                                        :[1,1]                     :2
               :1
     toOne()
               :1

               :1                                        :[1,1]          count()   :2
               :1                                        :[1]                      :1
     toOne()
               :1

               :1


42
                    Copyright 2011 Cloudera Inc. All rights reserved
MapReduce

 •   Distributed
 •   Code to data
 •   Reliable
 •   Scalable




43
                    Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop


                                                                        CUSTOMERS
                      Relational
                      Databases


                                                                          Web
                      Sqoop                                             Application




     Logs   Flume     HDFS




 Pig         Mahout         MapReduce                                     Pig




44
                            ©2011 Cloudera, Inc. All Rights Reserved.
Oozie




45
         Copyright 2011 Cloudera Inc. All rights reserved
Oozie

 • Workflows
 • Coordinator
     – Triggers




46
                  Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop


                                                                        CUSTOMERS
                      Relational
                      Databases


                                                                          Web
                      Sqoop                                             Application




     Logs   Flume     HDFS
                                                                                      Oozie




 Pig         Mahout         MapReduce                                     Pig




47
                            ©2011 Cloudera, Inc. All Rights Reserved.
HBase




48
         Copyright 2011 Cloudera Inc. All rights reserved
HBase

 • Key/value store
 • Data stored in HDFS
 • Access model is get/put/del
     – Plus range scans and versions
 • Random reads and writes for Hadoop




49
                   Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop


                                                                        CUSTOMERS
                      Relational
                      Databases


                                                                          Web
                      Sqoop                                             Application




     Logs   Flume     HDFS
                                                                        HBase         Oozie




 Pig         Mahout         MapReduce                                     Pig




50
                            ©2011 Cloudera, Inc. All Rights Reserved.
Business Intelligence




51
              Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop


                                                               ANALYSTS
            Relational
            Databases



                                                              BI / Analytics




     Logs




52
                  ©2011 Cloudera, Inc. All Rights Reserved.
Flume




53
         Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop


                                                                       ANALYSTS
                    Relational
                    Databases



                                                                      BI / Analytics




     Logs   Flume




54
                          ©2011 Cloudera, Inc. All Rights Reserved.
HDFS




55
        Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop


                                                                       ANALYSTS
                    Relational
                    Databases



                                                                      BI / Analytics




     Logs   Flume   HDFS




56
                          ©2011 Cloudera, Inc. All Rights Reserved.
Sqoop




57
         Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop


                                                                       ANALYSTS
                    Relational
                    Databases



                    Sqoop                                             BI / Analytics




     Logs   Flume   HDFS




58
                          ©2011 Cloudera, Inc. All Rights Reserved.
Hive




59
        Copyright 2011 Cloudera Inc. All rights reserved
Hive

 • Data warehouse
 • Ad-hoc queries
     – Not real-time (minutes)
 • SQL
 • Tables
 • Joins



60
                    Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop


                                                                       ANALYSTS
                    Relational
                    Databases



                    Sqoop                                             BI / Analytics




     Logs   Flume   HDFS




                                                     Hive


61
                          ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce




62
             Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop


                                                                          ANALYSTS
                    Relational
                    Databases



                    Sqoop                                                BI / Analytics




     Logs   Flume   HDFS




                                                     Hive             MapReduce


63
                          ©2011 Cloudera, Inc. All Rights Reserved.
Oozie




64
         Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop


                                                                               ANALYSTS
                    Relational
                    Databases



                    Sqoop                                                     BI / Analytics




     Logs   Flume   HDFS

                                                                      Oozie



                                                     Hive                 MapReduce


65
                          ©2011 Cloudera, Inc. All Rights Reserved.
HBase




66
         Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop


                                                                               ANALYSTS
                    Relational
                    Databases



                    Sqoop                                                     BI / Analytics




     Logs   Flume   HDFS                                                                       HBase
                                                                      Oozie



                                                     Hive                 MapReduce


67
                          ©2011 Cloudera, Inc. All Rights Reserved.
Hive




68
        Copyright 2011 Cloudera Inc. All rights reserved
Hive for Business Intelligence

 • JDBC
     – JasperReports*
     – Pentaho*
 • ODBC
     – MicroStrategy*^



                                                                       * Vender certified connector
                                                                       ^ Cloudera certified connector


69
                    Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop


                                                                               ANALYSTS
                    Relational
                    Databases



                    Sqoop                                                     BI / Analytics




     Logs   Flume   HDFS                             Hive                                      HBase
                                                                      Oozie



                                                     Hive                 MapReduce


70
                          ©2011 Cloudera, Inc. All Rights Reserved.
CDH

      File System Mount                  UI Framework                                   SDK
                       FUSE-DFS                                               HUE                HUE SDK



           Workflow                         Scheduling                                Metadata
                APACHE OOZIE*                              APACHE OOZIE*                    APACHE HIVE




                                  Languages / Compilers
                                                     APACHE PIG, APACHE HIVE          Fast Read/Write
     Data Integration
                                                                                          Access

         APACHE
      FLUME*, APACHE                                                                   APACHE HBASE
         SQOOP*

                                          Coordination
                                                                                     APACHE ZOOKEEPER

 *currently under incubation in the Apache Software Foundation


71
                                  Copyright 2011 Cloudera Inc. All rights reserved
What’s next?

 •   Cloudera Training Videos
 •   CDH Virtual Machines
 •   Hadoop: The Definitive Guide, 2nd Edition
 •   Cloudera University
     – Developer Training in Columbia, MD
        • Dec 13-16, Feb 13-16
     – Administrator Training in Herndon, VA
        • Jan 4-6
     – Private Training


72
                     Copyright 2011 Cloudera Inc. All rights reserved
We’re Hiring!
 • http://www.cloudera.com/company/careers/
 • Customer Operations
     – Customer Operations Engineer
     – Customer Operations Tools Developer
 • Customer Solutions
     – Solutions Architect
 • Engineering
     –   Senior Data Integration Developer
     –   Senior Distributed Systems Engineer
     –   Senior UI Engineer
     –   Software Quality Engineer
     –   Technical Writer
 • IT/Operations
     – Systems Administrator


73
                             Copyright 2011 Cloudera Inc. All rights reserved
74

Contenu connexe

Tendances

Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Edureka!
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Daniel Abadi
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_OpportunityNojan Emad
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture Ganesh B
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 

Tendances (20)

Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop
HadoopHadoop
Hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 

En vedette

Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop projectKamal A
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainKamal A
 
An example of a successful proof of concept
An example of a successful proof of conceptAn example of a successful proof of concept
An example of a successful proof of conceptETLSolutions
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 

En vedette (6)

NYE Stock analysis
NYE Stock analysisNYE Stock analysis
NYE Stock analysis
 
Hadoop
HadoopHadoop
Hadoop
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
An example of a successful proof of concept
An example of a successful proof of conceptAn example of a successful proof of concept
An example of a successful proof of concept
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 

Similaire à Hadoop in three use cases

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohugAdam Muise
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real WorldCloudera, Inc.
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleHarald Erb
 
sql on hadoop
sql on hadoop sql on hadoop
sql on hadoop Jianwei Li
 

Similaire à Hadoop in three use cases (20)

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohug
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real World
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
sql on hadoop
sql on hadoop sql on hadoop
sql on hadoop
 

Plus de Joey Echeverria

Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityJoey Echeverria
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and ClouderaJoey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

Plus de Joey Echeverria (11)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Hadoop in three use cases

  • 1. 2 December 2011 Hadoop in Three Use Cases Joey Echeverria | Solutions Architect joey@cloudera.com | @fwiffo
  • 2. About Joey • Solutions Architect • 6 months • 3+ years • Local 2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. Cloudera’s Distribution including Apache Hadoop File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE* APACHE OOZIE* APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Data Integration Access APACHE FLUME*, APACHE SQOOP* APACHE HBASE Coordination APACHE ZOOKEEPER *currently under incubation in the Apache Software Foundation 3 Copyright 2011 Cloudera Inc. All rights reserved
  • 4. Extract, Transform, and Load 4 Copyright 2011 Cloudera Inc. All rights reserved
  • 5. ETL before Hadoop Difficult to maintain, not scalable Relational Databases Logs Custom ETL Enterprise Data Scripts Warehouse Files 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. ETL before Hadoop May be scalable, expensive Relational Databases Logs Enterprise Data Warehouse SQL: raw table → warehouse tables Files 6 ©2011 Cloudera, Inc. All Rights Reserved.
  • 7. ETL with Hadoop Managed, flexible, scalable Relational Databases Logs Enterprise Data Warehouse Files 7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. Steps 1. In 2. Process 3. Out 8 Copyright 2011 Cloudera Inc. All rights reserved
  • 9. Flume 9 Copyright 2011 Cloudera Inc. All rights reserved
  • 10. Flume 10 Copyright 2011 Cloudera Inc. All rights reserved
  • 11. ETL with Hadoop Managed, flexible, scalable Relational Databases Enterprise Data Logs Flume Warehouse Files 11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. HDFS 12 Copyright 2011 Cloudera Inc. All rights reserved
  • 13. HDFS 02, 06, 10 NameNode open(“file.txt”) DataNode DataNode DataNode 01 05 09 DataNode DataNode DataNode 02 06 10 Client data data DataNode DataNode DataNode data 03 07 11 DataNode DataNode DataNode 04 08 12 13 Copyright 2011 Cloudera Inc. All rights reserved
  • 14. HDFS • Distributed • Replication • Bulk I/O • Fault tolerant • Scalable • Append only • Not POSIX 14 Copyright 2011 Cloudera Inc. All rights reserved
  • 15. ETL with Hadoop Managed, flexible, scalable Relational Databases Enterprise Data Logs Flume HDFS Warehouse Files 15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. FUSE-DFS 16 Copyright 2011 Cloudera Inc. All rights reserved
  • 17. FUSE-DFS • FUSE – User space – File systems • FUSE-DFS – /hdfs – Mostly transparent 17 Copyright 2011 Cloudera Inc. All rights reserved
  • 18. ETL with Hadoop Managed, flexible, scalable Relational Databases Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Files 18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. Sqoop 19 Copyright 2011 Cloudera Inc. All rights reserved
  • 20. Sqoop • SQL to Hadoop • Parallel import • File formats 20 Copyright 2011 Cloudera Inc. All rights reserved
  • 21. ETL with Hadoop Managed, flexible, scalable Relational Databases Sqoop Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Files 21 ©2011 Cloudera, Inc. All Rights Reserved.
  • 22. Pig 22 Copyright 2011 Cloudera Inc. All rights reserved
  • 23. Pig • Scripting language • Generates MapReduce jobs • Perl for Hadoop • Great for ETL A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int); B = GROUP A BY f1; C = FOREACH B GENERATE COUNT ($0); DUMP C; 23 Copyright 2011 Cloudera Inc. All rights reserved
  • 24. ETL with Hadoop Managed, flexible, scalable Relational Databases Pig Sqoop Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Files 24 ©2011 Cloudera, Inc. All Rights Reserved.
  • 25. Sqoop with connectors 25 Copyright 2011 Cloudera Inc. All rights reserved
  • 26. Sqoop with connectors • MySQL* • PostgreSQL* • Teradata* • Netezza* • Oracle* • Couchbase* • Microsoft SQL Server • VoltDB *Cloudera certified connector 26 Copyright 2011 Cloudera Inc. All rights reserved
  • 27. ETL with Hadoop Managed, flexible, scalable Relational Databases Pig Sqoop Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Sqoop Files 27 ©2011 Cloudera, Inc. All Rights Reserved.
  • 28. Recommendations 28 Copyright 2011 Cloudera Inc. All rights reserved
  • 29. Recommendations with Hadoop CUSTOMERS Relational Databases Web Application Logs 29 ©2011 Cloudera, Inc. All Rights Reserved.
  • 30. Flume 30 Copyright 2011 Cloudera Inc. All rights reserved
  • 31. Recommendations with Hadoop CUSTOMERS Relational Databases Web Application Logs Flume 31 ©2011 Cloudera, Inc. All Rights Reserved.
  • 32. HDFS 32 Copyright 2011 Cloudera Inc. All rights reserved
  • 33. Recommendations with Hadoop CUSTOMERS Relational Databases Web Application Logs Flume HDFS 33 ©2011 Cloudera, Inc. All Rights Reserved.
  • 34. Sqoop 34 Copyright 2011 Cloudera Inc. All rights reserved
  • 35. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS 35 ©2011 Cloudera, Inc. All Rights Reserved.
  • 36. Pig 36 Copyright 2011 Cloudera Inc. All rights reserved
  • 37. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Pig 37 ©2011 Cloudera, Inc. All Rights Reserved.
  • 38. Mahout 38 Copyright 2011 Cloudera Inc. All rights reserved
  • 39. Mahout • Scalable machine learning algorithms – Collaborative Filtering – User and Item based recommenders – K-Means, Fuzzy K-Means clustering – Mean Shift clustering – Singular value decomposition – Complementary Naive Bayes classifier … 39 Copyright 2011 Cloudera Inc. All rights reserved
  • 40. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Pig Mahout 40 ©2011 Cloudera, Inc. All Rights Reserved.
  • 41. MapReduce 41 Copyright 2011 Cloudera Inc. All rights reserved
  • 42. MapReduce map shuffle reduce :1 toOne() :1 :1 :[1,1,1,1] count() :4 :[1,1] :2 :1 toOne() :1 :1 :[1,1] count() :2 :1 :[1] :1 toOne() :1 :1 42 Copyright 2011 Cloudera Inc. All rights reserved
  • 43. MapReduce • Distributed • Code to data • Reliable • Scalable 43 Copyright 2011 Cloudera Inc. All rights reserved
  • 44. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Pig Mahout MapReduce Pig 44 ©2011 Cloudera, Inc. All Rights Reserved.
  • 45. Oozie 45 Copyright 2011 Cloudera Inc. All rights reserved
  • 46. Oozie • Workflows • Coordinator – Triggers 46 Copyright 2011 Cloudera Inc. All rights reserved
  • 47. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Oozie Pig Mahout MapReduce Pig 47 ©2011 Cloudera, Inc. All Rights Reserved.
  • 48. HBase 48 Copyright 2011 Cloudera Inc. All rights reserved
  • 49. HBase • Key/value store • Data stored in HDFS • Access model is get/put/del – Plus range scans and versions • Random reads and writes for Hadoop 49 Copyright 2011 Cloudera Inc. All rights reserved
  • 50. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS HBase Oozie Pig Mahout MapReduce Pig 50 ©2011 Cloudera, Inc. All Rights Reserved.
  • 51. Business Intelligence 51 Copyright 2011 Cloudera Inc. All rights reserved
  • 52. Business Intelligence with Hadoop ANALYSTS Relational Databases BI / Analytics Logs 52 ©2011 Cloudera, Inc. All Rights Reserved.
  • 53. Flume 53 Copyright 2011 Cloudera Inc. All rights reserved
  • 54. Business Intelligence with Hadoop ANALYSTS Relational Databases BI / Analytics Logs Flume 54 ©2011 Cloudera, Inc. All Rights Reserved.
  • 55. HDFS 55 Copyright 2011 Cloudera Inc. All rights reserved
  • 56. Business Intelligence with Hadoop ANALYSTS Relational Databases BI / Analytics Logs Flume HDFS 56 ©2011 Cloudera, Inc. All Rights Reserved.
  • 57. Sqoop 57 Copyright 2011 Cloudera Inc. All rights reserved
  • 58. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS 58 ©2011 Cloudera, Inc. All Rights Reserved.
  • 59. Hive 59 Copyright 2011 Cloudera Inc. All rights reserved
  • 60. Hive • Data warehouse • Ad-hoc queries – Not real-time (minutes) • SQL • Tables • Joins 60 Copyright 2011 Cloudera Inc. All rights reserved
  • 61. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Hive 61 ©2011 Cloudera, Inc. All Rights Reserved.
  • 62. MapReduce 62 Copyright 2011 Cloudera Inc. All rights reserved
  • 63. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Hive MapReduce 63 ©2011 Cloudera, Inc. All Rights Reserved.
  • 64. Oozie 64 Copyright 2011 Cloudera Inc. All rights reserved
  • 65. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Oozie Hive MapReduce 65 ©2011 Cloudera, Inc. All Rights Reserved.
  • 66. HBase 66 Copyright 2011 Cloudera Inc. All rights reserved
  • 67. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS HBase Oozie Hive MapReduce 67 ©2011 Cloudera, Inc. All Rights Reserved.
  • 68. Hive 68 Copyright 2011 Cloudera Inc. All rights reserved
  • 69. Hive for Business Intelligence • JDBC – JasperReports* – Pentaho* • ODBC – MicroStrategy*^ * Vender certified connector ^ Cloudera certified connector 69 Copyright 2011 Cloudera Inc. All rights reserved
  • 70. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Hive HBase Oozie Hive MapReduce 70 ©2011 Cloudera, Inc. All Rights Reserved.
  • 71. CDH File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE* APACHE OOZIE* APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Data Integration Access APACHE FLUME*, APACHE APACHE HBASE SQOOP* Coordination APACHE ZOOKEEPER *currently under incubation in the Apache Software Foundation 71 Copyright 2011 Cloudera Inc. All rights reserved
  • 72. What’s next? • Cloudera Training Videos • CDH Virtual Machines • Hadoop: The Definitive Guide, 2nd Edition • Cloudera University – Developer Training in Columbia, MD • Dec 13-16, Feb 13-16 – Administrator Training in Herndon, VA • Jan 4-6 – Private Training 72 Copyright 2011 Cloudera Inc. All rights reserved
  • 73. We’re Hiring! • http://www.cloudera.com/company/careers/ • Customer Operations – Customer Operations Engineer – Customer Operations Tools Developer • Customer Solutions – Solutions Architect • Engineering – Senior Data Integration Developer – Senior Distributed Systems Engineer – Senior UI Engineer – Software Quality Engineer – Technical Writer • IT/Operations – Systems Administrator 73 Copyright 2011 Cloudera Inc. All rights reserved
  • 74. 74