SlideShare une entreprise Scribd logo
1  sur  87
Extending Your Data InfrastructurePUBLICLY
                                DO NOT USE
    with Hadoop                  PRIOR TO 10/23/12
    Headline Goes Here
    Jonathan Seidman | Solutions Architect
    Speaker Name or Subhead Goes Here
    Big Data TechCon
    April 10, 2013




                             ©2013 Cloudera, Inc. All Rights
1
                                      Reserved.
Who I Am
    • Solutions Architect, Partner Engineering Team.
    • Co-founder/organizer of Chicago Hadoop User Group and
      Chicago Big Data.
    • jseidman@cloudera.com
    • @jseidman
    • cloudera.com/careers




                          ©2013 Cloudera, Inc. All Rights
2
                                   Reserved.
What I’ll be Talking About
    •   Big data challenges with current data integration approaches.
    •   How is Hadoop being leveraged with existing data infrastructures?
    •   Hadoop integration – the big picture.
    •   Deeper dive into tool categories.
          •   Data import/export
          •   Data Integration
          •   BI/Analytics
    •   Putting the pieces together.
    •   BI/Analytics with Hadoop.
    •   New approachs to data analysis with Hadoop.

                                   ©2013 Cloudera, Inc. All Rights
3
                                            Reserved.
What is Apache Hadoop?
                                                                          CORE HADOOP SYSTEM COMPONENTS
      Apache Hadoop is an open source
      platform for data storage and processing                   Hadoop Distributed
      that is…                                                   File System (HDFS)
                                                                                                         MapReduce
          Scalable                                                  Self-Healing, High
                                                                    Bandwidth Clustered
          Fault tolerant                                                 Storage
                                                                                                       Distributed Computing
                                                                                                             Framework
          Distributed

 Has the Flexibility to Store and                      Excels at                                        Scales
    Mine Any Type of Data                      Processing Complex Data                               Economically

 Ask questions across structured and       Scale-out architecture divides workloads      Can be deployed on commodity
  unstructured data that were previously     across multiple nodes                          hardware
  impossible to ask or solve                Flexible file system eliminates ETL           Open source platform guards against
 Not bound by a single schema               bottlenecks                                    vendor lock
                                              ©2013 Cloudera, Inc. All Rights
4
                                                       Reserved.
Current Challenges
    Limitations of Existing Data Management Systems




                            ©2013 Cloudera, Inc. All Rights
5
                                     Reserved.
The Transforming of Transformation

     Enterprise
    Applications

                     Extract                        Query
       OLTP        Transform                         Data
                                                                Business
                      Load                        Warehouse
                                                              Intelligence
                                                  Transform


           ODS

                           ©2013 Cloudera, Inc. All Rights
6
                                    Reserved.
Volume, Velocity, Variety Cause Capacity Problems
                   1    Slow Data Transformations = Missed ETL SLAs.
     Enterprise
    Applications
                   2    Slow Queries = Frustrated Business Users.


                         Extract                2       Query
       OLTP        1   Transform                         Data
                                                                         Business
                          Load                        Warehouse
                                                                       Intelligence
                                                 1    Transform




                               ©2013 Cloudera, Inc. All Rights
7
                                        Reserved.
Data Warehouse Optimization
     Enterprise                       Data Warehouse
    Applications
                                           Query
                                           (High $/Byte)


       OLTP        ETL                                       Business
                                 Hadoop                    Intelligence
                                      Transform
                                              Query
           ODS                                Store

                         ©2013 Cloudera, Inc. All Rights
8
                                  Reserved.
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS):                                     Schema-on-Read (Hadoop):
    •   Prescriptive Data Modeling:                      •   Descriptive Data Modeling:
           •   Create static DB schema                          •   Copy data in its native format
           •   Transform data into RDBMS                        •   Create schema + parser
           •   Query data in RDBMS format                       •   Query Data in its native format
    •   New columns must be added explicitly             •   New data can start flowing any time and
        before new data can propagate into                   will appear retroactively once the
        the system.                                          schema/parser properly describes it.

    •   Good for Known Unknowns                          •   Good for Unknown Unknowns
               (Repetition)                                         (Exploration)
                                         ©2013 Cloudera, Inc. All Rights
9
                                                  Reserved.
Not Just Transformation
     Other Ways Hadoop is Being Leveraged




                             ©2013 Cloudera, Inc. All Rights
10
                                      Reserved.
Data Archiving Before Hadoop




               Data                                     Tape
             Warehouse                                 Archive




                         ©2013 Cloudera, Inc. All Rights
11
                                  Reserved.
Active Archiving with Hadoop




               Data                                   Hadoop
             Warehouse




                         ©2013 Cloudera, Inc. All Rights
12
                                  Reserved.
Offloading Analysis

                   Data Warehouse




                                                            Business
                                                          Intelligence
                      Hadoop




                        ©2013 Cloudera, Inc. All Rights
13
                                 Reserved.
Exploratory Analysis


       Developers   Business   Analysts
                    Users




                    Hadoop                                         Data
                                                                 Warehouse



                               ©2013 Cloudera, Inc. All Rights
14
                                        Reserved.
The Common Themes?


     1   Offload expensive storage and processing
         to Hadoop
         • Complement, not replace


     2   Reduce strain on the data warehouse
         • Let it focus on what it was designed to do:
            • High speed queries on high value relational data
         • Increase ROI of existing relational stores



                            ©2013 Cloudera, Inc. All Rights
15
                                     Reserved.
Economics: Return on Byte
     Return on Byte (ROB) =
                                           Value of Data
                                        Cost of Storing Data
                                                            High ROB

                                                            Low ROB
                                                            (but still a ton
                                                            of aggregate
                                                            value)
                          ©2013 Cloudera, Inc. All Rights
16
                                   Reserved.
Use Case: A Major Financial Institution
     The Challenge:
     • Current EDW at capacity; cannot support growing data depth and width
     • Performance issues in business critical apps; little room for innovation.
 DATA WAREHOUSE           DATA WAREHOUSE             The Solution:
                             Operational
       Operational                 (50%)
                                                     • Hadoop offloads data storage (S),
          (44%)                                        processing (T) & some analytics (Q)
                               Analytics               from the EDW.
                                   (50%)
      ELT Processing                                 • EDW resources can now be focused
          (42%)           HADOOP                       on repeatable operational analytics.
                                Analytics
                               Processing
                                                     • Month data scan in 4 secs vs. 4 hours
      Analytics (11%)
                                Storage



                                       ©2013 Cloudera, Inc. All Rights
17
                                                Reserved.
Hadoop Integration
     Some Definitions




                        ©2013 Cloudera, Inc. All Rights
18
                                 Reserved.
Data Integration
     • Process in which heterogeneous data from multiple sources is
       retrieved and transformed to provide a unified view.
     • ETL (Extract, transform and load) is a central component of DI.




                             ©2013 Cloudera, Inc. All Rights
19
                                      Reserved.
ETL – The Wikipedia Definition
     •   Extract, transform and load (ETL) is a process in database
         usage and especially in data warehousing that involves:
           • Extracting data from outside sources
           • Transforming it to fit operational needs
           • Loading it into the end target (DB or data warehouse)




                   http://en.wikipedia.org/wiki/Extract,_transform,_load

                                   ©2013 Cloudera, Inc. All Rights
20
                                            Reserved.
BI – The Forrester Research Definition

     "Business Intelligence is a set of methodologies, processes,
     architectures, and technologies that transform raw data into
     meaningful and useful information used to enable more effective
     strategic, tactical, and operational insights and decision-making.”
     *




     * http://en.wikipedia.org/wiki/Business_intelligence

                                    ©2013 Cloudera, Inc. All Rights
21
                                             Reserved.
Hadoop Integration
     The Big Picture




                       ©2013 Cloudera, Inc. All Rights
22
                                Reserved.
BI/Analytics Tools




   Data
Warehouse
 /RDBMS


Streaming
   Data
            Data Import/Export
                                                                   Data Integration Tools

                                                  NoSQL

                                 ©2013 Cloudera, Inc. All Rights
23
                                          Reserved.
Example Use Case




                  ©2013 Cloudera, Inc. All Rights
24
                           Reserved.
Example Use Case

     • Online retailer.
     • Customer, order data
      stored in data
      warehouse.




                          ©2013 Cloudera, Inc. All Rights
25
                                   Reserved.
Example Use Case

     • Now wants to leverage
      behavioral (non-
      transactional) data, e.g.
      products viewed on-line
      to drive
      recommendations, etc.


                          ©2013 Cloudera, Inc. All Rights
26
                                   Reserved.
So Where is This Data?
     • Record of page views is stored in session logs as users browse
       site.
     • So how do we get it out?


         [2002/11/27 18:58:28.294 -0600] "GET /products/view/952 HTTP/1.1" 200 701 "-"
         "Mozilla/5.0 (PlayBook; U; RIM Tablet OS 2.0.1; en-US) AppleWebKit/535.8+
         (KHTML, like Gecko) Version/7.2.0.1 Safari/535.8+" ”age=63&gender=0&
         incomeCategory=4&session=51620033&user=-2118869394&region=9&userType=0"



                                     ©2013 Cloudera, Inc. All Rights
27
                                              Reserved.
Load Raw Logs into Data Warehouse?

     • Very expensive to store.
     • Difficult to model and
       process semi-structured                        Web      Logs   DWH
       data.                                         Servers

     • Oh, and also, very
       expensive.

                          ©2013 Cloudera, Inc. All Rights
28
                                   Reserved.
ETL In/Into Data Warehouse?

     • Time and resource
       intensive with larger log
       sizes.
     • No archive of raw logs –
        potentially valuable data is                               Logs   ETL   DWH
                                                        Web
       thrown away.                                    Servers
           •   How do you decide which
               fields have value?
     •   Still, some companies are
         doing things like this.
                                 ©2013 Cloudera, Inc. All Rights
29
                                          Reserved.
Hadoop Integration
     Data Import/Export Tools




                                ©2013 Cloudera, Inc. All Rights
30
                                         Reserved.
BI/Analytics Tools




   Data
Warehouse
 /RDBMS


Streaming
   Data
            Data Import/Export
                                                                   Data Integration Tools

                                                   NoSQL

                                 ©2013 Cloudera, Inc. All Rights
31
                                          Reserved.
Data Import/Export Tools


            Data
         Warehouse
          /RDBMS


         Streaming
            Data
                     Data Import/Export




                                ©2013 Cloudera, Inc. All Rights
32
                                         Reserved.
Flume in 2 Minutes
          Or, why you shouldn’t be using scripts for data movement.


     • Reliable, distributed, and available system for efficient
       collection, aggregation and movement of streaming data, e.g.
       logs.
     • Open-source, Apache project.




                               ©2013 Cloudera, Inc. All Rights
33
                                        Reserved.
Flume in 2 Minutes
       JVM process hosting components

                                      Flume Agent

 External       Source                                           Sink             Destination
                                  Channel
  Source


 Web Server
 Twitter     Consumes events    Stores events                Removes event from
 JMS         and forwards to    until consumed               channel and puts
 System logs channels           by sinks – file,             into external
 …                              memory, JDBC                 destination
                               ©2013 Cloudera, Inc. All Rights
34
                                        Reserved.
Flume in 2 Minutes
     • Reliable – events are stored in channel until delivered to next stage.
     • Recoverable – events can be persisted to disk and recovered in the
       event of failure.

                                Flume Agent

            Source            Channel                      Sink   Destination




                                ©2013 Cloudera, Inc. All Rights
35
                                         Reserved.
Flume in 2 Minutes
     • Supports multi-hop flows for more complex processing.
     • Also fan-out, fan-in.




             Flume Agent                                            Flume Agent

     Sourc    Channel      Sink                        Sourc         Channel      Sink
     e                                                 e


                                  ©2013 Cloudera, Inc. All Rights
36
                                           Reserved.
Flume in 2 Minutes

     • Declarative
        • No coding required.
        • Configuration specifies
          how components are
          wired together.




                            ©2013 Cloudera, Inc. All Rights
37
                                     Reserved.
Flume in 2 Minutes
     •   Similar systems:
           • Scribe
           • Chukwa




                            ©2013 Cloudera, Inc. All Rights
38
                                     Reserved.
Sqoop Overview
     • Apache project designed to ease import and export of data
       between Hadoop and relational databases.
     • Provides functionality to do bulk imports and exports of data
       with HDFS, Hive and HBase.
     • Java based. Leverages MapReduce to transfer data in parallel.




                             ©2012 Cloudera, Inc. All Rights
39
                                      Reserved.
Sqoop Overview
     • Uses a “connector” abstraction.
     • Two types of connectors
           • Standard connectors are JDBC based.
           • Direct connectors use native database interfaces to improve
             performance.
     •   Direct connectors are available for many open-source and
         commercial databases – MySQL, PostgreSQL, Oracle, SQL
         Server, Teradata, etc.

                                 ©2012 Cloudera, Inc. All Rights
40
                                          Reserved.
Sqoop Import Flow
                   Run import           Collect metadata
          Client                Sqoop

                                               Pull data
       Generate code,
                           MapReduce                 Map              Map     Map
       Execute MR job
                                 Write to Hadoop

                                                                     Hadoop



                                   ©2012 Cloudera, Inc. All Rights
41
                                            Reserved.
Sqoop Limitations
     Sqoop has some limitations, including:
     •   Poor support for security.
                    $ sqoop import –username scott –password tiger…
           •   Sqoop can read command line options from an option file, but this still
               has holes.
     • Error prone syntax.
     • Tight coupling to JDBC model – not a good fit for non-RDBMS
       systems.


                                     ©2012 Cloudera, Inc. All Rights
42
                                              Reserved.
Fortunately…
     Sqoop 2 (incubating) will address many of these
     limitations:
     • Adds a web-based GUI.
     • Centralized configuration.
     • More flexible model.
     • Improved security model.




                             ©2012 Cloudera, Inc. All Rights
43
                                      Reserved.
MapReduce For Transformation
     •   Standard interface is Java, but higher-level interfaces are
         commonly used:
           • Apache Hive – provides an SQL like interface to data in Hadoop.
           • Apache Pig – declarative language providing functionality to
             declare a sequence of transformations.
     •   Both Hive and Pig convert queries into MapReduce jobs and
         submit to Hadoop for execution.


                                 ©2013 Cloudera, Inc. All Rights
44
                                          Reserved.
Example Implementation with OSS Tools
     All the tools we need for moving and transforming data:
     •   Hadoop provides:
           • HDFS for storage
           • MapReduce for Processing
     •   Also components for process orchestration:
           •   Oozie, Azkaban
     •   And higher-level abstractions:
           •   Pig, Hive, etc.
                                 ©2013 Cloudera, Inc. All Rights
45
                                          Reserved.
Data Flow with OSS Tools

                                   Transform




                   Raw Logs         Hadoop                 Load
          Web
         Servers
                   Flume, etc.                          Sqoop, etc.

                                    Process
                                 Orchestration
                                  Oozie, etc.
                                 ©2013 Cloudera, Inc. All Rights
46
                                          Reserved.
Flume Configuration for Example Use Case
• Spooling source
  watches directory for
  new files and moves
  into channels.
  Renames files when
  processed.
• HDFS sink ingests files
  into HDFS.

                        ©2013 Cloudera, Inc. All Rights
47
                                 Reserved.
Pig Code for Example Use Case




                      ©2013 Cloudera, Inc. All Rights
48
                               Reserved.
Importing Final Data into DWH
     Output from Pig script stored in HDFS:

        2012-09-16T23:03:16.294Z|1461333428|290
        2012-09-20T04:48:52.294Z|772136124|749
        2012-09-24T03:51:16.294Z|1144520081|222
        2012-09-24T12:29:40.294Z|628304774|407

     Moved into destination table with Sqoop:




                                     ©2013 Cloudera, Inc. All Rights
49
                                              Reserved.
But…
     •   Some DI services are not provided in this stack:
           • Metadata repository
           • Master Data Management
           • Data lineage
           • …




                                ©2013 Cloudera, Inc. All Rights
50
                                         Reserved.
Also…
     •   …very low level:
           •   Requires knowledgeable developers to implement
               transformations. Not a whole lot of these right now.


                            Hadoop                      Data Modelers,
                           Developers                   ETL Developers,
                                                             etc.




                                   ©2013 Cloudera, Inc. All Rights
51
                                            Reserved.
Hadoop Integration
     Data Integration Tools




                              ©2013 Cloudera, Inc. All Rights
52
                                       Reserved.
BI/Analytics Tools




   Data
Warehouse
 /RDBMS


Streaming
   Data
            Data Import/Export
                                                                   Data Integration Tools

                                                   NoSQL
                                 ©2013 Cloudera, Inc. All Rights
53
                                          Reserved.
Data Integration Tools




                       ©2013 Cloudera, Inc. All Rights
54
                                Reserved.
Pentaho
     • Existing BI tools extended to support Hadoop.
     • Provides data import/export, transformation, job orchestration,
       reporting, and analysis functionality.
     • Supports integration with HDFS, Hive and Hbase.
     • Community and Enterprise Editions offered.




                             ©2012 Cloudera, Inc. All Rights
55
                                      Reserved.
Pentaho
     • Primary component is Pentaho
       Data Integration (PDI), also
       known as Kettle.
     • PDI Provides a graphical drag-
       and-drop environment for
       defining ETL jobs, which interface
       with Java MapReduce to execute
       in-cluster transformations.


                             ©2012 Cloudera, Inc. All Rights
56
                                      Reserved.
Pentaho/Cloudera Demo
     •   Ingest data into HDFS using Flume
     •   Pre-process the reference data
     •   Copy reference files into Hadoop
     •   Execute transformations in-cluster
     •   Load Hive
     •   Query Hive
     •   Discover, Analyze and Visualize

57
Pentaho MapReduce
                                                     96.239.76.17 - - [31/Dec/2000:14:11:59 -0800] "GET
                                                     /rate?movie=1207&rating=4 HTTP/1.1" 200 7
                                                     "http://clouderamovies.com/" "Mozilla/5.0
                                                     (Macintosh; Intel Mac OS X 10_7_4)
                                                     AppleWebKit/536.11 (KHTML, like Gecko)
                                                     Chrome/20.0.1132.57 Safari/536.11" "USER=1"




      5|Monty Python's Life of Brian|1979|5794|M|35-
      44|Salesman|53703|Madison|WI|2000|5|5|43295|20th|
      false|false|false|false|true|false|false|false|fa
      lse|false|false|false|false|false|false|false|fal
      se|false



58
Structure  Analysis & Visualization


5|Monty Python's Life of
Brian|1979|5794|M|35-
44|Salesman|53703|Madison|WI|2000|5|5|43
295|20th|false|false|false|false|true|fa
lse|false|false|false|false|false|false|
false|false|false|false|false|false
                 ...




59
Informatica
     •   Informatica
           •   Data import/export
           •   Metadata services
           •   Data lineage
           •   Transformation
           •   …




                                    ©2013 Cloudera, Inc. All Rights
60
                                             Reserved.
Informatica – Data Import
                       Access Data                     Pre-Process          Ingest Data
     Web server



 Databases,            PowerExchange                    PowerCenter
 Data Warehouse
                        Batch                                                 HDFS


 Message Queues,
 Email, Social Media    CDC                                                   HIVE
                                                       e.g. Filter, Join,
     ERP, CRM
                                                       Cleanse
                        Real-time

     Mainframe


                                    ©2013 Cloudera, Inc. All Rights
61
                                             Reserved.
Informatica – Data Export
Extract Data     Post-Process                       Deliver Data
                                                                      Web server


                  PowerCenter                        PowerExchange   Databases,
                                                      Batch          Data Warehouse
     HDFS



                                                      Real-time
                  e.g. Transform to                                    ERP, CRM

                  target schema
                                                                       Mainframe




                                ©2013 Cloudera, Inc. All Rights
62
                                         Reserved.
Informatica Data Import/Export
      1. Create Ingest or
      Extract Mapping

     2. Create Hadoop
     Connection




                                     3. Configure Workflow




                 4. Configure Hive
                 Properties



                                                   ©2013 Cloudera, Inc. All Rights
63
                                                            Reserved.
Informatica – Data Transformation




                      ©2013 Cloudera, Inc. All Rights
64
                               Reserved.
Hadoop Integration
     Business Intelligence/Analytic Tools




                               ©2013 Cloudera, Inc. All Rights
65
                                        Reserved.
BI/Analytics Tools




   Data
Warehouse
 /RDBMS


Streaming
   Data
            Data Import/Export
                                                                   Data Integration Tools

                                                   NoSQL
                                 ©2013 Cloudera, Inc. All Rights
66
                                          Reserved.
Business Intelligence/Analytics Tools




                      ©2013 Cloudera, Inc. All Rights
67
                               Reserved.
Business Intelligence/Analytics Tools




     Relational      Data
                                         …
     Databases    Warehouses




                               ©2013 Cloudera, Inc. All Rights
68
                                        Reserved.
ODBC Driver
     • Most of these tools use the ODBC
                                                               BI/Analytics Tools
       standard.
     • Since Hive is an SQL-like system it’s a
                                                                            ODBC

                                                                    DRIVER
       good fit for ODBC.
                                                                            HIVEQL
     • Several vendors, including Cloudera,
                                                                 HIVE SERVER
       make ODBC drivers available for
       Hadoop.                                                       HIVE

     • JDBC is also used by some products for
       Hive Integration

                             ©2013 Cloudera, Inc. All Rights
69
                                      Reserved.
Hive Integration

     HiveServer1                              HiveServer2
     • No support for concurrent              • Adds support for concurrent
       queries. Requires running                queries. Can support multiple
       multiple HiveServers for                 users.
       multiple users
                                              • Adds security support with
     • No support for security.
                                                Kerberos.
     • The Thrift API in the Hive
       Server doesn’t support                 • Better support for JDBC and
       common JDBC/ODBC calls.                  ODBC.
                             ©2013 Cloudera, Inc. All Rights
70
                                      Reserved.
Still Some Limitations With This Model
     • Hive does not have full SQL support.
     • Dependent on Hive – data must be loaded in Hive to be
       available.
     • Queries are high-latency.




                            ©2013 Cloudera, Inc. All Rights
71
                                     Reserved.
Hadoop Integration
     Next Generation BI/Analytics Tools




                              ©2013 Cloudera, Inc. All Rights
72
                                       Reserved.
New “Hadoop Native” Tools
     You can think of Hadoop as becoming a shared execution environment supporting new
     data analysis tools…

                             BI/Analytics                     New Query Engines

                                                                     Co



          MapReduce




                                  ©2013 Cloudera, Inc. All Rights
73
                                           Reserved.
Hadoop Native Tools – Advantages
     •   New data analysis tools:
           • Designed and optimized for working with Hadoop data and large
             data sets.
           • Remove reliance on Hive for accessing data – can work with any
             data in Hadoop.
     •   New query engines:
           • Provide ability to do low latency queries against Hadoop data.
           • Make it possible to do ad-hoc, exploratory analysis of data in
             Hadoop.
                                 ©2013 Cloudera, Inc. All Rights
74
                                          Reserved.
Datameer




                ©2013 Cloudera, Inc. All Rights
75
                         Reserved.
Datameer




                ©2013 Cloudera, Inc. All Rights
76
                         Reserved.
New Query Engines – Impala
     •   Fast, interactive queries on data stored in Hadoop (HDFS and HBase).
           •   But also designed to support long running queries.
     •   Uses familiar Hive Query Language and shares metastore.
     •   Tight integration with Hadoop.
           •   Reads common Hadoop file formats.
           •   Runs on Hadoop DataNodes.
     •   High Performance
           •   C++, not Java.
           •   Runtime code generation.
           •   Entirely re-designed execution engine bypasses MapReduce.
     •   Currently in beta, GA expected in April.

                                    Confidential. ©2012 Cloudera, Inc. All
77
                                              Rights Reserved.
Impala Architecture
     Common Hive SQL and interface                      Unified metadata and scheduler
                SQL App                          Hive                                    State
                                               Metastore      YARN       HDFS NN         Store
                 ODBC




         Query Planner                 Query Planner       Fully MPP        Query Planner
      Query Coordinator              Query Coordinator     Distributed   Query Coordinator
      Query Exec Engine              Query Exec Engine                    Query Exec Engine
      HDFS DN     HBase              HDFS DN    HBase                    HDFS DN         HBase

78
Cloudera Impala Details
Client submits query through ODBC
           SQL App                                       Hive                                  State
                                                       Metastore             YARN   HDFS NN    Store
            ODBC

                      SQL Request

    Query Planner                          Query Planner                              Query Planner
 Query Coordinator                      Query Coordinator                           Query Coordinator
  Query Exec Engine                     Query Exec Engine                           Query Exec Engine
 HDFS DN      HBase                     HDFS DN          HBase                      HDFS DN   HBase


                                    Confidential. ©2012 Cloudera, Inc. All
                                              Rights Reserved.
Cloudera Impala Details
Planner turns request into collection of plan fragments.
Coordinator initiates execution on remote impalad’s
           SQL App                                      Hive                                   State
                                                      Metastore             YARN    HDFS NN    Store
            ODBC


   Query Planner                          Query Planner               Fully MPP       Query Planner
 Query Coordinator                     Query Coordinator              Distributed   Query Coordinator
 Query Exec Engine                     Query Exec Engine                            Query Exec Engine
 HDFS DN       HBase                   HDFS DN          HBase                       HDFS DN   HBase


                                   Confidential. ©2012 Cloudera, Inc. All
                                             Rights Reserved.
Cloudera Impala Details
Impalads participating in query access local data in HDFS or HBase
          SQL App                                      Hive                                        State
                                                     Metastore             YARN      HDFS NN       Store
           ODBC




   Query Planner                         Query Planner                                  Query Planner
Query Coordinator                     Query Coordinator                               Query Coordinator
Query Exec Engine                     Query Exec Engine                               Query Exec Engine
HDFS DN       HBase                   HDFS DN          HBase                         HDFS DN       HBase
                                                                              Local Direct Reads

                                  Confidential. ©2012 Cloudera, Inc. All
                                            Rights Reserved.
Cloudera Impala Details
Intermediate results are streamed between impalad’s
Final results are streamed back to client
          SQL App                                   Hive                                  State
                                                  Metastore          YARN      HDFS NN    Store
           ODBC

                     SQL Results

  Query Planner                       Query Planner               In Memory      Query Planner
Query Coordinator                  Query Coordinator               Transfers   Query Coordinator
Query Exec Engine                  Query Exec Engine                           Query Exec Engine
HDFS DN     HBase                  HDFS DN         HBase                       HDFS DN   HBase

                                Confidential. ©2012 Cloudera, Inc. All
                                          Rights Reserved.
BI Example – Tableau with Impala




                      ©2013 Cloudera, Inc. All Rights
83
                               Reserved.
Development Challenges
     •   According to TDWI research*:
             • 28% of users feel software tools are few and immature.
             • And 25% note the lack of metadata management.




         *TDWI Best Practices Report: Integrating Hadoop Into Business Intelligence and Data Warehousing, Philip Russom, TDWI Research:
         http://tdwi.org/research/2013/04/tdwi-best-practices-report-integrating-hadoop-into-business-intelligence-and-data-
         warehousing.aspx


                                                    ©2013 Cloudera, Inc. All Rights
84
                                                             Reserved.
The Cloudera Developer Kit
     •   The CDK is an open-source collection of libraries, tools, examples, and
         documentation targeted at simplifying the most common tasks when
         working with the Hadoop.
     •   The first module released is the CDK Data module – APIs to drastically
         simplify working with datasets in Hadoop filesystems. The Data module:
           •   handles automatic serialization and deserialization of Java POJOs as well as
               Avro Records.
           •   Automatic compression.
           •   File and directory layout and management.
           •   Automatic partitioning based on configurable functions.
           •   A metadata provider plugin interface to integrate with centralized metadata
               management systems.

                                       ©2013 Cloudera, Inc. All Rights
85
                                                Reserved.
Cloudera Developer Kit
     •   Source code, examples, documentation, etc.:
           •   https://github.com/cloudera/cdk




                                  ©2013 Cloudera, Inc. All Rights
86
                                           Reserved.
Questions?
     •   Or see me at the Cloudera booth – 11:00-1:00.




                              ©2013 Cloudera, Inc. All Rights
87
                                       Reserved.

Contenu connexe

Tendances

Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopCloudera, Inc.
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Patrick Van Renterghem
 
Database awareness
Database awarenessDatabase awareness
Database awarenesskloia
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBaseJames Serra
 

Tendances (20)

Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
 
Database awareness
Database awarenessDatabase awareness
Database awareness
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 

En vedette

Lessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of DataLessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of DataDataWorks Summit
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Datameer
 
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...Cloudera, Inc.
 
The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014Cloudera, Inc.
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Business intelligence systems
Business intelligence systemsBusiness intelligence systems
Business intelligence systemsUMaine
 

En vedette (10)

Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Lessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of DataLessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of Data
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
 
The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014
 
Data Warehouse 101
Data Warehouse 101Data Warehouse 101
Data Warehouse 101
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Business intelligence systems
Business intelligence systemsBusiness intelligence systems
Business intelligence systems
 

Similaire à Extending data infrastructure with Hadoop

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...Cloudera, Inc.
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseDataWorks Summit
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseCloudera, Inc.
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...Cloudera, Inc.
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 How to use Hadoop for operational and transactional purposes by RODRIGO MERI... How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...Big Data Spain
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 

Similaire à Extending data infrastructure with Hadoop (20)

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Hadoop & Data Warehouse
Hadoop & Data Warehouse Hadoop & Data Warehouse
Hadoop & Data Warehouse
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Big Data
Big DataBig Data
Big Data
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 How to use Hadoop for operational and transactional purposes by RODRIGO MERI... How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 

Plus de Jonathan Seidman

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Jonathan Seidman
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_finalJonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Jonathan Seidman
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Jonathan Seidman
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 

Plus de Jonathan Seidman (15)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 

Dernier

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Extending data infrastructure with Hadoop

  • 1. Extending Your Data InfrastructurePUBLICLY DO NOT USE with Hadoop PRIOR TO 10/23/12 Headline Goes Here Jonathan Seidman | Solutions Architect Speaker Name or Subhead Goes Here Big Data TechCon April 10, 2013 ©2013 Cloudera, Inc. All Rights 1 Reserved.
  • 2. Who I Am • Solutions Architect, Partner Engineering Team. • Co-founder/organizer of Chicago Hadoop User Group and Chicago Big Data. • jseidman@cloudera.com • @jseidman • cloudera.com/careers ©2013 Cloudera, Inc. All Rights 2 Reserved.
  • 3. What I’ll be Talking About • Big data challenges with current data integration approaches. • How is Hadoop being leveraged with existing data infrastructures? • Hadoop integration – the big picture. • Deeper dive into tool categories. • Data import/export • Data Integration • BI/Analytics • Putting the pieces together. • BI/Analytics with Hadoop. • New approachs to data analysis with Hadoop. ©2013 Cloudera, Inc. All Rights 3 Reserved.
  • 4. What is Apache Hadoop? CORE HADOOP SYSTEM COMPONENTS Apache Hadoop is an open source platform for data storage and processing Hadoop Distributed that is… File System (HDFS) MapReduce  Scalable Self-Healing, High Bandwidth Clustered  Fault tolerant Storage Distributed Computing Framework  Distributed Has the Flexibility to Store and Excels at Scales Mine Any Type of Data Processing Complex Data Economically  Ask questions across structured and  Scale-out architecture divides workloads  Can be deployed on commodity unstructured data that were previously across multiple nodes hardware impossible to ask or solve  Flexible file system eliminates ETL  Open source platform guards against  Not bound by a single schema bottlenecks vendor lock ©2013 Cloudera, Inc. All Rights 4 Reserved.
  • 5. Current Challenges Limitations of Existing Data Management Systems ©2013 Cloudera, Inc. All Rights 5 Reserved.
  • 6. The Transforming of Transformation Enterprise Applications Extract Query OLTP Transform Data Business Load Warehouse Intelligence Transform ODS ©2013 Cloudera, Inc. All Rights 6 Reserved.
  • 7. Volume, Velocity, Variety Cause Capacity Problems 1 Slow Data Transformations = Missed ETL SLAs. Enterprise Applications 2 Slow Queries = Frustrated Business Users. Extract 2 Query OLTP 1 Transform Data Business Load Warehouse Intelligence 1 Transform ©2013 Cloudera, Inc. All Rights 7 Reserved.
  • 8. Data Warehouse Optimization Enterprise Data Warehouse Applications Query (High $/Byte) OLTP ETL Business Hadoop Intelligence Transform Query ODS Store ©2013 Cloudera, Inc. All Rights 8 Reserved.
  • 9. The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Prescriptive Data Modeling: • Descriptive Data Modeling: • Create static DB schema • Copy data in its native format • Transform data into RDBMS • Create schema + parser • Query data in RDBMS format • Query Data in its native format • New columns must be added explicitly • New data can start flowing any time and before new data can propagate into will appear retroactively once the the system. schema/parser properly describes it. • Good for Known Unknowns • Good for Unknown Unknowns (Repetition) (Exploration) ©2013 Cloudera, Inc. All Rights 9 Reserved.
  • 10. Not Just Transformation Other Ways Hadoop is Being Leveraged ©2013 Cloudera, Inc. All Rights 10 Reserved.
  • 11. Data Archiving Before Hadoop Data Tape Warehouse Archive ©2013 Cloudera, Inc. All Rights 11 Reserved.
  • 12. Active Archiving with Hadoop Data Hadoop Warehouse ©2013 Cloudera, Inc. All Rights 12 Reserved.
  • 13. Offloading Analysis Data Warehouse Business Intelligence Hadoop ©2013 Cloudera, Inc. All Rights 13 Reserved.
  • 14. Exploratory Analysis Developers Business Analysts Users Hadoop Data Warehouse ©2013 Cloudera, Inc. All Rights 14 Reserved.
  • 15. The Common Themes? 1 Offload expensive storage and processing to Hadoop • Complement, not replace 2 Reduce strain on the data warehouse • Let it focus on what it was designed to do: • High speed queries on high value relational data • Increase ROI of existing relational stores ©2013 Cloudera, Inc. All Rights 15 Reserved.
  • 16. Economics: Return on Byte Return on Byte (ROB) = Value of Data Cost of Storing Data High ROB Low ROB (but still a ton of aggregate value) ©2013 Cloudera, Inc. All Rights 16 Reserved.
  • 17. Use Case: A Major Financial Institution The Challenge: • Current EDW at capacity; cannot support growing data depth and width • Performance issues in business critical apps; little room for innovation. DATA WAREHOUSE DATA WAREHOUSE The Solution: Operational Operational (50%) • Hadoop offloads data storage (S), (44%) processing (T) & some analytics (Q) Analytics from the EDW. (50%) ELT Processing • EDW resources can now be focused (42%) HADOOP on repeatable operational analytics. Analytics Processing • Month data scan in 4 secs vs. 4 hours Analytics (11%) Storage ©2013 Cloudera, Inc. All Rights 17 Reserved.
  • 18. Hadoop Integration Some Definitions ©2013 Cloudera, Inc. All Rights 18 Reserved.
  • 19. Data Integration • Process in which heterogeneous data from multiple sources is retrieved and transformed to provide a unified view. • ETL (Extract, transform and load) is a central component of DI. ©2013 Cloudera, Inc. All Rights 19 Reserved.
  • 20. ETL – The Wikipedia Definition • Extract, transform and load (ETL) is a process in database usage and especially in data warehousing that involves: • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target (DB or data warehouse) http://en.wikipedia.org/wiki/Extract,_transform,_load ©2013 Cloudera, Inc. All Rights 20 Reserved.
  • 21. BI – The Forrester Research Definition "Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making.” * * http://en.wikipedia.org/wiki/Business_intelligence ©2013 Cloudera, Inc. All Rights 21 Reserved.
  • 22. Hadoop Integration The Big Picture ©2013 Cloudera, Inc. All Rights 22 Reserved.
  • 23. BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL ©2013 Cloudera, Inc. All Rights 23 Reserved.
  • 24. Example Use Case ©2013 Cloudera, Inc. All Rights 24 Reserved.
  • 25. Example Use Case • Online retailer. • Customer, order data stored in data warehouse. ©2013 Cloudera, Inc. All Rights 25 Reserved.
  • 26. Example Use Case • Now wants to leverage behavioral (non- transactional) data, e.g. products viewed on-line to drive recommendations, etc. ©2013 Cloudera, Inc. All Rights 26 Reserved.
  • 27. So Where is This Data? • Record of page views is stored in session logs as users browse site. • So how do we get it out? [2002/11/27 18:58:28.294 -0600] "GET /products/view/952 HTTP/1.1" 200 701 "-" "Mozilla/5.0 (PlayBook; U; RIM Tablet OS 2.0.1; en-US) AppleWebKit/535.8+ (KHTML, like Gecko) Version/7.2.0.1 Safari/535.8+" ”age=63&gender=0& incomeCategory=4&session=51620033&user=-2118869394&region=9&userType=0" ©2013 Cloudera, Inc. All Rights 27 Reserved.
  • 28. Load Raw Logs into Data Warehouse? • Very expensive to store. • Difficult to model and process semi-structured Web Logs DWH data. Servers • Oh, and also, very expensive. ©2013 Cloudera, Inc. All Rights 28 Reserved.
  • 29. ETL In/Into Data Warehouse? • Time and resource intensive with larger log sizes. • No archive of raw logs – potentially valuable data is Logs ETL DWH Web thrown away. Servers • How do you decide which fields have value? • Still, some companies are doing things like this. ©2013 Cloudera, Inc. All Rights 29 Reserved.
  • 30. Hadoop Integration Data Import/Export Tools ©2013 Cloudera, Inc. All Rights 30 Reserved.
  • 31. BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL ©2013 Cloudera, Inc. All Rights 31 Reserved.
  • 32. Data Import/Export Tools Data Warehouse /RDBMS Streaming Data Data Import/Export ©2013 Cloudera, Inc. All Rights 32 Reserved.
  • 33. Flume in 2 Minutes Or, why you shouldn’t be using scripts for data movement. • Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs. • Open-source, Apache project. ©2013 Cloudera, Inc. All Rights 33 Reserved.
  • 34. Flume in 2 Minutes JVM process hosting components Flume Agent External Source Sink Destination Channel Source Web Server Twitter Consumes events Stores events Removes event from JMS and forwards to until consumed channel and puts System logs channels by sinks – file, into external … memory, JDBC destination ©2013 Cloudera, Inc. All Rights 34 Reserved.
  • 35. Flume in 2 Minutes • Reliable – events are stored in channel until delivered to next stage. • Recoverable – events can be persisted to disk and recovered in the event of failure. Flume Agent Source Channel Sink Destination ©2013 Cloudera, Inc. All Rights 35 Reserved.
  • 36. Flume in 2 Minutes • Supports multi-hop flows for more complex processing. • Also fan-out, fan-in. Flume Agent Flume Agent Sourc Channel Sink Sourc Channel Sink e e ©2013 Cloudera, Inc. All Rights 36 Reserved.
  • 37. Flume in 2 Minutes • Declarative • No coding required. • Configuration specifies how components are wired together. ©2013 Cloudera, Inc. All Rights 37 Reserved.
  • 38. Flume in 2 Minutes • Similar systems: • Scribe • Chukwa ©2013 Cloudera, Inc. All Rights 38 Reserved.
  • 39. Sqoop Overview • Apache project designed to ease import and export of data between Hadoop and relational databases. • Provides functionality to do bulk imports and exports of data with HDFS, Hive and HBase. • Java based. Leverages MapReduce to transfer data in parallel. ©2012 Cloudera, Inc. All Rights 39 Reserved.
  • 40. Sqoop Overview • Uses a “connector” abstraction. • Two types of connectors • Standard connectors are JDBC based. • Direct connectors use native database interfaces to improve performance. • Direct connectors are available for many open-source and commercial databases – MySQL, PostgreSQL, Oracle, SQL Server, Teradata, etc. ©2012 Cloudera, Inc. All Rights 40 Reserved.
  • 41. Sqoop Import Flow Run import Collect metadata Client Sqoop Pull data Generate code, MapReduce Map Map Map Execute MR job Write to Hadoop Hadoop ©2012 Cloudera, Inc. All Rights 41 Reserved.
  • 42. Sqoop Limitations Sqoop has some limitations, including: • Poor support for security. $ sqoop import –username scott –password tiger… • Sqoop can read command line options from an option file, but this still has holes. • Error prone syntax. • Tight coupling to JDBC model – not a good fit for non-RDBMS systems. ©2012 Cloudera, Inc. All Rights 42 Reserved.
  • 43. Fortunately… Sqoop 2 (incubating) will address many of these limitations: • Adds a web-based GUI. • Centralized configuration. • More flexible model. • Improved security model. ©2012 Cloudera, Inc. All Rights 43 Reserved.
  • 44. MapReduce For Transformation • Standard interface is Java, but higher-level interfaces are commonly used: • Apache Hive – provides an SQL like interface to data in Hadoop. • Apache Pig – declarative language providing functionality to declare a sequence of transformations. • Both Hive and Pig convert queries into MapReduce jobs and submit to Hadoop for execution. ©2013 Cloudera, Inc. All Rights 44 Reserved.
  • 45. Example Implementation with OSS Tools All the tools we need for moving and transforming data: • Hadoop provides: • HDFS for storage • MapReduce for Processing • Also components for process orchestration: • Oozie, Azkaban • And higher-level abstractions: • Pig, Hive, etc. ©2013 Cloudera, Inc. All Rights 45 Reserved.
  • 46. Data Flow with OSS Tools Transform Raw Logs Hadoop Load Web Servers Flume, etc. Sqoop, etc. Process Orchestration Oozie, etc. ©2013 Cloudera, Inc. All Rights 46 Reserved.
  • 47. Flume Configuration for Example Use Case • Spooling source watches directory for new files and moves into channels. Renames files when processed. • HDFS sink ingests files into HDFS. ©2013 Cloudera, Inc. All Rights 47 Reserved.
  • 48. Pig Code for Example Use Case ©2013 Cloudera, Inc. All Rights 48 Reserved.
  • 49. Importing Final Data into DWH Output from Pig script stored in HDFS: 2012-09-16T23:03:16.294Z|1461333428|290 2012-09-20T04:48:52.294Z|772136124|749 2012-09-24T03:51:16.294Z|1144520081|222 2012-09-24T12:29:40.294Z|628304774|407 Moved into destination table with Sqoop: ©2013 Cloudera, Inc. All Rights 49 Reserved.
  • 50. But… • Some DI services are not provided in this stack: • Metadata repository • Master Data Management • Data lineage • … ©2013 Cloudera, Inc. All Rights 50 Reserved.
  • 51. Also… • …very low level: • Requires knowledgeable developers to implement transformations. Not a whole lot of these right now. Hadoop Data Modelers, Developers ETL Developers, etc. ©2013 Cloudera, Inc. All Rights 51 Reserved.
  • 52. Hadoop Integration Data Integration Tools ©2013 Cloudera, Inc. All Rights 52 Reserved.
  • 53. BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL ©2013 Cloudera, Inc. All Rights 53 Reserved.
  • 54. Data Integration Tools ©2013 Cloudera, Inc. All Rights 54 Reserved.
  • 55. Pentaho • Existing BI tools extended to support Hadoop. • Provides data import/export, transformation, job orchestration, reporting, and analysis functionality. • Supports integration with HDFS, Hive and Hbase. • Community and Enterprise Editions offered. ©2012 Cloudera, Inc. All Rights 55 Reserved.
  • 56. Pentaho • Primary component is Pentaho Data Integration (PDI), also known as Kettle. • PDI Provides a graphical drag- and-drop environment for defining ETL jobs, which interface with Java MapReduce to execute in-cluster transformations. ©2012 Cloudera, Inc. All Rights 56 Reserved.
  • 57. Pentaho/Cloudera Demo • Ingest data into HDFS using Flume • Pre-process the reference data • Copy reference files into Hadoop • Execute transformations in-cluster • Load Hive • Query Hive • Discover, Analyze and Visualize 57
  • 58. Pentaho MapReduce 96.239.76.17 - - [31/Dec/2000:14:11:59 -0800] "GET /rate?movie=1207&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11" "USER=1" 5|Monty Python's Life of Brian|1979|5794|M|35- 44|Salesman|53703|Madison|WI|2000|5|5|43295|20th| false|false|false|false|true|false|false|false|fa lse|false|false|false|false|false|false|false|fal se|false 58
  • 59. Structure  Analysis & Visualization 5|Monty Python's Life of Brian|1979|5794|M|35- 44|Salesman|53703|Madison|WI|2000|5|5|43 295|20th|false|false|false|false|true|fa lse|false|false|false|false|false|false| false|false|false|false|false|false ... 59
  • 60. Informatica • Informatica • Data import/export • Metadata services • Data lineage • Transformation • … ©2013 Cloudera, Inc. All Rights 60 Reserved.
  • 61. Informatica – Data Import Access Data Pre-Process Ingest Data Web server Databases, PowerExchange PowerCenter Data Warehouse Batch HDFS Message Queues, Email, Social Media CDC HIVE e.g. Filter, Join, ERP, CRM Cleanse Real-time Mainframe ©2013 Cloudera, Inc. All Rights 61 Reserved.
  • 62. Informatica – Data Export Extract Data Post-Process Deliver Data Web server PowerCenter PowerExchange Databases, Batch Data Warehouse HDFS Real-time e.g. Transform to ERP, CRM target schema Mainframe ©2013 Cloudera, Inc. All Rights 62 Reserved.
  • 63. Informatica Data Import/Export 1. Create Ingest or Extract Mapping 2. Create Hadoop Connection 3. Configure Workflow 4. Configure Hive Properties ©2013 Cloudera, Inc. All Rights 63 Reserved.
  • 64. Informatica – Data Transformation ©2013 Cloudera, Inc. All Rights 64 Reserved.
  • 65. Hadoop Integration Business Intelligence/Analytic Tools ©2013 Cloudera, Inc. All Rights 65 Reserved.
  • 66. BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL ©2013 Cloudera, Inc. All Rights 66 Reserved.
  • 67. Business Intelligence/Analytics Tools ©2013 Cloudera, Inc. All Rights 67 Reserved.
  • 68. Business Intelligence/Analytics Tools Relational Data … Databases Warehouses ©2013 Cloudera, Inc. All Rights 68 Reserved.
  • 69. ODBC Driver • Most of these tools use the ODBC BI/Analytics Tools standard. • Since Hive is an SQL-like system it’s a ODBC DRIVER good fit for ODBC. HIVEQL • Several vendors, including Cloudera, HIVE SERVER make ODBC drivers available for Hadoop. HIVE • JDBC is also used by some products for Hive Integration ©2013 Cloudera, Inc. All Rights 69 Reserved.
  • 70. Hive Integration HiveServer1 HiveServer2 • No support for concurrent • Adds support for concurrent queries. Requires running queries. Can support multiple multiple HiveServers for users. multiple users • Adds security support with • No support for security. Kerberos. • The Thrift API in the Hive Server doesn’t support • Better support for JDBC and common JDBC/ODBC calls. ODBC. ©2013 Cloudera, Inc. All Rights 70 Reserved.
  • 71. Still Some Limitations With This Model • Hive does not have full SQL support. • Dependent on Hive – data must be loaded in Hive to be available. • Queries are high-latency. ©2013 Cloudera, Inc. All Rights 71 Reserved.
  • 72. Hadoop Integration Next Generation BI/Analytics Tools ©2013 Cloudera, Inc. All Rights 72 Reserved.
  • 73. New “Hadoop Native” Tools You can think of Hadoop as becoming a shared execution environment supporting new data analysis tools… BI/Analytics New Query Engines Co MapReduce ©2013 Cloudera, Inc. All Rights 73 Reserved.
  • 74. Hadoop Native Tools – Advantages • New data analysis tools: • Designed and optimized for working with Hadoop data and large data sets. • Remove reliance on Hive for accessing data – can work with any data in Hadoop. • New query engines: • Provide ability to do low latency queries against Hadoop data. • Make it possible to do ad-hoc, exploratory analysis of data in Hadoop. ©2013 Cloudera, Inc. All Rights 74 Reserved.
  • 75. Datameer ©2013 Cloudera, Inc. All Rights 75 Reserved.
  • 76. Datameer ©2013 Cloudera, Inc. All Rights 76 Reserved.
  • 77. New Query Engines – Impala • Fast, interactive queries on data stored in Hadoop (HDFS and HBase). • But also designed to support long running queries. • Uses familiar Hive Query Language and shares metastore. • Tight integration with Hadoop. • Reads common Hadoop file formats. • Runs on Hadoop DataNodes. • High Performance • C++, not Java. • Runtime code generation. • Entirely re-designed execution engine bypasses MapReduce. • Currently in beta, GA expected in April. Confidential. ©2012 Cloudera, Inc. All 77 Rights Reserved.
  • 78. Impala Architecture Common Hive SQL and interface Unified metadata and scheduler SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query Planner Query Coordinator Query Coordinator Distributed Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase 78
  • 79. Cloudera Impala Details Client submits query through ODBC SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
  • 80. Cloudera Impala Details Planner turns request into collection of plan fragments. Coordinator initiates execution on remote impalad’s SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query Planner Query Coordinator Query Coordinator Distributed Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
  • 81. Cloudera Impala Details Impalads participating in query access local data in HDFS or HBase SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
  • 82. Cloudera Impala Details Intermediate results are streamed between impalad’s Final results are streamed back to client SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Results Query Planner Query Planner In Memory Query Planner Query Coordinator Query Coordinator Transfers Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
  • 83. BI Example – Tableau with Impala ©2013 Cloudera, Inc. All Rights 83 Reserved.
  • 84. Development Challenges • According to TDWI research*: • 28% of users feel software tools are few and immature. • And 25% note the lack of metadata management. *TDWI Best Practices Report: Integrating Hadoop Into Business Intelligence and Data Warehousing, Philip Russom, TDWI Research: http://tdwi.org/research/2013/04/tdwi-best-practices-report-integrating-hadoop-into-business-intelligence-and-data- warehousing.aspx ©2013 Cloudera, Inc. All Rights 84 Reserved.
  • 85. The Cloudera Developer Kit • The CDK is an open-source collection of libraries, tools, examples, and documentation targeted at simplifying the most common tasks when working with the Hadoop. • The first module released is the CDK Data module – APIs to drastically simplify working with datasets in Hadoop filesystems. The Data module: • handles automatic serialization and deserialization of Java POJOs as well as Avro Records. • Automatic compression. • File and directory layout and management. • Automatic partitioning based on configurable functions. • A metadata provider plugin interface to integrate with centralized metadata management systems. ©2013 Cloudera, Inc. All Rights 85 Reserved.
  • 86. Cloudera Developer Kit • Source code, examples, documentation, etc.: • https://github.com/cloudera/cdk ©2013 Cloudera, Inc. All Rights 86 Reserved.
  • 87. Questions? • Or see me at the Cloudera booth – 11:00-1:00. ©2013 Cloudera, Inc. All Rights 87 Reserved.

Notes de l'éditeur

  1. Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  2. Current Architecture BuildIn the beginning, there were enterprise applications backed by relational databases. These databases were optimized for processing transactions, or Online Transaction Processing (OLTP), which required high speed transactional reading and writing.Given the valuable data in these databases, business users wanted to be able to query them in order to ask questions. They used Business Intelligence tools that provided features like reports, dashboards, scorecards, alerts, and more. But these queries put a tremendous burden on the OLTP systems, which were not optimized to be queried like this.So architects introduced another database, called a data warehouse – you may also hear about data marts or operational data stores (ODS) – that were optimized for answering user queries.The data warehouse was loaded with data from the source systems. Specialized tools Extracted the source data, applied some Transformations to it – such as parsing, cleansing, validating, matching, translating, encoding, sorting, or aggregating – and then Loaded it into the data warehouse. For short we call these ETL.As it matured, the data warehouse incorporated additional data sources.Since the data warehouse was typically a very powerful database, some organizations also began performing some transformation workloads right in the database, choosing to load raw data for speed and letting the database do the heavy lifting of transformations. This model is called ELT. Many organizations perform both ETL and ELT for data integration.
  3. Issues BuildAs data volumes and business complexity grows, ETL and ELT processing is unable to keep up. Critical business windows are missed.Databases are designed to load and query data, not transform it. Transforming data in the database consumes valuable CPU, making queries run slower.
  4. Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
  5. Conventional databases are expensive to scale as data volumes grow. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it.
  6. Bank of AmericaA multinational bank saves millions by optimizing their EDW for analytics and reducing data storage costs by 99%.Background: A multinational bank has traditionally relied on a Teradata enterprise data warehousefor its data storage, processing and analytics. With the movement from in-person to online banking, the number of transactions and the data each transaction generates has ballooned. Challenge: The bank wanted to make effective use of all the data being generated, but their Teradata system quickly became maxed out. It could no longer handle current workloads and the bank’s business critical applications were hitting performance issues. The system was spending 44% of its resources for operational functions and 42% for ELT processing, leaving only 11% for analytics and discovery of ROI from new opportunities. The bank was forced to either expand the Teradata system which would be very expensive, restrict user access to the system in order to lessen the workload, or offloading raw data to tape backup and relying on small data samples and aggregations for analytics in order to reduce the data volume on Teradata. .Solution: The bank deployed Cloudera to offload data processing, storage and some analytics from the Teradata system, allowing the EDW to focus on its real purpose: performing operational functions and analytics. Results: By offloading data processing and storage onto Cloudera, which runs on industry standard hardware, the bank avoided spending millions to expand their Teradata infrastructure. Expensive CPU is no longer consumed by data processing, and storage costs are a mere 1% of what they were before. Meanwhile, data processing is 42% faster and data center power consumption has been reduced by 25%. The bank can now process 10TB of data every day.
  7. This is a simple example, but close to how a number of companies are using Hadoop now.
  8. Full history of users browing is stored in web logs. This is semi-structured data.
  9. Most companies aren’t going to store raw logs into their DWH because of expense and low value of much of the data. This goes back to the ROB discussion – This data might have value in aggregate, but may be very difficult to justify storing in the typical data warehouse.
  10. This is a very quick overview and glosses over much of the capabilities and functionality offered by Flume. This is describing 1.3 or “Flume NG”.
  11. Client executesSqoop job.Sqoop interrogates DB for column names, types, etc.Based on extracted metadata, Sqoop creates source code for table class, and then kicks off MR job. This table class can be used for processing on extracted records.Sqoop by default will guess at a column for splitting data for distribution across the cluster. This can also be specified by client.
  12. Should be emphasized that with this system we maintain the raw logs in Hadoop, allowing new transformations as needed.
  13. This works well, and is representative of how most companies are doing these types of tasks now.
  14. Very few database/ETL devs have Java, etc. backgrounds.Many organizations have ETL, SQL developers though familiar with common tools such as Informatica.
  15. Pentaho also has integration with NoSQL DBs (Mongo, Cassandra, etc.)
  16. Pentaho orchestrates the entire flow.Ratings data is ingested via a PDI job.Reference data is pre-processed – combined, cleansed, etc.Reference data is then copied into HDFS.Pentaho MapReduce is then used to do extensive transformations – joins, aggregations, etc. to create final data sets to drive analysis.Resulting data sets loaded into Hive.Hive queries drive analysis and reporting.All processing, reporting, etc. in this example performed in Hadoop.
  17. This provides an example of transforming raw input data into final records through the Pentaho UI.
  18. That output then drives a number of reports and visualizations.
  19. Not a promotion for Informatica, but an example of how the largest enterprise vendors are adapting their products for Hadoop.Also shows out-of-cluster transformations
  20. Uses same interface as existing PowerCenter.Transformations are converted to HQL.Existing Informatica jobs can be re-used with Hadoop.Also provides data profiling, data lineage, etc.
  21. Most of these tools integrate to existing data stores using the ODBC standard.
  22. MSTR and Tableau are tested and certified now with the Cloudera driver, but other standard ODBC based tools should also work, and more integrations will be supported soon.
  23. JDBC/ODBC support: HiveServer1 Thrift API lacks support for asynchronous query execution, the ability to cancel running queries, and methods for retrieving information about the capabilities of the remote server.
  24. Performing queries in Hive are basically the equivalent of a full table scan in a standard database. Not a good fit with most BI tools.
  25. Showing a definite bias here, but Impala is available now in beta, soon to be GA, and supported by major BI and analytics vendors. Also the system that I’m familiar with.Systems like Impala provide important new capabilities for performing data analysis with Hadoop, so well worth covering in this context. According to TDWI, lack of real-time query capabilities is an obstacle to Hadoop adoption for many companies.
  26. Impalad’scomponsed of 3 components – planner, coordinator, and execution engine.State Store Daemon isn’t shown here, but maintains information on impala daemons running in system
  27. Queries get sent to a single impalad, which is different from the HiveServerarcictecture.
  28. Changes in CDH4 allow for short-circuit reads – allows impalad’s to read directly from file system rather than going through datanodes.Another change allows Impala to know which disk data blocks are on.
  29. Impala makes it more practical to perform analysis with popular BI tools. You can now do exploratory analysis and quickly generate reports and visualizations with common tools.Integration with MSTR, QlikView, Pentaho, etc.
  30. The data module provides logical abstractions on top of storage subsystems (e.g. HDFS) that let users think and operate in terms of records, datasets, and dataset repositories