SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Cloudera Impala:
A Modern SQL Engine for Apache Hadoop
Mark Grover
Software Engineer, Cloudera
January 16, 2013
What is Impala?
●   General-purpose SQL engine
●   Real-time queries in Apache Hadoop
●   Beta version released since October 2012
●   General availability (GA) release slated for April 2013
Overview
●   User View of Impala
●   Architecture of Impala
●   Comparing Impala with Dremel
●   Comparing Impala with Hive
●   Impala Roadmap
Impala Overview: Goals
●   General-purpose SQL query engine:
     ●   should work both for analytical and transactional workloads
     ●   will support queries that take from milliseconds to hours
●   Runs directly within Hadoop:
     ●   reads widely used Hadoop file formats
     ●   talks to widely used Hadoop storage managers
     ●   runs on same nodes that run Hadoop processes
●   High performance:
     ●   C++ instead of Java
     ●   runtime code generation
     ●   completely new execution engine that doesn't build on MapReduce
User View of Impala: Overview
●   Runs as a distributed service in cluster: one Impala daemon on each
    node with data
●   User submits query via ODBC/Beeswax Thrift API to any of the
    daemons
●   Query is distributed to all nodes with relevant data
●   If any node fails, the query fails
●   Impala uses Hive's metadata interface, connects to Hive's metastore
●   Supported file formats:
     ●   text files (GA: with compression, including lzo)
     ●   sequence files with snappy/gzip compression
     ●   GA: Avro data files
     ●   GA: Trevni (columnar format; more on that later)
User View of Impala: SQL
●   SQL support:
     ●   patterned after Hive's version of SQL
     ●   limited to Select, Project, Join, Union, Subqueries, Aggregation and
         Insert
     ●   only equi-joins; no non-equi joins, no cross products
     ●   Order By only with Limit
     ●   GA: DDL support (CREATE, ALTER)
●   Functional limitations:
     ●   no custom UDFs, file formats, SerDes
     ●   no beyond SQL (buckets, samples, transforms, arrays, structs, maps,
         xpath, json)
     ●   only hash joins; joined table has to fit in memory:
           ● beta: of single node

           ● GA: aggregate memory of all (executing) nodes
User View of Impala: Apache HBase
●   HBase functionality:
     ●   uses Hive's mapping of HBase table into metastore table
     ●   predicates on rowkey columns are mapped into start/stop
         row
     ●   predicates on other columns are mapped into
         SingleColumnValueFilters
●   HBase functional limitations:
     ●   no nested-loop joins
     ●   all data stored as text
Impala Architecture
●   Two binaries: impalad and statestored
●   Impala daemon (impalad)
     ●   handles client requests and all internal requests related to
         query execution
     ●   exports Thrift services for these two roles
●   State store daemon (statestored)
     ●   provides name service and metadata distribution
     ●   also exports a Thrift service
Impala Architecture
●   Query execution phases
     ●   request arrives via odbc/beeswax Thrift API
     ●   planner turns request into collections of plan fragments
     ●   coordinator initiates execution on remote impalad's
     ●   during execution
           ● intermediate results are streamed between executors

           ● query results are streamed back to client

           ● subject to limitations imposed to blocking operators

             (top-n, aggregation)
Impala Architecture: Planner
●   join order = FROM clause order GA target: rudimentary cost-
    based optimizer
●   2-phase planning process:
     ●   single-node plan: left-deep tree of plan operators
     ●   plan partitioning: partition single-node plan to maximize scan locality,
         minimize data movement
●   plan operators: Scan, HashJoin, HashAggregation, Union,
    TopN, Exchange
●   distributed aggregation: pre-aggregation in all nodes, merge
    aggregation in single node. GA: hash-partitioned aggregation:
    re-partition aggregation input on grouping columns in order to
    reduce per-node memory requirement
Impala Architecture: Planner
●   Example: query with join and aggregation
    SELECT state, SUM(revenue)
    FROM HdfsTbl h JOIN HbaseTbl b ON (...)
    GROUP BY 1 ORDER BY 2 desc LIMIT 10

       TopN
                                                  Agg
                          TopN
       Agg                                       Hash
       Hash                Agg                   Join
       Join                                HDFS                   HBase
                           Exch                         Exch
                                           Scan                    Scan
    HDFS      HBase     at coordinator   at DataNodes          at region servers
    Scan       Scan
Impala Architecture: Query Execution
Request arrives via odbc/beeswax Thrift API

      SQL App                               Hive
                                                      HDFS NN   Statestore
       ODBC                             Metastore
                          SQL
                        request

     Query Planner                 Query Planner           Query Planner
    Query Coordinator             Query Coordinator      Query Coordinator
     Query Executor               Query Executor          Query Executor
   HDFS DN      HBase             HDFS DN     HBase      HDFS DN    HBase
Impala Architecture: Query Execution
Planner turns request into collections of plan fragments
Coordinator initiates execution on remote impalad's
      SQL App                     Hive
                                            HDFS NN   Statestore
       ODBC                   Metastore




     Query Planner       Query Planner           Query Planner
    Query Coordinator   Query Coordinator      Query Coordinator
     Query Executor      Query Executor         Query Executor
    HDFS DN     HBase   HDFS DN     HBase      HDFS DN    HBase
Impala Architecture: Query Execution
Intermediate results are streamed between impalad's Query
results are streamed back to client

     SQL App                          Hive
                                                HDFS NN   Statestore
      ODBC                        Metastore

                   query
                  results

     Query Planner           Query Planner           Query Planner
   Query Coordinator        Query Coordinator      Query Coordinator
    Query Executor           Query Executor         Query Executor
   HDFS DN     HBase        HDFS DN     HBase      HDFS DN    HBase
Impala Architecture
●   Metadata handling:
     ●   utilizes Hive's metastore
     ●   caches metadata: no synchronous metastore API calls
         during query execution
     ●   beta: impalad's read metadata from metastore at startup
     ●   GA: metadata distribution through statestore
     ●   post-GA: HCatalog and AccessServer
Impala Architecture
●   Execution engine
     ●   written in C++
     ●   runtime code generation for "big loops"
          ●   example: insert batch of rows into hash table
          ●   code generation with llvm
          ●   inlines all expressions; no function calls inside loop
     ●   all data is copied into canonical in-memory tuple format;
         all fixed-width data is located at fixed offsets
     ●   uses intrinsics/special cpu instructions for text parsing,
         crc32 computation, etc.
Impala's Statestore
●   Central system state repository
     ●   name service (membership)
     ●   GA: metadata
     ●   GA: other scheduling-relevant or diagnostic state
●   Soft-state
     ●   all impalad's register at startup
     ●   impalad's re-register after losing connection
     ●   Impala service continues to function in absence of statestored (but:
         with increasingly stale state)
     ●   State pushed to impalad's periodically
     ●   Repeated failed heartbeat means impalad evicted from cluster view
●   Thrift API for service / subscription registration
Statestore: Why not ZooKeeper?
●   ZK is not a good pub-sub system
     ●   Watch API is awkward and requires a lot of client logic
     ●   multiple round-trips required to get data for changes to
         node's children
     ●   push model is more natural for our use case
●   Don't need all the guarantees ZK provides:
     ●   serializability
     ●   persistence
     ●   prefer to avoid complexity where possible
●   ZK is bad at the things we care about and good at the
    things we don't
Comparing Impala to Dremel
●   What is Dremel?
     ●   columnar storage for data with nested structures
     ●   distributed scalable aggregation on top of that
●   Columnar storage in Hadoop: joint project between Cloudera
    and Twitter
     ●   new columnar format, derived from Doug Cutting's Trevni
     ●   stores data in appropriate native/binary types
     ●   can also store nested structures similar to Dremel's ColumnIO
●   Distributed aggregation: Impala
●   Impala plus columnar format: a superset of the published
    version of Dremel (which didn't support joins)
Comparing Impala to Hive
●   Hive: MapReduce as an execution engine
     ●   High latency, low throughput queries
     ●   Fault-tolerance model based on MapReduce's on-disk
         checkpointing; materializes all intermediate results
     ●   Java runtime allows for easy late-binding of functionality:
         file formats and UDFs.
     ●   Extensive layering imposes high runtime overhead
●   Impala:
     ●   direct, process-to-process data exchange
     ●   no fault tolerance
     ●   an execution engine designed for low runtime overhead
Comparing Impala to Hive
●   Impala's performance advantage over Hive: no hard
    numbers, but
     ●   Impala can get full disk throughput (~100MB/sec/disk);
         I/O-bound workloads often faster by 3-4x
     ●   queries that require multiple map-reduce phases in Hive
         see a higher speedup
     ●   queries that run against in-memory data see a higher
         speedup (observed up to 100x)
Impala Roadmap: GA – Q2 2013
●   New data formats:
     ●   lzo-compressed text
     ●   Avro
     ●   columnar
●   Better metadata handling:
     ●   automatic metadata distribution through statestore
●   Connectivity: jdbc
●   Improved query execution: partitioned joins
●   Further performance improvements
Impala Roadmap: GA – Q2 2013
●   Guidelines for production deployment:
     ●   load balancing across impalad's
     ●   resource isolation within MR cluster
●   Additional packages: RHEL 5.7, Ubuntu, Debian
Impala Roadmap: 2013
●   Improved HBase support:
     ●   composite keys, Avro data in columns,
         index nested-loop joins,
         INSERT/UPDATE/DELETE
●   Additional SQL:
     ●   UDFs
     ●   SQL authorization and DDL
     ●   ORDER BY without LIMIT
     ●   window functions
     ●   support for structured data types
Impala Roadmap: 2013
●   Runtime optimizations:
     ●   straggler handling
     ●   join order optimization
     ●   improved cache management
     ●   data collocation for improved join performance
●   Resource management:
     ●   cluster-wide quotas
     ●   Teradata-style policies ("user x can never have more than 5
         concurrent queries running", etc.)
     ●   goal: run exploratory and production workloads in same
         cluster, against same data, w/o impacting production jobs
Questions
●   Further questions          Download Impala (beta)
    impala-user@cloudera.org
                                 cloudera.com/downloads

●   My email:                  Learn more about Cloudera
     mgrover@cloudera.com       Enterprise RTQ, Powered
                                       by Impala
Thank you for attending!           cloudera.com/impala


                               Join the Impala Hackathon,
                               Monday, Feb 25 @ Cloudera
                                     HQ in Palo Alto
                                Register: bit.ly/DataHackDay

Contenu connexe

Tendances

Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentationpunesparkmeetup
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
How Impala Works
How Impala WorksHow Impala Works
How Impala WorksYue Chen
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 
HBaseCon 2012 | HBase Filtering - Lars George, Cloudera
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaHBaseCon 2012 | HBase Filtering - Lars George, Cloudera
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaCloudera, Inc.
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 

Tendances (20)

Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to know
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
HBaseCon 2012 | HBase Filtering - Lars George, Cloudera
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaHBaseCon 2012 | HBase Filtering - Lars George, Cloudera
HBaseCon 2012 | HBase Filtering - Lars George, Cloudera
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 

En vedette

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoopNiel Dunnage
 
AWS Big Data Platform - Pop-up Loft Tel Aviv
AWS Big Data Platform - Pop-up Loft Tel AvivAWS Big Data Platform - Pop-up Loft Tel Aviv
AWS Big Data Platform - Pop-up Loft Tel AvivAmazon Web Services
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
 
Making Big Data Projects Successful - Data Science Pop-up Seattle
Making Big Data Projects Successful - Data Science Pop-up SeattleMaking Big Data Projects Successful - Data Science Pop-up Seattle
Making Big Data Projects Successful - Data Science Pop-up SeattleDomino Data Lab
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Uri Laserson
 
How To Plan a Successful Big Data Pilot
How To Plan a Successful Big Data PilotHow To Plan a Successful Big Data Pilot
How To Plan a Successful Big Data Pilotfarsitegroup
 
How to start big data projects?
How to start big data projects?How to start big data projects?
How to start big data projects?Agnieszka Zdebiak
 
Compiled Python UDFs for Impala
Compiled Python UDFs for ImpalaCompiled Python UDFs for Impala
Compiled Python UDFs for ImpalaCloudera, Inc.
 
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCombat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCloudera, Inc.
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorRomain Rigaux
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...Yahoo Developer Network
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaWes McKinney
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksSlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShareSlideShare
 

En vedette (19)

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoop
 
AWS Big Data Platform - Pop-up Loft Tel Aviv
AWS Big Data Platform - Pop-up Loft Tel AvivAWS Big Data Platform - Pop-up Loft Tel Aviv
AWS Big Data Platform - Pop-up Loft Tel Aviv
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Making Big Data Projects Successful - Data Science Pop-up Seattle
Making Big Data Projects Successful - Data Science Pop-up SeattleMaking Big Data Projects Successful - Data Science Pop-up Seattle
Making Big Data Projects Successful - Data Science Pop-up Seattle
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
 
How To Plan a Successful Big Data Pilot
How To Plan a Successful Big Data PilotHow To Plan a Successful Big Data Pilot
How To Plan a Successful Big Data Pilot
 
How to start big data projects?
How to start big data projects?How to start big data projects?
How to start big data projects?
 
Compiled Python UDFs for Impala
Compiled Python UDFs for ImpalaCompiled Python UDFs for Impala
Compiled Python UDFs for Impala
 
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCombat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL Editor
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Similaire à Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop

Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentationmarkgrover
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFlink Forward
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Savanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSavanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSergey Lukjanov
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network
 

Similaire à Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop (20)

Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Savanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSavanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStack
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
 

Plus de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Plus de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop

  • 1. Cloudera Impala: A Modern SQL Engine for Apache Hadoop Mark Grover Software Engineer, Cloudera January 16, 2013
  • 2. What is Impala? ● General-purpose SQL engine ● Real-time queries in Apache Hadoop ● Beta version released since October 2012 ● General availability (GA) release slated for April 2013
  • 3. Overview ● User View of Impala ● Architecture of Impala ● Comparing Impala with Dremel ● Comparing Impala with Hive ● Impala Roadmap
  • 4. Impala Overview: Goals ● General-purpose SQL query engine: ● should work both for analytical and transactional workloads ● will support queries that take from milliseconds to hours ● Runs directly within Hadoop: ● reads widely used Hadoop file formats ● talks to widely used Hadoop storage managers ● runs on same nodes that run Hadoop processes ● High performance: ● C++ instead of Java ● runtime code generation ● completely new execution engine that doesn't build on MapReduce
  • 5. User View of Impala: Overview ● Runs as a distributed service in cluster: one Impala daemon on each node with data ● User submits query via ODBC/Beeswax Thrift API to any of the daemons ● Query is distributed to all nodes with relevant data ● If any node fails, the query fails ● Impala uses Hive's metadata interface, connects to Hive's metastore ● Supported file formats: ● text files (GA: with compression, including lzo) ● sequence files with snappy/gzip compression ● GA: Avro data files ● GA: Trevni (columnar format; more on that later)
  • 6. User View of Impala: SQL ● SQL support: ● patterned after Hive's version of SQL ● limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert ● only equi-joins; no non-equi joins, no cross products ● Order By only with Limit ● GA: DDL support (CREATE, ALTER) ● Functional limitations: ● no custom UDFs, file formats, SerDes ● no beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) ● only hash joins; joined table has to fit in memory: ● beta: of single node ● GA: aggregate memory of all (executing) nodes
  • 7. User View of Impala: Apache HBase ● HBase functionality: ● uses Hive's mapping of HBase table into metastore table ● predicates on rowkey columns are mapped into start/stop row ● predicates on other columns are mapped into SingleColumnValueFilters ● HBase functional limitations: ● no nested-loop joins ● all data stored as text
  • 8. Impala Architecture ● Two binaries: impalad and statestored ● Impala daemon (impalad) ● handles client requests and all internal requests related to query execution ● exports Thrift services for these two roles ● State store daemon (statestored) ● provides name service and metadata distribution ● also exports a Thrift service
  • 9. Impala Architecture ● Query execution phases ● request arrives via odbc/beeswax Thrift API ● planner turns request into collections of plan fragments ● coordinator initiates execution on remote impalad's ● during execution ● intermediate results are streamed between executors ● query results are streamed back to client ● subject to limitations imposed to blocking operators (top-n, aggregation)
  • 10. Impala Architecture: Planner ● join order = FROM clause order GA target: rudimentary cost- based optimizer ● 2-phase planning process: ● single-node plan: left-deep tree of plan operators ● plan partitioning: partition single-node plan to maximize scan locality, minimize data movement ● plan operators: Scan, HashJoin, HashAggregation, Union, TopN, Exchange ● distributed aggregation: pre-aggregation in all nodes, merge aggregation in single node. GA: hash-partitioned aggregation: re-partition aggregation input on grouping columns in order to reduce per-node memory requirement
  • 11. Impala Architecture: Planner ● Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 TopN Agg TopN Agg Hash Hash Agg Join Join HDFS HBase Exch Exch Scan Scan HDFS HBase at coordinator at DataNodes at region servers Scan Scan
  • 12. Impala Architecture: Query Execution Request arrives via odbc/beeswax Thrift API SQL App Hive HDFS NN Statestore ODBC Metastore SQL request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 13. Impala Architecture: Query Execution Planner turns request into collections of plan fragments Coordinator initiates execution on remote impalad's SQL App Hive HDFS NN Statestore ODBC Metastore Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 14. Impala Architecture: Query Execution Intermediate results are streamed between impalad's Query results are streamed back to client SQL App Hive HDFS NN Statestore ODBC Metastore query results Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 15. Impala Architecture ● Metadata handling: ● utilizes Hive's metastore ● caches metadata: no synchronous metastore API calls during query execution ● beta: impalad's read metadata from metastore at startup ● GA: metadata distribution through statestore ● post-GA: HCatalog and AccessServer
  • 16. Impala Architecture ● Execution engine ● written in C++ ● runtime code generation for "big loops" ● example: insert batch of rows into hash table ● code generation with llvm ● inlines all expressions; no function calls inside loop ● all data is copied into canonical in-memory tuple format; all fixed-width data is located at fixed offsets ● uses intrinsics/special cpu instructions for text parsing, crc32 computation, etc.
  • 17. Impala's Statestore ● Central system state repository ● name service (membership) ● GA: metadata ● GA: other scheduling-relevant or diagnostic state ● Soft-state ● all impalad's register at startup ● impalad's re-register after losing connection ● Impala service continues to function in absence of statestored (but: with increasingly stale state) ● State pushed to impalad's periodically ● Repeated failed heartbeat means impalad evicted from cluster view ● Thrift API for service / subscription registration
  • 18. Statestore: Why not ZooKeeper? ● ZK is not a good pub-sub system ● Watch API is awkward and requires a lot of client logic ● multiple round-trips required to get data for changes to node's children ● push model is more natural for our use case ● Don't need all the guarantees ZK provides: ● serializability ● persistence ● prefer to avoid complexity where possible ● ZK is bad at the things we care about and good at the things we don't
  • 19. Comparing Impala to Dremel ● What is Dremel? ● columnar storage for data with nested structures ● distributed scalable aggregation on top of that ● Columnar storage in Hadoop: joint project between Cloudera and Twitter ● new columnar format, derived from Doug Cutting's Trevni ● stores data in appropriate native/binary types ● can also store nested structures similar to Dremel's ColumnIO ● Distributed aggregation: Impala ● Impala plus columnar format: a superset of the published version of Dremel (which didn't support joins)
  • 20. Comparing Impala to Hive ● Hive: MapReduce as an execution engine ● High latency, low throughput queries ● Fault-tolerance model based on MapReduce's on-disk checkpointing; materializes all intermediate results ● Java runtime allows for easy late-binding of functionality: file formats and UDFs. ● Extensive layering imposes high runtime overhead ● Impala: ● direct, process-to-process data exchange ● no fault tolerance ● an execution engine designed for low runtime overhead
  • 21. Comparing Impala to Hive ● Impala's performance advantage over Hive: no hard numbers, but ● Impala can get full disk throughput (~100MB/sec/disk); I/O-bound workloads often faster by 3-4x ● queries that require multiple map-reduce phases in Hive see a higher speedup ● queries that run against in-memory data see a higher speedup (observed up to 100x)
  • 22. Impala Roadmap: GA – Q2 2013 ● New data formats: ● lzo-compressed text ● Avro ● columnar ● Better metadata handling: ● automatic metadata distribution through statestore ● Connectivity: jdbc ● Improved query execution: partitioned joins ● Further performance improvements
  • 23. Impala Roadmap: GA – Q2 2013 ● Guidelines for production deployment: ● load balancing across impalad's ● resource isolation within MR cluster ● Additional packages: RHEL 5.7, Ubuntu, Debian
  • 24. Impala Roadmap: 2013 ● Improved HBase support: ● composite keys, Avro data in columns, index nested-loop joins, INSERT/UPDATE/DELETE ● Additional SQL: ● UDFs ● SQL authorization and DDL ● ORDER BY without LIMIT ● window functions ● support for structured data types
  • 25. Impala Roadmap: 2013 ● Runtime optimizations: ● straggler handling ● join order optimization ● improved cache management ● data collocation for improved join performance ● Resource management: ● cluster-wide quotas ● Teradata-style policies ("user x can never have more than 5 concurrent queries running", etc.) ● goal: run exploratory and production workloads in same cluster, against same data, w/o impacting production jobs
  • 26. Questions ● Further questions Download Impala (beta) impala-user@cloudera.org cloudera.com/downloads ● My email: Learn more about Cloudera mgrover@cloudera.com Enterprise RTQ, Powered by Impala Thank you for attending! cloudera.com/impala Join the Impala Hackathon, Monday, Feb 25 @ Cloudera HQ in Palo Alto Register: bit.ly/DataHackDay