SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Impala: A Modern SQL
  Engine for Hadoop
     Marcel Kornacker
      Cloudera, Inc.
Agenda
● Goals; user view of Impala
● Impala internals
● Comparing Impala to other systems
Impala Overview: Goals
● General-purpose SQL query engine:
    ○ should work both for analytical and transactional
       workloads
    ○ will support queries that take from milliseconds to
       hours
●   Runs directly within Hadoop:
    ○ reads widely used Hadoop file formats
    ○ talks to widely used Hadoop storage managers
    ○ runs on same nodes that run Hadoop processes
●   High performance:
    ○ C++ instead of Java
    ○ runtime code generation
    ○ completely new execution engine that doesn't build
       on MapReduce
User View of Impala: Overview
● Runs as a distributed service in cluster: one Impala
    daemon on each node with data
●   User submits query via ODBC/JDBC to any of the
    daemons
●   Query is distributed to all nodes with relevant data
●   If any node fails, the query fails
●   Impala uses Hive's metadata interface, connects to
    Hive's metastore
●   Supported file formats:
     ○ uncompressed/lzo-compressed text files
     ○ sequence files and RCfile with snappy/gzip
        compression
     ○ GA: Avro data files
     ○ GA: columnar format (more on that later)
User View of Impala: SQL
● SQL support:
    ○ patterned after Hive's version of SQL
    ○ essentially SQL-92, minus correlated subqueries
    ○ Insert Into ... Select ...
    ○ only equi-joins; no non-equi joins, no cross products
    ○ Order By requires Limit
    ○ Post-GA: DDL support (CREATE, ALTER)
●   Functional limitations:
    ○ no custom UDFs, file formats, SerDes
    ○ no beyond SQL (buckets, samples, transforms,
       arrays, structs, maps, xpath, json)
    ○ only hash joins; joined table has to fit in memory:
       ■ beta: of single node
       ■ GA: aggregate memory of all (executing) nodes
User View of Impala: HBase
● HBase functionality:
    ○ uses Hive's mapping of HBase table into metastore
      table
    ○ predicates on rowkey columns are mapped into
      start/stop row
    ○ predicates on other columns are mapped into
      SingleColumnValueFilters
●   HBase functional limitations:
    ○ no nested-loop joins
    ○ all data stored as text
Impala Architecture
● Two binaries: impalad and statestored
● Impala daemon (impalad)
    a. handles client requests and all internal requests
       related to query execution
    b. exports Thrift services for these two roles
●   State store daemon (statestored)
    a. provides name service and metadata distribution
    b. also exports a Thrift service
Impala Architecture
● Query execution phases
    ○ request arrives via odbc/jdbc
    ○ planner turns request into collections of plan
       fragments
    ○ coordinator initiates execution on remote impalad's
●   During execution
    ○ intermediate results are streamed between
       executors
    ○ query results are streamed back to client
    ○ subject to limitations imposed to blocking operators
       (top-n, aggregation)
Impala Architecture: Planner
● 2-phase planning process:
    ○ single-node plan: left-deep tree of plan operators
    ○ plan partitioning: partition single-node plan to
       maximize scan locality, minimize data movement
●   Plan operators: Scan, HashJoin, HashAggregation,
    Union, TopN, Exchange
●   Distributed aggregation: pre-aggregation in all nodes,
    merge aggregation in single node
    GA: hash-partitioned aggregation: re-partition
    aggregation input on grouping columns in order to
    reduce per-node memory requirement
●   Join order = FROM clause order
    Post-GA: rudimentary cost-based optimizer
Impala Architecture: Planner
● Example: query with join and aggregation
   SELECT state, SUM(revenue)
   FROM HdfsTbl h JOIN HbaseTbl b ON (...)
   GROUP BY 1 ORDER BY 2 desc LIMIT 10

    TopN
                                                Agg
                          TopN
        Agg                                     Hash
                           Agg                  Join
        Hash
        Join
                                         Hdfs                     Hbase
                          Exch                         Exch
                                         Scan                     Scan
 Hdfs          Hbase   at coordinator   at DataNodes          at region servers
 Scan          Scan
Impala Architecture: Query Execution
Request arrives via odbc/jdbc


     SQL App                             Hive
                                                    HDFS NN       Statestore
      ODBC                             Metastore


                  SQL request



     Query Planner                Query Planner           Query Planner
    Query Coordinator           Query Coordinator       Query Coordinator
     Query Executor              Query Executor          Query Executor

   HDFS DN      HBase           HDFS DN     HBase       HDFS DN       HBase
Impala Architecture: Query Execution
Planner turns request into collections of plan fragments
Coordinator initiates execution on remote impalad's

     SQL App                     Hive
                                            HDFS NN       Statestore
      ODBC                     Metastore




     Query Planner        Query Planner           Query Planner
    Query Coordinator   Query Coordinator       Query Coordinator
     Query Executor      Query Executor          Query Executor

   HDFS DN      HBase   HDFS DN     HBase       HDFS DN       HBase
Impala Architecture: Query Execution
Intermediate results are streamed between impalad's
Query results are streamed back to client

     SQL App                         Hive
                                                HDFS NN     Statestore
      ODBC                         Metastore

                   query
                  results


     Query Planner            Query Planner           Query Planner
    Query Coordinator       Query Coordinator       Query Coordinator
     Query Executor          Query Executor          Query Executor

   HDFS DN      HBase       HDFS DN     HBase       HDFS DN     HBase
Metadata Handling
● Utilizes Hive's metastore
● Caches metadata: no synchronous metastore API calls
    during query execution
●   Beta: impalad's read metadata from metastore at
    startup
●   Post-GA: metadata distribution through statestore
●   Post-GA: HCatalog and AccessServer
Impala Execution Engine
● Written in C++
● Runtime code generation for "big loops"
● Internal in-memory tuple format puts fixed-width data at
  fixed offsets
● Uses intrinsics/special cpu instructions for text parsing,
  crc32 computation, etc.
Impala Execution Engine
● More on runtime code generation
   ○ example of "big loop": insert batch of rows into hash
     table
   ○ known at query compile time: # of tuples in batch,
     tuple layout, column types, etc.
   ○ generate at compile time: unrolled loop that inlines
     all function calls, contains no dead code, minimizes
     branches
   ○ code generated using llvm
Impala's Statestore
● Central system state repository
    ○ name service (membership)
    ○ Post-GA: metadata
    ○ Post-GA: other scheduling-relevant or diagnostic
       state
●   Soft-state
    ○ all data can be reconstructed from the rest of the
       system
    ○ cluster continues to function when statestore fails,
       but per-node state becomes increasingly stale
●   Sends periodic heartbeats
    ○ pushes new data
    ○ checks for liveness
Statestore: Why not ZooKeeper
● ZK is not a good pub-sub system
     ○ Watch API is awkward and requires a lot of client
        logic
     ○ multiple round-trips required to get data for changes
        to node's children
     ○ push model is more natural for our use case
●   Don't need all the guarantees ZK provides:
     ○ serializability
     ○ persistence
     ○ prefer to avoid complexity where possible
●   ZK is bad at the things we care about and good at the
    things we don't
Comparing Impala to Dremel
● What is Dremel:
     ○ columnar storage for data with nested structures
     ○ distributed scalable aggregation on top of that
●   Columnar storage in Hadoop: joint project between
    Cloudera and Twitter
     ○ new columnar format: Parquet;
       derived from Doug Cutting's Trevni
     ○ stores data in appropriate native/binary types
     ○ can also store nested structures similar to Dremel's
       ColumnIO
●   Distributed aggregation: Impala
●   Impala plus Parquet: a superset of the published
    version of Dremel (which didn't support joins)
More about Parquet
● What it is:
    ○ columnar container format for all popular
       serialization formats: Avro, Thrift, Protocol Buffers
    ○ successor to Trevni
    ○ jointly developed between Cloudera and Twitter
    ○ open source; hosted on github
●   Features:
    ○ rowgroup format: file contains multiple horiz. slices
    ○ supports storing each column in separate file
    ○ supports fully shredded nested data; repetition and
       definition levels similar to Dremel's ColumnIO
    ○ column values stored in native types (bool, int<x>,
       float, double, byte array)
    ○ support for index pages for fast lookup
    ○ extensible value encodings
Comparing Impala to Hive
● Hive: MapReduce as an execution engine
     ○ High latency, low throughput queries
     ○ Fault-tolerance model based on MapReduce's on-
       disk checkpointing; materializes all intermediate
       results
     ○ Java runtime allows for easy late-binding of
       functionality: file formats and UDFs.
     ○ Extensive layering imposes high runtime overhead
●   Impala:
     ○ direct, process-to-process data exchange
     ○ no fault tolerance
     ○ an execution engine designed for low runtime
       overhead
Comparing Impala to Hive
● Impala's performance advantage over Hive: no hard
   numbers, but
   ○ Impala can get full disk throughput
     (~100MB/sec/disk); I/O-bound workloads often faster
     by 3-4x
   ○ queries that require multiple map-reduce phases in
     Hive see a higher speedup
   ○ queries that run against in-memory data see a
     higher speedup (observed up to 100x)
Impala Roadmap: GA (April 2013)
● New data formats:
   ○ Avro
   ○ Parquet
● Improved query execution: partitioned joins
● Further performance improvements
● Guidelines for production deployment:
   ○ load balancing across impalad's
   ○ resource isolation within MR cluster
● Additional packages: Debian, SLES
Impala Roadmap: 2013
● Additional SQL:
   ○ UDFs
   ○ SQL authorization and DDL
   ○ ORDER BY without LIMIT
   ○ window functions
   ○ support for structured data types
● Improved HBase support:
   ○ composite keys, complex types in columns,
     index nested-loop joins,
     INSERT/UPDATE/DELETE
● Runtime optimizations:
   ○ straggler handling
   ○ join order optimization
   ○ improved cache management
   ○ data collocation for improved join performance
Impala Roadmap: 2013
● Better metadata handling:
  ○ automatic metadata distribution through statestore
● Resource management:
  ○ goal: run exploratory and production workloads in
     same cluster, against same data, without impacting
     production jobs
Try it out!
● Beta version available since 10/24/12
● Latest version is 0.6 (as of 2/26)
● We have packages for:
    ○ RHEL6.2/5.7
    ○ Ubuntu 10.04 and 12.04
    ○ SLES11
    ○ Debian6
●   We're targeting GA for April 2013
●   Questions/comments? impala-user@cloudera.org
●   My email address: marcel@cloudera.com

Contenu connexe

Tendances

Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 

Tendances (20)

Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 

En vedette

Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04Ted Dunning
 
Analyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotAnalyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotModern Data Stack France
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Modern Data Stack France
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaModern Data Stack France
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiModern Data Stack France
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopHortonworks
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainModern Data Stack France
 

En vedette (20)

Hadoop on Azure
Hadoop on AzureHadoop on Azure
Hadoop on Azure
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Analyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotAnalyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien Cabot
 
Cascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand DechouxCascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand Dechoux
 
IBM Stream au Hadoop User Group
IBM Stream au Hadoop User GroupIBM Stream au Hadoop User Group
IBM Stream au Hadoop User Group
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy Hanna
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on Hadoop
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Dépasser map() et reduce()
Dépasser map() et reduce()Dépasser map() et reduce()
Dépasser map() et reduce()
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
 
Hadoop chez Kobojo
Hadoop chez KobojoHadoop chez Kobojo
Hadoop chez Kobojo
 
Big Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent HeuschlingBig Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent Heuschling
 
HCatalog
HCatalogHCatalog
HCatalog
 
Hadopp Vue d'ensemble
Hadopp Vue d'ensembleHadopp Vue d'ensemble
Hadopp Vue d'ensemble
 
Hadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas VialHadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas Vial
 
Retour Hadoop Summit 2012
Retour Hadoop Summit 2012Retour Hadoop Summit 2012
Retour Hadoop Summit 2012
 

Similaire à Marcel Kornacker: Impala tech talk Tue Feb 26th 2013

An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentationmarkgrover
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopYahoo Developer Network
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.Roman Nikitchenko
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 

Similaire à Marcel Kornacker: Impala tech talk Tue Feb 26th 2013 (20)

An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 

Plus de Modern Data Stack France

Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupModern Data Stack France
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Modern Data Stack France
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlusModern Data Stack France
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)Modern Data Stack France
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Modern Data Stack France
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandationModern Data Stack France
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielModern Data Stack France
 

Plus de Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013

  • 1. Impala: A Modern SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc.
  • 2. Agenda ● Goals; user view of Impala ● Impala internals ● Comparing Impala to other systems
  • 3. Impala Overview: Goals ● General-purpose SQL query engine: ○ should work both for analytical and transactional workloads ○ will support queries that take from milliseconds to hours ● Runs directly within Hadoop: ○ reads widely used Hadoop file formats ○ talks to widely used Hadoop storage managers ○ runs on same nodes that run Hadoop processes ● High performance: ○ C++ instead of Java ○ runtime code generation ○ completely new execution engine that doesn't build on MapReduce
  • 4. User View of Impala: Overview ● Runs as a distributed service in cluster: one Impala daemon on each node with data ● User submits query via ODBC/JDBC to any of the daemons ● Query is distributed to all nodes with relevant data ● If any node fails, the query fails ● Impala uses Hive's metadata interface, connects to Hive's metastore ● Supported file formats: ○ uncompressed/lzo-compressed text files ○ sequence files and RCfile with snappy/gzip compression ○ GA: Avro data files ○ GA: columnar format (more on that later)
  • 5. User View of Impala: SQL ● SQL support: ○ patterned after Hive's version of SQL ○ essentially SQL-92, minus correlated subqueries ○ Insert Into ... Select ... ○ only equi-joins; no non-equi joins, no cross products ○ Order By requires Limit ○ Post-GA: DDL support (CREATE, ALTER) ● Functional limitations: ○ no custom UDFs, file formats, SerDes ○ no beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) ○ only hash joins; joined table has to fit in memory: ■ beta: of single node ■ GA: aggregate memory of all (executing) nodes
  • 6. User View of Impala: HBase ● HBase functionality: ○ uses Hive's mapping of HBase table into metastore table ○ predicates on rowkey columns are mapped into start/stop row ○ predicates on other columns are mapped into SingleColumnValueFilters ● HBase functional limitations: ○ no nested-loop joins ○ all data stored as text
  • 7. Impala Architecture ● Two binaries: impalad and statestored ● Impala daemon (impalad) a. handles client requests and all internal requests related to query execution b. exports Thrift services for these two roles ● State store daemon (statestored) a. provides name service and metadata distribution b. also exports a Thrift service
  • 8. Impala Architecture ● Query execution phases ○ request arrives via odbc/jdbc ○ planner turns request into collections of plan fragments ○ coordinator initiates execution on remote impalad's ● During execution ○ intermediate results are streamed between executors ○ query results are streamed back to client ○ subject to limitations imposed to blocking operators (top-n, aggregation)
  • 9. Impala Architecture: Planner ● 2-phase planning process: ○ single-node plan: left-deep tree of plan operators ○ plan partitioning: partition single-node plan to maximize scan locality, minimize data movement ● Plan operators: Scan, HashJoin, HashAggregation, Union, TopN, Exchange ● Distributed aggregation: pre-aggregation in all nodes, merge aggregation in single node GA: hash-partitioned aggregation: re-partition aggregation input on grouping columns in order to reduce per-node memory requirement ● Join order = FROM clause order Post-GA: rudimentary cost-based optimizer
  • 10. Impala Architecture: Planner ● Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 TopN Agg TopN Agg Hash Agg Join Hash Join Hdfs Hbase Exch Exch Scan Scan Hdfs Hbase at coordinator at DataNodes at region servers Scan Scan
  • 11. Impala Architecture: Query Execution Request arrives via odbc/jdbc SQL App Hive HDFS NN Statestore ODBC Metastore SQL request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 12. Impala Architecture: Query Execution Planner turns request into collections of plan fragments Coordinator initiates execution on remote impalad's SQL App Hive HDFS NN Statestore ODBC Metastore Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 13. Impala Architecture: Query Execution Intermediate results are streamed between impalad's Query results are streamed back to client SQL App Hive HDFS NN Statestore ODBC Metastore query results Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 14. Metadata Handling ● Utilizes Hive's metastore ● Caches metadata: no synchronous metastore API calls during query execution ● Beta: impalad's read metadata from metastore at startup ● Post-GA: metadata distribution through statestore ● Post-GA: HCatalog and AccessServer
  • 15. Impala Execution Engine ● Written in C++ ● Runtime code generation for "big loops" ● Internal in-memory tuple format puts fixed-width data at fixed offsets ● Uses intrinsics/special cpu instructions for text parsing, crc32 computation, etc.
  • 16. Impala Execution Engine ● More on runtime code generation ○ example of "big loop": insert batch of rows into hash table ○ known at query compile time: # of tuples in batch, tuple layout, column types, etc. ○ generate at compile time: unrolled loop that inlines all function calls, contains no dead code, minimizes branches ○ code generated using llvm
  • 17. Impala's Statestore ● Central system state repository ○ name service (membership) ○ Post-GA: metadata ○ Post-GA: other scheduling-relevant or diagnostic state ● Soft-state ○ all data can be reconstructed from the rest of the system ○ cluster continues to function when statestore fails, but per-node state becomes increasingly stale ● Sends periodic heartbeats ○ pushes new data ○ checks for liveness
  • 18. Statestore: Why not ZooKeeper ● ZK is not a good pub-sub system ○ Watch API is awkward and requires a lot of client logic ○ multiple round-trips required to get data for changes to node's children ○ push model is more natural for our use case ● Don't need all the guarantees ZK provides: ○ serializability ○ persistence ○ prefer to avoid complexity where possible ● ZK is bad at the things we care about and good at the things we don't
  • 19. Comparing Impala to Dremel ● What is Dremel: ○ columnar storage for data with nested structures ○ distributed scalable aggregation on top of that ● Columnar storage in Hadoop: joint project between Cloudera and Twitter ○ new columnar format: Parquet; derived from Doug Cutting's Trevni ○ stores data in appropriate native/binary types ○ can also store nested structures similar to Dremel's ColumnIO ● Distributed aggregation: Impala ● Impala plus Parquet: a superset of the published version of Dremel (which didn't support joins)
  • 20. More about Parquet ● What it is: ○ columnar container format for all popular serialization formats: Avro, Thrift, Protocol Buffers ○ successor to Trevni ○ jointly developed between Cloudera and Twitter ○ open source; hosted on github ● Features: ○ rowgroup format: file contains multiple horiz. slices ○ supports storing each column in separate file ○ supports fully shredded nested data; repetition and definition levels similar to Dremel's ColumnIO ○ column values stored in native types (bool, int<x>, float, double, byte array) ○ support for index pages for fast lookup ○ extensible value encodings
  • 21. Comparing Impala to Hive ● Hive: MapReduce as an execution engine ○ High latency, low throughput queries ○ Fault-tolerance model based on MapReduce's on- disk checkpointing; materializes all intermediate results ○ Java runtime allows for easy late-binding of functionality: file formats and UDFs. ○ Extensive layering imposes high runtime overhead ● Impala: ○ direct, process-to-process data exchange ○ no fault tolerance ○ an execution engine designed for low runtime overhead
  • 22. Comparing Impala to Hive ● Impala's performance advantage over Hive: no hard numbers, but ○ Impala can get full disk throughput (~100MB/sec/disk); I/O-bound workloads often faster by 3-4x ○ queries that require multiple map-reduce phases in Hive see a higher speedup ○ queries that run against in-memory data see a higher speedup (observed up to 100x)
  • 23. Impala Roadmap: GA (April 2013) ● New data formats: ○ Avro ○ Parquet ● Improved query execution: partitioned joins ● Further performance improvements ● Guidelines for production deployment: ○ load balancing across impalad's ○ resource isolation within MR cluster ● Additional packages: Debian, SLES
  • 24. Impala Roadmap: 2013 ● Additional SQL: ○ UDFs ○ SQL authorization and DDL ○ ORDER BY without LIMIT ○ window functions ○ support for structured data types ● Improved HBase support: ○ composite keys, complex types in columns, index nested-loop joins, INSERT/UPDATE/DELETE ● Runtime optimizations: ○ straggler handling ○ join order optimization ○ improved cache management ○ data collocation for improved join performance
  • 25. Impala Roadmap: 2013 ● Better metadata handling: ○ automatic metadata distribution through statestore ● Resource management: ○ goal: run exploratory and production workloads in same cluster, against same data, without impacting production jobs
  • 26. Try it out! ● Beta version available since 10/24/12 ● Latest version is 0.6 (as of 2/26) ● We have packages for: ○ RHEL6.2/5.7 ○ Ubuntu 10.04 and 12.04 ○ SLES11 ○ Debian6 ● We're targeting GA for April 2013 ● Questions/comments? impala-user@cloudera.org ● My email address: marcel@cloudera.com