SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Cloudera Impala:
A Modern SQL Engine for Apache Hadoop
Mark Grover
Software Engineer, Cloudera
February 27, 2013
What is Impala?
●   General-purpose SQL engine
●   Real-time queries in Apache Hadoop
●   Beta version released since October 2012
●   General availability (GA) release slated for April 2013
●   Open source under Apache license
Overview
●   User View of Impala
●   Architecture of Impala
●   Comparing Impala with Dremel
●   Comparing Impala with Hive
●   Impala Roadmap
Impala Overview: Goals
●   General-purpose SQL query engine:
     ●   should work both for analytical and transactional workloads
     ●   will support queries that take from milliseconds to hours
●   Runs directly within Hadoop:
     ●   reads widely used Hadoop file formats
     ●   talks to widely used Hadoop storage managers
     ●   runs on same nodes that run Hadoop processes
●   High performance:
     ●   C++ instead of Java
     ●   runtime code generation
     ●   completely new execution engine that doesn't build on MapReduce
User View of Impala: Overview
●   Runs as a distributed service in cluster: one Impala daemon on each
    node with data
●   User submits query via ODBC/JDBC to any of the daemons
●   Query is distributed to all nodes with relevant data
●   If any node fails, the query fails
●   Impala uses Hive's metadata interface, connects to Hive's metastore
●   Supported file formats:
     ●   uncompressed/lzo-compressed text files
     ●   sequence files and RCFile with snappy/gzip compression
     ●   GA: Avro data files
     ●   GA: columnar format (more on that later)
User View of Impala: SQL
●   SQL support:
     ●   patterned after Hive's version of SQL
     ●   essentially SQL-92, minus correlated subqueries
     ●   limited to Select, Project, Join, Union, Subqueries, Aggregation and
         Insert
     ●   only equi-joins; no non-equi joins, no cross products
     ●   Order By only with Limit
     ●   GA: DDL support (CREATE, ALTER)
●   Functional limitations:
     ●   no custom UDFs, file formats, SerDes
     ●   no beyond SQL (buckets, samples, transforms, arrays, structs, maps,
         xpath, json)
     ●   only hash joins; joined table has to fit in memory:
           ● beta: of single node

           ● GA: aggregate memory of all (executing) nodes
User View of Impala: Apache HBase
●   HBase functionality:
     ●   uses Hive's mapping of HBase table into metastore table
     ●   predicates on rowkey columns are mapped into start/stop
         row
     ●   predicates on other columns are mapped into
         SingleColumnValueFilters
●   HBase functional limitations:
     ●   no nested-loop joins
     ●   all data stored as text
Impala Architecture
●   Two binaries: impalad and statestored
●   Impala daemon (impalad)
     ●   handles client requests and all internal requests related to
         query execution
     ●   exports Thrift services for these two roles
●   State store daemon (statestored)
     ●   provides name service and metadata distribution
     ●   also exports a Thrift service
Impala Architecture
●   Query execution phases
     ●   request arrives via odbc/jdbc
     ●   planner turns request into collections of plan fragments
     ●   coordinator initiates execution on remote impalad's
     ●   during execution
           ● intermediate results are streamed between executors

           ● query results are streamed back to client

           ● subject to limitations imposed to blocking operators

             (top-n, aggregation)
Impala Architecture: Planner
●   2-phase planning process:
     ●   single-node plan: left-deep tree of plan operators
     ●   plan partitioning: partition single-node plan to maximize scan locality,
         minimize data movement
●   Plan operators: Scan, HashJoin, HashAggregation, Union, TopN,
    Exchange
●   Distributed aggregation: pre-aggregation in all nodes, merge
    aggregation in single node.
    GA: hash-partitioned aggregation: re-partition aggregation
    input on grouping columns in order to reduce per-node
    memory requirement
●   Join order = FROM clause order
    GA target: rudimentary cost-based optimizer
Impala Architecture: Planner
●   Example: query with join and aggregation
    SELECT state, SUM(revenue)
    FROM HdfsTbl h JOIN HbaseTbl b ON (...)
    GROUP BY 1 ORDER BY 2 desc LIMIT 10

       TopN
                                                  Agg
                          TopN
       Agg                                       Hash
       Hash                Agg                   Join
       Join                                HDFS                   HBase
                           Exch                         Exch
                                           Scan                    Scan
    HDFS      HBase     at coordinator   at DataNodes          at region servers
    Scan       Scan
Impala Architecture: Query Execution
Request arrives via odbc/jdbc

      SQL App                               Hive
                                                      HDFS NN   Statestore
       ODBC                             Metastore
                          SQL
                        request

     Query Planner                 Query Planner           Query Planner
    Query Coordinator             Query Coordinator      Query Coordinator
     Query Executor               Query Executor          Query Executor
   HDFS DN      HBase             HDFS DN     HBase      HDFS DN    HBase
Impala Architecture: Query Execution
Planner turns request into collections of plan fragments
Coordinator initiates execution on remote impalad's
      SQL App                     Hive
                                            HDFS NN   Statestore
       ODBC                   Metastore




     Query Planner       Query Planner           Query Planner
    Query Coordinator   Query Coordinator      Query Coordinator
     Query Executor      Query Executor         Query Executor
    HDFS DN     HBase   HDFS DN     HBase      HDFS DN    HBase
Impala Architecture: Query Execution
Intermediate results are streamed between impalad's Query
results are streamed back to client

     SQL App                          Hive
                                                HDFS NN   Statestore
      ODBC                        Metastore

                   query
                  results

     Query Planner           Query Planner           Query Planner
   Query Coordinator        Query Coordinator      Query Coordinator
    Query Executor           Query Executor         Query Executor
   HDFS DN     HBase        HDFS DN     HBase      HDFS DN    HBase
Impala Architecture
●   Metadata handling:
     ●   utilizes Hive's metastore
     ●   caches metadata: no synchronous metastore API calls
         during query execution
     ●   beta: impalad's read metadata from metastore at startup
     ●   Post-GA: metadata distribution through statestore
     ●   Post-GA: HCatalog
Impala Architecture
●   Execution engine
     ●   written in C++
     ●   runtime code generation for "big loops"
     ●   internal in-memory tuple format plus fixed-width data at
         fixed offsets
     ●   uses intrinsics/special cpu instructions for text parsing,
         crc32 computation, etc.
Impala Execution Engine
●   More on runtime code generation
     ●   example of "big loop": insert batch of rows into hash table
     ●   known at query compile time: # of tuples in a batch, tuple
         layout, column types, etc.
     ●   generate at compile time: unrolled loop that inlines all
         function calls, contains no dead code, minimizes branches
     ●   code generated using llvm
Impala's Statestore
●   Central system state repository
     ●   name service (membership)
     ●   Post-GA: metadata
     ●   Post-GA: other scheduling-relevant or diagnostic state
●   Soft-state
     ●   all data can be reconstructed from the rest of the system
     ●   cluster continues to function when statestore fails, but per-node state
         becomes increasingly stale
●   Sends periodic heartbeats
     ●   pushes new data
     ●   checks for liveness
Statestore: Why not ZooKeeper?
●   ZK is not a good pub-sub system
     ●   Watch API is awkward and requires a lot of client logic
     ●   multiple round-trips required to get data for changes to
         node's children
     ●   push model is more natural for our use case
●   Don't need all the guarantees ZK provides:
     ●   serializability
     ●   persistence
     ●   prefer to avoid complexity where possible
●   ZK is bad at the things we care about and good at the
    things we don't
Comparing Impala to Dremel
●   What is Dremel?
     ●   columnar storage for data with nested structures
     ●   distributed scalable aggregation on top of that
●   Columnar storage in Hadoop: joint project between Cloudera
    and Twitter
     ●   new columnar format: Parquet; derived from Doug Cutting's Trevni
     ●   stores data in appropriate native/binary types
     ●   can also store nested structures similar to Dremel's ColumnIO
●   Distributed aggregation: Impala
●   Impala plus Parquet: a superset of the published version of
    Dremel (which didn't support joins)
More about Parquet
●   What is it:
     ●   container format for all popular serialization formats: Avro, Thrift,
         Protocol Buffers
     ●   successor to Trevni
     ●   jointly developed between Cloudera and Twitter
     ●   open source; hosted on github
●   Features
     ●   rowgroup format: file contains multiple horiz. slices
     ●   supports storing each column in separate file
     ●   supports fully shredded nested data; repetition and definition levels
         similar to Dremel's ColumnIO
     ●   column values stored in native types (bool, int<x>, float, double, byte
         array)
     ●   support for index pages for fast lookup
     ●   extensible value encodings
Comparing Impala to Hive
●   Hive: MapReduce as an execution engine
     ●   High latency, low throughput queries
     ●   Fault-tolerance model based on MapReduce's on-disk
         checkpointing; materializes all intermediate results
     ●   Java runtime allows for easy late-binding of functionality:
         file formats and UDFs.
     ●   Extensive layering imposes high runtime overhead
●   Impala:
     ●   direct, process-to-process data exchange
     ●   no fault tolerance
     ●   an execution engine designed for low runtime overhead
Comparing Impala to Hive
●   Impala's performance advantage over Hive: no hard
    numbers, but
     ●   Impala can get full disk throughput (~100MB/sec/disk);
         I/O-bound workloads often faster by 3-4x
     ●   queries that require multiple map-reduce phases in Hive
         see a higher speedup
     ●   queries that run against in-memory data see a higher
         speedup (observed up to 100x)
Impala Roadmap: GA – April 2013
●   New data formats:
     ●   Avro
     ●   Parquet
●   Improved query execution: partitioned joins
●   Further performance improvements
●   Guidelines for production deployment:
     ●   load balancing across impalad's
     ●   resource isolation within MR cluster
Impala Roadmap: 2013
●   Additional SQL:
     ●   UDFs
     ●   SQL authorization and DDL
     ●   ORDER BY without LIMIT
     ●   window functions
     ●   support for structured data types
●   Improved HBase support:
     ●   composite keys, complex types in columns,
         index nested-loop joins,
         INSERT/UPDATE/DELETE
Impala Roadmap: 2013
●   Runtime optimizations:
     ●   straggler handling
     ●   join order optimization
     ●   improved cache management
     ●   data collocation for improved join performance
●   Better metadata handling:
     ●   automatic metadata distribution through statestore
●   Resource management:
     ●   goal: run exploratory and production workloads in same
         cluster, against same data, w/o impacting production jobs
Try it out!
●   Beta version available since 10/24/12
●   Latest version is 0.6
●   We have packages for:
●   RHEL 6.2/5.7
●   Ubuntu 10.04 and 12.04
●   SLES 11
●   Debian 6
●   We are targeting GA for April 2013
●   Questions/comments? impala-user@cloudera.org
●   My email address: mgrover@cloudera.com
●   My twitter handle: mark_grover

Contenu connexe

Tendances

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphP. Taylor Goetz
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDBMongoDB
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 

Tendances (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Data stage
Data stageData stage
Data stage
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 

En vedette

Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop EcosystemDataWorks Summit
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"Dr. Mirko Kämpf
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Taming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop EcosystemTaming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop EcosystemCloudera, Inc.
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Datainside-BigData.com
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Cloudera, Inc.
 
Debugging (Docker) containers in production
Debugging (Docker) containers in productionDebugging (Docker) containers in production
Debugging (Docker) containers in productionbcantrill
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in ImpalaCloudera, Inc.
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 

En vedette (20)

The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
 
Hadoop Puzzlers
Hadoop PuzzlersHadoop Puzzlers
Hadoop Puzzlers
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Taming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop EcosystemTaming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop Ecosystem
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Data
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
 
Debugging (Docker) containers in production
Debugging (Docker) containers in productionDebugging (Docker) containers in production
Debugging (Docker) containers in production
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 

Similaire à Cloudera Impala: A Modern SQL Engine for Apache Hadoop

Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopYahoo Developer Network
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFlink Forward
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network
 

Similaire à Cloudera Impala: A Modern SQL Engine for Apache Hadoop (20)

Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Dernier (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Cloudera Impala: A Modern SQL Engine for Apache Hadoop

  • 1. Cloudera Impala: A Modern SQL Engine for Apache Hadoop Mark Grover Software Engineer, Cloudera February 27, 2013
  • 2. What is Impala? ● General-purpose SQL engine ● Real-time queries in Apache Hadoop ● Beta version released since October 2012 ● General availability (GA) release slated for April 2013 ● Open source under Apache license
  • 3. Overview ● User View of Impala ● Architecture of Impala ● Comparing Impala with Dremel ● Comparing Impala with Hive ● Impala Roadmap
  • 4. Impala Overview: Goals ● General-purpose SQL query engine: ● should work both for analytical and transactional workloads ● will support queries that take from milliseconds to hours ● Runs directly within Hadoop: ● reads widely used Hadoop file formats ● talks to widely used Hadoop storage managers ● runs on same nodes that run Hadoop processes ● High performance: ● C++ instead of Java ● runtime code generation ● completely new execution engine that doesn't build on MapReduce
  • 5. User View of Impala: Overview ● Runs as a distributed service in cluster: one Impala daemon on each node with data ● User submits query via ODBC/JDBC to any of the daemons ● Query is distributed to all nodes with relevant data ● If any node fails, the query fails ● Impala uses Hive's metadata interface, connects to Hive's metastore ● Supported file formats: ● uncompressed/lzo-compressed text files ● sequence files and RCFile with snappy/gzip compression ● GA: Avro data files ● GA: columnar format (more on that later)
  • 6. User View of Impala: SQL ● SQL support: ● patterned after Hive's version of SQL ● essentially SQL-92, minus correlated subqueries ● limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert ● only equi-joins; no non-equi joins, no cross products ● Order By only with Limit ● GA: DDL support (CREATE, ALTER) ● Functional limitations: ● no custom UDFs, file formats, SerDes ● no beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) ● only hash joins; joined table has to fit in memory: ● beta: of single node ● GA: aggregate memory of all (executing) nodes
  • 7. User View of Impala: Apache HBase ● HBase functionality: ● uses Hive's mapping of HBase table into metastore table ● predicates on rowkey columns are mapped into start/stop row ● predicates on other columns are mapped into SingleColumnValueFilters ● HBase functional limitations: ● no nested-loop joins ● all data stored as text
  • 8. Impala Architecture ● Two binaries: impalad and statestored ● Impala daemon (impalad) ● handles client requests and all internal requests related to query execution ● exports Thrift services for these two roles ● State store daemon (statestored) ● provides name service and metadata distribution ● also exports a Thrift service
  • 9. Impala Architecture ● Query execution phases ● request arrives via odbc/jdbc ● planner turns request into collections of plan fragments ● coordinator initiates execution on remote impalad's ● during execution ● intermediate results are streamed between executors ● query results are streamed back to client ● subject to limitations imposed to blocking operators (top-n, aggregation)
  • 10. Impala Architecture: Planner ● 2-phase planning process: ● single-node plan: left-deep tree of plan operators ● plan partitioning: partition single-node plan to maximize scan locality, minimize data movement ● Plan operators: Scan, HashJoin, HashAggregation, Union, TopN, Exchange ● Distributed aggregation: pre-aggregation in all nodes, merge aggregation in single node. GA: hash-partitioned aggregation: re-partition aggregation input on grouping columns in order to reduce per-node memory requirement ● Join order = FROM clause order GA target: rudimentary cost-based optimizer
  • 11. Impala Architecture: Planner ● Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 TopN Agg TopN Agg Hash Hash Agg Join Join HDFS HBase Exch Exch Scan Scan HDFS HBase at coordinator at DataNodes at region servers Scan Scan
  • 12. Impala Architecture: Query Execution Request arrives via odbc/jdbc SQL App Hive HDFS NN Statestore ODBC Metastore SQL request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 13. Impala Architecture: Query Execution Planner turns request into collections of plan fragments Coordinator initiates execution on remote impalad's SQL App Hive HDFS NN Statestore ODBC Metastore Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 14. Impala Architecture: Query Execution Intermediate results are streamed between impalad's Query results are streamed back to client SQL App Hive HDFS NN Statestore ODBC Metastore query results Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 15. Impala Architecture ● Metadata handling: ● utilizes Hive's metastore ● caches metadata: no synchronous metastore API calls during query execution ● beta: impalad's read metadata from metastore at startup ● Post-GA: metadata distribution through statestore ● Post-GA: HCatalog
  • 16. Impala Architecture ● Execution engine ● written in C++ ● runtime code generation for "big loops" ● internal in-memory tuple format plus fixed-width data at fixed offsets ● uses intrinsics/special cpu instructions for text parsing, crc32 computation, etc.
  • 17. Impala Execution Engine ● More on runtime code generation ● example of "big loop": insert batch of rows into hash table ● known at query compile time: # of tuples in a batch, tuple layout, column types, etc. ● generate at compile time: unrolled loop that inlines all function calls, contains no dead code, minimizes branches ● code generated using llvm
  • 18. Impala's Statestore ● Central system state repository ● name service (membership) ● Post-GA: metadata ● Post-GA: other scheduling-relevant or diagnostic state ● Soft-state ● all data can be reconstructed from the rest of the system ● cluster continues to function when statestore fails, but per-node state becomes increasingly stale ● Sends periodic heartbeats ● pushes new data ● checks for liveness
  • 19. Statestore: Why not ZooKeeper? ● ZK is not a good pub-sub system ● Watch API is awkward and requires a lot of client logic ● multiple round-trips required to get data for changes to node's children ● push model is more natural for our use case ● Don't need all the guarantees ZK provides: ● serializability ● persistence ● prefer to avoid complexity where possible ● ZK is bad at the things we care about and good at the things we don't
  • 20. Comparing Impala to Dremel ● What is Dremel? ● columnar storage for data with nested structures ● distributed scalable aggregation on top of that ● Columnar storage in Hadoop: joint project between Cloudera and Twitter ● new columnar format: Parquet; derived from Doug Cutting's Trevni ● stores data in appropriate native/binary types ● can also store nested structures similar to Dremel's ColumnIO ● Distributed aggregation: Impala ● Impala plus Parquet: a superset of the published version of Dremel (which didn't support joins)
  • 21. More about Parquet ● What is it: ● container format for all popular serialization formats: Avro, Thrift, Protocol Buffers ● successor to Trevni ● jointly developed between Cloudera and Twitter ● open source; hosted on github ● Features ● rowgroup format: file contains multiple horiz. slices ● supports storing each column in separate file ● supports fully shredded nested data; repetition and definition levels similar to Dremel's ColumnIO ● column values stored in native types (bool, int<x>, float, double, byte array) ● support for index pages for fast lookup ● extensible value encodings
  • 22. Comparing Impala to Hive ● Hive: MapReduce as an execution engine ● High latency, low throughput queries ● Fault-tolerance model based on MapReduce's on-disk checkpointing; materializes all intermediate results ● Java runtime allows for easy late-binding of functionality: file formats and UDFs. ● Extensive layering imposes high runtime overhead ● Impala: ● direct, process-to-process data exchange ● no fault tolerance ● an execution engine designed for low runtime overhead
  • 23. Comparing Impala to Hive ● Impala's performance advantage over Hive: no hard numbers, but ● Impala can get full disk throughput (~100MB/sec/disk); I/O-bound workloads often faster by 3-4x ● queries that require multiple map-reduce phases in Hive see a higher speedup ● queries that run against in-memory data see a higher speedup (observed up to 100x)
  • 24. Impala Roadmap: GA – April 2013 ● New data formats: ● Avro ● Parquet ● Improved query execution: partitioned joins ● Further performance improvements ● Guidelines for production deployment: ● load balancing across impalad's ● resource isolation within MR cluster
  • 25. Impala Roadmap: 2013 ● Additional SQL: ● UDFs ● SQL authorization and DDL ● ORDER BY without LIMIT ● window functions ● support for structured data types ● Improved HBase support: ● composite keys, complex types in columns, index nested-loop joins, INSERT/UPDATE/DELETE
  • 26. Impala Roadmap: 2013 ● Runtime optimizations: ● straggler handling ● join order optimization ● improved cache management ● data collocation for improved join performance ● Better metadata handling: ● automatic metadata distribution through statestore ● Resource management: ● goal: run exploratory and production workloads in same cluster, against same data, w/o impacting production jobs
  • 27. Try it out! ● Beta version available since 10/24/12 ● Latest version is 0.6 ● We have packages for: ● RHEL 6.2/5.7 ● Ubuntu 10.04 and 12.04 ● SLES 11 ● Debian 6 ● We are targeting GA for April 2013 ● Questions/comments? impala-user@cloudera.org ● My email address: mgrover@cloudera.com ● My twitter handle: mark_grover