SlideShare une entreprise Scribd logo
1  sur  25
Impala
A high level overview.
By Ahad Rana
Director of Engineering @ Factual
What is Impala ?
● Real-time SQL Query Engine from Cloudera.
● Built on top of Hadoop.
● Open Source (Apache License).
Hadoop Ecosystem Keeps Growing/Maturing
● High Availability NameNodes
● Quorum based NameNode Journaling.
● Zookeeper integration.
● YARN support for Linux Containers/ Multi-Tenancy.
● HDFS optimizations - direct reads, mmap reads.
● Kerberos based security throughout, ACLs (in CDH5)
● HBase improvements.
● New file formats: Parquet (based on Google Dremel).
MapReduce:Not the Solution for Everything
● Batch oriented, high latency.
● Joins are complex.
● DSLs are complex.
● Only really accessible to developers.
Needed:Better Accessibility to “Big Data”
● SQL is still the lingua franca of the Business Analytics community.
● Real-time / faster query response times can have a dramatic affect on
productivity.
● The previous generation of enabling technologies (like Hive) are
batch oriented and too slow.
The Solution:MPP Query Engines on Hadoop
● That is, MPP SQL Query Engines on Hadoop.
● Proprietary vendors have been in this game for a while and have been
charging a lot for custom hardware and software.
● New engines, like Impala, are game changers:
○ Built from the ground up to integrate with Hadoop - Your data is
already in Hadoop - co-locate the queries next to your data!
○ Read data from existing Hadoop File Formats: Text,LZO-P,
Sequence File,AVRO etc.
○ No time consuming ETL cycle or indexing.
○ Open Source, non-proprietary, and they run on the same commodity
hardware!
○ Bypass MapReduce framework, tuned for speed.
○ All support similar subset of SQL-92 as Hive.
Benchmarks: Impala vs. Hive 0.12
Source: Cloudera
● Interactive - 6x-68x faster.
● Reports - 8x-29x faster.
● Analytics - 10x - 69x faster.
Impala - One of Many in a Crowded Field
● Many parallel efforts to address this need:
○ Presto (Facebook).
○ Shark (AmpLab).
○ Hive on Tez (Hortonworks).
○ Impala (Cloudera).
○ Redshift (Amazon/proprietary).
○ etc...
Impala vs. Other MPP Engines
● 21 Nodes (2 CPU - 12 Core, 12 disks, 384GB RAM)
● TPC-DS derived benchmark scaled to 15TB.
Source: Cloudera
Impala Technology Highlights
● Designed with performance in mind - backend is C++.
● Operates on native types in memory wherever possible.
● Uses JIT compiler (LLVM) to generate optimized expression evaluation
code (minimal branching, no v-table calls) on the fly.
● Asynchronously streams results between various stages of the
pipelines.
● Keeps all intermediate results in memory.
● Parallelizes scans and locates them near data (DataNodes),and
directly reads HDFS blocks from disk.
● Supports Broadcast Joins (Fact table is Scanned in a partitioned
manner, Smaller Dimension table(s) are broadcast to all nodes)
● Also supports Partitioned Joins (both Fact/Dimension tables are
partitioned on join columns / expressions).
● Supports a variety of file formats (Text,Text compressed, Seq File,
RCFile, AVRO, HFile, Parquet) AND Scans against HBase.
Impala Deployment Environment
● Three binaries
○ impalad - one on each Hadoop Datanode, handles client requests,
runs query execution pipelines.
○ statestored - single instance, name service (like Zookeeper),
used to coordinate catalog changes, impalad availability etc.
○ catalogd - single instance, talks to Hive metastore, caches
table metadata, distributes metadata to impalads, broadcasts
metadata changes via statestored.
Impala - Life of a Query
1. Query is routed to arbitrary impalad in cluster.
2. Java code running in impalad creates parse tree from SQL statement.
3. Planner evaluates parse tree and create query Plan Tree.
4. Data Sources and Sinks (Tables) are validated against the metadata
catalog (via catalogd).
5. Planner uses table metadata to make optimal placement decisions,
establish sharding strategy,memory requirements per query stage,and
constructs Plan Fragments.
6. catalogd caches HDFS block locations for every HDFS file, and Planner
attaches block locations to Query Plan metadata.
7. Planner converts Plan Fragments into Thrift objects, and passes them
on to the Query Coordinator (C++).
8. Coordinator runs root Fragment locally and distributes remaining
Fragments to other backends.
9. Coordinator monitors Fragment execution until completion,cancellation
or failure.
10. Fragments stream data to other Fragments or to the primary Fragment
(User or Table).
Life of a Query (Planning)
Life of a Query (Execution)
Simple Query Plan (Broadcast Join)
Simple Query Plan (Shuffle)
Simple Query Plan (Aggregation)
Performance Checklist
● Fact to Fact table joins are expensive!
● Scalar data types (INT,BIGINT) better than generic (String).
● Coerce types in metadata during load.
● Push Filter Predicates downstream as much as possible (Avoid upstream
IO).
● Use Impala’s Partitioning support to pre-partition tables.
Compression
● Compression means less network / disk IO.
● Always use compression.
● Choose codec wisely - make the appropriate size vs. cpu cost
tradeoff.
● Snappy - Low CPU requirements, Low compression ratio.
● GZip - High CPU,High compression ratio.
● LZO - Somewhere in the middle.
● Snappy is generally a good default choice.
File Formats
● Use a Columnar format (like Parquet) if possible.
● Columnar formats give you:
○ Less IO for Impala - Only read relevant column data.
○ Better compression, more efficient encoding.
● Parquet is the new preferred file format for Impala.
Parquet Format Detail
● Row Group - group of rows in columnar format
○ A Row Group’s worth of data is buffered in RAM when writing.
○ Impala prefers large (1GB row group sizes)
○ Potentially multiple row groups per split (when reading).
● Column Chunk / Page(s)
○ Each column’s data is stored in a separate chunk.
○ Each chunk is further divided into Page(s) worth of data.
○ Each page is ~1MB in size.
○ Compression is applied at Page level.
○ Smallest level of granularity for read is the Page.
● Good for writing data in bulk, not good for small writes (Insert
Values … ).
● Supports nested columns (see Google Dremel Paper).
● Map-Reduce InputFormat can materialize data back as Thrift objects.
HBase Integration
● Impala currently only uses HBase RPCs to scan HBase tables.
● Map HBase binary key/columns to types using Hive/Impala metadata.
● Best to only use HBase key column in predicates, otherwise you will
have to scan entire table.
● Best to do large fact table scans in Impala first and then do a late
stage join to a limited number of hbase rows.
● HBase now has snapshotting support, so perhaps bulk scans are
possible in the future ?
Impala Codebase Commentary
● Good use of best practices throughout codebase
○ Robust Comments
○ Modern C++ standards (STL,Boost, aggressive smart pointer
utilization to avoid leaks)
○ Good instrumentation framework throughout codebase.
○ Smart memory management.
○ Thrift RPC and Thrift data structures for exchange of data /
messages between services and layers.
○ Smart use of Java for front-end query processing/parsing, DDL,
DML integration with Hive etc, and then transfer of Query Plan
from Java layer to C++ via Thrift objects.
○ Very efficient use of memory in pipeline
■ Tuple (collection of typed column data, allocated using
pooled memory)
■ Awareness of various distinct tuples flowing through various
stages of pipeline means they can generate super efficient
just-in-time code to read/write these tuples.
Potential Gotchas
● High, potentially unconstrained demand on Hadoop nodes
○ Containerization (in YARN) helps, but still, nodes can consume a
lot of RAM and CPU.
● Allocate enough memory to support your needs, avoid swap!
● Potentially large impact on shared network IO when doing running
inefficient queries that broadcast lots of data upstream.
● No quota management, resource tracking by user.
ahad@factual.com
P.S. Factual is always looking for good engineers!
Thank you.

Contenu connexe

Tendances

Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5Samuel Rash
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalabilityjbellis
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architectureOmid Vahdaty
 
HBase at Mendeley
HBase at MendeleyHBase at Mendeley
HBase at MendeleyDan Harvey
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learningMehdi Shibahara
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentDataWorks Summit
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 

Tendances (20)

Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
NoSql
NoSqlNoSql
NoSql
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
 
HBase at Mendeley
HBase at MendeleyHBase at Mendeley
HBase at Mendeley
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
 
Spark 101
Spark 101Spark 101
Spark 101
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial Environment
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 

En vedette

Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianData Con LA
 
VoltDB Big Data Camp LA 2014 - Scott Jar
VoltDB  Big Data Camp LA 2014 - Scott JarVoltDB  Big Data Camp LA 2014 - Scott Jar
VoltDB Big Data Camp LA 2014 - Scott JarData Con LA
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Data Con LA
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...Data Con LA
 
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Data Con LA
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksData Con LA
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 

En vedette (20)

Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerian
 
VoltDB Big Data Camp LA 2014 - Scott Jar
VoltDB  Big Data Camp LA 2014 - Scott JarVoltDB  Big Data Camp LA 2014 - Scott Jar
VoltDB Big Data Camp LA 2014 - Scott Jar
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
 
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 

Similaire à Impala presentation ahad rana

Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokesGagan Bajpai
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 

Similaire à Impala presentation ahad rana (20)

Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Hadoop
HadoopHadoop
Hadoop
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 

Plus de Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Plus de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Dernier

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Dernier (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Impala presentation ahad rana

  • 1. Impala A high level overview. By Ahad Rana Director of Engineering @ Factual
  • 2. What is Impala ? ● Real-time SQL Query Engine from Cloudera. ● Built on top of Hadoop. ● Open Source (Apache License).
  • 3. Hadoop Ecosystem Keeps Growing/Maturing ● High Availability NameNodes ● Quorum based NameNode Journaling. ● Zookeeper integration. ● YARN support for Linux Containers/ Multi-Tenancy. ● HDFS optimizations - direct reads, mmap reads. ● Kerberos based security throughout, ACLs (in CDH5) ● HBase improvements. ● New file formats: Parquet (based on Google Dremel).
  • 4. MapReduce:Not the Solution for Everything ● Batch oriented, high latency. ● Joins are complex. ● DSLs are complex. ● Only really accessible to developers.
  • 5. Needed:Better Accessibility to “Big Data” ● SQL is still the lingua franca of the Business Analytics community. ● Real-time / faster query response times can have a dramatic affect on productivity. ● The previous generation of enabling technologies (like Hive) are batch oriented and too slow.
  • 6. The Solution:MPP Query Engines on Hadoop ● That is, MPP SQL Query Engines on Hadoop. ● Proprietary vendors have been in this game for a while and have been charging a lot for custom hardware and software. ● New engines, like Impala, are game changers: ○ Built from the ground up to integrate with Hadoop - Your data is already in Hadoop - co-locate the queries next to your data! ○ Read data from existing Hadoop File Formats: Text,LZO-P, Sequence File,AVRO etc. ○ No time consuming ETL cycle or indexing. ○ Open Source, non-proprietary, and they run on the same commodity hardware! ○ Bypass MapReduce framework, tuned for speed. ○ All support similar subset of SQL-92 as Hive.
  • 7. Benchmarks: Impala vs. Hive 0.12 Source: Cloudera ● Interactive - 6x-68x faster. ● Reports - 8x-29x faster. ● Analytics - 10x - 69x faster.
  • 8. Impala - One of Many in a Crowded Field ● Many parallel efforts to address this need: ○ Presto (Facebook). ○ Shark (AmpLab). ○ Hive on Tez (Hortonworks). ○ Impala (Cloudera). ○ Redshift (Amazon/proprietary). ○ etc...
  • 9. Impala vs. Other MPP Engines ● 21 Nodes (2 CPU - 12 Core, 12 disks, 384GB RAM) ● TPC-DS derived benchmark scaled to 15TB. Source: Cloudera
  • 10. Impala Technology Highlights ● Designed with performance in mind - backend is C++. ● Operates on native types in memory wherever possible. ● Uses JIT compiler (LLVM) to generate optimized expression evaluation code (minimal branching, no v-table calls) on the fly. ● Asynchronously streams results between various stages of the pipelines. ● Keeps all intermediate results in memory. ● Parallelizes scans and locates them near data (DataNodes),and directly reads HDFS blocks from disk. ● Supports Broadcast Joins (Fact table is Scanned in a partitioned manner, Smaller Dimension table(s) are broadcast to all nodes) ● Also supports Partitioned Joins (both Fact/Dimension tables are partitioned on join columns / expressions). ● Supports a variety of file formats (Text,Text compressed, Seq File, RCFile, AVRO, HFile, Parquet) AND Scans against HBase.
  • 11. Impala Deployment Environment ● Three binaries ○ impalad - one on each Hadoop Datanode, handles client requests, runs query execution pipelines. ○ statestored - single instance, name service (like Zookeeper), used to coordinate catalog changes, impalad availability etc. ○ catalogd - single instance, talks to Hive metastore, caches table metadata, distributes metadata to impalads, broadcasts metadata changes via statestored.
  • 12. Impala - Life of a Query 1. Query is routed to arbitrary impalad in cluster. 2. Java code running in impalad creates parse tree from SQL statement. 3. Planner evaluates parse tree and create query Plan Tree. 4. Data Sources and Sinks (Tables) are validated against the metadata catalog (via catalogd). 5. Planner uses table metadata to make optimal placement decisions, establish sharding strategy,memory requirements per query stage,and constructs Plan Fragments. 6. catalogd caches HDFS block locations for every HDFS file, and Planner attaches block locations to Query Plan metadata. 7. Planner converts Plan Fragments into Thrift objects, and passes them on to the Query Coordinator (C++). 8. Coordinator runs root Fragment locally and distributes remaining Fragments to other backends. 9. Coordinator monitors Fragment execution until completion,cancellation or failure. 10. Fragments stream data to other Fragments or to the primary Fragment (User or Table).
  • 13. Life of a Query (Planning)
  • 14. Life of a Query (Execution)
  • 15. Simple Query Plan (Broadcast Join)
  • 16. Simple Query Plan (Shuffle)
  • 17. Simple Query Plan (Aggregation)
  • 18. Performance Checklist ● Fact to Fact table joins are expensive! ● Scalar data types (INT,BIGINT) better than generic (String). ● Coerce types in metadata during load. ● Push Filter Predicates downstream as much as possible (Avoid upstream IO). ● Use Impala’s Partitioning support to pre-partition tables.
  • 19. Compression ● Compression means less network / disk IO. ● Always use compression. ● Choose codec wisely - make the appropriate size vs. cpu cost tradeoff. ● Snappy - Low CPU requirements, Low compression ratio. ● GZip - High CPU,High compression ratio. ● LZO - Somewhere in the middle. ● Snappy is generally a good default choice.
  • 20. File Formats ● Use a Columnar format (like Parquet) if possible. ● Columnar formats give you: ○ Less IO for Impala - Only read relevant column data. ○ Better compression, more efficient encoding. ● Parquet is the new preferred file format for Impala.
  • 21. Parquet Format Detail ● Row Group - group of rows in columnar format ○ A Row Group’s worth of data is buffered in RAM when writing. ○ Impala prefers large (1GB row group sizes) ○ Potentially multiple row groups per split (when reading). ● Column Chunk / Page(s) ○ Each column’s data is stored in a separate chunk. ○ Each chunk is further divided into Page(s) worth of data. ○ Each page is ~1MB in size. ○ Compression is applied at Page level. ○ Smallest level of granularity for read is the Page. ● Good for writing data in bulk, not good for small writes (Insert Values … ). ● Supports nested columns (see Google Dremel Paper). ● Map-Reduce InputFormat can materialize data back as Thrift objects.
  • 22. HBase Integration ● Impala currently only uses HBase RPCs to scan HBase tables. ● Map HBase binary key/columns to types using Hive/Impala metadata. ● Best to only use HBase key column in predicates, otherwise you will have to scan entire table. ● Best to do large fact table scans in Impala first and then do a late stage join to a limited number of hbase rows. ● HBase now has snapshotting support, so perhaps bulk scans are possible in the future ?
  • 23. Impala Codebase Commentary ● Good use of best practices throughout codebase ○ Robust Comments ○ Modern C++ standards (STL,Boost, aggressive smart pointer utilization to avoid leaks) ○ Good instrumentation framework throughout codebase. ○ Smart memory management. ○ Thrift RPC and Thrift data structures for exchange of data / messages between services and layers. ○ Smart use of Java for front-end query processing/parsing, DDL, DML integration with Hive etc, and then transfer of Query Plan from Java layer to C++ via Thrift objects. ○ Very efficient use of memory in pipeline ■ Tuple (collection of typed column data, allocated using pooled memory) ■ Awareness of various distinct tuples flowing through various stages of pipeline means they can generate super efficient just-in-time code to read/write these tuples.
  • 24. Potential Gotchas ● High, potentially unconstrained demand on Hadoop nodes ○ Containerization (in YARN) helps, but still, nodes can consume a lot of RAM and CPU. ● Allocate enough memory to support your needs, avoid swap! ● Potentially large impact on shared network IO when doing running inefficient queries that broadcast lots of data upstream. ● No quota management, resource tracking by user.
  • 25. ahad@factual.com P.S. Factual is always looking for good engineers! Thank you.