SlideShare a Scribd company logo
1 of 32
1© Cloudera, Inc. All rights reserved.
Greg Rahn | @gregrahn | Director of Product Management
Apache Impala:
Recent advancements, performance
benchmarks, and pro-tips
2© Cloudera, Inc. All rights reserved.
Agenda
• Improvements in recent Impala versions
• Latest performance comparisons
• Performance recommendations and considerations
3© Cloudera, Inc. All rights reserved.
Meet Impala
4© Cloudera, Inc. All rights reserved.
Apache Impala: Open Source & Open Standard
1 Millions of downloads since GA in May 2013
2 Majority adoption across Cloudera customers
3 Certification across key application partners:
4 De facto standard with multi-vendor support:
and others
5© Cloudera, Inc. All rights reserved.
The Leader in Interactive SQL Analytics for Hadoop
Multi-User Performance
& Usability
✔ • 10x vs. alternatives with latest benchmarks
• Cost-based optimization allows for more users and tools to
run a broader range of queries
Compatibility
✔ • Provides both ANSI SQL and vendor-specific extensions
• Compatibility with the leading BI partners
Impala delivers the best of both worlds
Flexibility
✔ • Supports the common native Hadoop file formats
• Parquet provides best-of-breed columnar performance across
Hadoop frameworks
Native Integration
✔ • Unified with Hadoop resource management, metadata,
security, and management
6© Cloudera, Inc. All rights reserved.
Key Benefits
An analytic database designed for Hadoop
High-Performance BI and SQL Analytics
Flexibility for Data and Use Case Variety
Cost-effective Scale for Today and Tomorrow
Go Beyond SQL with an Open Architecture
7© Cloudera, Inc. All rights reserved.
Advancements in recent Impala versions
8© Cloudera, Inc. All rights reserved.
• Enables faster and more
complex queries
• 5.4x speedup in standard
benchmarks over the last
14 months
• Track record of
constantly improving
performance
Impala keeps getting faster
100% 100% 100%
214%
226%
544%
0%
100%
200%
300%
400%
500%
600%
TPC-H Nested TPC-H TPC-DS
Impala Relative Performance
Impala 2.3 Impala 2.5 Impala 2.6 Impala 2.7 Impala 2.8
9© Cloudera, Inc. All rights reserved.
Major themes across releases
Performance Real-time Update Support
Cloud-Native Capabilities
v2.8
•Intra-node parallelism + codegen for
TIMESTAMP = 8x faster compute stats
•Constant folding predicate expressions = 3x
speedup
•Optimizations for long IN list predicates = 7.5x
speedup
•Optimizations for metadata loading for tables
w/ large # of blocks = 9x speedup
v2.6
•3x faster performance on secure clusters
(beneficial for large data extractions)
v2.8
•GA of Impala/Kudu integration
•“Fast analytics on fast data”
•INSERT, UPDATE, DELETE
v2.6
•Support reads/writes on S3
•Enable fast, flexible ETL and BI analytics
across all data in any environment
•Performance Metrics: Analytics and BI on
Amazon S3 with Apache Impala
10© Cloudera, Inc. All rights reserved.
Latest Impala performance comparisons
April 2017
vs. analytic database (Greenplum) & other SQL-on-Hadoop
(Hive LLAP, Spark SQL, Presto)
11© Cloudera, Inc. All rights reserved.
Testing approach
• Start with TPC-DS 10TB scale factor
• Use unmodified TPC-DS v2.4 query templates
• Casting syntax
• CONCAT() vs. ||
• Aliases for in-line views or columns
• Determine common set of queries
• Eliminate unsupported language and excessively long executions
• Single-user test
• Multi-user tests
• Leverage best practices for all systems
12© Cloudera, Inc. All rights reserved.
Software:
• Impala 2.8 from CDH 5.10
• Greenplum Database 4.3.9.1
• Spark 2.1
• Presto 0.160
• Hive 2.1 with LLAP from HDP v2.5
Hardware:
• 7 worker nodes each with
• CPU: 2 x E5-2698 v4 @ 2.20GHz
• Storage: 8 x 2TB HDDs
• Memory: 256GB RAM
Workload:
• Data: TPC-DS 10TB & 1TB stored in
• Parquet for Impala & Spark SQL
• ORCFile for LLAP & Presto
• Columnar storage for Greenplum
• Partitioned fact tables
Queries:
• 77 of the 99 TPC-DS queries using only
permitted minor query modifications*
• 22 queries not used that required non-
minor modifications (valid alternates
not used either)
Environment details
13© Cloudera, Inc. All rights reserved.
• Impala outperforms on
both single and multi-
user tests at 10TB
• Impala lead expands
with concurrency
• Other SQL on Hadoop
engines failed at 10TB
Streams Impala QpH Greenplum QpH Impala Times Better
2 41 20 2.05x
4 75 20 3.75x
8 133 16 8.31x
Impala outperforms analytical database
Metric Impala Greenplum Impala Times Better
Total seconds 11,898 21,093 1.77x
Geometric mean 33 92 2.79x
Single-user Test – 10TB
Multi-user Test – 10TB
14© Cloudera, Inc. All rights reserved.
• The analytical db cohort
(Impala / GP) leads SQL
on Hadoop
• Impala outperforms
• Presto by 8.3x
• Hive w/ LLAP by 4x
• Spark SQL by 2.8x
TPC-DS 1TB: Single-user
Impala Greenplum Spark SQL Hive with LLAP Presto
Total 100% 97% 283% 405% 826%
Geomean 100% 250% 367% 450% 1317%
0%
200%
400%
600%
800%
1000%
1200%
1400%
RelativePerformance
TPC-DS 1TB Single-user (Lower is Better)
15© Cloudera, Inc. All rights reserved.
• At 1TB the analytic db
cohort (Impala / GP)
expands lead with
concurrency
• Presto failed to complete
• With 16 streams Impala
outperforms
• Spark by ~22x
• Hive by ~20x
• Greenplum by ~3x
TPC-DS 1TB: Multi-user
4 streams 8 streams 16 streams
Impala 495 865 1,315
Greenplum 381 417 462
Spark SQL 76 70 61
Hive with LLAP 58 62 66
0
200
400
600
800
1000
1200
1400
QueriesperHour
Multi-user TPC-DS 1TB Queries/Hour (Higher is Better)
20x
16© Cloudera, Inc. All rights reserved.
Impala leads analytic database performance
• Impala leads in performance against the traditional analytic database, including
over 8x better performance for high concurrency workloads
• Even greater difference compared to other SQL-on-Hadoop engines with Impala
nearly 22x faster for multi-user workloads
• Other SQL-on-Hadoop engines also required a simplified, smaller scale
benchmark (with Hive even still requiring modifications and Presto unable to
complete multi-user tests)
• Impala delivers analytic database performance as well as:
• Flexibility
• Cost-effective scale
• Open architecture
17© Cloudera, Inc. All rights reserved.
Impala performance optimizations and tips
Data types, partitioning, and runtime filters
18© Cloudera, Inc. All rights reserved.
Data type selection
Data type selection affects performance:
• Computation: numerical types allow direct computation, string types require
conversion
• On-disk storage size: numerical types are more compact
• More compact types also require less network traffic
General guidelines:
• Choose numerical types over character types for numerical data
• Use smallest data type that will accommodate the largest possible value
19© Cloudera, Inc. All rights reserved.
Data type selection
Picking the incorrect data type can result in:
• Increase in on-disk storage by 40%
• 80% slower scans
• 80% slower aggregations
• 150% slower joins
• Increase in runtime memory utilization
Data set: L_ORDERKEY column from TPC-H 3TB
Values domain: 1 - 18,000,000,000
Number of distinct values: 4,500,000,000
20© Cloudera, Inc. All rights reserved.
Data type selection
100% 100% 100% 100%
107%
100%
106%
164%
100%
124%
104%
159%
144%
178% 179%
242%
144%
180% 177%
252%
Size Scan Group by Join
Numeric Data Type Performance
Bigint Double Decimal String Varchar
21© Cloudera, Inc. All rights reserved.
Partitioning design considerations
• Date based partitioning is the most common
• WHERE event_dt BETWEEN 20000101 AND 20000201
• WHERE event_year = 2000 AND event_month = 1
• Partition keys can be column groups / multi-level. Eg.
• By date, by hour
• By date, by region
22© Cloudera, Inc. All rights reserved.
CREATE TABLE sales (...)
PARTITIONED BY (INT year, INT month)
SELECT ... FROM sales
WHERE year >= 2012
AND month IN (1, 2, 3)
CREATE TABLE sales (...)
PARTITIONED BY (INT date_key)
SELECT ...
FROM sales s
JOIN date_dim d USING (date_key)
WHERE d.year >= 2012
AND d.month IN (1, 2, 3)
Partitioning examples
23© Cloudera, Inc. All rights reserved.
Partition wisely
Too few partitions (granularity is too large):
• Data elimination is not effective
• Increases minimum unit of work
Too many partitions (granularity is too small):
• Many small data files hurt large queries; scans less efficient; limits parallelism
• Large number of files can cause metadata bloat and create bottlenecks on HDFS
NameNode, Hive Metastore, Impala catalog service
General guidelines:
• Regularly compact tables to keep the number of files per partition under control and
improve scan and compression efficiency
• Keep number of partitions under 20K (not a hard limit, mileage will vary)
24© Cloudera, Inc. All rights reserved.
Business question:
How much was sold in June?
CREATE TABLE store_sales (...)
PARTITIONED BY (INT ss_sold_date_sk);
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
25© Cloudera, Inc. All rights reserved.
Business question:
How much was sold in June?
CREATE TABLE store_sales (...)
PARTITIONED BY (INT ss_sold_date_sk);
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
AggregateThe query planner does not know what
values for d_date_sk will be returned and
what fact table partitions need to be
scanned (or eliminated).
But there’s clearly an opportunity to save
some work - why bother sending 28 billion of
those rows to the joins?
Runtime filters construct the partition
pruning predicate at runtime.
26© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 1: Planner tells Join #1
to produce Bloom filter for
qualifying distinct values of
d_date_sk
Bloom filter: compact, probabilistic
representation of a data set
Essentially a sophisticated bitmap
27© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 2: Join reads all rows
from build side (right
input), and populates
Bloom filter containing all
distinct values of
d_date_sk
28© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 3: Query coordinator
sends filter to store_sales
scan before the scan starts.
29© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
STORE_SALES
28 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
Step 4: Scan eliminates all
partitions that don’t have a
match in the Bloom filter.
Only 150 out of the 1824
partitions are read from
store_sales.
30© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON
(d_date_sk = ss_sold_date_sk)
WHERE d_moy = 6
GROUP BY d_year;
Dynamic partition pruning with runtime filters
Step 5: Rows coming out of
the scan is reduced from
28 Billion to 1.3 Billion.
STORE_SALES
1.3 billion rows
DATE_DIM
6,000 rows
Broadcast
Join #1
1.3 Billion rows
Aggregate
31© Cloudera, Inc. All rights reserved.
In summary
• Impala offers
• Flexibility
• Cost-effective scale
• Open architecture
• Leading performance
• Recent improvements/enhancements:
• Performance
• Real-time and cloud capabilities
• Be sure to check out the Impala Cookbook (SlideShare) for more performance
protips
32© Cloudera, Inc. All rights reserved.
Thank you
Downloads:
https://cloudera.com/downloads
Interested in contributing?
http://impala.io

More Related Content

What's hot

How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformCloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester WebinarCloudera, Inc.
 
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Cloudera, Inc.
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Cloudera, Inc.
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnThe Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnCloudera, Inc.
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Cloudera, Inc.
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18Cloudera, Inc.
 
A Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber ThreatsA Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber ThreatsCloudera, Inc.
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.
 
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Cloudera, Inc.
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Cloudera, Inc.
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Cloudera, Inc.
 
RecordService for Unified Access Control
RecordService for Unified Access ControlRecordService for Unified Access Control
RecordService for Unified Access ControlCloudera, Inc.
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduCloudera, Inc.
 

What's hot (20)

How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
 
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence

 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera

 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnThe Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in Churn
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
A Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber ThreatsA Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber Threats
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
RecordService for Unified Access Control
RecordService for Unified Access ControlRecordService for Unified Access Control
RecordService for Unified Access Control
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 

Similar to New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database

Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Technical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASTechnical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASAshnikbiz
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Save money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinuxSave money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinuxEDB
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus Ashnikbiz
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...DataStax
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle
Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle
Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle Ashnikbiz
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 

Similar to New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database (20)

Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Technical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASTechnical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPAS
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Save money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinuxSave money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinux
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle
Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle
Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 

Recently uploaded (20)

Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database

  • 1. 1© Cloudera, Inc. All rights reserved. Greg Rahn | @gregrahn | Director of Product Management Apache Impala: Recent advancements, performance benchmarks, and pro-tips
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • Improvements in recent Impala versions • Latest performance comparisons • Performance recommendations and considerations
  • 3. 3© Cloudera, Inc. All rights reserved. Meet Impala
  • 4. 4© Cloudera, Inc. All rights reserved. Apache Impala: Open Source & Open Standard 1 Millions of downloads since GA in May 2013 2 Majority adoption across Cloudera customers 3 Certification across key application partners: 4 De facto standard with multi-vendor support: and others
  • 5. 5© Cloudera, Inc. All rights reserved. The Leader in Interactive SQL Analytics for Hadoop Multi-User Performance & Usability ✔ • 10x vs. alternatives with latest benchmarks • Cost-based optimization allows for more users and tools to run a broader range of queries Compatibility ✔ • Provides both ANSI SQL and vendor-specific extensions • Compatibility with the leading BI partners Impala delivers the best of both worlds Flexibility ✔ • Supports the common native Hadoop file formats • Parquet provides best-of-breed columnar performance across Hadoop frameworks Native Integration ✔ • Unified with Hadoop resource management, metadata, security, and management
  • 6. 6© Cloudera, Inc. All rights reserved. Key Benefits An analytic database designed for Hadoop High-Performance BI and SQL Analytics Flexibility for Data and Use Case Variety Cost-effective Scale for Today and Tomorrow Go Beyond SQL with an Open Architecture
  • 7. 7© Cloudera, Inc. All rights reserved. Advancements in recent Impala versions
  • 8. 8© Cloudera, Inc. All rights reserved. • Enables faster and more complex queries • 5.4x speedup in standard benchmarks over the last 14 months • Track record of constantly improving performance Impala keeps getting faster 100% 100% 100% 214% 226% 544% 0% 100% 200% 300% 400% 500% 600% TPC-H Nested TPC-H TPC-DS Impala Relative Performance Impala 2.3 Impala 2.5 Impala 2.6 Impala 2.7 Impala 2.8
  • 9. 9© Cloudera, Inc. All rights reserved. Major themes across releases Performance Real-time Update Support Cloud-Native Capabilities v2.8 •Intra-node parallelism + codegen for TIMESTAMP = 8x faster compute stats •Constant folding predicate expressions = 3x speedup •Optimizations for long IN list predicates = 7.5x speedup •Optimizations for metadata loading for tables w/ large # of blocks = 9x speedup v2.6 •3x faster performance on secure clusters (beneficial for large data extractions) v2.8 •GA of Impala/Kudu integration •“Fast analytics on fast data” •INSERT, UPDATE, DELETE v2.6 •Support reads/writes on S3 •Enable fast, flexible ETL and BI analytics across all data in any environment •Performance Metrics: Analytics and BI on Amazon S3 with Apache Impala
  • 10. 10© Cloudera, Inc. All rights reserved. Latest Impala performance comparisons April 2017 vs. analytic database (Greenplum) & other SQL-on-Hadoop (Hive LLAP, Spark SQL, Presto)
  • 11. 11© Cloudera, Inc. All rights reserved. Testing approach • Start with TPC-DS 10TB scale factor • Use unmodified TPC-DS v2.4 query templates • Casting syntax • CONCAT() vs. || • Aliases for in-line views or columns • Determine common set of queries • Eliminate unsupported language and excessively long executions • Single-user test • Multi-user tests • Leverage best practices for all systems
  • 12. 12© Cloudera, Inc. All rights reserved. Software: • Impala 2.8 from CDH 5.10 • Greenplum Database 4.3.9.1 • Spark 2.1 • Presto 0.160 • Hive 2.1 with LLAP from HDP v2.5 Hardware: • 7 worker nodes each with • CPU: 2 x E5-2698 v4 @ 2.20GHz • Storage: 8 x 2TB HDDs • Memory: 256GB RAM Workload: • Data: TPC-DS 10TB & 1TB stored in • Parquet for Impala & Spark SQL • ORCFile for LLAP & Presto • Columnar storage for Greenplum • Partitioned fact tables Queries: • 77 of the 99 TPC-DS queries using only permitted minor query modifications* • 22 queries not used that required non- minor modifications (valid alternates not used either) Environment details
  • 13. 13© Cloudera, Inc. All rights reserved. • Impala outperforms on both single and multi- user tests at 10TB • Impala lead expands with concurrency • Other SQL on Hadoop engines failed at 10TB Streams Impala QpH Greenplum QpH Impala Times Better 2 41 20 2.05x 4 75 20 3.75x 8 133 16 8.31x Impala outperforms analytical database Metric Impala Greenplum Impala Times Better Total seconds 11,898 21,093 1.77x Geometric mean 33 92 2.79x Single-user Test – 10TB Multi-user Test – 10TB
  • 14. 14© Cloudera, Inc. All rights reserved. • The analytical db cohort (Impala / GP) leads SQL on Hadoop • Impala outperforms • Presto by 8.3x • Hive w/ LLAP by 4x • Spark SQL by 2.8x TPC-DS 1TB: Single-user Impala Greenplum Spark SQL Hive with LLAP Presto Total 100% 97% 283% 405% 826% Geomean 100% 250% 367% 450% 1317% 0% 200% 400% 600% 800% 1000% 1200% 1400% RelativePerformance TPC-DS 1TB Single-user (Lower is Better)
  • 15. 15© Cloudera, Inc. All rights reserved. • At 1TB the analytic db cohort (Impala / GP) expands lead with concurrency • Presto failed to complete • With 16 streams Impala outperforms • Spark by ~22x • Hive by ~20x • Greenplum by ~3x TPC-DS 1TB: Multi-user 4 streams 8 streams 16 streams Impala 495 865 1,315 Greenplum 381 417 462 Spark SQL 76 70 61 Hive with LLAP 58 62 66 0 200 400 600 800 1000 1200 1400 QueriesperHour Multi-user TPC-DS 1TB Queries/Hour (Higher is Better) 20x
  • 16. 16© Cloudera, Inc. All rights reserved. Impala leads analytic database performance • Impala leads in performance against the traditional analytic database, including over 8x better performance for high concurrency workloads • Even greater difference compared to other SQL-on-Hadoop engines with Impala nearly 22x faster for multi-user workloads • Other SQL-on-Hadoop engines also required a simplified, smaller scale benchmark (with Hive even still requiring modifications and Presto unable to complete multi-user tests) • Impala delivers analytic database performance as well as: • Flexibility • Cost-effective scale • Open architecture
  • 17. 17© Cloudera, Inc. All rights reserved. Impala performance optimizations and tips Data types, partitioning, and runtime filters
  • 18. 18© Cloudera, Inc. All rights reserved. Data type selection Data type selection affects performance: • Computation: numerical types allow direct computation, string types require conversion • On-disk storage size: numerical types are more compact • More compact types also require less network traffic General guidelines: • Choose numerical types over character types for numerical data • Use smallest data type that will accommodate the largest possible value
  • 19. 19© Cloudera, Inc. All rights reserved. Data type selection Picking the incorrect data type can result in: • Increase in on-disk storage by 40% • 80% slower scans • 80% slower aggregations • 150% slower joins • Increase in runtime memory utilization Data set: L_ORDERKEY column from TPC-H 3TB Values domain: 1 - 18,000,000,000 Number of distinct values: 4,500,000,000
  • 20. 20© Cloudera, Inc. All rights reserved. Data type selection 100% 100% 100% 100% 107% 100% 106% 164% 100% 124% 104% 159% 144% 178% 179% 242% 144% 180% 177% 252% Size Scan Group by Join Numeric Data Type Performance Bigint Double Decimal String Varchar
  • 21. 21© Cloudera, Inc. All rights reserved. Partitioning design considerations • Date based partitioning is the most common • WHERE event_dt BETWEEN 20000101 AND 20000201 • WHERE event_year = 2000 AND event_month = 1 • Partition keys can be column groups / multi-level. Eg. • By date, by hour • By date, by region
  • 22. 22© Cloudera, Inc. All rights reserved. CREATE TABLE sales (...) PARTITIONED BY (INT year, INT month) SELECT ... FROM sales WHERE year >= 2012 AND month IN (1, 2, 3) CREATE TABLE sales (...) PARTITIONED BY (INT date_key) SELECT ... FROM sales s JOIN date_dim d USING (date_key) WHERE d.year >= 2012 AND d.month IN (1, 2, 3) Partitioning examples
  • 23. 23© Cloudera, Inc. All rights reserved. Partition wisely Too few partitions (granularity is too large): • Data elimination is not effective • Increases minimum unit of work Too many partitions (granularity is too small): • Many small data files hurt large queries; scans less efficient; limits parallelism • Large number of files can cause metadata bloat and create bottlenecks on HDFS NameNode, Hive Metastore, Impala catalog service General guidelines: • Regularly compact tables to keep the number of files per partition under control and improve scan and compression efficiency • Keep number of partitions under 20K (not a hard limit, mileage will vary)
  • 24. 24© Cloudera, Inc. All rights reserved. Business question: How much was sold in June? CREATE TABLE store_sales (...) PARTITIONED BY (INT ss_sold_date_sk); SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate
  • 25. 25© Cloudera, Inc. All rights reserved. Business question: How much was sold in June? CREATE TABLE store_sales (...) PARTITIONED BY (INT ss_sold_date_sk); SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows AggregateThe query planner does not know what values for d_date_sk will be returned and what fact table partitions need to be scanned (or eliminated). But there’s clearly an opportunity to save some work - why bother sending 28 billion of those rows to the joins? Runtime filters construct the partition pruning predicate at runtime.
  • 26. 26© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 1: Planner tells Join #1 to produce Bloom filter for qualifying distinct values of d_date_sk Bloom filter: compact, probabilistic representation of a data set Essentially a sophisticated bitmap
  • 27. 27© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 2: Join reads all rows from build side (right input), and populates Bloom filter containing all distinct values of d_date_sk
  • 28. 28© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 3: Query coordinator sends filter to store_sales scan before the scan starts.
  • 29. 29© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 4: Scan eliminates all partitions that don’t have a match in the Bloom filter. Only 150 out of the 1824 partitions are read from store_sales.
  • 30. 30© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters Step 5: Rows coming out of the scan is reduced from 28 Billion to 1.3 Billion. STORE_SALES 1.3 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate
  • 31. 31© Cloudera, Inc. All rights reserved. In summary • Impala offers • Flexibility • Cost-effective scale • Open architecture • Leading performance • Recent improvements/enhancements: • Performance • Real-time and cloud capabilities • Be sure to check out the Impala Cookbook (SlideShare) for more performance protips
  • 32. 32© Cloudera, Inc. All rights reserved. Thank you Downloads: https://cloudera.com/downloads Interested in contributing? http://impala.io