sql on hadoop

1© Cloudera, Inc. All rights reserved.
Choosing the Right Tool for the
Right Job
Overview of Cloudera’s SQL-on-Hadoop Technologies
Jianwei Li
Jarred@cloudera.com

Hadoop Ecosystem
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite

Choosing the Right SQL Engine
Know Your Audience, Know Your Use Case
SQLOR
Impala

Hive
SQL on Hadoop

What is Hive?
l Data warehouse system for Hadoop
l Enables Extract/Transform/Load (ETL)
l Associate structure with a variety of data formats
l Integrates with HDFS, HBase, MongoDB, etc.
l Query execution in MapReduce
5

Hive Architecture
6

Hive features
l Create table, create view, create index - DDL
l Select, where clause, group by, order by, joins
l Pluggable User Defined Functions - UDFs (e.g from_unixtime)
l Pluggable User Defined Aggregate Functions - UDAFs (e.g. count, avg)
l Pluggable User Defined Table Generating Functions - UDTFs (e.g.
explode)
lPluggable custom Input/Output format
l Pluggable Serialization Deserialization libraries (SerDes)
l Pluggable custom map and reduce scripts
7

What Hive does NOT support
l OLTP workloads - low latency
lNot super performant with small amounts of data
l How much data do you need to call it “Big Data”?
8

The Future of Hive
• Hive is designed and great for batch processing
• Hive is not architected for low-latency and multi-user interactive queries
• Hive-on-DAG (Spark) provides incrementally faster batch processing

Spark SQL
SQL on Hadoop

Dataframes
• Distributed collection of data organized as named typed columns
• Like RDDs, they consist of partitions, can be cached, and have fault-tolerance via
lineage
• Can be constructed from:
• Structured data files: Json, avro, parquet, etc
• Tables in Hive
• Tables in a RDBMS
• Existing RDDs by programmatically applying schema

Spark SQL
• SQL statements to process Dataframes
• Embed SQL statements in your scala, java, python Spark application
• Queries can also be issued via JDBC/ODBC

Spark SQL Performance
SQL processed by Query Optimizer à Automatic Optimizations
• Compressed memory format (as against java serialized objects in RDDs)
• Predicate pushdown (read less data to reduce IO)
• Optimal pipelining of operations
• Cost based optimizer

movies = sc.textFile(“movies.txt”)
.map(Movie(_)
.toDF()
ratings = sc.textFile(“ratings.txt”)
.map(Rating(_))
.toDF()
movies.join(ratings, “titleId”)
.filter(“month = ‘Nov’”)
.groupBy(movies("title"))
.agg(count(ratings("rating")))
SparkSQL Example

Mixing SparkSQL and Machine Learning

Why Spark SQL
• Ease of embedding SQL into Java, Scala, or Python applications
• Easy language for common operations (eg. aggregations, filters,
samples)
• Seamlessly mix SQL and Spark code within a single application
• Improved performance with automatic optimizations (Intelligent
Query Engine)

Impala
SQL on Hadoop

What’s Impala?
• Interactive SQL
• Typically 5-70x faster than the latest Hive
• Responses in seconds instead of minutes (sometimes sub-second)
• ANSI-92 standard SQL queries with HiveQL
• Compatible SQL interface for existing Hadoop/CDH applications
• Based on industry standard SQL
• Natively on Hadoop/HBase storage and metadata
• Flexibility, scale, and cost advantages of Hadoop
• No duplication/synchronization of data and metadata
• Local processing to avoid network bottlenecks
• Separate runtime from batch processing
• Hive, Pig, MapReduce are designed and great for batch
• Impala is purpose-built for low-latency SQL queries on Hadoop
18

Business Intelligence with Impala
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and
KeyTrustee
Optimizer
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREA
M
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite

Where does the Performance Come From?
20
• No MapReduce; No JVM; All Native
• In-Memory Data Transfers
• Optimized File Format (ie Parquet)
• In-Memory HDFS Caching
• Cost-Based Join Order Optimization – Frees User from Having to Guess the
Correct Join Order

Impala Architecture
• Impala daemon (impalad)
• one Impala daemon on each node with data
• handles external client requests and all internal requests related to query
execution
• State store daemon (statestored)
• Check the health of impalad, one process in one cluster
• not part of query execution path
• Catalog service(catalogd)
• Relay metadata changes to Datanodes
• One process in one cluster
21

Impala Architecture

Impala Architecture: Query Execution Phases
• Client SQL arrives via ODBC/JDBC/Hue GUI/Shell
• Planner turns request into collections of plan fragments
• Coordinator initiates execution on impalad's local to data
• During execution:
• intermediate results are streamed between executors
• query results are streamed back to client
23

Impala Architecture: Planner
• Example: query with join and aggregation
SELECT state, SUM(revenue)
FROM HdfsTbl h JOIN HbaseTbl b ON (...)
GROUP BY 1 ORDER BY 2 desc LIMIT 10
Hbase
Scan
Hash
Join
Hdfs
Scan
Exch
TopN
Agg
Exch
at coordinator at DataNodes at region servers
Agg
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
24

Impala Architecture: Query Execution
• Request arrives via ODBC/JDBC/Hue GUI/Shell
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
25

• Planner turns request into collections of plan fragments
• Coordinator initiates execution on impalad's local to data
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Hive
Metastore
HDFS NN Statestore
26

• Intermediate results are streamed between impalad’s
• Query results are streamed back to client
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
query results
27

Impala and Hive
• Everything Client-Facing is Shared with Hive:
• Metadata (table definitions)
• ODBC/JDBC drivers
• Hue GUI
• SQL syntax (HiveQL)
• Flexible file formats
• Machine pool
• Internal Improvements:
• Purpose-built query engine direct on HDFS and HBase
• No JVM startup and no MapReduce
• In-memory data transfers
• Native distributed relational query engine
28

Parquet Overview
• State-of-the-art, open source columnar file format that’s available for (most)
Hadoop processing frameworks:
• Impala, Hive, Pig, MapReduce, Spark, Cascading, Crunch, Drill, Tajo, …
• Offers both high compression and high scan efficiency
• Co-developed by Twitter and Cloudera
• Contributions from Criteo, Stripe, Berkeley AMPlab, LinkedIn
• Top-Level Apache Project

Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text

Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
Only read 1 column
1GB 2GB 1GB 200GB

Columnar compression
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
Created_at Diff(created_at)
1442865158 n/a
1442828307 -36851
1442865156 36849
1442865155 -1
64 bits each 17 bits each
• Many columns can compress to
a few bits per row!
• Especially:
• Timestamps
• Time series values
• Low-cardinality strings
• Massive space savings and
throughput increase!

Impala Performance
• Impala’s latest milestone:
• Comparable commercial MPP DBMS speed
• Natively on Hadoop
• Three result sets:
• Impala vs Hive (Impala 6-70x faster)
• Impala vs “DBMS-Y” (Impala average of 2x faster)
• Impala scalability (Impala achieves linear scale)
• Background:
• 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported language)
• Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB)
• Methodical testing (multiple runs, reviewed fairness for competition, etc)
• Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
33

Impala vs Hive (Lower bars are better)
34

Impala vs “DBMS-Y” (Lower bars are better)
35

Impala Scalability: 2x the Hardware
(Expectation: Cut response times in half)
36

Impala Scalability: 2x the Hardware and 2x Users/Data
(Expectation: Constant response times)
37
2x the Users, 2x the Hardware
2x the Data, 2x the Hardware

Impala Roadmap
2H 2015 1H 2016 2016
• SQL Support & Usability
• Nested structures
• Kudu updates (beta)
• Management & Security
• Record reader service
(beta)
• Finer-grained security
(Sentry)
• Integration
• Isilon support
• Python interface (Ibis)
• Performance & Scale
• Improved predictability
under concurrency
• Continued scalability and
concurrency
• Initial perf/scale
improvements
• Improved admission
control
• Resource utilization and
showback
• Dynamic partitioning
• Improved timestamp
compatibility
• >20x performance
• Multi-threaded
joins/aggregations
• Continued scale work
• Improved YARN
integration
• Automated metadata
• Integration
• S3 support
• Nested types with Avro
• Date type
• Added SQL extensions

Typical Using Scenario
SQL on Hadoop

Apache Hive
Batch Processing
• User:
• SQL-based ETL developers
• Designed for:
• Handful of concurrent, very long-
running batch jobs
• Strengths:
• Custom file formats
• Very long-running ETL, data preparation,
or batch processing
• Massive ETL sorts with joins
• Existing Hive jobs

Apache Impala
BI and Analytics
User:
• Data Analysts
• BI Users
Designed for:
• Interactive SQL for large number of BI
users and analysts
Strengths:
• Multi-user scale
• Interactive latency
• Compatibility (BI tools, ANSI SQL, and
vendor-specific SQL)
• Usability

Apache Spark SQL
Machine Learning Applications
User:
• Data Engineers
• Data Scientists
Designed for:
• Ease of development for Spark
developers
• Handful of concurrent Spark jobs
Strengths:
• Ease of embedding SQL into Java or Scala
applications
• SQL for common functionality in
developer flow (eg. aggregations, filters,
samples)

SQL-on-Hadoop Benchmark
Impala, Spark SQL, Hive-on-Tez
Versions:
• Impala 2.3
• Hive 2.0 on Tez 0.5.2 (aka “Stinger”)
• Spark SQL 1.5 with Tungsten
• Benchmark Details
• Based on industry standards (TPC)
• Repeatable (https://github.com/cloudera/impala-tpcds-kit)
• Methodical testing with multiple runs
on same hardware
• Help competing software do well
• Run on optimal file formats for each
• Tune query engines appropriately
Full Details: http://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-
delivers-analytic-database-performance/

Impala Multi-User Performance Over 7x Faster
with Just 10 Users
0
50
100
150
200
250
Time (in Seconds)
Single User, 4
10 Users, 12.8
Single User, 32
10 Users, 97
Single User, 59
10 Users, 210
7.2x
7.6x
13.4x
16.4x
Single User vs 10 User Response Time/Impala
Times Faster
(Lower Bars = Better)
Impala Spark SQL
(with Tungsten)
Hive-on-Tez

Impala Enables Nearly 7x Throughput
More Work Done in Less Time
2045
302
136.0
0
500
1000
1500
2000
2500
Queries per Hour
Query Throughput/Impala Throughput Times Faster
(Higher Bars = Better)
6.8x 15x
Impala Hive-on-TezSpark SQL
(with Tungsten)

Performance Benchmark Takeaways
• Impala unlocks BI usage directly on Hadoop
• Meets BI low-latency and multi-user requirements
• Advantage expands for single-user vs just 10 users
• Hive is designed (and still great) for batch processing
• Most Impala customers use Hive for data preparation
• Hive is the most commonly used ETL framework
• Spark SQL enables easier Spark application development
• Enables mixed procedural Spark (Java/Scala) and SQL job development
• Mid-term trends will further favor Impala’s design approach for latency and concurrency
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)
• CPU efficiency will increase in importance
• Native code enables easy optimizations for CPU instruction sets
• Intel joint roadmap support these opportunities

IBM Research Validation
• VLDB academic paper compares Impala and Hive (both MR and Tez) for SQL-on-Hadoop
• http://www.vldb.org/pvldb/vol7/p1295-floratou.pdf
• Impala’s significantly more efficient than Hive/Tez or Hive/MR
• Impala’s lead due to CPU efficiency, I/O manager, and overall
architecture that resembles a shared-nothing parallel database
• Parquet more efficient than ORC
• Additional Notes:
• Impala 1.4 and higher is significantly faster on selective joins than Impala 1.2.2 used in the paper
• Impala 2.0 has disk-based joins and aggregations
• Paper compares single-user only. Multi-user would perform even better
“Impala’s database-like architecture
provides significant performance gains,
compared to Hive’s MapReduce or Tez-
based runtime”
“The Parquet format skips data more efficiently
than ORC, which tends to pre-fetch
unnecessary data, especially when a table
contains a large number of columns”

Choosing the Right SQL Engine
Know Your Audience, Know Your Use Case
Batch
Processing
BI and
SQL Analytics
Procedural
Development
SQLOR
Impala
SQL-Based ETL
Developer
Data
Analyst
Data Engineer/
Data Scientist
Tool
Use
Case
User

Thank You

sql on hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à sql on hadoop

Similaire à sql on hadoop (20)

Dernier

Dernier (20)

sql on hadoop