1. Impala: A Modern,
Open-Source SQL
Engine for Hadoop
Â
Mark
 Grover
Â
So+ware
 Engineer,
 Cloudera
Â
May
 22nd,
 2014
Â
Twi<er:
 mark_grover
Â
Â
Slides
 at
Â
Â
slideshare.net/markgrover/introducFon-Ââto-Ââimpala
Â
Â
2. â˘âŻ What
 is
 Hadoop?
Â
â˘âŻ What
 is
 Impala?
Â
â˘âŻ Use-Ââcases
 for
 Impala
Â
â˘âŻ Architecture
 of
 Impala
Â
â˘âŻ Impala
 comparisons
 and
 performance
Â
â˘âŻ Demo
 (Fme
 permiRng)
Â
Agenda
Â
4. What
 is
 Apache
 Hadoop?
Â
Has the Flexibility to Store
and Mine Any Type of Data
§ď§âŻ Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
§ď§âŻ Not bound by a single schema
Excels at
Processing Complex Data
§ď§âŻ Scale-out architecture divides
workloads across multiple nodes
§ď§âŻ Flexible file system eliminates ETL
bottlenecks
Scales
Economically
§ď§âŻ Can be deployed on commodity
hardware
§ď§âŻ Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open
source platform for data storage and
processing that isâŚ
ĂźďźâŻ Distributed
ĂźďźâŻ Fault tolerant
ĂźďźâŻ Scalable
CORE HADOOP SYSTEM COMPONENTS
5. MapReduce
 -Ââ
 the
 good
 and
 the
 bad
Â
The
 Good
Â
â˘âŻVersaFle
Â
â˘âŻFlexible
Â
â˘âŻScalable
Â
The
 Bad
Â
â˘âŻ High
 latency
Â
â˘âŻ Batch
 oriented
Â
â˘âŻ Not
 all
 paradigms
 ďŹt
 very
Â
well
Â
â˘âŻ Only
 for
 developers
Â
6. â˘âŻ MR
 is
 hard
 and
 only
 for
 developers
Â
â˘âŻ Higher
 level
 pla]orms
 for
 converFng
 declaraFve
Â
syntax
 to
 MapReduce
Â
â˘âŻ SQL
 â
 Hive
Â
â˘âŻ workďŹow
 language
 â
 Pig
Â
â˘âŻ Build
 on
 top
 of
 MapReduce
 (although
 they
 are
 being
Â
made
 more
 pluggable
 now)
Â
â˘âŻ But
 jobs
 are
 sFll
 as
 slow
 as
 MapReduce
Â
What
 are
 Hive
 and
 Pig?
Â
8. â˘âŻ General-Ââpurpose
 SQL
 engine
Â
â˘âŻ Real-ÂâFme
 queries
 in
 Apache
 Hadoop
Â
â˘âŻ Beta
 version
 released
 since
 October
 2012
Â
â˘âŻ General
 availability
 (v1.0)
 release
 out
 since
 April
 2013
Â
â˘âŻ Open
 source
 under
 Apache
 license
Â
â˘âŻ Latest
 release
 (v1.3.1)
 released
 on
 May
 1st,
 2014
Â
What
 is
 Impala?
Â
9. Impala
 Overview:
 Goals
Â
â˘âŻ General-Ââpurpose
 SQL
 query
 engine:
Â
â˘âŻ Works
 for
 both
 for
 analyFcal
 and
 transacFonal/single-Âârow
Â
workloads
Â
â˘âŻ Supports
 queries
 that
 take
 from
 milliseconds
 to
 hours
Â
â˘âŻ Runs
 directly
 within
 Hadoop:
Â
â˘âŻ reads
 widely
 used
 Hadoop
 ďŹle
 formats
Â
â˘âŻ talks
 to
 widely
 used
 Hadoop
 storage
 managers
Â
Â
â˘âŻ runs
 on
 same
 nodes
 that
 run
 Hadoop
 processes
Â
â˘âŻ High
 performance:
Â
â˘âŻ C++
 instead
 of
 Java
Â
â˘âŻ runFme
 code
 generaFon
Â
â˘âŻ completely
 new
 execuFon
 engine
 â
 No
 MapReduce
Â
10. User
 View
 of
 Impala:
 Overview
Â
â˘âŻ Runs
 as
 a
 distributed
 service
 in
 cluster:
 one
 Impala
 daemon
 on
Â
each
 node
 with
 data
Â
â˘âŻ Highly
 available:
 no
 single
 point
 of
 failure
Â
11. User
 View
 of
 Impala:
 Overview
Â
â˘âŻ There
 is
 no
 âImpala
 formatâ!
Â
â˘âŻ Supported
 ďŹle
 formats:
Â
â˘âŻ uncompressed/lzo-Ââcompressed
 text
 ďŹles
Â
â˘âŻ sequence
 ďŹles
 and
 RCFile
 with
 snappy/gzip
 compression
Â
â˘âŻ Avro
 data
 ďŹles
Â
â˘âŻ Parquet
 columnar
 format
 (more
 on
 that
 later)
Â
â˘âŻ HBase
Â
12. User
 View
 of
 Impala:
 SQL
Â
â˘âŻ SQL
 support:
Â
â˘âŻ essenFally
 SQL-Ââ92,
 minus
 correlated
 subqueries
Â
â˘âŻ only
 equi-Ââjoins;
 no
 non-Ââequi
 joins,
 no
 cross
 products
Â
â˘âŻ Order
 By
 requires
 Limit
Â
â˘âŻ (Limited)
 DDL
 support
Â
â˘âŻ SQL-Ââstyle
 authorizaFon
 via
 Apache
 Sentry
 (incubaFng)
Â
â˘âŻ UDFs
 and
 UDAFs
 are
 supported
Â
13. User
 View
 of
 Impala:
 SQL
Â
â˘âŻ FuncFonal
 limitaFons:
Â
â˘âŻ No
 ďŹle
 formats,
 SerDes
Â
â˘âŻ no
 beyond
 SQL
 (buckets,
 samples,
 transforms,
 arrays,
Â
structs,
 maps,
 xpath,
 json)
Â
â˘âŻ Broadcast
 joins
 and
 parFFoned
 hash
 joins
 supported
Â
â˘âŻ Smaller
 table
 has
 to
 ďŹt
 in
 aggregate
 memory
 of
 all
 execuFng
Â
nodes
Â
15. Impala
 Use
 Cases
Â
Interactive BI/analytics on more data
Asking new questions â exploration,
ML
Data processing with tight SLAs
Query-able archive w/full fidelity
Cost-effective, ad hoc query environment that
offloads/replaces the data warehouse for:
16. Global
 Financial
 Services
 Company
Â
Saved 90% on incremental EDW spend &
improved performance by 5x
Offload data warehouse for query-able
archive
Store decades of data cost-effectively
Process & analyze on the same system
Improved capabilities through interactive
query on more data
17. Digital
 Media
 Company
Â
20x performance improvement for
exploration & data discovery
Easily identify new data sets for
modeling
Interact with raw data directly to test
hypotheses
Avoid expensive DW schema changes
Accelerate âtime to answerâ
19. Impala
 Architecture
Â
â˘âŻ Three
 binaries:
 impalad,
 statestored,
 catalogd
Â
â˘âŻ Impala
 daemon
 (impalad)
 â
 N
 instances
Â
â˘âŻ handles
 client
 requests
 and
 all
 internal
 requests
 related
 to
Â
query
 execuFon
Â
â˘âŻ State
 store
 daemon
 (statestored)
 â
 1
 instance
Â
â˘âŻ Provides
 name
 service
 and
 metadata
 distribuFon
Â
â˘âŻ Catalog
 daemon
 (catalogd)
 â
 1
 instance
Â
â˘âŻ Relays
 metadata
 changes
 to
 all
 impaladâs
Â
20. Impala
 Architecture:
 Query
 ExecuFon
Â
Request
 arrives
 via
 odbc/jdbc
Â
Query
 Planner
Â
Query
 Executor
Â
HDFS
 DN
 HBase
Â
SQL
 App
Â
ODBC
Â
Query
 Planner
Â
Query
 Coordinator
Â
Query
 Executor
Â
HDFS
 DN
 HBase
Â
Query
 Planner
Â
Query
 Executor
Â
HDFS
 DN
 HBase
Â
SQL
Â
request
Â
Query
 Coordinator
 Query
 Coordinator
Â
HiveMeta
store
Â
HDFS
 NN
Â
Statestore
Â
+
Â
Catalogd
Â
21. Impala
 Architecture:
 Query
 ExecuFon
Â
Planner
 turns
 request
 into
 collecFons
 of
 plan
 fragments
Â
Coordinator
 iniFates
 execuFon
 on
 remote
 impalad's
Â
Query
 Planner
Â
Query
 Coordinator
Â
Query
 Executor
Â
HDFS
 DN
 HBase
Â
SQL
 App
Â
ODBC
Â
Query
 Planner
Â
Query
 Coordinator
Â
Query
 Executor
Â
HDFS
 DN
 HBase
Â
Query
 Planner
Â
Query
 Coordinator
Â
Query
 Executor
Â
HDFS
 DN
 HBase
Â
HiveMeta
store
Â
HDFS
 NN
Â
Statestore
Â
+
Â
Catalogd
Â
22. Impala
 Architecture:
 Query
 ExecuFon
Â
Intermediate
 results
 are
 streamed
 between
 impalad's
 Query
Â
results
 are
 streamed
 back
 to
 client
Â
Query
 Planner
Â
Query
 Coordinator
Â
Query
 Executor
Â
HDFS
 DN
 HBase
Â
SQL
 App
Â
ODBC
Â
Query
 Planner
Â
Query
 Coordinator
Â
Query
 Executor
Â
HDFS
 DN
 HBase
Â
Query
 Planner
Â
Query
 Coordinator
Â
Query
 Executor
Â
HDFS
 DN
 HBase
Â
query
Â
results
Â
HiveMeta
store
Â
HDFS
 NN
Â
Statestore
Â
+
Â
Catalogd
Â
23. Query
 Planning:
 Overview
Â
â˘âŻ 2-Ââphase
 planning
 process:
Â
â˘âŻ single-Âânode
 plan:
 le+-Ââdeep
 tree
 of
 plan
 operators
Â
â˘âŻ plan
 parFFoning:
 parFFon
 single-Âânode
 plan
 to
 maximize
 scan
 locality,
Â
minimize
 data
 movement
Â
â˘âŻ ParallelizaFon
 of
 operators:
Â
â˘âŻ All
 query
 operators
 are
 fully
 distributed
Â
24. Query Planning:
 Single-ÂâNode
 Plan
Â
â˘âŻ Plan
 operators:
 Scan,
 HashJoin,
 HashAggregaFon,
 Union,
Â
TopN,
 Exchange
Â
25. Single-ÂâNode
 Plan:
 Example
 Query
Â
SELECT
 t1.cusFd,
Â
Â
Â
Â
Â
Â
Â
 SUM(t2.revenue)
 AS
 revenue
Â
FROM
 LargeHdfsTable
 t1
Â
JOIN
 LargeHdfsTable
 t2
 ON
 (t1.id1
 =
 t2.id)
Â
JOIN
 SmallHbaseTable
 t3
 ON
 (t1.id2
 =
 t3.id)
Â
WHERE
 t3.category
 =
 'Online'
Â
GROUP
 BY
 t1.cusFd
Â
ORDER
 BY
 revenue
 DESC
 LIMIT
 10;
Â
26. Query Planning:
 Single-ÂâNode
 Plan
Â
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
â˘âŻ Single-Âânode
 plan
 for
 example:
Â
Â
27. Query
 Planning:
 Distributed
 Plans
Â
HashJoinScan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Pre-Agg
MergeAgg
TopN
Broadcast
Broadcast
hash t2.idhash t1.id1
hash
t1.custid
at HDFS DN
at HBase RS
at coordinator
28. Metadata
 Handling
Â
â˘âŻ Impala
 metadata:
Â
â˘âŻ Hiveâs
 metastore:
 logical
 metadata
 (table
 deďŹniFons,
Â
columns,
 CREATE
 TABLE
 parameters)
Â
â˘âŻ HDFS
 Namenode:
 directory
 contents
 and
 block
 replica
Â
locaFons
Â
â˘âŻ HDFS
 DataNode:
 block
 replicasâ
 volume
 IDs
Â
29. Impala
 ExecuFon
 Engine
Â
â˘âŻ Wri<en
 in
 C++
 for
 minimal
 execuFon
 overhead
Â
â˘âŻ Internal
 in-Ââmemory
 tuple
 format
 puts
 ďŹxed-Ââwidth
Â
data
 at
 ďŹxed
 oďŹsets
Â
â˘âŻ Uses
 intrinsics/special
 cpu
 instrucFons
 for
 text
Â
parsing,
 crc32
 computaFon,
 etc.
Â
â˘âŻ RunFme
 code
 generaFon
 for
 âbig
 loopsâ
Â
30. Impala
 ExecuFon
 Engine
Â
â˘âŻ More
 on
 runFme
 code
 generaFon
Â
â˘âŻ example
 of
 "big
 loop":
 insert
 batch
 of
 rows
 into
 hash
 table
Â
â˘âŻ known
 at
 query
 compile
 Fme:
 #
 of
 tuples
 in
 a
 batch,
 tuple
Â
layout,
 column
 types,
 etc.
Â
â˘âŻ generate
 at
 compile
 Fme:
 unrolled
 loop
 that
 inlines
 all
Â
funcFon
 calls,
 contains
 no
 dead
 code,
 minimizes
 branches
Â
â˘âŻ code
 generated
 using
 llvm
Â
31. Comparing
 Impala
 to
 Dremel
Â
â˘âŻ What
 is
 Dremel?
Â
â˘âŻ columnar
 storage
 for
 data
 with
 nested
 structures
Â
â˘âŻ distributed
 scalable
 aggregaFon
 on
 top
 of
 that
Â
â˘âŻ Columnar
 storage
 in
 Hadoop:
 Parquet
Â
â˘âŻ stores
 data
 in
 appropriate
 naFve/binary
 types
Â
â˘âŻ can
 also
 store
 nested
 structures
 similar
 to
 Dremel's
 ColumnIO
Â
â˘âŻ Parquet
 is
 open
 source:
 github.com/parquet
Â
â˘âŻ Distributed
 aggregaFon:
 Impala
Â
â˘âŻ Impala
 plus
 Parquet:
 a
 superset
 of
 the
 published
 version
 of
Â
Dremel
 (which
 didn't
 support
 joins)
Â
33. Impala
 Performance
 Results
Â
â˘âŻImpalaâs
 Latest
 Milestone:
Â
â˘âŻ Comparable
 commercial
 MPP
 DBMS
 speed
Â
â˘âŻ NaFvely
 on
 Hadoop
Â
Â
â˘âŻThree
 Result
 Sets:
Â
â˘âŻ Impala
 vs
 Hive
 0.12
 (Impala
 6-Ââ70x
 faster)
Â
â˘âŻ Impala
 vs
 âDBMS-ÂâYâ
 (Impala
 average
 of
 2x
 faster)
Â
â˘âŻ Impala
 scalability
 (Impala
 achieves
 linear
 scale)
Â
Â
â˘âŻBackground
Â
â˘âŻ 20
 pre-Ââselected,
 diverse
 TPC-ÂâDS
 queries
 (modiďŹed
 to
 remove
 unsupported
Â
language)
Â
â˘âŻ SuďŹcient
 data
 scale
 for
 realisFc
 comparison
 (3
 TB,
 15
 TB,
 and
 30
 TB)
Â
â˘âŻ RealisFc
 nodes
 (e.g.
 8-Ââcore
 CPU,
 96GB
 RAM,
 12x2TB
 disks)
Â
â˘âŻ Methodical
 tesFng
 (mulFple
 runs,
 reviewed
 fairness
 for
 compeFFon,
 etc)
Â
Â
â˘âŻ Details:
 h<p://blog.cloudera.com/blog/2014/01/impala-Ââperformance-Ââdbms-Ââclass-Ââspeed/
Â
33
Â
37. Impala
 Scalability:
 2x
 the
 Hardware
 and
 2x
 Users/Data
Â
(ExpectaFon:
 Constant
 Response
 Times)
Â
37
Â
2x the Users, 2x the Hardware
2x the Data, 2x the Hardware
38. Demo
Â
â˘âŻ Uses
 Clouderaâs
 Quickstart
 VM
h<p://Fny.cloudera.com/quick-Ââstart
Â
â˘âŻ Dataset/queries
 from
 h<ps://github.com/
markgrover/cloudcon-Ââhive
Â
39. I
 am
 co-Ââauthoring
 OâReilly
 book
Â
Hadoop
 ApplicaFon
Â
Architectures
Â
How
 to
 build
 end-Ââto-Ââend
 soluFons
Â
using
 Apache
 Hadoop
 and
 related
Â
tools
Â
@hadooparchbook
Â
www.hadooparchitecturebook.com
Â
40. Try
 it
 out!
Â
â˘âŻ Open
 source!
 Available
 at
 cloudera.com,
 AWS
 EMR!
Â
â˘âŻ Packages
 for
 many
 diďŹerent
 Linux
 ďŹavours
Â
â˘âŻ QuesFons/comments?
 community.cloudera.com
Â
â˘âŻ My
 twi<er
 handle:
 mark_grover
Â
â˘âŻ Slides
 at:
 slideshare.net/markgrover/introducFon-Ââto-Ââ
impala
Â