Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale.
As presented to Portland Big Data User Group on July 23rd 2014.
http://www.meetup.com/Hadoop-Portland/events/194930422/
2. Thirty
Seconds
About
Alex
• SoluGons
Architect
• aka
consultant
• government
• infrastructure
• former
coder
of
Perl
• former
administrator
• fan
of
Portland
2
3. What
Does
Cloudera
Do?
• product
• distribuGon
of
Hadoop
components,
Apache
licensed
• enterprise
tooling
• support
• training
• services
(aka
consulGng)
• community
3
4. Disclaimer
• Cloudera
builds
things
soPware
• most
donated
to
Apache
• some
closed-‐source
• Cloudera
“products”
I
reference
are
open
source
• Apache
Licensed
• source
code
is
on
GitHub
• hVps://github.com/cloudera
4
5. What
This
Talk
Isn’t
About
• deploying
• Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor
• sizing
&
tuning
• depends
heavily
on
data
and
workload
• coding
• unless
you
count
XML
or
CSV
or
SQL
• algorithms
5
10. Why
“Ecosystem?”
• In
the
beginning,
just
Hadoop
• HDFS
• MapReduce
• Today,
dozens
of
interrelated
components
• I/O
• Processing
• Specialty
ApplicaGons
• ConfiguraGon
• Workflow
10
11. HDFS
• Distributed,
highly
fault-‐tolerant
filesystem
• OpGmized
for
large
streaming
access
to
data
• Based
on
Google
File
System
• hVp://research.google.com/archive/gfs.html
11
13. MapReduce
(MR)
• Programming
paradigm
• Batch
oriented,
not
realGme
• Works
well
with
distributed
compuGng
• Lots
of
Java,
but
other
languages
supported
• Based
on
Google’s
paper
• hVp://research.google.com/archive/mapreduce.html
13
20. 20
I
am
not
a
SQL
wizard
by
any
means…
Super
Shady
SQL
Supplement
21. A
Simple
RelaGonal
Database
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
21
22. InteracGng
with
RelaGonal
Data
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
22
SELECT
*
FROM
people;
23. InteracGng
with
RelaGonal
Data
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
23
SELECT
*
FROM
people;
24. RequesGng
Specific
Fields
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
24
SELECT
name,
state
FROM
people;
25. RequesGng
Specific
Fields
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
25
SELECT
name,
state
FROM
people;
26. RequesGng
Specific
Rows
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
26
SELECT
name,
state
FROM
people
WHERE
year
2012;
27. RequesGng
Specific
Rows
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
27
SELECT
name,
state
FROM
people
WHERE
year
2012;
28. Two
Simple
Tables
owner
species
name
Alex
Cactus
Marvin
Joey
Cat
Brain
Sean
None
Paris
Unknown
28
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
29. Joining
Two
Tables
owner
species
name
Alex
Cactus
Marvin
Joey
Cat
Brain
Sean
None
Paris
Unknown
29
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
30. Joining
Two
Tables
owner
species
name
Alex
Cactus
Marvin
Joey
Cat
Brain
Sean
None
Paris
Unknown
30
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
31. Joining
Two
Tables
owner
species
name
Alex
Cactus
Marvin
Joey
Cat
Brain
Sean
None
Paris
Unknown
31
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
32. Joining
Two
Tables
32
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
owner
state
pet
Alex
Maryland
Marvin
Joey
Maryland
Brain
Sean
Texas
Paris
Maryland
33. Varying
ImplementaGon
of
JOIN
33
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
owner
state
pet
Alex
Maryland
Marvin
Joey
Maryland
Brain
Sean
Texas
?
Paris
Maryland
?
35. Cloudera
Impala
• InteracGve
query
on
Hadoop
• think
seconds,
not
minutes
• Nearly
ANSI-‐92
standard
SQL
• compaGble
with
HiveQL
• NaGve
MPP
query
engine
• built
for
low-‐latency
queries
35
36. Cloudera
Impala
–
Design
Choices
• NaGve
daemons,
wriVen
in
C/C++
• No
JVM,
no
MapReduce
• Saturate
disks
on
reads
• Uses
in-‐memory
HDFS
caching
• Re-‐uses
Hive
metastore
• Not
as
fault-‐tolerant
as
MapReduce
36
37. Cloudera
Impala
–
Architecture
• Impala
Daemon
• runs
on
every
node
• handles
client
requests
• handles
query
planning
execuGon
• State
Store
Daemon
• provides
name
service
• metadata
distribuGon
• used
for
finding
data
37
39. Impala
Query
ExecuGon
39
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
SQL
App
ODBC
Hive
Metastore
HDFS
NN
Statestore
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
2)
Planner
turns
request
into
collecPons
of
plan
fragments
3)
Coordinator
iniPates
execuPon
on
impalad(s)
local
to
data
40. Impala
Query
ExecuGon
40
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
SQL
App
ODBC
Hive
Metastore
HDFS
NN
Statestore
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
4)
Intermediate
results
are
streamed
between
impalad(s)
5)
Query
results
are
streamed
back
to
client
Query
results
41. Cloudera
Impala
–
Results
• Allows
for
fast
iteraGon/discovery
• How
much
faster?
• 3-‐4x
faster
on
I/O
bound
workloads
• up
to
45x
faster
on
mulG-‐MR
queries
• up
to
90x
faster
on
in-‐memory
cache
41