Cloudera Impala

1
Cloudera
Impala

Portland
Big
Data
User
Group,
July
2014

Alex
Moundalexis

@technmsg

Thirty
Seconds
About
Alex

•  SoluGons
Architect

•  aka
consultant

•  government

•  infrastructure

•  former
coder
of
Perl

•  former
administrator

•  fan
of
Portland

2

What
Does
Cloudera
Do?

•  product

•  distribuGon
of
Hadoop
components,
Apache
licensed

•  enterprise
tooling

•  support

•  training

•  services
(aka
consulGng)

•  community

3

Disclaimer

•  Cloudera
builds
things
soPware

•  most
donated
to
Apache

•  some
closed-‐source

•  Cloudera
“products”
I
reference
are
open
source

•  Apache
Licensed

•  source
code
is
on
GitHub

•  hVps://github.com/cloudera

4

What
This
Talk
Isn’t
About

•  deploying

•  Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor

•  sizing
&
tuning

•  depends
heavily
on
data
and
workload

•  coding

•  unless
you
count
XML
or
CSV
or
SQL

•  algorithms

5

CC
BY-‐SA
Lilian
De
Cassai

cloud·∙e·∙ra
im·∙pal·∙a

8
/kloudˈi(ə)rə
imˈpalə/

noun

a
modern,
open
source,
MPP
SQL
query
engine

for
Apache
Hadoop.

“Cloudera
Impala
provides
fast,
ad
hoc
SQL
query

capability
for
Apache
Hadoop,
complemenGng

tradiGonal
MapReduce
batch
processing.”

9
Quick
and
dirty,
for
context.

The
Apache
Hadoop
Ecosystem

Why
“Ecosystem?”

•  In
the
beginning,
just
Hadoop

•  HDFS

•  MapReduce

•  Today,
dozens
of
interrelated
components

•  I/O

•  Processing

•  Specialty
ApplicaGons

•  ConﬁguraGon

•  Workﬂow

10

HDFS

•  Distributed,
highly
fault-‐tolerant
ﬁlesystem

•  OpGmized
for
large
streaming
access
to
data

•  Based
on
Google
File
System

•  hVp://research.google.com/archive/gfs.html

11

Lots
of
Commodity
Machines

12
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce
(MR)

•  Programming
paradigm

•  Batch
oriented,
not
realGme

•  Works
well
with
distributed
compuGng

•  Lots
of
Java,
but
other
languages
supported

•  Based
on
Google’s
paper

•  hVp://research.google.com/archive/mapreduce.html

13

Under
the
Covers

14

You specify map() and
reduce() functions.

The framework does the
rest.

60

Apache
Hive

•  AbstracGon
of
Hadoop’s
Java
API

•  HiveQL
“compiles”
down
to
MR

•  a
“SQL-‐like”
language

•  Eases
analysis
using
MapReduce

16

Apache
Hive
Metastore

•  Maps
HDFS
ﬁles
to
DB-‐like
resources

•  Databases

•  Tables

•  Column/ﬁeld
names,
data
types

•  Roles/users

•  InputFormat/OutputFormat

17

WHY
DO
WE
NEED
THIS?

But
wait…

18

20
I
am
not
a
SQL
wizard
by
any
means…

Super
Shady
SQL
Supplement

A
Simple
RelaGonal
Database

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

21

InteracGng
with
RelaGonal
Data

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

22

SELECT
*
FROM
people;

InteracGng
with
RelaGonal
Data

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

23

SELECT
*
FROM
people;

RequesGng
Speciﬁc
Fields

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

24

SELECT
name,
state
FROM
people;

RequesGng
Speciﬁc
Fields

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

25

SELECT
name,
state
FROM
people;

RequesGng
Speciﬁc
Rows

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

26

SELECT
name,
state
FROM
people
WHERE
year

2012;

RequesGng
Speciﬁc
Rows

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

27

SELECT
name,
state
FROM
people
WHERE
year

2012;

Two
Simple
Tables

owner
species
name

Alex
Cactus
Marvin

Joey
Cat
Brain

Sean
None

Paris
Unknown

28

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

Joining
Two
Tables

owner
species
name

Alex
Cactus
Marvin

Joey
Cat
Brain

Sean
None

Paris
Unknown

29

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

Joining
Two
Tables

owner
species
name

Alex
Cactus
Marvin

Joey
Cat
Brain

Sean
None

Paris
Unknown

30

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

Joining
Two
Tables

owner
species
name

Alex
Cactus
Marvin

Joey
Cat
Brain

Sean
None

Paris
Unknown

31

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

Joining
Two
Tables

32

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

owner
state
pet

Alex
Maryland
Marvin

Joey
Maryland
Brain

Sean
Texas

Paris
Maryland

Varying
ImplementaGon
of
JOIN

33

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

owner
state
pet

Alex
Maryland
Marvin

Joey
Maryland
Brain

Sean
Texas
?

Paris
Maryland
?

34
Familiar
interface,
but
more
powerful.

Cloudera
Impala

Cloudera
Impala

•  InteracGve
query
on
Hadoop

•  think
seconds,
not
minutes

•  Nearly
ANSI-‐92
standard
SQL

•  compaGble
with
HiveQL

•  NaGve
MPP
query
engine

•  built
for
low-‐latency
queries

35

Cloudera
Impala
–
Design
Choices

•  NaGve
daemons,
wriVen
in
C/C++

•  No
JVM,
no
MapReduce

•  Saturate
disks
on
reads

•  Uses
in-‐memory
HDFS
caching

•  Re-‐uses
Hive
metastore

•  Not
as
fault-‐tolerant
as
MapReduce

36

Cloudera
Impala
–
Architecture

•  Impala
Daemon

•  runs
on
every
node

•  handles
client
requests

•  handles
query
planning

execuGon

•  State
Store
Daemon

•  provides
name
service

•  metadata
distribuGon

•  used
for
ﬁnding
data

37

Impala
Query
ExecuGon

38
Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Hive

Metastore

HDFS
NN
Statestore

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
request

1)
Request
arrives
via
ODBC/JDBC/HUE/Shell

Impala
Query
ExecuGon

39
Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Hive

Metastore

HDFS
NN
Statestore

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

2)
Planner
turns
request
into
collecPons
of
plan
fragments

3)
Coordinator
iniPates
execuPon
on
impalad(s)
local
to
data

Impala
Query
ExecuGon

40
Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Hive

Metastore

HDFS
NN
Statestore

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

4)
Intermediate
results
are
streamed
between
impalad(s)

5)
Query
results
are
streamed
back
to
client

Query
results

Cloudera
Impala
–
Results

•  Allows
for
fast
iteraGon/discovery

•  How
much
faster?

•  3-‐4x
faster
on
I/O
bound
workloads

•  up
to
45x
faster
on
mulG-‐MR
queries

•  up
to
90x
faster
on
in-‐memory
cache

41

42
Hold
onto
something,
folks.

Demo

What’s
Next?

•  Download
Hadoop!

•  CDH
available
at
www.cloudera.com

•  Already
done
that?
Contribute…

•  Cloudera
provides
pre-‐loaded
VMs

•  hVp://Gny.cloudera.com/quickstartvm

•  Clone
our
repos!

•  hVps://github.com/cloudera

43

PORTLAND

Special
thanks:

44

45
Preferably
related
to
the
talk…
or
not.

QuesGons?

46
Thank
You!

Alex
Moundalexis

@technmsg

We’re
hiring,
kids!
Well,
not
kids.

Cloudera Impala

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Cloudera Impala

Similaire à Cloudera Impala (20)

Plus de Alex Moundalexis

Plus de Alex Moundalexis (6)

Cloudera Impala