Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop

1©
Cloudera,
Inc.
All
rights
reserved.
Jean-‐Daniel
Cryans on
behalf
of
the
Kudu
team
Kudu:
Resolving
Transactional

and
Analytic
Trade-‐offs
in

Hadoop

2©
Cloudera,
Inc.
All
rights
reserved.
Myself
• Software
Engineer
at
Cloudera
• On
the
Kudu
team
for
2
years
• Apache
HBase committer
and
PMC
member
since
2008
• Previously
at
StumbleUpon

3©
Cloudera,
Inc.
All
rights
reserved.
Kudu
Storage
for
Fast
Analytics
on
Fast
Data
• New
updating
column
store
for

Hadoop
• Apache-‐licensed
open
source
• Beta
now
available
Columnar
Store
Kudu

4©
Cloudera,
Inc.
All
rights
reserved.
Motivation
and
Goals
Why
build
Kudu?
4

5©
Cloudera,
Inc.
All
rights
reserved.
Motivating
Questions
• Are
there
user
problems
that
can
we
can’t
address
because
of
gaps
in
Hadoop
ecosystem
storage
technologies?
• Are
we
positioned
to
take
advantage
of
advancements
in
the
hardware

landscape?

6©
Cloudera,
Inc.
All
rights
reserved.
Current
Storage
Landscape
in
Hadoop
HDFS
excels
at:
• Efficiently
scanning
large
amounts

of
data
• Accumulating
data
with
high

throughput
HBase
excels
at:
• Efficiently
finding
and
writing

individual
rows
• Making
data
mutable
Gaps
exist
when
these
properties

are
needed
simultaneously

7©
Cloudera,
Inc.
All
rights
reserved.
• High
throughput
for
big
scans
(columnar

storage
and
replication)
Goal: Within
2x
of
Parquet
• Low-‐latency
for
short
accesses
(primary
key

indexes
and
quorum
replication)
Goal: 1ms
read/write
on
SSD
• Database-‐like semantics
(initially
single-‐row

ACID)
• Relational
data
model
• SQL
query
• “NoSQL”
style
scan/insert/update
(Java
client)
Kudu
Design
Goals

8©
Cloudera,
Inc.
All
rights
reserved.
Changing
Hardware
landscape
• Spinning
disk
-‐>
solid
state
storage
• NAND
flash:
Up
to
450k
read
250k
write
iops,
about
2GB/sec
read
and

1.5GB/sec
write
throughput,at
a
price
of
less
than
$3/GB
and
dropping
• 3D
XPoint memory (1000x
faster
than
NAND,
cheaper
than
RAM)
• RAM is
cheaper
and
more
abundant:
• 64-‐>128-‐>256GB
over
last
few
years
• Takeaway
1:
The next
bottleneck
is
CPU,
and
current
storage
systems
weren’t

designed
with
CPU
efficiency
in
mind.
• Takeaway
2: Column
stores
are
feasible
for
random
access

9©
Cloudera,
Inc.
All
rights
reserved.
Kudu
Usage
• Table
has
a
SQL-‐like
schema
• Finite
number
of
columns
(unlike
HBase/Cassandra)
• Types:
BOOL,
INT8,
INT16,
INT32,
INT64,
FLOAT,
DOUBLE,
STRING,
BINARY,

TIMESTAMP
• Some
subset
of
columns
makes
up
a
possibly-‐composite
primary
key
• Fast
ALTER
TABLE
• Java
and
C++
“NoSQL”
style
APIs
• Insert(),
Update(),
Delete(),
Scan()
• Integrations
with
MapReduce,
Spark,
and
Impala
• more
to
come!
9

10©
Cloudera,
Inc.
All
rights
reserved.
Use
cases
and
architectures

11©
Cloudera,
Inc.
All
rights
reserved.
Kudu
Use
Cases
Kudu
is
best
for
use
cases
requiring
a
simultaneous
combination
of
sequential
and
random
reads
and
writes
● Time
Series
○ Examples:
Stream
market
data;
fraud
detection
&
prevention;
risk
monitoring
○ Workload:
Insert,
updates,
scans,
lookups
● Machine
Data
Analytics
○ Examples:
Network
threat
detection
○ Workload:
Inserts,
scans,
lookups
● Online
Reporting
○ Examples:
OperationalData
Store
(ODS)
○ Workload:
Inserts,
updates,
scans,
lookups

12©
Cloudera,
Inc.
All
rights
reserved.
Real-‐Time
Analytics
in
Hadoop
Today
Fraud
Detection
in
the
Real
World
=
Storage
Complexity
Considerations:
● How
do
I
handle
failure

during
this
process?
● How
often
do
I
reorganize

data
streaming
in
into
a

format
appropriate
for

reporting?
● When
reporting,
how
do
I
see

data
that
has
not
yet
been

reorganized?
● How
do
I
ensure
that

important
jobs
aren’t

interrupted
by
maintenance?
HBase
Have
we

accumulated

enough
data?
Incoming
Data

(Messaging

System)
Parquet

File
Reorganize

HBase
file

into
Parquet
Reporting

Request
New
Partition
Most
Recent
Partition
Historic
Data
Impala
on
HDFS
• Wait
for
running
operations
to
complete

• Define
new
Impala
partition
referencing

the
newly
written
Parquet
file

13©
Cloudera,
Inc.
All
rights
reserved.
Real-‐Time
Analytics
in
Hadoop
with
Kudu
Improvements:
● One
system to
operate
● No
cron
jobs
or
background

processes
● Handle
late
arrivals
or
data

corrections
with
ease
● New
data
available

immediately
for
analytics
or

operations

Historical
and
Real-‐time
Data
Incoming
Data

(Messaging

System)
Reporting

Request
Storage
in
Kudu

14©
Cloudera,
Inc.
All
rights
reserved.
How
it
works
14

15©
Cloudera,
Inc.
All
rights
reserved.
Tables
and
Tablets
• Table
is
horizontally
partitioned
into
tablets
• Range or
hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• Each
tablet
has
N
replicas
(3
or
5),
with
Raft consensus
• Allow
read
from
any
replica,
plus
leader-‐driven
writes
with
low
MTTR
• Tablet
servers
host
tablets
• Store
data
on
local
disks
(no
HDFS)
15

16©
Cloudera,
Inc.
All
rights
reserved.
Metadata
• Replicated
master*
• Acts
as
a
tablet
directory
(“META”
table)
• Acts
as
a
catalog
(table
schemas,
etc)
• Acts
as
a
load
balancer
(tracks
TS
liveness,
re-‐replicates
under-‐replicated

tablets)
• Caches
all
metadata
in
RAM
for
high
performance
• 80-‐node
load
test,
GetTableLocationsRPC
perf:
• 99th percentile:
68us,

99.99th percentile:
657us

• <2%
peak
CPU
usage
• Client
configured
with
master
addresses
• Asks
master
for
tablet
locations
as
needed
and
caches
them
16

17©
Cloudera,
Inc.
All
rights
reserved.

18©
Cloudera,
Inc.
All
rights
reserved.
Raft
consensus
18
TS
A
Tablet
1
(LEADER)
Client
TS
B
Tablet
1
(FOLLOWER)
TS
C
Tablet
1
(FOLLOWER)
WAL
WALWAL
2b.
Leader
writes
local
WAL
1a.
Client-‐>Leader:
Write()
RPC
2a.
Leader-‐>Followers:

UpdateConsensus()
RPC
3.
Follower:
write
WAL
4.
Follower-‐>Leader:
success
3.
Follower:
write
WAL
5.
Leader
has
achieved
majority
6.
Leader-‐>Client:
Success!

19©
Cloudera,
Inc.
All
rights
reserved.
Fault
tolerance
• Transient
FOLLOWER
failure:
• Leader
can
still
achieve
majority
• Restart
follower
TS
within
5
min
and
it
will
rejoin
transparently
• Transient
LEADER
failure:
• Followers
expect
to
hear
a
heartbeat
from
their
leader
every
1.5
seconds
• 3
missed
heartbeats:
leader
election!
• New
LEADER
is
elected
from
remaining
nodes
within
a
few
seconds
• Restart
within
5
min
and
it
rejoins
as
a
FOLLOWER
• N
replicas
handle
(N-‐1)/2
failures
19

20©
Cloudera,
Inc.
All
rights
reserved.
Fault
tolerance
(2)
• Permanent
failure:
• Leader
notices
that
a
follower
has
been
dead
for
5
minutes
• Evicts
that
follower
• Master
selects
a
new
replica
• Leader
copies
the
data
over
to
the
new
one,
which
joins
as
a
new
FOLLOWER
20

21©
Cloudera,
Inc.
All
rights
reserved.
Tablet
design
• Inserts
buffered
in
an
in-‐memory
store
(like
HBase’s
memstore)
• Flushed
to
disk
• Columnar
layout,
similar
to
Apache
Parquet
• Updates
use
MVCC
(updates
tagged
with
timestamp,
not
in-‐place)
• Allow
“SELECT
AS
OF
<timestamp>”
queries
and
consistent
cross-‐tablet
scans
• Near-‐optimal
read
path
for
“current
time”
scans
• No
per
row
branches,
fast
vectorized decoding
and
predicate
evaluation
• Performance
worsens
based
on
number
of
recent
updates
21

22©
Cloudera,
Inc.
All
rights
reserved.
LSM
vs Kudu
• LSM
– Log
Structured
Merge
(Cassandra,
HBase,
etc)
• Inserts
and
updates
all
go
to
an
in-‐memory
map
(MemStore)
and
later
flush
to

on-‐disk
files
(HFile/SSTable)
• Reads
perform
an
on-‐the-‐fly
merge
of
all
on-‐disk
HFiles
• Kudu
• Shares
some
traits
(memstores,
compactions)
• More
complex.
• Slower
writes in
exchange
for
faster
reads
(especially
scans)
22

23©
Cloudera,
Inc.
All
rights
reserved.
Kudu
trade-‐offs
• Random
updates
will
be
slower
• HBase
model
allows
random
updates
without
incurring
a
disk
seek
• Kudu
requires
a
key
lookup
before
update,
bloom
lookup
before
insert,
may

incur
seeks
• Single-‐row
reads
may
be
slower
• Columnar
design
is
optimized
for
scans
• Especially
slow
at
reading
a
row
that
has
had
many
recent
updates
(e.g YCSB

“zipfian”)
23

24©
Cloudera,
Inc.
All
rights
reserved.
Benchmarks
24

25©
Cloudera,
Inc.
All
rights
reserved.
TPC-‐H
(Analytics
benchmark)
• 75TS
+
1
master
cluster
• 12
(spinning)
disk
each,
enough
RAM
to
fit
dataset
• Using
Kudu
0.5.0,
Impala
2.2
with
Kudu
support,
CDH
5.4
• TPC-‐H
Scale
Factor
100
(100GB)
• Example
query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
25

26©
Cloudera,
Inc.
All
rights
reserved.
-‐ Kudu
outperforms
Parquet
by
31%
(geometric
mean)
for
RAM-‐resident
data
-‐ Parquet
likely
to
outperform
Kudu
for
HDD-‐resident
(larger
IO
requests)

27©
Cloudera,
Inc.
All
rights
reserved.
What
about
Apache
Phoenix?
• 10
node
cluster
(9
worker,
1
master)
• HBase
1.0,
Phoenix
4.3
• TPC-‐H
LINEITEM
table
only
(6B
rows)
27
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000
Load TPCH
Q1 COUNT(*)
COUNT(*)
WHERE…
single-‐row
lookup
Time
(sec)
Phoenix
Kudu
Parquet

28©
Cloudera,
Inc.
All
rights
reserved.
What
about
NoSQL-‐style
random
access?
(YCSB)
• YCSB 0.5.0-‐snapshot
• 10
node
cluster
(9
worker,
1
master)
• HBase 1.0
• 100M
rows,
10M
ops
28

29©
Cloudera,
Inc.
All
rights
reserved.
But
don’t
trust
me
(a
vendor)…
29

30©
Cloudera,
Inc.
All
rights
reserved.
About
Xiaomi
Mobile
Internet
Company
Founded
in
2010
Smartphones Software
E-‐commerce
MIUI
Cloud
Services
App
Store/Game
Payment/Finance
…
Smart
Home
Smart
Devices

31©
Cloudera,
Inc.
All
rights
reserved.
Big
Data
Analytics
Pipeline
Before
Kudu
• Long
pipeline
high
latency(1
hour
~
1
day),
data
conversion
pains
• No
ordering
Log
arrival(storage)
order
not
exactly
logical
order
e.g.
read
2-‐3
days
of
log
for
data
in
1
day

32©
Cloudera,
Inc.
All
rights
reserved.
Big
Data
Analysis
Pipeline
Simplified
With
Kudu
• ETL
Pipeline(0~10s
latency)
Apps
that
need
to
prevent
backpressure
or
require
ETL

• Direct
Pipeline(no
latency)
Apps
that
don’t
require
ETL
and
no
backpressure
issues
OLAP
scan
Side
table
lookup
Result
store

33©
Cloudera,
Inc.
All
rights
reserved.
Use
Case
1
Mobile
service
monitoring
and
tracing
tool
Requirements
u High
write
throughput
>5
Billion
records/day
and
growing
u Query
latest
data
and
quick
response
Identify
and
resolve
issues
quickly
u Can
search
for
individual
records
Easy
for
troubleshooting
Gather
important
RPC
tracing
events
from
mobile
app
and
backend
service.

Service
monitoring
&
troubleshooting
tool.

34©
Cloudera,
Inc.
All
rights
reserved.
Use
Case
1:
Benchmark
Environment
u 71
Node
cluster
u Hardware
CPU:
E5-‐2620
2.1GHz
*
24
core

Memory:
64GB

Network:
1Gb

Disk:
12
HDD
u Software
Hadoop2.6/Impala
2.1/Kudu
Data
u 1
day
of
server
side
tracing
data
~2.6
Billion
rows
~270
bytes/row
17
columns,
5
key
columns

35©
Cloudera,
Inc.
All
rights
reserved.
Use
Case
1:
Benchmark
Results
1.4
2.0
2.3

3.1

1.3
0.9
1.3

2.8

4.0

5.7

7.5

16.7

Q1 Q2 Q3 Q4 Q5 Q6
kudu
parquet
Total
Time(s) Throughput(Total) Throughput(pernode)
Kudu 961.1 2.8M
record/s 39.5k
record/s
Parquet 114.6 23.5M
record/s 331k records/s
Bulk
load
using
impala
(INSERT
INTO):

Query
latency:
*
HDFS
parquet
file
replication
=
3
,
kudu
table
replication
=
3
*
Each
query
run
5
times
then
take
average

36©
Cloudera,
Inc.
All
rights
reserved.
Use
Case
1:
Result
Analysis
u Lazy
materialization
Ideal
for
search
style
query
Q6
returns
only
a
few
records
(of
a
single
user)
with
all
columns
u Scan
range
pruning
using
primary
index
Predicates
on
primary
key
Q5
only
scans
1
hour
of
data
u Future
work
Primary
index:
speed-‐up
order
by
and
distinct
Hash
Partitioning:
speed-‐up
count(distinct),
no
need
for
global

shuffle/merge

37©
Cloudera,
Inc.
All
rights
reserved.
Use
Case
2
OLAP
PaaS for
ecosystem
cloud
u Provide
big
data
service
for
smart
hardware
startups
(Xiaomi’s

ecosystem
members)
u OLAP
database
with
some
OLTP
features
u Manage/Ingest/query
your
data
and
serving
results
in
one
place
Backend/Mobile
App/Smart
Device/IoT …

39©
Cloudera,
Inc.
All
rights
reserved.
Demo
39
• Code
currently
at
https://github.com/tmalaska/SparkOnKudu/
• Work
being
finished
in
https://issues.cloudera.org/browse/KUDU-‐1214
Ingestion
in

Kafka
Gamer
data

points
Processing
in

Spark

Streaming
Data
stored
in

Kudu
Querying

done
in

ImpalaProducer
sends
data
points
to
Kafka
Spark
pulls
from
Kafka Spark
loads
base
data
from
Kudu
Aggregates
are
stored
back
into
Kudu
Live
queries
come
from
Impala

42©
Cloudera,
Inc.
All
rights
reserved.
Project
status
• Public
Beta
released
September
28th
2015,
version
0.5.0
• Not
ready
for
production
• No
security
• Feedback/jiras/patches
welcome
• Next
release
in
November
(0.6.0):
• Mac
OSX
support
for
single
node
deployment
• Lots
of
small
fixes
and
improvements
• GA
sometime
next
year
(hopefully!)
• Will
have
Kerberos
integration
• Ready
for
production
42

44©
Cloudera,
Inc.
All
rights
reserved.
Getting
started
as
a
user
• http://getkudu.io
• kudu-‐user@googlegroups.com
• http://getkudu-‐slack.herokuapp.com/
• Quickstart VM
• Easiest
way
to
get
started
• Impala
and
Kudu
in
an
easy-‐to-‐install
VM
• CSD
and
Parcels
• For
installation
on
a
Cloudera
Manager-‐managed
cluster
44

45©
Cloudera,
Inc.
All
rights
reserved.
Getting
started
as
a
developer
• http://github.com/cloudera/kudu
• All
commits
go
here
first
• Public
gerrit:
http://gerrit.cloudera.org
• All
code
reviews
happening
here
• Public
JIRA:
http://issues.cloudera.org
• Includes
bugs
going
back
to
2013.
Come
see
our
dirty
laundry!
• kudu-‐dev@googlegroups.com
• Apache
2.0
license
open
source
• Contributions
are
welcome
and
encouraged!
45

Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop

Similar to Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop (20)

Recently uploaded

Recently uploaded (20)

Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop