Low Latency OLAP with Hadoop and HBase

Low-Latency “OLAP” with Hadoop and HBase
Andrei Dragomir | Software Engineer

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Synopsis

§  What are we trying to solve
§  Description of our system
§  How it works
§  Minimizing Latency

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2

In a nutshell

Low-latency OLAP system
Hadoop DFS to store input data (ie log files, or
HBase tables)
The processing loop of the system takes a cube
description and processes it (pre-aggregations)
using Hadoop Map/Reduce.
The output is written to a statistics HBase table.
To get the data, users query a server, which scans
the HBase table, applying the filters, roll-ups or
drill-downs, and returning the result.

In a nutshell

HBase tables)

Vocabulary

Date Country City OS Browser Sales
2012-05-12 USA NY Win FF $ 0.0
2012-05-12 USA NY Win FF $ 10.0
2012-05-13 USA SF OSX Chrome $ 25.0
2012-05-13 Canada Ontario Linux Chrome $ 0.0
2012-05-14 USA Chicago OSX Safari $ 15.0
... ... ... ... ... ...
5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0
3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales
Canada: 1 SF: 1 OSX: 2 Chrome: 2


Vocabulary

2012-05-12 USA NY Win FF $ 0.0
2012-05-12 USA NY Win FF $ 10.0
... ... ... ... ... ...
§  We want to get (mostly) numeric data: metrics
§  These metrics have a set of labels (dimensions)
§  We want to view the metrics by any combination of
dimensions

Vocabulary

2012-05-12 USA NY Win FF $ 0.0
2012-05-12 USA NY Win FF $ 10.0
... ... ... ... ... ...
dimensions

OLAP Queries

§  Rolling up to country level Country visits sales
SELECT
COUNT(visits),
SUM(sales)
USA 4 $50
GROUP
BY
country

Canada 1 0

§  “Slicing” by browser Country visits sales

SELECT
COUNT(visits),
SUM(sales)
USA 2 $10

GROUP
BY
country
Canada 0 0
HAVING
browser
=
“FF”

Browser sales visits
§  Top browsers by sales
Chrome $25 2
SELECT
SUM(sales),
COUNT(visits)

GROUP
BY
browser

Safari $15 1

ORDER
BY
sales
FF $10 2


Looking inside – physical diagram


Looking inside – logical diagram


Simplifying assumptions: pre-aggregation

§  In most cases...
§  Data needs to be summarized – hard to
draw 1B data points
§  You don’t need to look at all dimensions at
the same time – hard to correlate
§  Not all queries are used with the same
frequency


A timeless CS problem: Optimize...

Time Space
§  Pre-aggregation §  Runtime

§  Fast
aggregation
§  Flexible
§  Efficient reads –
O(1)
§  Inflexible §  I/O, CPU intensive
§  Processing latency §  Slow– always need
§  Combinatorial
to look at all the
Explosion data
§  Low throughput

Solution ?

§  Just do both !
§  Can tune: pre-aggregate more, or rely on
runtime aggregation
§  Ingestion + process speed vs Query speed

§  Works just like normal queries +
materialized views


Solution ?

§  Process: pre-aggregate all the report
definitions, create an indexed HBase table.
§  Query: use the indexes to get the data
fast. Perform extra aggregation, filtering if
needed at runtime.
§  Platform strengths
§  Parallelism in M/R
§  Fast access and natural key ordering in
HBase

Minimal HBase details

Row
Columns...

§  Data is stored in tables Key

u1
v1
v2
v3

§  Each row has a key,
u2
v
X
...

and any number of
columns (long & wide) u3
v
x
...

u4
x
v2
...

§  Ordered by row keys: u5
...
v3
...

clustered indexes
u6
...
v5
...

built-in
u7
...
...
...

§  Sparse tables. NULLs u8
...
...
...

are free.


Minimal HBase details

Row
Column
§  Operations use row key
...

key: get(), put()
aaa
v1

aab
v2

§  Can scan a range of
←

rows:[start,
end)
aac
v3

←
aad
v4

§  We can use the row ←
aae
v5

key as a built-in ←
aaf
v6

indexing aba
...

mechanism abb
...


SaasBase vs. SQL Views Comparison


Reports configuration

§  List of Dimensions (with custom classes,
arguments, etc)
§  List of Metrics (with custom classes, arguments,
etc)
§  List of Reports, each containing
§  Dimensions (subset)
§  Metrics (subset)
§  Sorting, etc
§  The reports configuration is used in the
entire system: import, process, query

Solution ?

Date Countr Cit Sale
y y s
2012-05-1 USA NY 3
2
2012-05-1 USA NY 10
2
2012-05-1 USA SF 25
3
2012-05-1 CAN ON 0
3
2012-05-1 USA CH 15
4


Solution ?

y y s
2012-05-1 USA NY 3
2
2012-05-1 USA NY 10
2
2012-05-1 USA SF 25
3
2012-05-1 CAN ON 0
3
2012-05-1 USA
visits_by_city:
{
CH 15

dimensions:
[country,
city],

4

metrics:
[visits]

},

daily_sales:
{

dimensions:
[year,
month,
day,

country],

metrics:
[sales]

}


Solution ?

y y s
2012-05-1 USA NY 3
2
2012-05-1 USA NY 10
2

Statistics
HBASE
Output
Table

ROWKEY

VALUE

2012-05-1 USA SF 25
3 daily_sales/2012+05+12+USA

$13

daily_sales/2012+05+13+CAN

$0

2012-05-1 CAN ON 0
daily_sales/2012+05+13+USA

$25

3
daily_sales/2012+05+14+USA

$15

2012-05-1 USA
visits_by_city:
{
CH 15 visits_by_city/CAN+ON

1

dimensions:
[country,
city],

4

metrics:
[visits]
visits_by_city/USA+CH

1

},

daily_sales:
{
visits_by_city/USA+NY

2

dimensions:
[year,
month,
day,
visits_by_city/USA+SF

1

country],

metrics:
[sales]

}


HBase natural order: hierarchical filtering


Sorting

§  Add the metrics that you want to sort by to the
row key...
§  In a way that preserves the ordering


Sorting

§  Add the metrics that you want to sort by to the
row key...
§  In a way that preserves the ordering
§  ORDER
BY
metric
DESC
==
Long.MAX_VALUE
–
metric

2012+05+USA+0000000000+

2012+05+USA+4294961296+SF
=
1000
visits

2012+05+USA+4294961396+NY
=
900
visits

.
.
.

2012+05+USA+9999999999+


Minimizing Latency


Minimizing Import Latency

§  Only import the minimal set of changes
§  Map/Reduce input filters:
§  c.a.s.a.i.FileCache – checks if file already
processed
§  c.a.s.a.i.FileDateFilter – checks if a date in
the file path is against a specified interval
§  process files from 3 days ago up until now,
once
§  HBase scan (from import table) start and stop row
§  Minimize map-task overhead – stitch input splits


§  Minimize map-task overhead – stitch input splits
§  for 400000 files -> 400000 Map Tasks, slow reduce-copy
phase
§  o.a.h.m.i.CombineFileInputFormat – make 2GB
splits
§  c.a.s.a.m.i.FixedMappersTableInputFormat –
stitches multiple HBase regions in the same
map task



§  If warehousing in HBase, use
o.a.h.h.m.HFileOutputFormat

§  ~ 100 times faster than using the API
§  No shuffle step! you must use a global order partitioner
§  Problem: data grows over time
§  Solution: estimate output partitions based on input data
size, and make partitions (regions) using this heuristic
§  c.a.s.a.m.FileSizeDatePartitioner – inject input files
size and dates and rebalance regions based on these,
and a fixed size (2GB)


Minimizing Processing Latency

§  Processing involves reading the input (files, tables,
events), pre-aggregating it (reducing cardinality) and
generating tables that can be queried in real-time
§  Processing does GROUP BY, COUNT/SUM/AVG, ORDER
BY
§  Minimize each M/R step: read, map, partition, combine,
copy, sort, reduce, write
§  Read
§  Filter input data (incremental processing) – differentiate
between OPEN and CLOSED data
§  HBase Scan options: caching, batching, etc
§  Ensure HBase table regions are distributed in the cluster

Minimizing Processing Latency

§  c.a.s.a.m.j.SuperProcessor

§  One shot M/R job: for all data, for all reports, emit the
pre-aggregated values in 1 map() call
§  no allocations
§  Simple and tight
§  no system calls (avoid context switches)
§  no String <> byte[] transformations
§  minimize Map > Combine > Reduce I/O
§  NO ALLOCATIONS


Minimizing Query Latency

§  c.a.s.a.m.t.ReportHandler

§  Simple Thrift server
§  Data is already processed and pre-aggregated
§  Query time does HAVING/WHERE (filters), extra
GROUP BY (roll-ups)
§  Calculate an optimal set of HBase scan()s

§  single / multiple scans
§  start / stop rows (prefixes, index positions)
§  Perform extra roll-ups / sorting
§  Assorted sundries: paging, display-time ser/des, etc


Flexible

§  Report configuration – the core of the system
§  c.a.s.a.e.Dimension, c.a.s.a.e.Metric

§  Can override ser/des, aggregate functions (for metrics)
§  Can override behavior (only add 1 if X...)
§  Emergent patterns are rolled-up in the reporting core
§  The entire processing loop can be written outside of
M/R for realtime
§  Storm ?
§  Applied in 4 use-cases right now, easy to extend
§  Some programming required

Thank you

adragomi@adobe.com / @adragomir
http://hstack.org

Our team: Adrian Muraru, Andrei Dulvac, Bogdan Dragu,
Bogdan Drutu, Cosmin Lehene, Raluca Podiuc, Tudor Scurtu


Break!
Break takes place in the Community Showcase (Hall 2)
Sessions will resume at 3:35pm

Page 40

Low Latency OLAP with Hadoop and HBase

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Low Latency OLAP with Hadoop and HBase

Similaire à Low Latency OLAP with Hadoop and HBase (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Low Latency OLAP with Hadoop and HBase