Hadoop & Hive Change the Data Warehousing Game Forever

Hadoop & Hive
HOW
Data Warehousing Game
CHANGE THE
Forever
Dave Mariani
CEO & Founder of AtScale
@dmariani
Atscale.com
2014 Hadoop Summit
San Jose, CA
June 3, 2014

44
“We think only 3% of the potentially
useful data is tagged, and even less
is analyzed.”
Source: IDC Predictions 2013: Big Data, IDC
“90% of the data in the world
today has been created in the
last two years”
Source: IBM

In 2012, 2.5 quintillion byes
of data was generated every
day
Source: IBM

7
…and that was back in 2012

Relational DBs
Volume: Write twice
Variety: Structured
Velocity: Early Transformation

Hadoop
Volume: Write once
Variety: Semi-structured
Velocity: Late Transformation

INPUT DATA
HADOOP
ETL
MART MART MART
QUERY ENGINE
VISUALIZER

INPUT DATA
HADOOP (HIVE)
VISUALIZER

15
Social networks processed daily

200,000
Indexed users added daily

400,000,000
Users indexed daily

12,000,000,000
Social signals processed daily

50,000,000,000
API calls delivered monthly

10,080,000,000,000
Rows of data in the data warehouse

A common question
How are users using our site?

Google Analytics
+ Great page by page analysis
- User identifiable data against EULA

MixPanel
+ Great event tracking
- Can send user specific identifiers, big limitations

Klout
+ All our data telling us who these people actually are
- That’s about it

{
"project": "plusK",
"event": "spend",
"ks_uid": 123456,
"type": "add_topic"
}

{
"project": "plusK",
"event": "spend",
"session_id": "0",
"ip": "50.68.47.158",
"kloutId": "123456",
"cookie_id": "123456",
"ref": "http://www.klout.com",
"type": "add_topic",
"time": "1338366015"
}

EVENT_LOG
tstamp INT
project STRING
event STRING
session_id BIGINT
ks_uid BIGINT
ip STRING
attr_map MAP<STRING,STRING>
json_text STRING
dt STRING
hr STRING

SELECT { [Measures].[Counter],
[Measures].[PreviousPeriodCounter]}
ON COLUMNS,
NON EMPTY CROSSJOIN (
exists([Date].[Date].[Date].allmembers,
[Date].[Date].&[2012-05-19T00:00:00]:[Date].[Date].&[2012-0
02T00:00:00]),
[Events].[Event].[Event].allmembers) DIMENSION PROPERTIES
MEMBER_CAPTION
ON ROWS
FROM [ProductInsight]
WHERE ({[Projects].[Project].[plusK]})

SELECT
get_json_object(json_text,'$.sid') as sid,
get_json_object(json_text,'$.kloutId') as kloutId,
get_json_object(json_text,'$.v') as version,
get_json_object(json_text,'$.status') as status,
event
FROM bi.event_log
WHERE project='mobile-ios'
AND tstamp=20121027
AND event in ('api_error', 'api_timeout')
ORDER BY sid;

So, what’s wrong with this
picture?

Case Study
Online Gaming (MMO)

...
LogInt1369155542t4533245t”loc":”23”,"rank":"Expert”,"client":"ios"lf
Buyt1369155556t4533446t”loc":”23”,"item":"212”,"ref”:”ask.com”,"amt":"1.50"lf
...
Capture
Event Name Timestamp User ID Attributes

CREATE EXTERNAL TABLE event_log (
event STRING,
event_time TIMESTAMP,
user_id INTEGER,
event_attributes MAP<STRING, STRING>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' COLLECTION ITEM TERMINATED
BY ','
PARTITIONED BY (day(FROM_UNIXTIME(event_time)), INTEGER)
LOCATION '/user/event_logs’;
Map

SELECT
SUBSTR(FROM_UNIXTIME(event_time),1,7) AS MonthOfEvent,
event_attributes[”loc"] AS Location,
count(*) AS EventCount
FROM event_log
WHERE year(FROM_UNIXTIME(event_time)) = 2014
GROUP BY SUBSTR(FROM_UNIXTIME(event_time),1,7), attributes[”loc"]
Transform and Query

Shark Impala Stinger
Performance approach Caching Optimizer Improve Hive
Theoretical limits (# of rows) Billions Trillions Trillions
Supports UDFs, SerDes Yes Fall ‘14 Yes
Supports non-scalar data types Yes Fall ‘14 Yes
Preferred file format Tachyon Parquet ORC
Sponsorship Databricks Cloudera Hortonworks

Records
Returned
Time (Seconds)
Select Statement
HANA
Small
Impala
Small
(1 Node)
Parquet
Impala
Small
(3
Nodes)
Parquet
Impala
Small
(1 Node)
Text
Impala
Small
(3 Nodes)
Text
select count(*) from lineitem 1 1 3 1 74 31
select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29
select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by
l_shipmode 7 8 23 5 74 28
select l_shipmode, count(*), sum(l_extendedprice) from lineitem where
l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from
lineitem group by l_shipmode, l_linestatus 14 10 32 7 74 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from
lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' group by
l_shipmode, l_linestatus 1 1 27 5 72 29
select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus =
'F' and l_suppkey = 1 45 1 23 5 73 30
select l_shipmode, l_linestatus, l_extendedprice from lineitem where
l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31
select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and
l_suppkey = 1 45 1 104 21 73 30
Size
(5 Part.)
1.9Gb
(40 files x 80mb)
3.2Gb
(1 file – No
Compression)
7.2Gb
Est. Monthly Cost of Production Environment on AWS
(HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350Source: Aron MacDonald, http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws
TPC-H Query Run Times (Impala vs. HANA)
Line item table, 60 Million Rows

Real Customer Scenario: Impala (CDH5)
5 Data Node Cluster, 16Gb of RAM each, 4 Cores each, Parquet
Fact Table
(# of rows)
Dimensions
Execution
Time
(seconds)
1 Dimension (28 rows) 0:00:00.836
2 Dimensions (28 rows, 28 rows) 0:00:00.767
2 Dimensions (28 rows, 2,926 rows) 0:00:00.660
3 Dimensions (28 rows, 28 rows, 2,926 rows) 0:00:00.871
1 Dimension (2,926 rows) 0:00:00.490
1 Dimension (28 rows) 0:00:06.780
0 Dimensions 0:00:52.074
1 Dimension (121,964,466 rows) - Count Distinct 0:00:55.861
1 Dimension (5,547,151 rows) 0:01:08.972
3 Dimensions (5 rows, 2,926 rows, 3,782 rows) 0:01:39.119
2 Dimensions(5 rows, 40,040 rows) 0:00:31.980
4 Dimensions (2 rows, 487,374 rows, 7,875,489 rows, 2,038,760 rows) 0:01:33.404
1 Dimension (8,038 rows) 0:00:24.083
995,761,863 1 Dimension (3,782 rows) 0:01:23.484
1 Dimension (28 rows) 0:00:56.716
1 Dimension (5 rows) 0:00:33.750
2 Dimensions (3 rows, 371 rows) 0:00:54.329
520
15,036
55,676
72,745,961
121,964,466
263,223,987
378,706,328
587,679,516
1,064,423,864
1,174,737,467

TL;DW
(Too long; didn’t watch)

DO DO NOT
Capture Data Aggregate Data

DO DO NOT
T (Transform) ETL (Extract, Transform, Load)

DO DO NOT
Schema on Read Schema on Load

DO DO NOT
Query in Place Create Data Marts

Thank you!
Dave Mariani
CEO & Founder of AtScale
@dmariani
Atscale.com

Hadoop & Hive Change the Data Warehousing Game Forever

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Hadoop & Hive Change the Data Warehousing Game Forever

Similaire à Hadoop & Hive Change the Data Warehousing Game Forever (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Hadoop & Hive Change the Data Warehousing Game Forever

Notes de l'éditeur