18. Short Batch processing
Presto, Impala, Drill
Target window: seconds - hours (- days)
Total throughput: Normal
Query latency: Small (seconds - mins)
14年7月8日火曜日
19. Stream processing
Storm, Kafka, Esper, Norikra, Fluentd, ....
Spark streaming(?)
Target window: seconds - hours
Total throughput: Normal
Query latency: SMALLEST (milliseconds)
Queries must be written BEFORE DATA
Once registered, runs forever
14年7月8日火曜日
20. Data flow and latency
data window
query execution
Batch
Short
Batch Stream
incremental
query exection
14年7月8日火曜日
21. Data window
Target time (or size) range of queries
Batch (or short-batch)
FROM-TO: WHERE dt >= ‘2014-07-07 00:00:00‘
AND dt <= ‘2014-07-08 23:59:59’
Stream
“Calculate this query for every 3 minutes”
Extended SQL required
14年7月8日火曜日
23. Stream processing with SQL
Esper: Java library to process Stream
Esper EPL
SELECT param1, param2
FROM tbl
WHERE age > 30
14年7月8日火曜日
24. Stream processing with SQL
SELECT param, COUNT(*) AS c
FROM tbl
WHERE age > 30
GROUP BY param
Esper: Java library to process Stream
Esper EPL
14年7月8日火曜日
25. Stream processing with SQL
SELECT param, COUNT(*) AS c
FROM tbl.win:time_batch(1 hour)
WHERE age > 30
GROUP BY param
Esper: Java library to process Stream
Esper EPL
14年7月8日火曜日
27. Norikra:
Schema-less Stream Processing with SQL
OSS, based on Esper EPL, GPLv2
Without pre-defined schema
Complex event processing (w/ nested hash/array) w/ SQL
HTTP RPC w/ JSON or MessagePack (fluentd plugin available!)
Dynamic query registration/removing
Ultra fast bootstrap (in 3 minutes!)
UDF plugins by Java/Ruby
http://norikra.github.io/
14年7月8日火曜日
28. Distributed processing OR NOT?
Norikra is NOT a distributed processing platform.
Of course, SCALE OUT IS FANTASTIC.
Is non-distributed software useless?
MySQL
MySQL Cluster
Norikra can handle 10k events/sec
on 2CPU (8core) server
14年7月8日火曜日
34. Hybrid stream processing:
against complexity
Non-SQL stream processing:
for simple, fixed, high-traffic events
SQL stream processing:
for complex, fragile events
14年7月8日火曜日
35. Case study in LINE
Prompt-report & fixed-report
Norikra + Hive Hybrid
Error detection from application and access logs
Norikra + Fluentd Hybrid
Realtime aggregation for complex and simple(fixed) objects
Norikra + Fluentd Hybrid
14年7月8日火曜日
36. Case study in LINE
Prompt-report & fixed-report
Norikra + Hive Hybrid
Error detection from application and access logs
Norikra + Fluentd Hybrid
Realtime aggregation for complex and simple(fixed) objects
Norikra + Fluentd Hybrid
14年7月8日火曜日
37. Hive: fixed-reports
SELECT
yyyymmdd, hh, campaign_id, region, lang,
COUNT(*) AS click,
COUNT(DISTINCT member_id) AS uu
FROM (
SELECT
yyyymmdd,
hh,
get_json_object(log, '$.campaign.id') AS campaign_id,
get_json_object(log, '$.member.region') AS region,
get_json_object(log, '$.member.lang') AS lang,
get_json_object(log, '$.member.id') AS member_id
FROM applog
WHERE service='myservice'
AND yyyymmdd='20140708' AND hh='00'
AND get_json_object(log, '$.type')='click'
) x
GROUP BY yyyymmdd, hh, campaign_id, region, lang
14年7月8日火曜日
38. Norikra: prompt-reports
SELECT
campaign.id AS campaign_id,
member.region AS region,
member.lang AS lang,
COUNT(*) AS click,
COUNT(DISTINCT member.id) AS uu
FROM myservice.win:time_batch(1 hours)
WHERE type="click"
GROUP BY campaign.id, member.region, member.lang
14年7月8日火曜日