SlideShare une entreprise Scribd logo
1  sur  47
Stream Analysis with Kafka Native way
and Considerations in Monitoring As A service
Andrew yongjoon Kong
sstrato.open@gmail.com
• Cloud Technical Advisory for Government Broad Cast Agency
• Adjunct Prof. Ajou Univ
• Korea Data Base Agency Acting Professor for Bigdata
• Member of National Information Agency Bigdata Advisory committee
• Kakaocorp, Cloud Part Lead
• Talks
• Scalable Loadbalancer with VM orchestrator (2017, netdev, korea)
• Embrace clouds (2017, openstack days, korea)
• Full route based network with linux (2016, netdev, Tokyo)
• SDN without SDN (2015, openstack, Vancouber)
Who am I
Andrew. Yongjoon kong
Supervised,
Korean
edition
Korean
Edition.
Some Terms
Batch vs Stream
Processing
응용프로그램
센서(IOT)
기타
이
벤
트
이
벤
트
이
벤
트
이
벤
트
이
벤
트
응용프로그램
스트림
데이터베이스
분산 파일 시스템
이벤트 이벤트 이벤트 이벤트 이벤트
신규 스트림
쿼리 실행
분석
업테이트
응용프로그램
배치처리 영역
스트림 처리영역
What is Real Time?
The term real-time analytics implies
practically instant access and use of
analytical data
Relative, Time is
To be continued
processing
이
벤
트
이
벤
트
이
벤
트
이
벤
트
이
벤
트
응용프로그램 이벤
트
이벤
트
이벤
트
이벤
트
이벤
트
기존 스트림
신규 스트림
state저장소
Popular stream processor
• Apache Flume ( Too old school)
• Apache Storm
• Apache Spark
• Apache Samza
• Apache Nifi …
Popular stream processor
• e.g. Apache Flume
• flume comprises source, sink, channel
Channel
source sink
AVRO Source
Thrift Source
Exec source
JMS source
Spooling Directory Source
NetCat Source
Sequence Generator
Syslog sources
HTTP source
twitter source
HDFS sink
Logger sink
Avro sink
Thift sink
IRC sink
File Roll sink
HbaseSink
ElasticSearchSink
sourc
e
sink
partitions
Kafka Streams
• Simple (only can work with kafka)
• guarantee exactly once
• provide local state store
• DSL support
• Kafka Streams comprises source processor, sink processor, topology
• Source: reading data from Kakfa topic
• Sink : getting data from other processor
• topology: automatically created data pipe line
Kafka Streams Sample 1, pipe
• kafka streams code:
• running:
mvn exec:java -Dexec.mainClass=myapps.Pipe
final StreamsBuilder builder = new StreamsBuilder();
builder.stream(“streams-plaintext-input”).to(“streams-pipe-output”);
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
Kafka Streams Sample 1, pipe
vs Apache Samza
• Code:
• running:
• copy the jar or path to the hadoop cluster
• run the program
• what happen if something goes bad?
Kafka Streams Sample 2, wordcount
• Topology
Kafka Streams Sample 2, wordcount
• Code
builder.<String, String>stream("streams-plaintext-input")
.flatMapValues(new ValueMapper<String, Iterable<String>>() {
@Override
public Iterable<String> apply(String value) {
return Arrays.asList(value.toLowerCase(Locale.getDefault()).split("W+"));
}
})
.groupBy(new KeyValueMapper<String, String, String>() {
@Override
public String apply(String key, String value) {
return value;
}
})
.count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as("counts-store"))
.toStream()
.to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long()));
Kafka Streams
• Demo
Kafka Streams QnA
• Let’s talk
KSQL
Before go into KSQL
• Why you need SQL?
Productivity Perspective
• 18 ~ 25세 연령대의 사용자가 가장 많이 방문하는 사이트 5개를
찾아라
사용자
정보
사이트 방문
데이터
사용자정보
Loading
사이트 방문
데이터 Loading
나이
Filtering
Join
(아이디)
그룹핑
(사이트)
카운트
(방문횟수)
정렬
(방문횟수)
Top 5
사이트
사용자 아이디 나이 성별
길동 kildong 20 남
철수 cheol 25 남
영희 young 15 여
영구 ygu 34 남
사이트 방문자 시간
chosum.com kildong 08:00
ddanji.com tiffany 12:00
flickr.com yuna 11:00
espn.com ygu 21:34
• 직접 MapReduce 프로그램을 코딩 할 경우
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.mapred.lib.IdentityMapper;
public class MRExample {
public static class LoadPages extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String key = line.substring(0, firstComma);
String value = line.substring(firstComma + 1);
Text outKey = new Text(key);
// Prepend an index to the value so we know
which file
// it came from.
Text outVal = new Text("1" + value);
oc.collect(outKey, outVal);
}
}
public static class LoadAndFilterUsers extends
MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String value = line.substring(firstComma + 1);
int age = Integer.parseInt(value);
if (age < 18 || age > 25) return;
String key = line.substring(0, firstComma);
Text outKey = new Text(key);
// Prepend an index to the value so we know
which file
// it came from.
Text outVal = new Text("2" + value);
oc.collect(outKey, outVal);
}
}
public static class Join extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key,
Iterator<Text> iter,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// For each value, figure out which file it's
from and store it
// accordingly.
List<String> first = new ArrayList<String>();
List<String> second = new ArrayList<String>();
while (iter.hasNext()) {
Text t = iter.next();
String value = t.toString();
if (value.charAt(0) == '1')
first.add(value.substring(1));
else second.add(value.substring(1));
reporter.setStatus("OK");
}
// Do the cross product and collect the values
for (String s1 : first) {
for (String s2 : second) {
String outval = key + "," + s1 + "," +
s2;
oc.collect(null, new Text(outval));
reporter.setStatus("OK");
}
}
}
}
public static class LoadJoined extends MapReduceBase
implements Mapper<Text, Text, Text, LongWritable> {
public void map(
Text k,
Text val,
OutputCollector<Text, LongWritable> oc,
Reporter reporter) throws IOException {
// Find the url
String line = val.toString();
int firstComma = line.indexOf(',');
int secondComma = line.indexOf(',',
firstComma);
String key = line.substring(firstComma,
secondComma);
// drop the rest of the record, I don't need it
anymore,
// just pass a 1 for the combiner/reducer to
sum instead.
Text outKey = new Text(key);
oc.collect(outKey, new LongWritable(1L));
}
}
public static class ReduceUrls extends MapReduceBase
implements Reducer<Text, LongWritable,
WritableComparable, Writable> {
public void reduce(
Text key,
Iterator<LongWritable> iter,
OutputCollector<WritableComparable,
Writable> oc,
Reporter reporter) throws IOException {
// Add up all the values we see
long sum = 0;
while (iter.hasNext()) {
sum += iter.next().get();
reporter.setStatus("OK");
}
oc.collect(key, new LongWritable(sum));
}
}
public static class LoadClicks extends MapReduceBase
implements Mapper<WritableComparable, Writable,
LongWritable, Text> {
public void map(
WritableComparable key,
Writable val,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
oc.collect((LongWritable)val, (Text)key);
}
}
public static class LimitClicks extends MapReduceBase
implements Reducer<LongWritable, Text, LongWritable,
Text> {
int count = 0;
public void reduce(
LongWritable key,
Iterator<Text> iter,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
// Only output the first 100 records
while (count < 100 && iter.hasNext()) {
oc.collect(key, iter.next());
count++;
}
}
}
public static void main(String[] args) throws
IOException {
JobConf lp = new JobConf(MRExample.class);
lp.setJobName("Load Pages");
lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class);
lp.setOutputValueClass(Text.class);
lp.setMapperClass(LoadPages.class);
FileInputFormat.addInputPath(lp, new
Path("/user/gates/pages"));
FileOutputFormat.setOutputPath(lp,
new Path("/user/gates/tmp/indexed_pages"));
lp.setNumReduceTasks(0);
Job loadPages = new Job(lp);
JobConf lfu = new JobConf(MRExample.class);
lfu.setJobName("Load and Filter Users");
lfu.setInputFormat(TextInputFormat.class);
lfu.setOutputKeyClass(Text.class);
lfu.setOutputValueClass(Text.class);
lfu.setMapperClass(LoadAndFilterUsers.class);
FileInputFormat.addInputPath(lfu, new
Path("/user/gates/users"));
FileOutputFormat.setOutputPath(lfu,
new Path("/user/gates/tmp/filtered_users"));
lfu.setNumReduceTasks(0);
Job loadUsers = new Job(lfu);
JobConf join = new JobConf(MRExample.class);
join.setJobName("Join Users and Pages");
join.setInputFormat(KeyValueTextInputFormat.class);
join.setOutputKeyClass(Text.class);
join.setOutputValueClass(Text.class);
join.setMapperClass(IdentityMapper.class);
join.setReducerClass(Join.class);
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/indexed_pages"));
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/filtered_users"));
FileOutputFormat.setOutputPath(join, new
Path("/user/gates/tmp/joined"));
join.setNumReduceTasks(50);
Job joinJob = new Job(join);
joinJob.addDependingJob(loadPages);
joinJob.addDependingJob(loadUsers);
JobConf group = new JobConf(MRExample.class);
group.setJobName("Group URLs");
group.setInputFormat(KeyValueTextInputFormat.class);
group.setOutputKeyClass(Text.class);
group.setOutputValueClass(LongWritable.class);
group.setOutputFormat(SequenceFileOutputFormat.class);
group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
group.setReducerClass(ReduceUrls.class);
FileInputFormat.addInputPath(group, new
Path("/user/gates/tmp/joined"));
FileOutputFormat.setOutputPath(group, new
Path("/user/gates/tmp/grouped"));
group.setNumReduceTasks(50);
Job groupJob = new Job(group);
groupJob.addDependingJob(joinJob);
JobConf top100 = new JobConf(MRExample.class);
top100.setJobName("Top 100 sites");
top100.setInputFormat(SequenceFileInputFormat.class);
top100.setOutputKeyClass(LongWritable.class);
top100.setOutputValueClass(Text.class);
top100.setOutputFormat(SequenceFileOutputFormat.class);
top100.setMapperClass(LoadClicks.class);
top100.setCombinerClass(LimitClicks.class);
top100.setReducerClass(LimitClicks.class);
FileInputFormat.addInputPath(top100, new
Path("/user/gates/tmp/grouped"));
FileOutputFormat.setOutputPath(top100, new
Path("/user/gates/top100sitesforusers18to25"));
top100.setNumReduceTasks(1);
Job limit = new Job(top100);
limit.addDependingJob(groupJob);
JobControl jc = new JobControl("Find top 100 sites
for users 18 to 25");
jc.addJob(loadPages);
jc.addJob(loadUsers);
jc.addJob(joinJob);
jc.addJob(groupJob);
jc.addJob(limit);
jc.run();
}
}
Productivity Perspective
MapReduce
Sample Code
• It’s for coder , not for user
• duplicated code and effort
• complexity in managing code
Productivity Perspective
High Level Parallel Processing Language
• 쉬운 MapReduce 를 위한 병렬 처리 언어
• Pig by Yahoo
• Hive by FaceBook
22
Pig example
• Same purpose code in PIG
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;
code line
1/20
coding time
1/16
easy
code
23
아파치 Pig
• 데이터 처리를 위한 고차원 언어
• 아파치 Top-Level 프로젝트
• Yahoo 내 Hadoop 작업의 30%
• 2007년 배포 이후 2~10배 성능 개선
• Native 대비 70 ~ 80 % 성능
Hive 개발 동기
• 벤더 데이터 웨어하우스 시스템 교체
• 데이터 확장성 문제(최초 10GB  수십TB)
• 라이선스 등 운영 비용 절감
• 벤더 DBMS 에서 Hadoop 으로 교체 결정
• 교체 과정에서 나타난 필요 기능을 개발
• 사용자를 위한 CLI
• 코딩 없이 Ad-hoc 질의를 할 수 있는 기능
• 스키마 정보들의 관리
Hive 기반 데이터웨어하우징
• Hive on Hadoop 클러스터
• Scribe & MySQL 데이터를 HDFS 에 벌크 로딩
• 수동 Python 스크립트를 Hive 로 변경
Oracle DatabaseData collection server
Scribe server tier MySQL server tier
26
What is the key component in Hive?
27
metastore
SerDe
execution
worker
Lambda Architecture
로우(Raw)
데이터 토픽
가공된 데이터
토픽
장기 데이터
저장소
데이터
조회
배치(Batch)
계산
배치 테이블
고속 테이블
스트리밍 계산
SUM
ADD
Filtering
데이터
제공 영역
Lambda Architecture
BTW, (not BTS)
Why it calls Lambda Architecture.
Greek letter lambda (λ)
Lambda Architecture e.g.
kakao’s KEMI-stat
http://tech.kakao.com/2016/08/25/kemi/
Kappa Architecture
• Key takeaways is “Calculating instantly, not retrieving instantly”
로우(Raw)
데이터 토픽
(+장기 저장)
가공된 데이
터
토픽
(+장기 저장)
데이터
조회
스트리밍 계산
SUM
ADD
Filtering
장기 계산
SUM
ADD
Filtering
데이터 제공 영역
Kappa Architecture with KSQL
단기 데이터
토픽
장기 데이터
토픽
데이터
조회
계산
select * from
short_topic
계산
select * from
long_topic
KSQL
KSQL
카프카 클러스터
KSQL 서버
KSQL
Engine
REST API
KSQL 서버
KSQL
Engine
REST API
KSQL 서버
KSQL
Engine
REST API
KSQL Shell Client
KSQL metastore
KSQL execution worker
Kafka Streams
KSQL metastore
KSQL DDL (Data Definition
Language)• stream vs table
• support only “CREATE/DELETE” for stream and table
• sample
CREATE TABLE users
(usertimestamp BIGINT, user_id VARCHAR,
gender VARCHAR, region_id VARCHAR)
WITH (VALUE_FORMAT = 'JSON',
KAFKA_TOPIC = 'my-users-topic');
KSQL DML (Data Manipulation
Language)
• SELECT, LEFT JOIN
• aggregate function like ADD, SUM and UDF like ABS/CONCAT
supported
Example
PageViews
(카프카토픽)
Users
(카프카토픽)
Page view 토픽 데이터
생성기
Users 토픽
데이터 생성기
PAGEVIEWS_
FEMALE
(신규 카프카토픽)
pageviews_en
riched_r8_r9
(신규 카프카토픽)
PAGEVIEWS_
REGIONS
(신규 카프카토픽)
pageviews_female
(스트림)
pageviews_female_like
_89
(스트림)
pageviews_region
(테이블)
Example, create user/pageview table
PageView
s
(카프카토픽)
Users
(카프카토픽)
Page view 토픽
데이터
생성기
Users 토픽
데이터 생성기
PAGEVIE
WS_FEM
ALE
(신규 카프
카토픽)
pageviews
_enriched
_r8_r9
(신규 카프
카토픽)
PAGEVIE
WS_REGI
ONS
(신규 카프
카토픽)
pageviews_femal
e
(스트림)
pageviews_fem
ale_like_89
(스트림)
pageviews_regi
on
(테이블)
ksql> CREATE STREAM pageviews_original  ❶
(viewtime bigint, userid varchar, pageid varchar)  ❷
WITH (kafka_topic='pageviews',
value_format='DELIMITED'); ❸
ksql> CREATE TABLE users_original ❶
(registertime bigint, gender varchar, regionid varchar,
userid varchar) ❷ WITH (kafka_topic='users',
value_format='JSON'); ❸
Example, create table/stream from
query
PageView
s
(카프카토픽)
Users
(카프카토픽)
Page view 토픽
데이터
생성기
Users 토픽
데이터 생성기
PAGEVIE
WS_FEM
ALE
(신규 카프
카토픽)
pageviews
_enriched
_r8_r9
(신규 카프
카토픽)
PAGEVIE
WS_REGI
ONS
(신규 카프
카토픽)
pageviews_femal
e
(스트림)
pageviews_fem
ale_like_89
(스트림)
pageviews_regi
on
(테이블)
ksql> CREATE STREAM pageviews_female AS  ❶
SELECT users_original.userid AS userid, pageid, regionid,
gender FROM pageviews_ original ❷
LEFT JOIN users_original ❸
ON pageviews_original.userid = users_original.userid
WHERE gender = 'FEMALE'; ❹
ksql> CREATE STREAM pageviews_female_like_89
 ❶
WITH (kafka_topic='pageviews_enriched_r8_r9',
value_format='DELIMITED') AS  ❷
SELECT * FROM pageviews_female WHERE
regionid LIKE '%_8' OR regionid LIKE '%_9'; ❸
Considerations about
Monitoring As A Service
What is most important in
Data pipeline?
Kafka (ESB)
파이썬
앱
플러그인
자바
앱
ERP. 브릿지 웹앱
Kafka (ESB)
기존
메시징 시스템
EIP
EIP
What is the most important in Data
pipeline?
• Performance
• Have to be real-time ( or near real-time)
• Data Integrity
• No data loss
• Every Data can be consumed
The most important in Data pipeline?
• Provider Perspective
• Service Level Agreement
• Rate
• Format
• ACL
• It’s all about Managed Service
About Data Structure?
• Data Structure defines the Data computing architecture
• It defines API
• It defines Data Storage
• It defines Computing method
• What would you do if the data structure like below
data["resource_id"]=”Some Server ID”
data["svc_id"]= “Some Service ID”
data["timestamp"]=str(int(time.time()))
data["statistics"]= stats
response =requests.put( url, data=json.dumps(data), )
QnA

Contenu connexe

Tendances

Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
Mariusz Gil
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
Holden Karau
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 

Tendances (20)

Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Building Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaBuilding Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJava
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before Reacting
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Using akka streams to access s3 objects
Using akka streams to access s3 objectsUsing akka streams to access s3 objects
Using akka streams to access s3 objects
 
[OpenInfra Days Korea 2018] (Track 4) CloudEvents 소개 - 상호 운용 가능성을 극대화한 이벤트 데이...
[OpenInfra Days Korea 2018] (Track 4) CloudEvents 소개 - 상호 운용 가능성을 극대화한 이벤트 데이...[OpenInfra Days Korea 2018] (Track 4) CloudEvents 소개 - 상호 운용 가능성을 극대화한 이벤트 데이...
[OpenInfra Days Korea 2018] (Track 4) CloudEvents 소개 - 상호 운용 가능성을 극대화한 이벤트 데이...
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
STORM
STORMSTORM
STORM
 
Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014
 
Reactive stream processing using Akka streams
Reactive stream processing using Akka streams Reactive stream processing using Akka streams
Reactive stream processing using Akka streams
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 

Similaire à Stream analysis with kafka native way and considerations about monitoring as a service

Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
Sages
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
PennonSoft
 
Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streams
confluent
 

Similaire à Stream analysis with kafka native way and considerations about monitoring as a service (20)

PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataPigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
JS everywhere 2011
JS everywhere 2011JS everywhere 2011
JS everywhere 2011
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the way
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
Rx workshop
Rx workshopRx workshop
Rx workshop
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Reactive Programming Patterns with RxSwift
Reactive Programming Patterns with RxSwiftReactive Programming Patterns with RxSwift
Reactive Programming Patterns with RxSwift
 
Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streams
 

Plus de Andrew Yongjoon Kong

Plus de Andrew Yongjoon Kong (12)

Tunnel without tunnel
Tunnel without tunnelTunnel without tunnel
Tunnel without tunnel
 
Nightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failureNightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failure
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
Automating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAutomating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestrator
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
Embracing clouds
Embracing cloudsEmbracing clouds
Embracing clouds
 
openstack, devops and people
openstack, devops and peopleopenstack, devops and people
openstack, devops and people
 
Cloud data center and openstack
Cloud data center and openstackCloud data center and openstack
Cloud data center and openstack
 
Openstack summit 2015
Openstack summit 2015Openstack summit 2015
Openstack summit 2015
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
 
Way to cloud
Way to cloudWay to cloud
Way to cloud
 
Openstack dev on
Openstack dev onOpenstack dev on
Openstack dev on
 

Dernier

pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Monica Sydney
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Monica Sydney
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 

Dernier (20)

20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 

Stream analysis with kafka native way and considerations about monitoring as a service

  • 1. Stream Analysis with Kafka Native way and Considerations in Monitoring As A service Andrew yongjoon Kong sstrato.open@gmail.com
  • 2. • Cloud Technical Advisory for Government Broad Cast Agency • Adjunct Prof. Ajou Univ • Korea Data Base Agency Acting Professor for Bigdata • Member of National Information Agency Bigdata Advisory committee • Kakaocorp, Cloud Part Lead • Talks • Scalable Loadbalancer with VM orchestrator (2017, netdev, korea) • Embrace clouds (2017, openstack days, korea) • Full route based network with linux (2016, netdev, Tokyo) • SDN without SDN (2015, openstack, Vancouber) Who am I Andrew. Yongjoon kong Supervised, Korean edition Korean Edition.
  • 4. Processing 응용프로그램 센서(IOT) 기타 이 벤 트 이 벤 트 이 벤 트 이 벤 트 이 벤 트 응용프로그램 스트림 데이터베이스 분산 파일 시스템 이벤트 이벤트 이벤트 이벤트 이벤트 신규 스트림 쿼리 실행 분석 업테이트 응용프로그램 배치처리 영역 스트림 처리영역
  • 5. What is Real Time? The term real-time analytics implies practically instant access and use of analytical data
  • 6. Relative, Time is To be continued
  • 8. Popular stream processor • Apache Flume ( Too old school) • Apache Storm • Apache Spark • Apache Samza • Apache Nifi …
  • 9. Popular stream processor • e.g. Apache Flume • flume comprises source, sink, channel Channel source sink AVRO Source Thrift Source Exec source JMS source Spooling Directory Source NetCat Source Sequence Generator Syslog sources HTTP source twitter source HDFS sink Logger sink Avro sink Thift sink IRC sink File Roll sink HbaseSink ElasticSearchSink sourc e sink partitions
  • 10. Kafka Streams • Simple (only can work with kafka) • guarantee exactly once • provide local state store • DSL support • Kafka Streams comprises source processor, sink processor, topology • Source: reading data from Kakfa topic • Sink : getting data from other processor • topology: automatically created data pipe line
  • 11. Kafka Streams Sample 1, pipe • kafka streams code: • running: mvn exec:java -Dexec.mainClass=myapps.Pipe final StreamsBuilder builder = new StreamsBuilder(); builder.stream(“streams-plaintext-input”).to(“streams-pipe-output”); final Topology topology = builder.build(); final KafkaStreams streams = new KafkaStreams(topology, props);
  • 12. Kafka Streams Sample 1, pipe vs Apache Samza • Code: • running: • copy the jar or path to the hadoop cluster • run the program • what happen if something goes bad?
  • 13. Kafka Streams Sample 2, wordcount • Topology
  • 14. Kafka Streams Sample 2, wordcount • Code builder.<String, String>stream("streams-plaintext-input") .flatMapValues(new ValueMapper<String, Iterable<String>>() { @Override public Iterable<String> apply(String value) { return Arrays.asList(value.toLowerCase(Locale.getDefault()).split("W+")); } }) .groupBy(new KeyValueMapper<String, String, String>() { @Override public String apply(String key, String value) { return value; } }) .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as("counts-store")) .toStream() .to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long()));
  • 16. Kafka Streams QnA • Let’s talk
  • 17. KSQL
  • 18. Before go into KSQL • Why you need SQL?
  • 19. Productivity Perspective • 18 ~ 25세 연령대의 사용자가 가장 많이 방문하는 사이트 5개를 찾아라 사용자 정보 사이트 방문 데이터 사용자정보 Loading 사이트 방문 데이터 Loading 나이 Filtering Join (아이디) 그룹핑 (사이트) 카운트 (방문횟수) 정렬 (방문횟수) Top 5 사이트 사용자 아이디 나이 성별 길동 kildong 20 남 철수 cheol 25 남 영희 young 15 여 영구 ygu 34 남 사이트 방문자 시간 chosum.com kildong 08:00 ddanji.com tiffany 12:00 flickr.com yuna 11:00 espn.com ygu 21:34
  • 20. • 직접 MapReduce 프로그램을 코딩 할 경우 import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.jobcontrol.Job; import org.apache.hadoop.mapred.jobcontrol.JobControl; import org.apache.hadoop.mapred.lib.IdentityMapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String key = line.substring(0, firstComma); String value = line.substring(firstComma + 1); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("1" + value); oc.collect(outKey, outVal); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String value = line.substring(firstComma + 1); int age = Integer.parseInt(value); if (age < 18 || age > 25) return; String key = line.substring(0, firstComma); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("2" + value); oc.collect(outKey, outVal); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasNext()) { Text t = iter.next(); String value = t.toString(); if (value.charAt(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1)); reporter.setStatus("OK"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setStatus("OK"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.toString(); int firstComma = line.indexOf(','); int secondComma = line.indexOf(',', firstComma); String key = line.substring(firstComma, secondComma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outKey = new Text(key); oc.collect(outKey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasNext()) { sum += iter.next().get(); reporter.setStatus("OK"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((LongWritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasNext()) { oc.collect(key, iter.next()); count++; } } } public static void main(String[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setJobName("Load Pages"); lp.setInputFormat(TextInputFormat.class); lp.setOutputKeyClass(Text.class); lp.setOutputValueClass(Text.class); lp.setMapperClass(LoadPages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setNumReduceTasks(0); Job loadPages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setJobName("Load and Filter Users"); lfu.setInputFormat(TextInputFormat.class); lfu.setOutputKeyClass(Text.class); lfu.setOutputValueClass(Text.class); lfu.setMapperClass(LoadAndFilterUsers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setNumReduceTasks(0); Job loadUsers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setJobName("Join Users and Pages"); join.setInputFormat(KeyValueTextInputFormat.class); join.setOutputKeyClass(Text.class); join.setOutputValueClass(Text.class); join.setMapperClass(IdentityMapper.class); join.setReducerClass(Join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setNumReduceTasks(50); Job joinJob = new Job(join); joinJob.addDependingJob(loadPages); joinJob.addDependingJob(loadUsers); JobConf group = new JobConf(MRExample.class); group.setJobName("Group URLs"); group.setInputFormat(KeyValueTextInputFormat.class); group.setOutputKeyClass(Text.class); group.setOutputValueClass(LongWritable.class); group.setOutputFormat(SequenceFileOutputFormat.class); group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); group.setReducerClass(ReduceUrls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setNumReduceTasks(50); Job groupJob = new Job(group); groupJob.addDependingJob(joinJob); JobConf top100 = new JobConf(MRExample.class); top100.setJobName("Top 100 sites"); top100.setInputFormat(SequenceFileInputFormat.class); top100.setOutputKeyClass(LongWritable.class); top100.setOutputValueClass(Text.class); top100.setOutputFormat(SequenceFileOutputFormat.class); top100.setMapperClass(LoadClicks.class); top100.setCombinerClass(LimitClicks.class); top100.setReducerClass(LimitClicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setNumReduceTasks(1); Job limit = new Job(top100); limit.addDependingJob(groupJob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addJob(loadPages); jc.addJob(loadUsers); jc.addJob(joinJob); jc.addJob(groupJob); jc.addJob(limit); jc.run(); } } Productivity Perspective MapReduce Sample Code
  • 21. • It’s for coder , not for user • duplicated code and effort • complexity in managing code Productivity Perspective
  • 22. High Level Parallel Processing Language • 쉬운 MapReduce 를 위한 병렬 처리 언어 • Pig by Yahoo • Hive by FaceBook 22
  • 23. Pig example • Same purpose code in PIG Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’; code line 1/20 coding time 1/16 easy code 23
  • 24. 아파치 Pig • 데이터 처리를 위한 고차원 언어 • 아파치 Top-Level 프로젝트 • Yahoo 내 Hadoop 작업의 30% • 2007년 배포 이후 2~10배 성능 개선 • Native 대비 70 ~ 80 % 성능
  • 25. Hive 개발 동기 • 벤더 데이터 웨어하우스 시스템 교체 • 데이터 확장성 문제(최초 10GB  수십TB) • 라이선스 등 운영 비용 절감 • 벤더 DBMS 에서 Hadoop 으로 교체 결정 • 교체 과정에서 나타난 필요 기능을 개발 • 사용자를 위한 CLI • 코딩 없이 Ad-hoc 질의를 할 수 있는 기능 • 스키마 정보들의 관리
  • 26. Hive 기반 데이터웨어하우징 • Hive on Hadoop 클러스터 • Scribe & MySQL 데이터를 HDFS 에 벌크 로딩 • 수동 Python 스크립트를 Hive 로 변경 Oracle DatabaseData collection server Scribe server tier MySQL server tier 26
  • 27. What is the key component in Hive? 27 metastore SerDe execution worker
  • 28. Lambda Architecture 로우(Raw) 데이터 토픽 가공된 데이터 토픽 장기 데이터 저장소 데이터 조회 배치(Batch) 계산 배치 테이블 고속 테이블 스트리밍 계산 SUM ADD Filtering 데이터 제공 영역
  • 29. Lambda Architecture BTW, (not BTS) Why it calls Lambda Architecture. Greek letter lambda (λ)
  • 30. Lambda Architecture e.g. kakao’s KEMI-stat http://tech.kakao.com/2016/08/25/kemi/
  • 31. Kappa Architecture • Key takeaways is “Calculating instantly, not retrieving instantly” 로우(Raw) 데이터 토픽 (+장기 저장) 가공된 데이 터 토픽 (+장기 저장) 데이터 조회 스트리밍 계산 SUM ADD Filtering 장기 계산 SUM ADD Filtering 데이터 제공 영역
  • 32. Kappa Architecture with KSQL 단기 데이터 토픽 장기 데이터 토픽 데이터 조회 계산 select * from short_topic 계산 select * from long_topic KSQL
  • 33. KSQL 카프카 클러스터 KSQL 서버 KSQL Engine REST API KSQL 서버 KSQL Engine REST API KSQL 서버 KSQL Engine REST API KSQL Shell Client
  • 37. KSQL DDL (Data Definition Language)• stream vs table • support only “CREATE/DELETE” for stream and table • sample CREATE TABLE users (usertimestamp BIGINT, user_id VARCHAR, gender VARCHAR, region_id VARCHAR) WITH (VALUE_FORMAT = 'JSON', KAFKA_TOPIC = 'my-users-topic');
  • 38. KSQL DML (Data Manipulation Language) • SELECT, LEFT JOIN • aggregate function like ADD, SUM and UDF like ABS/CONCAT supported
  • 39. Example PageViews (카프카토픽) Users (카프카토픽) Page view 토픽 데이터 생성기 Users 토픽 데이터 생성기 PAGEVIEWS_ FEMALE (신규 카프카토픽) pageviews_en riched_r8_r9 (신규 카프카토픽) PAGEVIEWS_ REGIONS (신규 카프카토픽) pageviews_female (스트림) pageviews_female_like _89 (스트림) pageviews_region (테이블)
  • 40. Example, create user/pageview table PageView s (카프카토픽) Users (카프카토픽) Page view 토픽 데이터 생성기 Users 토픽 데이터 생성기 PAGEVIE WS_FEM ALE (신규 카프 카토픽) pageviews _enriched _r8_r9 (신규 카프 카토픽) PAGEVIE WS_REGI ONS (신규 카프 카토픽) pageviews_femal e (스트림) pageviews_fem ale_like_89 (스트림) pageviews_regi on (테이블) ksql> CREATE STREAM pageviews_original ❶ (viewtime bigint, userid varchar, pageid varchar) ❷ WITH (kafka_topic='pageviews', value_format='DELIMITED'); ❸ ksql> CREATE TABLE users_original ❶ (registertime bigint, gender varchar, regionid varchar, userid varchar) ❷ WITH (kafka_topic='users', value_format='JSON'); ❸
  • 41. Example, create table/stream from query PageView s (카프카토픽) Users (카프카토픽) Page view 토픽 데이터 생성기 Users 토픽 데이터 생성기 PAGEVIE WS_FEM ALE (신규 카프 카토픽) pageviews _enriched _r8_r9 (신규 카프 카토픽) PAGEVIE WS_REGI ONS (신규 카프 카토픽) pageviews_femal e (스트림) pageviews_fem ale_like_89 (스트림) pageviews_regi on (테이블) ksql> CREATE STREAM pageviews_female AS ❶ SELECT users_original.userid AS userid, pageid, regionid, gender FROM pageviews_ original ❷ LEFT JOIN users_original ❸ ON pageviews_original.userid = users_original.userid WHERE gender = 'FEMALE'; ❹ ksql> CREATE STREAM pageviews_female_like_89 ❶ WITH (kafka_topic='pageviews_enriched_r8_r9', value_format='DELIMITED') AS ❷ SELECT * FROM pageviews_female WHERE regionid LIKE '%_8' OR regionid LIKE '%_9'; ❸
  • 43. What is most important in Data pipeline? Kafka (ESB) 파이썬 앱 플러그인 자바 앱 ERP. 브릿지 웹앱 Kafka (ESB) 기존 메시징 시스템 EIP EIP
  • 44. What is the most important in Data pipeline? • Performance • Have to be real-time ( or near real-time) • Data Integrity • No data loss • Every Data can be consumed
  • 45. The most important in Data pipeline? • Provider Perspective • Service Level Agreement • Rate • Format • ACL • It’s all about Managed Service
  • 46. About Data Structure? • Data Structure defines the Data computing architecture • It defines API • It defines Data Storage • It defines Computing method • What would you do if the data structure like below data["resource_id"]=”Some Server ID” data["svc_id"]= “Some Service ID” data["timestamp"]=str(int(time.time())) data["statistics"]= stats response =requests.put( url, data=json.dumps(data), )
  • 47. QnA

Notes de l'éditeur

  1. 표지
  2. wordcount, need global state storage
  3. complex