Stream analysis with kafka native way and considerations about monitoring as a service

Stream Analysis with Kafka Native way
and Considerations in Monitoring As A service
Andrew yongjoon Kong
sstrato.open@gmail.com

• Cloud Technical Advisory for Government Broad Cast Agency
• Adjunct Prof. Ajou Univ
• Korea Data Base Agency Acting Professor for Bigdata
• Member of National Information Agency Bigdata Advisory committee
• Kakaocorp, Cloud Part Lead
• Talks
• Scalable Loadbalancer with VM orchestrator (2017, netdev, korea)
• Embrace clouds (2017, openstack days, korea)
• Full route based network with linux (2016, netdev, Tokyo)
• SDN without SDN (2015, openstack, Vancouber)
Who am I
Andrew. Yongjoon kong
Supervised,
Korean
edition
Korean
Edition.

Processing
응용프로그램
센서(IOT)
기타
이
벤
트
이
벤
트
이
벤
트
이
벤
트
이
벤
트
응용프로그램
스트림
데이터베이스
분산 파일 시스템
이벤트 이벤트 이벤트 이벤트 이벤트
신규 스트림
쿼리 실행
분석
업테이트
응용프로그램
배치처리 영역
스트림 처리영역

What is Real Time?
The term real-time analytics implies
practically instant access and use of
analytical data

Relative, Time is
To be continued

processing
이
벤
트
이
벤
트
이
벤
트
이
벤
트
이
벤
트
응용프로그램 이벤
트
이벤
트
이벤
트
이벤
트
이벤
트
기존 스트림
신규 스트림
state저장소

Popular stream processor
• Apache Flume ( Too old school)
• Apache Storm
• Apache Spark
• Apache Samza
• Apache Nifi …

Popular stream processor
• e.g. Apache Flume
• flume comprises source, sink, channel
Channel
source sink
AVRO Source
Thrift Source
Exec source
JMS source
Spooling Directory Source
NetCat Source
Sequence Generator
Syslog sources
HTTP source
twitter source
HDFS sink
Logger sink
Avro sink
Thift sink
IRC sink
File Roll sink
HbaseSink
ElasticSearchSink
sourc
e
sink
partitions

Kafka Streams
• Simple (only can work with kafka)
• guarantee exactly once
• provide local state store
• DSL support
• Kafka Streams comprises source processor, sink processor, topology
• Source: reading data from Kakfa topic
• Sink : getting data from other processor
• topology: automatically created data pipe line

Kafka Streams Sample 1, pipe
• kafka streams code:
• running:
mvn exec:java -Dexec.mainClass=myapps.Pipe
final StreamsBuilder builder = new StreamsBuilder();
builder.stream(“streams-plaintext-input”).to(“streams-pipe-output”);
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);

Kafka Streams Sample 1, pipe
vs Apache Samza
• Code:
• running:
• copy the jar or path to the hadoop cluster
• run the program
• what happen if something goes bad?

Kafka Streams Sample 2, wordcount
• Topology

Kafka Streams Sample 2, wordcount
• Code
builder.<String, String>stream("streams-plaintext-input")
.flatMapValues(new ValueMapper<String, Iterable<String>>() {
@Override
public Iterable<String> apply(String value) {
return Arrays.asList(value.toLowerCase(Locale.getDefault()).split("W+"));
}
})
.groupBy(new KeyValueMapper<String, String, String>() {
@Override
public String apply(String key, String value) {
return value;
}
})
.count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as("counts-store"))
.toStream()
.to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long()));

Kafka Streams QnA
• Let’s talk

Before go into KSQL
• Why you need SQL?

Productivity Perspective
• 18 ~ 25세 연령대의 사용자가 가장 많이 방문하는 사이트 5개를
찾아라
사용자
정보
사이트 방문
데이터
사용자정보
Loading
사이트 방문
데이터 Loading
나이
Filtering
Join
(아이디)
그룹핑
(사이트)
카운트
(방문횟수)
정렬
(방문횟수)
Top 5
사이트
사용자 아이디 나이 성별
길동 kildong 20 남
철수 cheol 25 남
영희 young 15 여
영구 ygu 34 남
사이트 방문자 시간
chosum.com kildong 08:00
ddanji.com tiffany 12:00
flickr.com yuna 11:00
espn.com ygu 21:34

• 직접 MapReduce 프로그램을 코딩 할 경우
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.mapred.lib.IdentityMapper;
public class MRExample {
public static class LoadPages extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String key = line.substring(0, firstComma);
String value = line.substring(firstComma + 1);
Text outKey = new Text(key);
// Prepend an index to the value so we know
which file
// it came from.
Text outVal = new Text("1" + value);
oc.collect(outKey, outVal);
}
}
public static class LoadAndFilterUsers extends
MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
// Pull the key out
String value = line.substring(firstComma + 1);
int age = Integer.parseInt(value);
if (age < 18 || age > 25) return;
String key = line.substring(0, firstComma);
// Prepend an index to the value so we know
which file
// it came from.
Text outVal = new Text("2" + value);
oc.collect(outKey, outVal);
}
}
public static class Join extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key,
Iterator<Text> iter,
// For each value, figure out which file it's
from and store it
// accordingly.
List<String> first = new ArrayList<String>();
List<String> second = new ArrayList<String>();
while (iter.hasNext()) {
Text t = iter.next();
String value = t.toString();
if (value.charAt(0) == '1')
first.add(value.substring(1));
else second.add(value.substring(1));
reporter.setStatus("OK");
}
// Do the cross product and collect the values
for (String s1 : first) {
for (String s2 : second) {
String outval = key + "," + s1 + "," +
s2;
oc.collect(null, new Text(outval));
}
}
}
}
public static class LoadJoined extends MapReduceBase
implements Mapper<Text, Text, Text, LongWritable> {
public void map(
Text k,
Text val,
OutputCollector<Text, LongWritable> oc,
// Find the url
int secondComma = line.indexOf(',',
firstComma);
String key = line.substring(firstComma,
secondComma);
// drop the rest of the record, I don't need it
anymore,
// just pass a 1 for the combiner/reducer to
sum instead.
oc.collect(outKey, new LongWritable(1L));
}
}
public static class ReduceUrls extends MapReduceBase
implements Reducer<Text, LongWritable,
WritableComparable, Writable> {
public void reduce(
Text key,
Iterator<LongWritable> iter,
OutputCollector<WritableComparable,
Writable> oc,
// Add up all the values we see
long sum = 0;
while (iter.hasNext()) {
sum += iter.next().get();
}
oc.collect(key, new LongWritable(sum));
}
}
public static class LoadClicks extends MapReduceBase
implements Mapper<WritableComparable, Writable,
LongWritable, Text> {
public void map(
WritableComparable key,
Writable val,
OutputCollector<LongWritable, Text> oc,
oc.collect((LongWritable)val, (Text)key);
}
}
public static class LimitClicks extends MapReduceBase
implements Reducer<LongWritable, Text, LongWritable,
Text> {
int count = 0;
public void reduce(
LongWritable key,
Iterator<Text> iter,
OutputCollector<LongWritable, Text> oc,
// Only output the first 100 records
while (count < 100 && iter.hasNext()) {
oc.collect(key, iter.next());
count++;
}
}
}
public static void main(String[] args) throws
IOException {
JobConf lp = new JobConf(MRExample.class);
lp.setJobName("Load Pages");
lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class);
lp.setOutputValueClass(Text.class);
lp.setMapperClass(LoadPages.class);
FileInputFormat.addInputPath(lp, new
Path("/user/gates/pages"));
FileOutputFormat.setOutputPath(lp,
new Path("/user/gates/tmp/indexed_pages"));
lp.setNumReduceTasks(0);
Job loadPages = new Job(lp);
JobConf lfu = new JobConf(MRExample.class);
lfu.setJobName("Load and Filter Users");
lfu.setInputFormat(TextInputFormat.class);
lfu.setOutputKeyClass(Text.class);
lfu.setOutputValueClass(Text.class);
lfu.setMapperClass(LoadAndFilterUsers.class);
FileInputFormat.addInputPath(lfu, new
Path("/user/gates/users"));
FileOutputFormat.setOutputPath(lfu,
new Path("/user/gates/tmp/filtered_users"));
lfu.setNumReduceTasks(0);
Job loadUsers = new Job(lfu);
JobConf join = new JobConf(MRExample.class);
join.setJobName("Join Users and Pages");
join.setInputFormat(KeyValueTextInputFormat.class);
join.setOutputKeyClass(Text.class);
join.setOutputValueClass(Text.class);
join.setMapperClass(IdentityMapper.class);
join.setReducerClass(Join.class);
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/indexed_pages"));
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/filtered_users"));
FileOutputFormat.setOutputPath(join, new
Path("/user/gates/tmp/joined"));
join.setNumReduceTasks(50);
Job joinJob = new Job(join);
joinJob.addDependingJob(loadPages);
joinJob.addDependingJob(loadUsers);
JobConf group = new JobConf(MRExample.class);
group.setJobName("Group URLs");
group.setInputFormat(KeyValueTextInputFormat.class);
group.setOutputKeyClass(Text.class);
group.setOutputValueClass(LongWritable.class);
group.setOutputFormat(SequenceFileOutputFormat.class);
group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
group.setReducerClass(ReduceUrls.class);
FileInputFormat.addInputPath(group, new
Path("/user/gates/tmp/joined"));
FileOutputFormat.setOutputPath(group, new
Path("/user/gates/tmp/grouped"));
group.setNumReduceTasks(50);
Job groupJob = new Job(group);
groupJob.addDependingJob(joinJob);
JobConf top100 = new JobConf(MRExample.class);
top100.setJobName("Top 100 sites");
top100.setInputFormat(SequenceFileInputFormat.class);
top100.setOutputKeyClass(LongWritable.class);
top100.setOutputValueClass(Text.class);
top100.setOutputFormat(SequenceFileOutputFormat.class);
top100.setMapperClass(LoadClicks.class);
top100.setCombinerClass(LimitClicks.class);
top100.setReducerClass(LimitClicks.class);
FileInputFormat.addInputPath(top100, new
Path("/user/gates/tmp/grouped"));
FileOutputFormat.setOutputPath(top100, new
Path("/user/gates/top100sitesforusers18to25"));
top100.setNumReduceTasks(1);
Job limit = new Job(top100);
limit.addDependingJob(groupJob);
JobControl jc = new JobControl("Find top 100 sites
for users 18 to 25");
jc.addJob(loadPages);
jc.addJob(loadUsers);
jc.addJob(joinJob);
jc.addJob(groupJob);
jc.addJob(limit);
jc.run();
}
}
MapReduce
Sample Code

• It’s for coder , not for user
• duplicated code and effort
• complexity in managing code

High Level Parallel Processing Language
• 쉬운 MapReduce 를 위한 병렬 처리 언어
• Pig by Yahoo
• Hive by FaceBook
22

Pig example
• Same purpose code in PIG
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;
code line
1/20
coding time
1/16
easy
code
23

아파치 Pig
• 데이터 처리를 위한 고차원 언어
• 아파치 Top-Level 프로젝트
• Yahoo 내 Hadoop 작업의 30%
• 2007년 배포 이후 2~10배 성능 개선
• Native 대비 70 ~ 80 % 성능

Hive 개발 동기
• 벤더 데이터 웨어하우스 시스템 교체
• 데이터 확장성 문제(최초 10GB  수십TB)
• 라이선스 등 운영 비용 절감
• 벤더 DBMS 에서 Hadoop 으로 교체 결정
• 교체 과정에서 나타난 필요 기능을 개발
• 사용자를 위한 CLI
• 코딩 없이 Ad-hoc 질의를 할 수 있는 기능
• 스키마 정보들의 관리

Hive 기반 데이터웨어하우징
• Hive on Hadoop 클러스터
• Scribe & MySQL 데이터를 HDFS 에 벌크 로딩
• 수동 Python 스크립트를 Hive 로 변경
Oracle DatabaseData collection server
Scribe server tier MySQL server tier
26

What is the key component in Hive?
27
metastore
SerDe
execution
worker

Lambda Architecture
로우(Raw)
데이터 토픽
가공된 데이터
토픽
장기 데이터
저장소
데이터
조회
배치(Batch)
계산
배치 테이블
고속 테이블
스트리밍 계산
SUM
ADD
Filtering
데이터
제공 영역

Lambda Architecture
BTW, (not BTS)
Why it calls Lambda Architecture.
Greek letter lambda (λ)

Lambda Architecture e.g.
kakao’s KEMI-stat
http://tech.kakao.com/2016/08/25/kemi/

Kappa Architecture
• Key takeaways is “Calculating instantly, not retrieving instantly”
로우(Raw)
데이터 토픽
(+장기 저장)
가공된 데이
터
토픽
(+장기 저장)
데이터
조회
스트리밍 계산
SUM
ADD
Filtering
장기 계산
SUM
ADD
Filtering
데이터 제공 영역

Kappa Architecture with KSQL
단기 데이터
토픽
장기 데이터
토픽
데이터
조회
계산
select * from
short_topic
계산
select * from
long_topic
KSQL

KSQL
카프카 클러스터
KSQL 서버
KSQL
Engine
REST API
KSQL 서버
KSQL
Engine
REST API
KSQL 서버
KSQL
Engine
REST API
KSQL Shell Client

KSQL execution worker
Kafka Streams

KSQL DDL (Data Definition
Language)• stream vs table
• support only “CREATE/DELETE” for stream and table
• sample
CREATE TABLE users
(usertimestamp BIGINT, user_id VARCHAR,
gender VARCHAR, region_id VARCHAR)
WITH (VALUE_FORMAT = 'JSON',
KAFKA_TOPIC = 'my-users-topic');

KSQL DML (Data Manipulation
Language)
• SELECT, LEFT JOIN
• aggregate function like ADD, SUM and UDF like ABS/CONCAT
supported

Example
PageViews
(카프카토픽)
Users
(카프카토픽)
Page view 토픽 데이터
생성기
Users 토픽
데이터 생성기
PAGEVIEWS_
FEMALE
(신규 카프카토픽)
pageviews_en
riched_r8_r9
PAGEVIEWS_
REGIONS
pageviews_female
(스트림)
pageviews_female_like
_89
(스트림)
pageviews_region
(테이블)

Example, create user/pageview table
PageView
s
(카프카토픽)
Users
(카프카토픽)
Page view 토픽
데이터
생성기
Users 토픽
데이터 생성기
PAGEVIE
WS_FEM
ALE
(신규 카프
카토픽)
pageviews
_enriched
_r8_r9
(신규 카프
카토픽)
PAGEVIE
WS_REGI
ONS
(신규 카프
카토픽)
pageviews_femal
e
(스트림)
pageviews_fem
ale_like_89
(스트림)
pageviews_regi
on
(테이블)
ksql> CREATE STREAM pageviews_original ❶
(viewtime bigint, userid varchar, pageid varchar) ❷
WITH (kafka_topic='pageviews',
value_format='DELIMITED'); ❸
ksql> CREATE TABLE users_original ❶
(registertime bigint, gender varchar, regionid varchar,
userid varchar) ❷ WITH (kafka_topic='users',
value_format='JSON'); ❸

Example, create table/stream from
query
PageView
s
(카프카토픽)
Users
(카프카토픽)
Page view 토픽
데이터
생성기
Users 토픽
데이터 생성기
PAGEVIE
WS_FEM
ALE
(신규 카프
카토픽)
pageviews
_enriched
_r8_r9
(신규 카프
카토픽)
PAGEVIE
WS_REGI
ONS
(신규 카프
카토픽)
pageviews_femal
e
(스트림)
pageviews_fem
ale_like_89
(스트림)
pageviews_regi
on
(테이블)
ksql> CREATE STREAM pageviews_female AS ❶
SELECT users_original.userid AS userid, pageid, regionid,
gender FROM pageviews_ original ❷
LEFT JOIN users_original ❸
ON pageviews_original.userid = users_original.userid
WHERE gender = 'FEMALE'; ❹
ksql> CREATE STREAM pageviews_female_like_89
❶
WITH (kafka_topic='pageviews_enriched_r8_r9',
value_format='DELIMITED') AS ❷
SELECT * FROM pageviews_female WHERE
regionid LIKE '%_8' OR regionid LIKE '%_9'; ❸

Considerations about
Monitoring As A Service

What is most important in
Data pipeline?
Kafka (ESB)
파이썬
앱
플러그인
자바
앱
ERP. 브릿지 웹앱
Kafka (ESB)
기존
메시징 시스템
EIP
EIP

What is the most important in Data
pipeline?
• Performance
• Have to be real-time ( or near real-time)
• Data Integrity
• No data loss
• Every Data can be consumed

The most important in Data pipeline?
• Provider Perspective
• Service Level Agreement
• Rate
• Format
• ACL
• It’s all about Managed Service

About Data Structure?
• Data Structure defines the Data computing architecture
• It defines API
• It defines Data Storage
• It defines Computing method
• What would you do if the data structure like below
data["resource_id"]=”Some Server ID”
data["svc_id"]= “Some Service ID”
data["timestamp"]=str(int(time.time()))
data["statistics"]= stats
response =requests.put( url, data=json.dumps(data), )

Stream analysis with kafka native way and considerations about monitoring as a service

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Stream analysis with kafka native way and considerations about monitoring as a service

Similaire à Stream analysis with kafka native way and considerations about monitoring as a service (20)

Plus de Andrew Yongjoon Kong

Plus de Andrew Yongjoon Kong (12)

Dernier

Dernier (20)

Stream analysis with kafka native way and considerations about monitoring as a service

Notes de l'éditeur