SlideShare une entreprise Scribd logo
1  sur  77
Télécharger pour lire hors ligne
Shuhsi Lin
2017/06/09 at PyconTw 2017
Connect K of SMACK:
pykafka, kafka-python or ?
About Me
Data Software Engineer of EAD
in the manufacturer, Micron
Currently working with
- data and people
- Lurking in PyHug, Taipei.py and various Meetups
Shuhsi Lin
sucitw gmail.com
K in SMACK
http://datastrophic.io/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka/
https://www.linkedin.com/pulse/smack-my-bdaas-why-2017-year-big-data-goes-tom-martin
http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
https://dzone.com/articles/short-interview-with-smack-tech-stack-1
https://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
● Apache Spark: Processing Engine.
● Apache Mesos: The Container.
● Akka: The Model.
● Apache Cassandra: The Storage.
● Apache Kafka: The Broker.
Agenda
» Pipeline to streaming
» What is Apache Kafka
⋄ Overview
⋄ Architecture
⋄ Use cases
» Kafka API
⋄ Python clients
» Conclusion and More about Kafka
What we will not focus on
» Reliability and durability
⋄ Scaling, replication, guarantee
⋄ Zookeeper
» Compact log
» Administration, Configuration, Operations
» Kafka connect
» Kafka Stream
» Apache Kafka vs XXX
⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ,
ZeroMQ, Redis, and ....
What is
Stream Processing
3 Paradigms for Programming
1. Request/response
2. Batch
3. Stream processing
https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html
Request/response
Batch
Stream Processing
What is streaming process
» Data comes from the rise of events
(orders, sales, clicks or trades)
» Databases are event streams
⋄ the process of creating a backup or standby copy
of a database
⋄ publishing the database changes
Data pipeline
https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini
What often happen in a
complex Data pipeline
● Complexity meant that the data
was always unreliable
● Reports were untrustworthy,
● Derived indexes and stores were
questionable
● Everyone spent a lot of time
battling data quality issues of
all kinds.
● Data discrepancy
Data pipeline
Data streaming
Apache Kafka 101
The name, “Kafka”, came from?
https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system
http://slideplayer.com/slide/4221536/
https://en.wikipedia.org/wiki/Franz_Kafka
What is Apache Kafka?
Apache Kafka is a distributed system designed for streams. It is built to be
fault-tolerant, high-throughput, horizontally scalable, and allows geographically
distributing data streams and processing.
https://kafka.apache.org
Why Apache Kafka
Fast
Scalable
Durable
Distributedhttps://pixabay.com/photo-2135057/
Stream data platform (Orignal mechanism)
https://www.confluent.io/blog/stream-data-platform-1/
Integration mechanism between systems
Kafka as a service
https://www.confluent.io/
What a streaming data platform can provide
» “Data integration” (ETL)
⋄ How to transport data between systems
⋄ Captures streams of events or data changes and
feeds these to other data systems
» “Stream processing” (messaging)
⋄ Continuous, real-time processing and
transformation of these streams and makes the
results available system-wide.
various systems in LinkedIn
https://www.confluent.io/blog/stream-data-platform-1/
Analytical data processing with very low latency
Kafka terminology
» Producer
» Consumer
⋄ Consumer group
⋄ offset
» Broker
» Topic
» Patition
» Message
» Replica
What Kafka Does
Publish & subscribe
● to streams of data like a messaging system
Process
● streams of data efficiently and in real time
Store
● streams of data safely in a distributed replicated cluster
https://kafka.apache.org/
Publish/Subscribe
P14 at
https://www.slideshare.net/lucasjellema/amis-sig-introducing-a
pache-kafka-scalable-reliable-event-bus-message-queue
P15 at https://www.slideshare.net/rahuldausa/real-time-analytics-with-apache-kafka-and-apache-spark
v0.10
Update offset
v08
Update offset
Smart consumer
2181
9092
A modern stream-centric data architecture built around Apache Kafka
https://www.confluent.io/blog/stream-data-platform-1/
500 billion events per day
The key abstraction in Kafka is a
structured commit log of updates
append records to this log
https://www.confluent.io/blog/stream-data-platform-1/
Each of these data consumers
has its own position in the log
and advances independently.
This allows a reliable, ordered stream of updates
to be distributed to each consumer.
The log can be sharded and spread
over a cluster of machines, and
each shard is replicated for
fault-tolerance.
consumers
producers
parallel, ordered consumption
(important to a change capture system
for database updates)
TBs of data
Topics and Partitions
» Topics are split into partitions
» Partitions are strongly ordered & immutable
» Partitions can exist on different servers
» Partition enable scalability
» Producers assign a message to a partition within the topic
⋄ Either round robin ( simply to balance load)
⋄ or according to the keys
https://kafka.apache.org/documentation/#gettingStarted
Offsets
» Message are assigned an offset in the partition
» Consumers track with ( offset, partition, topic)
https://kafka.apache.org/documentation/#gettingStarted
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups
Consumers and Partitions
» A consumer group consumes one topic
» A partition is always sent to the same consumer instance
https://kafka.apache.org/documentation/#gettingStarted
Consumer
● Messages are available to consumers only when they have been
committed
● Kafka does not push
○ Unlike JMS
● Read does not destroy by consumers
○ Unlike JMS Topic
● (some) History available
○ Offline consumers can catch up
○ Consumers can re-consume from the past
● Delivery Guarantees
○ Ordering maintained
○ At-least-once (per consumer) by default; at-most-once and exactly-once can be
implemented
P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue
ZooKeeper: the coordination interface
between the Kafka broker and consumers
https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3
» Stores configuration data for distributed services
» Used primarily by brokers
» Used by consumers in 0.8 but not 0.9
Apache Kafka timeline
Apache Kafka timeline
2011-Nov
2016-May2013-Nov 2015-Nov
Next
version
v0.10
Kafka Stream
rack awareness
v0.8
New Producer
Reassign-partitions
v0.9
Kafka Connect
Security
New Consumer
Apache
Software
Foundation
incubator
2010
Creation
In Linkedin
2014, Confluent
v0.10.2
Single Message Transforms
for Kafka Connect
TLS connection
SSL is supported only for the new Kafka Producer and Consumer (Kafka versions 0.9.0 and higher)
http://kafka.apache.org/documentation.html#security_ssl
http://docs.confluent.io/current/kafka/ssl.html
http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl
https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka
Apache Kafka is consider as :
Stream data platform
» Commit log service
» Messaging system
» circular buffer
Cons of Apache Kafka
» Consumer Complexity (smart, but poor client)
» Lack of tooling/monitoring (3rd party)
» Still pre 1.0 release
» Operationally, it’s more manual than desired
» Requires ZooKeeper
Sep 26, 2015http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326
Use Cases
» Website Activity Tracking
» Log Aggregation
» Stream Processing
» Event Sourcing
» Commit logs
» Metrics (Performance index streaming)
⋄ CPU/IO/Memory usage
⋄ Application Specific:
⋄ Time taken to load a web-page
⋄ Time taken to build a web-page
⋄ No. of requests
⋄ No. of hits on a particular page/url
Event-driven Applications
» how it first is adopted and how its role
evolves over time in their architecture.
https://aws.amazon.com/tw/kafka/
https://www.slideshare.net/ConfluentInc/iot-data-platforms-processing-iot-data-with-apache-kafka
Conceptual Reference Architecture
for Real-Time Processing in HDP 2.2
https://hortonworks.com/blog/storm-kafka-together-real-time-data-refinery/ February 12, 2015
Event delivery system design in Spotify
43
https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/
Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming
http://helenaedelson.com/?p=1186 (2016/03)
2 + 2 Core APIs
Four Core APIs
» Producer API
» Consumer API
» Connect API
» Streams API
» Legacy APIs
$ cat < in.txt | grep “python” | tr a-z A-Z > out.txt
https://www.slideshare.net/ConfluentInc/apache-kafkaa-distributed-streaming-platform
Kafka Clients
» JAVA (officially maintain)
» C/C++ (librdkafka)
» Go (AKA golang)
» Erlang
» .NET
» Clojure
» Ruby
» Node.js
» Proxy (HTTP REST, etc)
» Perl
» stdin/stdout
» PHP
» Rust
» Alternative Java
» Storm
» Scala DSL
» Clojure
https://cwiki.apache.org/confluence/display/KAFKA/Clients
» Python
⋄ Confluent-kafka-python
⋄ Kafka-python
⋄ pykafka
Kafka Clients survey
https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey (February 14, 2017)
How users choose a Kafka client
Kafka Client: Language Adoption
Results from 187 responses
Reliability:
● Stability should be
priority
● Good error handling
● Good testing
● Good metrics and logging
3rd
Create your own Kafka broker
https://github.com/Landoop/fast-data-dev
See your brokers and topics
● Kafka-topics-ui
○ Demo http://kafka-topics-ui.landoop.com/#/
● Kafka-connect-ui
○ Demo http://kafka-connect-ui.landoop.com/
● Kafka-manager (yahoo)
● Kafka Eagle
● kafka-offset-monitor
Kafka Tool (GUI)
https://www.datadoghq.com/
Kafka Tool
Kafka UI(landoop)
2 + 2 Core APIs
And python clients
Kafka API Documents
https://kafka.apache.org/0102/javadoc/index.html?
Apache Kafka client for Python
» Pykafka
» kafka-python
» Confluent-kafka-python
» Librdkafka
⋄ The Apache Kafka C/C++ library
Pykafka
https://github.com/Parsely/pykafka
http://pykafka.readthedocs.io/en/latest/
» Similar level of abstraction
to the JVM Kafka client
» Built on librdkafka
https://blog.parse.ly/post/3886/pykafka-now/ (2016,June)
kafka-python
https://github.com/dpkp/kafka-python/
http://kafka-python.readthedocs.io/
API
● Producer
● Consumer
● Message
● TopicPartition
● KafkaError
● KafkaException
● kafka-python is designed to function
much like the official java client,
with a sprinkling of pythonic
interfaces.
Confluent-kafka-python
Confluent's Python client for Apache Kafka and
the Confluent Platform.
Features:
● High performance
⋄ librdkafka
● Reliability
● Supported
● Future proof
https://github.com/confluentinc/confluent-kafka-python
http://docs.confluent.io/current/clients/confluent-kafka-python/index.html?
Producer API (JAVA)
https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
https://www.tutorialspoint.com/apache_kafka/apache_kafka_simple_producer_example.htm
● KafkaProducer – Sync and Async
○ close()
○ flush()
○ metrics()
○ partitionsFor( topic)
○ send(ProducerRecord<K,V> record)
Writing data to Kafka: A client that publishes records to the Kafka cluster.
Class KafkaProducer<K,V>
Class ProducerRecord<K,V>
● ProducerRecord( topic, V value)
● ProducerRecord( topic, Integer partition, K key, V value)
A key/value pair to be sent to Kafka.
Configuration Settings
(configuration is externalized in a property file)
● client.id
● producer.type
● acks
● retries
● bootstrap.servers
● linger.ms
● key.serializer
● value.serializer
● batch.size
● buffer.memory
messages
Producer API -Pykafka
from pykafka import KafkaClient
from settings import ….
client = KafkaClient(hosts=bootstrap_servers)
topic = client.topics [topic.encode('UTF-8')]
producer = topic.get_producer(use_rdkafka=use_rdkafka)
producer.produce(msg_payload)
producer.stop() # Will flush background queue
Class pykafka.producer.Producer()
Classpykafka.topic.Topic(cluster, topic_metadata)
http://pykafka.readthedocs.io/en/latest/api/producer.html
● produce(msg, partition_key=None)
● stop()
● get_producer(use_rdkafka=False,
**kwargs)
Performance assessment
https://blog.parse.ly/post/3886/pykafka-now/
Must be type bytes, or be
serializable to bytes via
configured value_serializer.
Producer API -Kafka-Python
from kafka import KafkaConsumer, KafkaProducer
from settings import BOOTSTRAP_SERVERS, TOPICS, MSG
p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS)
p.send(TOPICS, MSG.encode('utf-8'))
p.flush()
Class kafka.KafkaProducer(**configs)
https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer
● close(timeout=None)
● flush(timeout=None)
● partitions_for(topic)
● send(topic, value=None, key=None,
partition=None, timestamp_ms=None)
http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html
Producer API -Confluent-python -Kafka
from confluent_kafka import Producer
from settings import BOOTSTRAP_SERVERS,
TOPICS, MSG
p = Producer({'bootstrap.servers':
BOOTSTRAP_SERVERS})
p.produce(TOPICS, MSG.encode('utf-8'))
p.flush()
http://docs.confluent.io/current/clients/confluent-kafka-python/#producer
Class confluent_kafka.Producer(*kwargs)
● len()
● flush([timeout])
● poll([timeout])
● produce(topic[, value][, key][, partition][,
on_delivery][, timestamp])
Consumer
● Consumer group
○ group.id
○ session.timout.ms
○ max.poll.records
○ heartbeat.interval.ms
● Offset Management
○ enable.auto.commit
○ Auto.commit.interval.ms
○ auto.offset.reset
https://kafka.apache.org/documentation.html#newconsumerconfigs
Consumer API (JAVA)
https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
● assign(<TopicPartition> partitions)
● assignment()
● beginningOffsets(<TopicPartition> partitions)
● close(long timeout, TimeUnit timeUnit)
● commitAsync(Map<TopicPartition,OffsetAndMetadata> offsets,
OffsetCommitCallback callback)
● commitSync(Map<TopicPartition,OffsetAndMetadata> offsets)
● committed(TopicPartition partition)
● endOffsets(<TopicPartition> partitions)
● listTopics()
● metrics()
● offsetsForTimes(Map<TopicPartition,Long> timestampsToSearch)
● partitionsFor(topic)
● pause(<TopicPartition> partitions)
Reading data from Kafka: A client that consumes records from a Kafka cluster.
Class KafkaConsumer<K,V>
● poll(long timeout)
● position(TopicPartition partition)
● resume(<TopicPartition> partitions)
● seek(TopicPartition partition, long offset)
● seekToBeginning(<TopicPartition> partitions)
● seekToEnd(<TopicPartition> partitions)
● subscribe(topics, ConsumerRebalanceListener
listener)
● subscribe(Pattern pattern,
ConsumerRebalanceListener listener)
● subscription()
● unsubscribe()
● wakeup()
Kafka shell scripts
Create a Kafka Topic
» Let's create a topic named "test" with a single partition and
only one replica:
⋄ kafka-topics.sh --create --zookeeper zhost:2181
--replication-factor 1 --partitions 1 --topic test
» See that topic
⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181
bin/kafka-topics.sh
» Create, delete, describe, or change a topic.
Python Kafka Client Benchmarking
DEMO
1. http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
2. https://github.com/sucitw/benchmark-python-client-for-kafka
http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
Python Kafka Client Benchmarking
Conclusion:
pykafka, kafka-python or ?
https://github.com/Parsely/pykafka/issues/559
More about Kafka
More about Kafka
» Reliability and durability
⋄ Scaling, replication, guarantee, Zookeeper
» Compact log
» Administration, Configuration, Operations, Monitoring
» Kafka connect
» Kafka Stream
» Schema Registry
» Rest proxy
» Apache Kafka vs XXX
⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis,
and ....
The Another 2 APIs
» Connect API
○ JDBC, HDFS, S3, ….
» Streams API
○ MAP, filter, aggregate, join
More references
1. The Log: What every software engineer should know about real-time data's unifying abstraction,
Jay Kreps, 2013
2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559
3. Why I am not a fan of Apache Kafka (2015-2016 Sep)
4. Kafka vs RabbitMQ
a. What are the differences between Apache Kafka and RabbitMQ?
b. Understanding When to use RabbitMQ or Apache Kafka
5. Kafka summit (2016~)
6. Future features of Kafka (Kafka Improvement Proposals)
7. Kafka- The Definitive Guide
We’re hiring
(104 link)

Contenu connexe

Tendances

Tendances (16)

Introduction to Linux-wpan and Potential Collaboration
Introduction to Linux-wpan and Potential CollaborationIntroduction to Linux-wpan and Potential Collaboration
Introduction to Linux-wpan and Potential Collaboration
 
Development Boards for Tizen IoT
Development Boards for Tizen IoTDevelopment Boards for Tizen IoT
Development Boards for Tizen IoT
 
IoT: From Arduino Microcontrollers to Tizen Products using IoTivity
IoT: From Arduino Microcontrollers to Tizen Products using IoTivityIoT: From Arduino Microcontrollers to Tizen Products using IoTivity
IoT: From Arduino Microcontrollers to Tizen Products using IoTivity
 
SOSCON 2016 JerryScript
SOSCON 2016 JerryScriptSOSCON 2016 JerryScript
SOSCON 2016 JerryScript
 
Rapid SPi Device Driver Development over USB
Rapid SPi Device Driver Development over USBRapid SPi Device Driver Development over USB
Rapid SPi Device Driver Development over USB
 
JerryScript on RIOT
JerryScript on RIOTJerryScript on RIOT
JerryScript on RIOT
 
DevCon 5 (December 2013) - WebRTC & WebSockets
DevCon 5 (December 2013) - WebRTC & WebSocketsDevCon 5 (December 2013) - WebRTC & WebSockets
DevCon 5 (December 2013) - WebRTC & WebSockets
 
Run Your Own 6LoWPAN Based IoT Network
Run Your Own 6LoWPAN Based IoT NetworkRun Your Own 6LoWPAN Based IoT Network
Run Your Own 6LoWPAN Based IoT Network
 
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsJerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
 
Introduction to IoT.JS
Introduction to IoT.JSIntroduction to IoT.JS
Introduction to IoT.JS
 
zebra & openconfigd Introduction
zebra & openconfigd Introductionzebra & openconfigd Introduction
zebra & openconfigd Introduction
 
Advanced Kurento Real Time Media Stream Processing
Advanced Kurento Real Time Media Stream ProcessingAdvanced Kurento Real Time Media Stream Processing
Advanced Kurento Real Time Media Stream Processing
 
WPEWebKit, the WebKit port for embedded platforms (Linaro Connect San Diego 2...
WPEWebKit, the WebKit port for embedded platforms (Linaro Connect San Diego 2...WPEWebKit, the WebKit port for embedded platforms (Linaro Connect San Diego 2...
WPEWebKit, the WebKit port for embedded platforms (Linaro Connect San Diego 2...
 
OpenStack Neutron IPv6 Lessons
OpenStack Neutron IPv6 LessonsOpenStack Neutron IPv6 Lessons
OpenStack Neutron IPv6 Lessons
 
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
 
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and BareboxEmbedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
 

En vedette

En vedette (9)

Python and test
Python and testPython and test
Python and test
 
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
 
Athens BigData Meetup - Sept 17
Athens BigData Meetup - Sept 17Athens BigData Meetup - Sept 17
Athens BigData Meetup - Sept 17
 
London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)
 
Kafka Tutorial: Streaming Data Architecture
Kafka Tutorial: Streaming Data ArchitectureKafka Tutorial: Streaming Data Architecture
Kafka Tutorial: Streaming Data Architecture
 
Kafka Tutorial - DevOps, Admin and Ops
Kafka Tutorial - DevOps, Admin and OpsKafka Tutorial - DevOps, Admin and Ops
Kafka Tutorial - DevOps, Admin and Ops
 
Kafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersKafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka Consumers
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced Producers
 

Similaire à Connect K of SMACK:pykafka, kafka-python or?

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 

Similaire à Connect K of SMACK:pykafka, kafka-python or? (20)

Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
 
Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache Kafka
 
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 

Dernier

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 

Connect K of SMACK:pykafka, kafka-python or?

  • 1. Shuhsi Lin 2017/06/09 at PyconTw 2017 Connect K of SMACK: pykafka, kafka-python or ?
  • 2. About Me Data Software Engineer of EAD in the manufacturer, Micron Currently working with - data and people - Lurking in PyHug, Taipei.py and various Meetups Shuhsi Lin sucitw gmail.com
  • 5. Agenda » Pipeline to streaming » What is Apache Kafka ⋄ Overview ⋄ Architecture ⋄ Use cases » Kafka API ⋄ Python clients » Conclusion and More about Kafka
  • 6. What we will not focus on » Reliability and durability ⋄ Scaling, replication, guarantee ⋄ Zookeeper » Compact log » Administration, Configuration, Operations » Kafka connect » Kafka Stream » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....
  • 8. 3 Paradigms for Programming 1. Request/response 2. Batch 3. Stream processing https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html
  • 10. Batch
  • 12. What is streaming process » Data comes from the rise of events (orders, sales, clicks or trades) » Databases are event streams ⋄ the process of creating a backup or standby copy of a database ⋄ publishing the database changes
  • 13. Data pipeline https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini What often happen in a complex Data pipeline ● Complexity meant that the data was always unreliable ● Reports were untrustworthy, ● Derived indexes and stores were questionable ● Everyone spent a lot of time battling data quality issues of all kinds. ● Data discrepancy
  • 16. The name, “Kafka”, came from? https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system http://slideplayer.com/slide/4221536/ https://en.wikipedia.org/wiki/Franz_Kafka
  • 17. What is Apache Kafka? Apache Kafka is a distributed system designed for streams. It is built to be fault-tolerant, high-throughput, horizontally scalable, and allows geographically distributing data streams and processing. https://kafka.apache.org
  • 20. Stream data platform (Orignal mechanism) https://www.confluent.io/blog/stream-data-platform-1/ Integration mechanism between systems
  • 21. Kafka as a service https://www.confluent.io/
  • 22. What a streaming data platform can provide » “Data integration” (ETL) ⋄ How to transport data between systems ⋄ Captures streams of events or data changes and feeds these to other data systems » “Stream processing” (messaging) ⋄ Continuous, real-time processing and transformation of these streams and makes the results available system-wide. various systems in LinkedIn https://www.confluent.io/blog/stream-data-platform-1/ Analytical data processing with very low latency
  • 23. Kafka terminology » Producer » Consumer ⋄ Consumer group ⋄ offset » Broker » Topic » Patition » Message » Replica
  • 24. What Kafka Does Publish & subscribe ● to streams of data like a messaging system Process ● streams of data efficiently and in real time Store ● streams of data safely in a distributed replicated cluster https://kafka.apache.org/
  • 27. A modern stream-centric data architecture built around Apache Kafka https://www.confluent.io/blog/stream-data-platform-1/ 500 billion events per day
  • 28. The key abstraction in Kafka is a structured commit log of updates append records to this log https://www.confluent.io/blog/stream-data-platform-1/ Each of these data consumers has its own position in the log and advances independently. This allows a reliable, ordered stream of updates to be distributed to each consumer. The log can be sharded and spread over a cluster of machines, and each shard is replicated for fault-tolerance. consumers producers parallel, ordered consumption (important to a change capture system for database updates) TBs of data
  • 29. Topics and Partitions » Topics are split into partitions » Partitions are strongly ordered & immutable » Partitions can exist on different servers » Partition enable scalability » Producers assign a message to a partition within the topic ⋄ Either round robin ( simply to balance load) ⋄ or according to the keys https://kafka.apache.org/documentation/#gettingStarted
  • 30. Offsets » Message are assigned an offset in the partition » Consumers track with ( offset, partition, topic) https://kafka.apache.org/documentation/#gettingStarted A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups
  • 31. Consumers and Partitions » A consumer group consumes one topic » A partition is always sent to the same consumer instance https://kafka.apache.org/documentation/#gettingStarted
  • 32. Consumer ● Messages are available to consumers only when they have been committed ● Kafka does not push ○ Unlike JMS ● Read does not destroy by consumers ○ Unlike JMS Topic ● (some) History available ○ Offline consumers can catch up ○ Consumers can re-consume from the past ● Delivery Guarantees ○ Ordering maintained ○ At-least-once (per consumer) by default; at-most-once and exactly-once can be implemented P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue
  • 33. ZooKeeper: the coordination interface between the Kafka broker and consumers https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3 » Stores configuration data for distributed services » Used primarily by brokers » Used by consumers in 0.8 but not 0.9
  • 35. Apache Kafka timeline 2011-Nov 2016-May2013-Nov 2015-Nov Next version v0.10 Kafka Stream rack awareness v0.8 New Producer Reassign-partitions v0.9 Kafka Connect Security New Consumer Apache Software Foundation incubator 2010 Creation In Linkedin 2014, Confluent v0.10.2 Single Message Transforms for Kafka Connect
  • 36. TLS connection SSL is supported only for the new Kafka Producer and Consumer (Kafka versions 0.9.0 and higher) http://kafka.apache.org/documentation.html#security_ssl http://docs.confluent.io/current/kafka/ssl.html http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka
  • 37. Apache Kafka is consider as : Stream data platform » Commit log service » Messaging system » circular buffer
  • 38. Cons of Apache Kafka » Consumer Complexity (smart, but poor client) » Lack of tooling/monitoring (3rd party) » Still pre 1.0 release » Operationally, it’s more manual than desired » Requires ZooKeeper Sep 26, 2015http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326
  • 39. Use Cases » Website Activity Tracking » Log Aggregation » Stream Processing » Event Sourcing » Commit logs » Metrics (Performance index streaming) ⋄ CPU/IO/Memory usage ⋄ Application Specific: ⋄ Time taken to load a web-page ⋄ Time taken to build a web-page ⋄ No. of requests ⋄ No. of hits on a particular page/url
  • 40. Event-driven Applications » how it first is adopted and how its role evolves over time in their architecture. https://aws.amazon.com/tw/kafka/
  • 42. Conceptual Reference Architecture for Real-Time Processing in HDP 2.2 https://hortonworks.com/blog/storm-kafka-together-real-time-data-refinery/ February 12, 2015
  • 43. Event delivery system design in Spotify 43 https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/
  • 44. Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming http://helenaedelson.com/?p=1186 (2016/03)
  • 45. 2 + 2 Core APIs
  • 46. Four Core APIs » Producer API » Consumer API » Connect API » Streams API » Legacy APIs $ cat < in.txt | grep “python” | tr a-z A-Z > out.txt https://www.slideshare.net/ConfluentInc/apache-kafkaa-distributed-streaming-platform
  • 47. Kafka Clients » JAVA (officially maintain) » C/C++ (librdkafka) » Go (AKA golang) » Erlang » .NET » Clojure » Ruby » Node.js » Proxy (HTTP REST, etc) » Perl » stdin/stdout » PHP » Rust » Alternative Java » Storm » Scala DSL » Clojure https://cwiki.apache.org/confluence/display/KAFKA/Clients » Python ⋄ Confluent-kafka-python ⋄ Kafka-python ⋄ pykafka
  • 48. Kafka Clients survey https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey (February 14, 2017) How users choose a Kafka client Kafka Client: Language Adoption Results from 187 responses Reliability: ● Stability should be priority ● Good error handling ● Good testing ● Good metrics and logging 3rd
  • 49. Create your own Kafka broker https://github.com/Landoop/fast-data-dev
  • 50. See your brokers and topics ● Kafka-topics-ui ○ Demo http://kafka-topics-ui.landoop.com/#/ ● Kafka-connect-ui ○ Demo http://kafka-connect-ui.landoop.com/ ● Kafka-manager (yahoo) ● Kafka Eagle ● kafka-offset-monitor Kafka Tool (GUI) https://www.datadoghq.com/
  • 53. 2 + 2 Core APIs And python clients
  • 55. Apache Kafka client for Python » Pykafka » kafka-python » Confluent-kafka-python » Librdkafka ⋄ The Apache Kafka C/C++ library
  • 56. Pykafka https://github.com/Parsely/pykafka http://pykafka.readthedocs.io/en/latest/ » Similar level of abstraction to the JVM Kafka client » Built on librdkafka https://blog.parse.ly/post/3886/pykafka-now/ (2016,June)
  • 57. kafka-python https://github.com/dpkp/kafka-python/ http://kafka-python.readthedocs.io/ API ● Producer ● Consumer ● Message ● TopicPartition ● KafkaError ● KafkaException ● kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces.
  • 58. Confluent-kafka-python Confluent's Python client for Apache Kafka and the Confluent Platform. Features: ● High performance ⋄ librdkafka ● Reliability ● Supported ● Future proof https://github.com/confluentinc/confluent-kafka-python http://docs.confluent.io/current/clients/confluent-kafka-python/index.html?
  • 59. Producer API (JAVA) https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html https://www.tutorialspoint.com/apache_kafka/apache_kafka_simple_producer_example.htm ● KafkaProducer – Sync and Async ○ close() ○ flush() ○ metrics() ○ partitionsFor( topic) ○ send(ProducerRecord<K,V> record) Writing data to Kafka: A client that publishes records to the Kafka cluster. Class KafkaProducer<K,V> Class ProducerRecord<K,V> ● ProducerRecord( topic, V value) ● ProducerRecord( topic, Integer partition, K key, V value) A key/value pair to be sent to Kafka. Configuration Settings (configuration is externalized in a property file) ● client.id ● producer.type ● acks ● retries ● bootstrap.servers ● linger.ms ● key.serializer ● value.serializer ● batch.size ● buffer.memory messages
  • 60. Producer API -Pykafka from pykafka import KafkaClient from settings import …. client = KafkaClient(hosts=bootstrap_servers) topic = client.topics [topic.encode('UTF-8')] producer = topic.get_producer(use_rdkafka=use_rdkafka) producer.produce(msg_payload) producer.stop() # Will flush background queue Class pykafka.producer.Producer() Classpykafka.topic.Topic(cluster, topic_metadata) http://pykafka.readthedocs.io/en/latest/api/producer.html ● produce(msg, partition_key=None) ● stop() ● get_producer(use_rdkafka=False, **kwargs)
  • 62. Must be type bytes, or be serializable to bytes via configured value_serializer. Producer API -Kafka-Python from kafka import KafkaConsumer, KafkaProducer from settings import BOOTSTRAP_SERVERS, TOPICS, MSG p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS) p.send(TOPICS, MSG.encode('utf-8')) p.flush() Class kafka.KafkaProducer(**configs) https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer ● close(timeout=None) ● flush(timeout=None) ● partitions_for(topic) ● send(topic, value=None, key=None, partition=None, timestamp_ms=None) http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html
  • 63. Producer API -Confluent-python -Kafka from confluent_kafka import Producer from settings import BOOTSTRAP_SERVERS, TOPICS, MSG p = Producer({'bootstrap.servers': BOOTSTRAP_SERVERS}) p.produce(TOPICS, MSG.encode('utf-8')) p.flush() http://docs.confluent.io/current/clients/confluent-kafka-python/#producer Class confluent_kafka.Producer(*kwargs) ● len() ● flush([timeout]) ● poll([timeout]) ● produce(topic[, value][, key][, partition][, on_delivery][, timestamp])
  • 64. Consumer ● Consumer group ○ group.id ○ session.timout.ms ○ max.poll.records ○ heartbeat.interval.ms ● Offset Management ○ enable.auto.commit ○ Auto.commit.interval.ms ○ auto.offset.reset https://kafka.apache.org/documentation.html#newconsumerconfigs
  • 65. Consumer API (JAVA) https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html ● assign(<TopicPartition> partitions) ● assignment() ● beginningOffsets(<TopicPartition> partitions) ● close(long timeout, TimeUnit timeUnit) ● commitAsync(Map<TopicPartition,OffsetAndMetadata> offsets, OffsetCommitCallback callback) ● commitSync(Map<TopicPartition,OffsetAndMetadata> offsets) ● committed(TopicPartition partition) ● endOffsets(<TopicPartition> partitions) ● listTopics() ● metrics() ● offsetsForTimes(Map<TopicPartition,Long> timestampsToSearch) ● partitionsFor(topic) ● pause(<TopicPartition> partitions) Reading data from Kafka: A client that consumes records from a Kafka cluster. Class KafkaConsumer<K,V> ● poll(long timeout) ● position(TopicPartition partition) ● resume(<TopicPartition> partitions) ● seek(TopicPartition partition, long offset) ● seekToBeginning(<TopicPartition> partitions) ● seekToEnd(<TopicPartition> partitions) ● subscribe(topics, ConsumerRebalanceListener listener) ● subscribe(Pattern pattern, ConsumerRebalanceListener listener) ● subscription() ● unsubscribe() ● wakeup()
  • 67. Create a Kafka Topic » Let's create a topic named "test" with a single partition and only one replica: ⋄ kafka-topics.sh --create --zookeeper zhost:2181 --replication-factor 1 --partitions 1 --topic test » See that topic ⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181 bin/kafka-topics.sh » Create, delete, describe, or change a topic.
  • 68. Python Kafka Client Benchmarking
  • 74. More about Kafka » Reliability and durability ⋄ Scaling, replication, guarantee, Zookeeper » Compact log » Administration, Configuration, Operations, Monitoring » Kafka connect » Kafka Stream » Schema Registry » Rest proxy » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....
  • 75. The Another 2 APIs » Connect API ○ JDBC, HDFS, S3, …. » Streams API ○ MAP, filter, aggregate, join
  • 76. More references 1. The Log: What every software engineer should know about real-time data's unifying abstraction, Jay Kreps, 2013 2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559 3. Why I am not a fan of Apache Kafka (2015-2016 Sep) 4. Kafka vs RabbitMQ a. What are the differences between Apache Kafka and RabbitMQ? b. Understanding When to use RabbitMQ or Apache Kafka 5. Kafka summit (2016~) 6. Future features of Kafka (Kafka Improvement Proposals) 7. Kafka- The Definitive Guide