2. About Me
Data Software Engineer of EAD
in the manufacturer, Micron
Currently working with
- data and people
- Lurking in PyHug, Taipei.py and various Meetups
Shuhsi Lin
sucitw gmail.com
5. Agenda
» Pipeline to streaming
» What is Apache Kafka
⋄ Overview
⋄ Architecture
⋄ Use cases
» Kafka API
⋄ Python clients
» Conclusion and More about Kafka
6. What we will not focus on
» Reliability and durability
⋄ Scaling, replication, guarantee
⋄ Zookeeper
» Compact log
» Administration, Configuration, Operations
» Kafka connect
» Kafka Stream
» Apache Kafka vs XXX
⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ,
ZeroMQ, Redis, and ....
12. What is streaming process
» Data comes from the rise of events
(orders, sales, clicks or trades)
» Databases are event streams
⋄ the process of creating a backup or standby copy
of a database
⋄ publishing the database changes
16. The name, “Kafka”, came from?
https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system
http://slideplayer.com/slide/4221536/
https://en.wikipedia.org/wiki/Franz_Kafka
17. What is Apache Kafka?
Apache Kafka is a distributed system designed for streams. It is built to be
fault-tolerant, high-throughput, horizontally scalable, and allows geographically
distributing data streams and processing.
https://kafka.apache.org
22. What a streaming data platform can provide
» “Data integration” (ETL)
⋄ How to transport data between systems
⋄ Captures streams of events or data changes and
feeds these to other data systems
» “Stream processing” (messaging)
⋄ Continuous, real-time processing and
transformation of these streams and makes the
results available system-wide.
various systems in LinkedIn
https://www.confluent.io/blog/stream-data-platform-1/
Analytical data processing with very low latency
24. What Kafka Does
Publish & subscribe
● to streams of data like a messaging system
Process
● streams of data efficiently and in real time
Store
● streams of data safely in a distributed replicated cluster
https://kafka.apache.org/
27. A modern stream-centric data architecture built around Apache Kafka
https://www.confluent.io/blog/stream-data-platform-1/
500 billion events per day
28. The key abstraction in Kafka is a
structured commit log of updates
append records to this log
https://www.confluent.io/blog/stream-data-platform-1/
Each of these data consumers
has its own position in the log
and advances independently.
This allows a reliable, ordered stream of updates
to be distributed to each consumer.
The log can be sharded and spread
over a cluster of machines, and
each shard is replicated for
fault-tolerance.
consumers
producers
parallel, ordered consumption
(important to a change capture system
for database updates)
TBs of data
29. Topics and Partitions
» Topics are split into partitions
» Partitions are strongly ordered & immutable
» Partitions can exist on different servers
» Partition enable scalability
» Producers assign a message to a partition within the topic
⋄ Either round robin ( simply to balance load)
⋄ or according to the keys
https://kafka.apache.org/documentation/#gettingStarted
30. Offsets
» Message are assigned an offset in the partition
» Consumers track with ( offset, partition, topic)
https://kafka.apache.org/documentation/#gettingStarted
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups
31. Consumers and Partitions
» A consumer group consumes one topic
» A partition is always sent to the same consumer instance
https://kafka.apache.org/documentation/#gettingStarted
32. Consumer
● Messages are available to consumers only when they have been
committed
● Kafka does not push
○ Unlike JMS
● Read does not destroy by consumers
○ Unlike JMS Topic
● (some) History available
○ Offline consumers can catch up
○ Consumers can re-consume from the past
● Delivery Guarantees
○ Ordering maintained
○ At-least-once (per consumer) by default; at-most-once and exactly-once can be
implemented
P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue
33. ZooKeeper: the coordination interface
between the Kafka broker and consumers
https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3
» Stores configuration data for distributed services
» Used primarily by brokers
» Used by consumers in 0.8 but not 0.9
35. Apache Kafka timeline
2011-Nov
2016-May2013-Nov 2015-Nov
Next
version
v0.10
Kafka Stream
rack awareness
v0.8
New Producer
Reassign-partitions
v0.9
Kafka Connect
Security
New Consumer
Apache
Software
Foundation
incubator
2010
Creation
In Linkedin
2014, Confluent
v0.10.2
Single Message Transforms
for Kafka Connect
36. TLS connection
SSL is supported only for the new Kafka Producer and Consumer (Kafka versions 0.9.0 and higher)
http://kafka.apache.org/documentation.html#security_ssl
http://docs.confluent.io/current/kafka/ssl.html
http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl
https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka
37. Apache Kafka is consider as :
Stream data platform
» Commit log service
» Messaging system
» circular buffer
38. Cons of Apache Kafka
» Consumer Complexity (smart, but poor client)
» Lack of tooling/monitoring (3rd party)
» Still pre 1.0 release
» Operationally, it’s more manual than desired
» Requires ZooKeeper
Sep 26, 2015http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326
39. Use Cases
» Website Activity Tracking
» Log Aggregation
» Stream Processing
» Event Sourcing
» Commit logs
» Metrics (Performance index streaming)
⋄ CPU/IO/Memory usage
⋄ Application Specific:
⋄ Time taken to load a web-page
⋄ Time taken to build a web-page
⋄ No. of requests
⋄ No. of hits on a particular page/url
40. Event-driven Applications
» how it first is adopted and how its role
evolves over time in their architecture.
https://aws.amazon.com/tw/kafka/
42. Conceptual Reference Architecture
for Real-Time Processing in HDP 2.2
https://hortonworks.com/blog/storm-kafka-together-real-time-data-refinery/ February 12, 2015
43. Event delivery system design in Spotify
43
https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/
44. Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming
http://helenaedelson.com/?p=1186 (2016/03)
62. Must be type bytes, or be
serializable to bytes via
configured value_serializer.
Producer API -Kafka-Python
from kafka import KafkaConsumer, KafkaProducer
from settings import BOOTSTRAP_SERVERS, TOPICS, MSG
p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS)
p.send(TOPICS, MSG.encode('utf-8'))
p.flush()
Class kafka.KafkaProducer(**configs)
https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer
● close(timeout=None)
● flush(timeout=None)
● partitions_for(topic)
● send(topic, value=None, key=None,
partition=None, timestamp_ms=None)
http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html
63. Producer API -Confluent-python -Kafka
from confluent_kafka import Producer
from settings import BOOTSTRAP_SERVERS,
TOPICS, MSG
p = Producer({'bootstrap.servers':
BOOTSTRAP_SERVERS})
p.produce(TOPICS, MSG.encode('utf-8'))
p.flush()
http://docs.confluent.io/current/clients/confluent-kafka-python/#producer
Class confluent_kafka.Producer(*kwargs)
● len()
● flush([timeout])
● poll([timeout])
● produce(topic[, value][, key][, partition][,
on_delivery][, timestamp])
67. Create a Kafka Topic
» Let's create a topic named "test" with a single partition and
only one replica:
⋄ kafka-topics.sh --create --zookeeper zhost:2181
--replication-factor 1 --partitions 1 --topic test
» See that topic
⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181
bin/kafka-topics.sh
» Create, delete, describe, or change a topic.
74. More about Kafka
» Reliability and durability
⋄ Scaling, replication, guarantee, Zookeeper
» Compact log
» Administration, Configuration, Operations, Monitoring
» Kafka connect
» Kafka Stream
» Schema Registry
» Rest proxy
» Apache Kafka vs XXX
⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis,
and ....
75. The Another 2 APIs
» Connect API
○ JDBC, HDFS, S3, ….
» Streams API
○ MAP, filter, aggregate, join
76. More references
1. The Log: What every software engineer should know about real-time data's unifying abstraction,
Jay Kreps, 2013
2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559
3. Why I am not a fan of Apache Kafka (2015-2016 Sep)
4. Kafka vs RabbitMQ
a. What are the differences between Apache Kafka and RabbitMQ?
b. Understanding When to use RabbitMQ or Apache Kafka
5. Kafka summit (2016~)
6. Future features of Kafka (Kafka Improvement Proposals)
7. Kafka- The Definitive Guide