Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.
5. Founded by creators of Kafka - @jaykreps, @nehanarkhede, @junrao
We help you gather, transport, organize, and analyze all of your stream data
What we offer
• Confluent Platform
• Kafka plus critical bug fixes not yet applied in Apache release
• Kafka ecosystem projects
• Enterprise support
• Training and Professional Services
20. Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in
partition
Order retained with in
partition but not over
partitionsOffSetX
OffSetX
OffSetX
OffSetYOffSetYOffSetY
Off sets are kept per
consumer group
22. Keeping Things Simple
• Consume records from Kafka Topic
• Filter, transform, join, lookups, aggregate
• Write to another Kafka Topic
• https://github.com/confluentinc/examples/tree/master/specific-avro-
consumer
23. Kafka Makes Streams Easy
• Producers partition the data
• Consumers load balance partitions
• Add / remove consumers any way you want
• Will work with any framework (or none!)
24. Coming Soon to Kafka Near You
• KafkaConnect - Export / Import for Kafka - 0.9.0 (Its here!)
• KStream
• Consumer-Producer client - Processor (0.10.0 - April?)
• DSLs:
• KStream (a bit like Spark) - (0.10.0 - April?)
• SQL - ???
25. KConnect - Its a thing
• Easy to add connectors to Kafka
• Existing connectors
• JDBC
• HDFS
• MySQL * 2
• ElasticSearch * 4
• Cassandra
• S3 * 2
• MQTT
• Twitter
This gives me a lot of perspective regarding the use of Hadoop
Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
Kafka retains all messages for fixed amount of time.
Not waiting for acks from consumers.
The only metadata retained per consumer is the position in the log – the offset
So adding many consumers is cheap
On the other hand, consumers have more responsibility and are more challenging to implement correctly
And “batching” consumers is not a problem
3 partitions, each replicated 3 times.
The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
can read from one or more partition leader. You can’t have two consumers in same group reading the same partition.
Leaders obviously do more work – but they are balanced between nodes
We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka.