Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Apache	Kafka	and	Real	Time	
Stream	Processing
Gwen	Shapira
System	Architect
Confluent
@gwenshap
I’ll	tell	you	about
• What	is	stream	processing	and	
why	it	matters
• What	is	Apache	Kafka
• How	Kafka	helps	stream	proces...
What	is	Stream	Processing?
Data	Processing	Paradigm
Request	/	Response	
Batch
Stream	Processing
Stream	Processing	Paradigm
• Data	is	generated	at	its	own	rate	as	“Streams”
• We	can	process	as	much	or	as	little	as	we	wa...
This	is	the	world	changing	bit
• Most	of	the	business	is…
• Not	urgent	enough	to	require	immediate	response
• But	can’t	wa...
Ok,	got	the	streams	part.
But	what	about	Apache	Kafka?
Cross	of	messaging	system	
and	file	system
Kafka	is	all	about	LOGS
If	you	understand	logs
You	understand	Kafka
Redo	Log:
The	most	crucial	structure	for	
recovery	operations	…	
store	all	changes	made	to	the	
database	as	they	occur.
Important	Point
The	redo	log	is	the	only reliable	
source	of	information	about	current	
state	of	the	database.
But	Logs	are	also	a	STREAM	of	events
And	Kafka	stores	those	logs
Allowing	to	read	the	past
and	keep	getting	updates	on	the...
Stream	Processing
Read	a	stream
modify	it
output	another	stream
Example:	CDC-based	ETL
If	we	use	Kafka	for	CDC,	
does	it	mean	it	is	ACID?
Stream	Processing	is	Important
Kafka	is	a	collection	of	logs.
How	does	Kafka	help	with	stream	processing?
First,	How	do	we	actually	
do	stream	processing?
Method	1:	
Do	it	yourself	(Hipster	stream	processing)
Method	2:
The	Stream	Processing	Frameworks
• Storm
• Spark
• Flink
• Samza
• Apex
• Nifi
• StreamBase
• InfoSphere Streams...
Few	of	those	are	really	popular!
• Pro:	They	handle	some	hard	problems
• Con:	It	can	be	too	complex
What	do	I	mean	by	too	complex?
Hadoop	Cluster	II
Storage Processing
SolR
Hadoop	Cluster	I
ClientClient
Flume	Agents
Hbase ...
Why	so	many	moving	parts?
We	needed…
Hbase to	handle	complex	state
Spark	requires	HDFS
Ingest	layer	
Batch	layer	to	handle...
What	we	really	wanted	was…
Inputs
Kafka
Processor
output
Enter	KafkaStreams
3	Simplifications:
1. Uses	Kafka
2. No	Framework
3. Unify	Tables	and	Streams
Don’t	all	stream	processing	use	
Kafka?
We	use	Kafka	for…	Partitioning,	Scalability,	
Fault	Tolerance
Kafka
A A A
Group	A
B
B
Group	B
Handling	Time
No	Framework
• It	is	just	a	library	that	does	transformations
• We	can	add	languages	on	top
• Kafka	does	everything	we	nee...
The	really	important	bit:
Streams	meet	Tables
Streams:	Things	that	happen.	Events.
Tables:	State	of	things	as	they	are.
Databases:	Only	states.
Streams:	Only	events.
We	can	convert	tables	to	streams	and	back:
Stream	->	Apply	->	Table
Table	->	Change	Capture	->	Stream
This	is	called	Table...
Streams	and	Tables	sometimes	work	
the	same.
And	sometimes	are	very	different.
KafkaStreams handles	both.
But…
Where	do	streams	come	from?
We	really	like	streams
So	we	created	a	
Stream	Data	Platform
Where	can	we	learn	more?
• http://www.confluent.io/blog
• http://kafka.apache.org/docume
ntation.html
• http://docs.conflu...
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
Prochain SlideShare
Chargement dans…5
×

GNW03: Stream Processing with Apache Kafka by Gwen Shapira

2 813 vues

Publié le

Gwen Shapira of Confluent presented episode #03 of Gluent New World series and talked about stream processing in modern enterprises using Apache Kafka.

The video recording for this presentation is at: http://vimeo.com/gluent/

Publié dans : Technologie
  • Soyez le premier à commenter

GNW03: Stream Processing with Apache Kafka by Gwen Shapira

  1. 1. Apache Kafka and Real Time Stream Processing Gwen Shapira System Architect Confluent @gwenshap
  2. 2. I’ll tell you about • What is stream processing and why it matters • What is Apache Kafka • How Kafka helps stream processing Stay awake for this part
  3. 3. What is Stream Processing?
  4. 4. Data Processing Paradigm Request / Response Batch Stream Processing
  5. 5. Stream Processing Paradigm • Data is generated at its own rate as “Streams” • We can process as much or as little as we want • Continuously • Results are available in real-time • But nothing waits for specific results • Time for data availability? • More than “few ms” • Less than “hours”
  6. 6. This is the world changing bit • Most of the business is… • Not urgent enough to require immediate response • But can’t wait for the next day • “Streams of events” represents something fundamental • Same way relational tables are fundamental
  7. 7. Ok, got the streams part. But what about Apache Kafka?
  8. 8. Cross of messaging system and file system
  9. 9. Kafka is all about LOGS
  10. 10. If you understand logs You understand Kafka
  11. 11. Redo Log: The most crucial structure for recovery operations … store all changes made to the database as they occur.
  12. 12. Important Point The redo log is the only reliable source of information about current state of the database.
  13. 13. But Logs are also a STREAM of events And Kafka stores those logs Allowing to read the past and keep getting updates on the future
  14. 14. Stream Processing Read a stream modify it output another stream
  15. 15. Example: CDC-based ETL
  16. 16. If we use Kafka for CDC, does it mean it is ACID?
  17. 17. Stream Processing is Important Kafka is a collection of logs. How does Kafka help with stream processing?
  18. 18. First, How do we actually do stream processing?
  19. 19. Method 1: Do it yourself (Hipster stream processing)
  20. 20. Method 2: The Stream Processing Frameworks • Storm • Spark • Flink • Samza • Apex • Nifi • StreamBase • InfoSphere Streams • Google DataFlow (AKA Beam) • I can go on for 5 more pages…
  21. 21. Few of those are really popular! • Pro: They handle some hard problems • Con: It can be too complex
  22. 22. What do I mean by too complex? Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Imp ala Map/Red uce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App
  23. 23. Why so many moving parts? We needed… Hbase to handle complex state Spark requires HDFS Ingest layer Batch layer to handle re-calculations
  24. 24. What we really wanted was… Inputs Kafka Processor output
  25. 25. Enter KafkaStreams 3 Simplifications: 1. Uses Kafka 2. No Framework 3. Unify Tables and Streams
  26. 26. Don’t all stream processing use Kafka?
  27. 27. We use Kafka for… Partitioning, Scalability, Fault Tolerance Kafka A A A Group A B B Group B
  28. 28. Handling Time
  29. 29. No Framework • It is just a library that does transformations • We can add languages on top • Kafka does everything we needed the framework to do • You don’t need “framework” to run queries, why do you need it to run queries continuously?
  30. 30. The really important bit: Streams meet Tables
  31. 31. Streams: Things that happen. Events. Tables: State of things as they are.
  32. 32. Databases: Only states. Streams: Only events.
  33. 33. We can convert tables to streams and back: Stream -> Apply -> Table Table -> Change Capture -> Stream This is called Table-Stream Duality.
  34. 34. Streams and Tables sometimes work the same. And sometimes are very different. KafkaStreams handles both.
  35. 35. But… Where do streams come from?
  36. 36. We really like streams So we created a Stream Data Platform
  37. 37. Where can we learn more? • http://www.confluent.io/blog • http://kafka.apache.org/docume ntation.html • http://docs.confluent.io/current

×