Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
BUILDING REALTIME DATA
PIPELINES WITH KAFKA CONNECT
AND SPARK STREAMING
Ewen Cheslack-Postava
Confluent
About Me: Ewen Cheslack-Postava
• Engineer @ Confluent
• Kafka Committer
• Kafka Connect Lead
Traditional ETL
More Data Systems
Stream Processing
Separation of Concerns
Large-scale streaming data import/export for Kafka
Kafka Connect
Separation of Concerns
Tasks - Parallelism
Execution Model - Standalone
Execution Model - Distributed
Execution Model - Distributed
Execution Model - Distributed
Data Integration as a Service
Delivery Guarantees
• Automatic offset checkpointing and recovery
– Supports at least once
– Exactly once for connectors t...
Spark Streaming
• Use Direct Kafka streams (1.3+)
– Better integration, more efficient, better
semantics
• Spark Kafka Wri...
Spark Streaming & Kafka Connect
• Increase # of systems Spark Streaming
works with, indirectly
• Reduce friction to adopt ...
Kafka Connect Summary
23
• Designed for large scale stream or batch data
integration
• Community supported and certified w...
THANK YOU.
Follow me on Twitter: @ewencp
Try it out: http://confluent.io/download
More like this, but in blog form: http:/...
Add Pages as Necessary
• Supporting points go here.
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Prochain SlideShare
Chargement dans…5
×

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Spark Summit East Talk

  • Soyez le premier à commenter

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

  1. 1. BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING Ewen Cheslack-Postava Confluent
  2. 2. About Me: Ewen Cheslack-Postava • Engineer @ Confluent • Kafka Committer • Kafka Connect Lead
  3. 3. Traditional ETL
  4. 4. More Data Systems
  5. 5. Stream Processing
  6. 6. Separation of Concerns
  7. 7. Large-scale streaming data import/export for Kafka Kafka Connect
  8. 8. Separation of Concerns
  9. 9. Tasks - Parallelism
  10. 10. Execution Model - Standalone
  11. 11. Execution Model - Distributed
  12. 12. Execution Model - Distributed
  13. 13. Execution Model - Distributed
  14. 14. Data Integration as a Service
  15. 15. Delivery Guarantees • Automatic offset checkpointing and recovery – Supports at least once – Exactly once for connectors that support it (e.g. HDFS) – At most once simply swaps write & commit – On restart: task checks offsets & rewinds
  16. 16. Spark Streaming • Use Direct Kafka streams (1.3+) – Better integration, more efficient, better semantics • Spark Kafka Writer – At least once – Kafka community is working on improved producer semantics
  17. 17. Spark Streaming & Kafka Connect • Increase # of systems Spark Streaming works with, indirectly • Reduce friction to adopt Spark Streaming • Reduce need for Spark-specific connectors • By leveraging Kafka as de facto streaming data storage
  18. 18. Kafka Connect Summary 23 • Designed for large scale stream or batch data integration • Community supported and certified way of using Kafka • Soon, large repository of open source connectors • Easy data pipelines when combined with Spark & Spark Streaming
  19. 19. THANK YOU. Follow me on Twitter: @ewencp Try it out: http://confluent.io/download More like this, but in blog form: http://confluent.io/blog
  20. 20. Add Pages as Necessary • Supporting points go here.

×