Kafka connect-london-meetup-2016

•Télécharger en tant que PPTX, PDF•

6 j'aime•1,735 vues

Gwen (Chen) Shapira

Kafka is the basis for Modern Data Integration infrastructure and KafkaConnect makes it much better

Logiciels

Stream All
Things
Real-time Data Integration at Scale
with Apache Kafka
By Gwen Shapira

Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of NRT
Changes and
Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App

Data Integration
getting data to all the right places

Introducing
Kafka Connect
Large-scale streaming data import/export for Kafka

Offsets automatically committed and restored
On restart: task checks offsets & rewinds
At least once delivery – flush data, then commit
Exactly once for connectors that support it (e.g. HDFS)
Delivery Guarantees

Abstract serialization: 1 connector, many serialization formats
Convert between Kafka Connect Data API (Connectors) and serialized bytes
(Kafka)
JSON and Avro are currently well supported
Converters

Confluent Open Source – HDFS, JDBC
Connector Hub: connectors.confluent.io
Examples: MySQL, MongoDB, Twitter, Solr, S3, MQTT, Bloomberg, Apache Ignite, Attunity,
Couchbase, Vertica, Cassandra, Hbase, Kudu, Mixpanel, Systlog, Twitter and more
Connectors Today

Jenkins connector – Aravind Yarram (Equifax)
Twitter semantic analysis and visualization – Ashish Singh (Cloudera)
Brain monitoring device connector – Silicon Valley Data Science
DynamoDB, Cassandra, Slack, Splunk, and many more
Connectors from the Hackathon

Improved connector control via REST API, standardized configs, metrics
Single record transformations
Data pipelines in an app - embedded mode & Kafka Streams integration
Many more connectors
Coming soon…

THANK YOU!
Gwen Shapira | gwen@confluent.io | @gwenshap
Visit us in the Confluent Booth (#217)
Kafka: The Definitive Guide = Book Giveaway and Signing
Making Sense of Stream Processing = Book Giveaway
Kafka Training with Confluent University
Kafka Developer and Operations Courses
Visit www.confluent.io/training

Recommandé

Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang

Kafka Connect by DatioDatio Big Data

Introduction to Kafka StreamsGuozhang Wang

Introduction to Apache Kafka and why it matters - MadridPaolo Castagna

What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent

Architecture of a Kafka camus infrastructuremattlieber

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent

Kafka Streams for Java enthusiastsSlim Baltagi

Recommandé

Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang

Kafka Connect by DatioDatio Big Data

Introduction to Kafka StreamsGuozhang Wang

Introduction to Apache Kafka and why it matters - MadridPaolo Castagna

What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent

Architecture of a Kafka camus infrastructuremattlieber

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent

Kafka Streams for Java enthusiastsSlim Baltagi

Apache Kafka 0.8 basic training - VerisignMichael Noll

kafka for db as postgresPivotalOpenSourceHub

Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020confluent

I Heart Log: Real-time Data and Apache KafkaJay Kreps

Hello, kafka! (an introduction to apache kafka)Timothy Spann

Kafka internalsDavid Groozman

Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella

ksqlDB: A Stream-Relational Database Systemconfluent

Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira

The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede

Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex

Apache kafka-a distributed streaming platformconfluent

Design Patterns for working with Fast DataMapR Technologies

How Apache Kafka is transforming Hadoop, Spark and StormEdureka!

Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira

Capture the Streams of Database Changesconfluent

PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas

Data Pipeline with KafkaPeerapat Asoktummarungsri

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira

Kafka at scale facebook israelGwen (Chen) Shapira

Contenu connexe

Tendances

Apache Kafka 0.8 basic training - VerisignMichael Noll

kafka for db as postgresPivotalOpenSourceHub

Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020confluent

I Heart Log: Real-time Data and Apache KafkaJay Kreps

Hello, kafka! (an introduction to apache kafka)Timothy Spann

Kafka internalsDavid Groozman

Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella

ksqlDB: A Stream-Relational Database Systemconfluent

Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira

The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede

Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex

Apache kafka-a distributed streaming platformconfluent

Design Patterns for working with Fast DataMapR Technologies

How Apache Kafka is transforming Hadoop, Spark and StormEdureka!

Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira

Capture the Streams of Database Changesconfluent

PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas

Data Pipeline with KafkaPeerapat Asoktummarungsri

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

Tendances (20)

Apache Kafka 0.8 basic training - Verisign

kafka for db as postgres

Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020

I Heart Log: Real-time Data and Apache Kafka

Hello, kafka! (an introduction to apache kafka)

Kafka internals

Real time Messages at Scale with Apache Kafka and Couchbase

ksqlDB: A Stream-Relational Database System

Kafka & Hadoop - for NYC Kafka Meetup

The Many Faces of Apache Kafka: Leveraging real-time data at scale

Confluent building a real-time streaming platform using kafka streams and k...

Apache kafka-a distributed streaming platform

Design Patterns for working with Fast Data

How Apache Kafka is transforming Hadoop, Spark and Storm

Fraud Detection for Israel BigThings Meetup

Capture the Streams of Database Changes

PostgreSQL + Kafka: The Delight of Change Data Capture

Data Pipeline with Kafka

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example

En vedette

Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira

Kafka at scale facebook israelGwen (Chen) Shapira

Data Architectures for Robust Decision MakingGwen (Chen) Shapira

Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira

Kafka for DBAsGwen (Chen) Shapira

Streaming Data Ingest and Processing with Apache KafkaAttunity

Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll

Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira

Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira

Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira

Application architectures with hadoop – big data techcon 2014Jonathan Seidman

Have your cake and eat it tooGwen (Chen) Shapira

Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey

Architecting applications with Hadoop - Fraud Detectionhadooparchbook

JustGiving – Serverless Data Pipelines, API, Messaging and Stream ProcessingLuis Gonzalez

Apache KafkaJoe Stein

Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira

Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey

Large scale ETL with HadoopOReillyStrata

Sparkストリーミング検証BrainPad Inc.

En vedette (20)

Streaming Data Integration - For Women in Big Data Meetup

Kafka at scale facebook israel

Data Architectures for Robust Decision Making

Kafka Reliability - When it absolutely, positively has to be there

Kafka for DBAs

Streaming Data Ingest and Processing with Apache Kafka

Introducing Kafka Streams, the new stream processing library of Apache Kafka,...

Nyc kafka meetup 2015 - when bad things happen to good kafka clusters

Scaling ETL with Hadoop - Avoiding Failure

Kafka and Hadoop at LinkedIn Meetup

Application architectures with hadoop – big data techcon 2014

Have your cake and eat it too

Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming

Architecting applications with Hadoop - Fraud Detection

JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing

Apache Kafka

Multi-Datacenter Kafka - Strata San Jose 2017

Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming

Large scale ETL with Hadoop

Sparkストリーミング検証

Similaire à Kafka connect-london-meetup-2016

Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan

Confluent and ElasticPaolo Castagna

Large scale, distributed and reliable messaging with KafkaRafał Hryniewski

Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman

Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!confluent

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

Deploying Apache Flume to enable low-latency analyticsDataWorks Summit

A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)Robert Metzger

2014 sept 26_thug_lambda_part1Adam Muise

Cloud lunch and learn real-time streaming in azureTimothy Spann

Data / Streaming / Microservices Platform with DevopsKidong Lee

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly

Deep Learning Streaming Platform with Kafka Streams, TensorFlow, DeepLearning...Kai Wähner

The other Apache technologies your big data solution needs!gagravarr

Building Streaming Applications with Apache Storm 1.1Hugo Louro

Apache Kafka - A Distributed Streaming PlatformPaolo Castagna

Kafka for data scientistsJenn Rawlins

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks

Down the event-driven road: Experiences of integrating streaming into analyti...inovex GmbH

Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent

Similaire à Kafka connect-london-meetup-2016 (20)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Confluent and Elastic

Large scale, distributed and reliable messaging with Kafka

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Deploying Apache Flume to enable low-latency analytics

A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

2014 sept 26_thug_lambda_part1

Cloud lunch and learn real-time streaming in azure

Data / Streaming / Microservices Platform with Devops

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

Deep Learning Streaming Platform with Kafka Streams, TensorFlow, DeepLearning...

The other Apache technologies your big data solution needs!

Building Streaming Applications with Apache Storm 1.1

Apache Kafka - A Distributed Streaming Platform

Kafka for data scientists

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

Down the event-driven road: Experiences of integrating streaming into analyti...

Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik

Plus de Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira

Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira

Gluecon - Kafka and the service meshGwen (Chen) Shapira

Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira

Papers we love realtime at facebookGwen (Chen) Shapira

Kafka reliability velocity 17Gwen (Chen) Shapira

Fraud Detection ArchitectureGwen (Chen) Shapira

Twitter with hadoop for oowGwen (Chen) Shapira

R for hadoopersGwen (Chen) Shapira

Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira

Incredible Impala Gwen (Chen) Shapira

Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira

Scaling etl with hadoop shapira 3Gwen (Chen) Shapira

Is hadoop for youGwen (Chen) Shapira

Ssd collab13Gwen (Chen) Shapira

Integrated dwh 3Gwen (Chen) Shapira

Visualizing database performance hotsos 13-v2Gwen (Chen) Shapira

Flexible DesignGwen (Chen) Shapira

Plus de Gwen (Chen) Shapira (18)

Velocity 2019 - Kafka Operations Deep Dive

Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote

Gluecon - Kafka and the service mesh

Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

Papers we love realtime at facebook

Kafka reliability velocity 17

Fraud Detection Architecture

Twitter with hadoop for oow

R for hadoopers

Intro to Spark - for Denver Big Data Meetup

Incredible Impala

Data Wrangling and Oracle Connectors for Hadoop

Scaling etl with hadoop shapira 3

Is hadoop for you

Ssd collab13

Integrated dwh 3

Visualizing database performance hotsos 13-v2

Flexible Design

Dernier

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

Introduction Computer Science - Software Design.pdfFerryKemperman

Cyber security and its impact on E commercemanigoyal112

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

Post Quantum Cryptography – The Impact on Identityteam-WIBU

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

Advantages of Odoo ERP 17 for Your BusinessEnvertis Software Solutions

Understanding Flamingo - DeepMind's VLM Architecturerahul_net

SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa

Dernier (20)

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

Folding Cheat Sheet #4 - fourth in a series

Machine Learning Software Engineering Patterns and Their Engineering

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

A healthy diet for your Java application Devoxx France.pdf

Introduction Computer Science - Software Design.pdf

Cyber security and its impact on E commerce

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

2.pdf Ejercicios de programación competitiva

Post Quantum Cryptography – The Impact on Identity

Cloud Data Center Network Construction - IEEE

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...

Software Project Health Check: Best Practices and Techniques for Your Product...

Advantages of Odoo ERP 17 for Your Business

Understanding Flamingo - DeepMind's VLM Architecture

SpotFlow: Tracking Method Calls and States at Runtime

Kafka connect-london-meetup-2016

1. Stream All Things Real-time Data Integration at Scale with Apache Kafka By Gwen Shapira

5. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App

6. Data Integration getting data to all the right places

10.

11.

12.

13. Introducing Kafka Connect Large-scale streaming data import/export for Kafka

14.

15.

16.

17.

18.

19.

20.

21.

22.

23. Offsets automatically committed and restored On restart: task checks offsets & rewinds At least once delivery – flush data, then commit Exactly once for connectors that support it (e.g. HDFS) Delivery Guarantees

24. Abstract serialization: 1 connector, many serialization formats Convert between Kafka Connect Data API (Connectors) and serialized bytes (Kafka) JSON and Avro are currently well supported Converters

25.

26. Confluent Open Source – HDFS, JDBC Connector Hub: connectors.confluent.io Examples: MySQL, MongoDB, Twitter, Solr, S3, MQTT, Bloomberg, Apache Ignite, Attunity, Couchbase, Vertica, Cassandra, Hbase, Kudu, Mixpanel, Systlog, Twitter and more Connectors Today

27. Jenkins connector – Aravind Yarram (Equifax) Twitter semantic analysis and visualization – Ashish Singh (Cloudera) Brain monitoring device connector – Silicon Valley Data Science DynamoDB, Cassandra, Slack, Splunk, and many more Connectors from the Hackathon

28. Improved connector control via REST API, standardized configs, metrics Single record transformations Data pipelines in an app - embedded mode & Kafka Streams integration Many more connectors Coming soon…

29. THANK YOU! Gwen Shapira | gwen@confluent.io | @gwenshap Visit us in the Confluent Booth (#217) Kafka: The Definitive Guide = Book Giveaway and Signing Making Sense of Stream Processing = Book Giveaway Kafka Training with Confluent University Kafka Developer and Operations Courses Visit www.confluent.io/training

Notes de l'éditeur

Hi everyone and thanks for coming. Today I want to tell you about Kafka Connect and how it’s helping to address the challenges of real-time data integration.
Traditional model with relational DB with data for OLTP and data was copied into data warehouse for OLAP. There was one primary data store for active data and one for offline, batch analysis.
More types of data stores with specialized functionality – e.g. rise of NoSQL systems handling document-oriented and columnar stores. A lot more sources of data. Rise of secondary data stores and indexes – e.g. Elasticsearch for efficient text-based queries, graph DBs for graph-oriented queries, time series databases. A lot more destinations for data, and a lot of transformations along the way to those destinations. Real-time: data needs to be moved between these systems continuously and at low latency.
Unfortunately, as you build up large, complex data pipelines in an ad hoc fashion by connecting different data systems that need copies of the same data with one-off connectors for those systems, or build out custom connectors for stream processing frameworks to handle different sources and sinks of streaming data, we end up with a giant, unmaintainable mess. This mess has a huge impact on productivity and agility once you get past just a few systems. Adding any new data storage system or stream processing job requires carefully tracking down all the downstream systems that might be affected, which may require coordinating with dozens of teams and code spread across many repositories. Trying to change one data source’s data format can impact many downstream systems, yet there’s no simple way to discover how these jobs are related. This is a real problem that we’re seeing across a variety of companies today. We need to do something to simplify this picture. While Confluent is working to build out a number of tools to help with these challenges, today I want to focus on how we can standardize and simplify constructing these data pipelines so that, at a minimum, we reduce operational complexity and make it easier to discover and understand the full data pipeline and dependencies.
We refer to this problem as data integration – by which we broadly mean making sure data gets to all the right places. We need to be able to collect data from a diverse set of sources and then feed it to several downstream applications and systems for processing. This problem isn’t a new one. There were legacy solutions to this problem but the approach of copying data in an ad-hoc way across applications just does not scale anymore. Today data is in motion and it needs to move in real-time and at scale.
I want to start by highlighting some anti-patterns we observe in how people are tackling this problem today. One-off tools – connect any two given specific systems. High complexity, operational overhead Designed to be too specific – n^2 connectors Overly-generic data copying tools – make few assumptions, connect any and all inputs and outputs, and do a bunch of intermediate transformation as well. Try to do too much – E, T, and L with weak interfaces Too abstract – difficult/impossible to make guarantees even when connecting right pairs of systems Stream processing tools for data integration Overkill for simple EL workloads Weaker connector ecosystem – focus is rightly on T Generic, weak interfaces as found in generic data copying tools result in difficult to understand semantics and guarantees
When we get too specific, handling everything ad hoc, we end up with a ton of different tools for every connection, often times many different tools for doing transformations, and probably the worst case – a lot of different tools that do *all* of ETL for specific systems. If we have too little separation of concerns, we end up in situations where we use the stream processing framework for literally every step even though they use a specific model that doesn’t map well to ingesting or exporting data from many types of systems. Alternatively, we use overly generic data copying & transformation tools. These tools are so abstract that they can’t provide many guarantees and become overly complex, requiring you to learn a dozen concepts just to setup a simple pipeline. What we really need is a separation of concerns in ET&L.
One step towards getting to a separation of concerns is being able to decouple the E, T, and L steps. Kafka, when used as shown here, can help us do that. The vision of Kafka when originally built at LinkedIn was for it to act as a common hub for real-time data. When streaming data from data stores like RDBMS or K/V store, we produce data into Kafka, making it available to as many downstream consumers as want it. Save data to other systems like secondary indexes and batch storage systems, which are implemented with consumers. Stream processing frameworks and custom consumer apps fit in by being both consumers and producers – reading data from Kafka transforming it, and then possibly publishing derived data back into Kafka. Using this model can simplify the problem as we’re now always interacting with .
To set some context, I want to just quickly list a few of the features that make it possible for Kafka to handle data at this scale. We’ll come back to many of these properties when looking at Kafka Connect. At its core, pub/sub messaging system rethought as distributed commit log. Based on an append-only and sequentially accessed log, which results in very high performance reading and writing data. Extends the model to a *partitioned stream* model for a single logical topic of data, which allows for distribution of data on the brokers and parallelism in both writes and reads. In order to still provide organization and ordering within a single partition, it guarantees ordering within each partition and uses keys to determine which partition to put data in. As part of its append-only approach, it decouples data consumption from data retention policy, e.g. retaining data for 7 days or until we have 1TB in a topic. This both gets rid of individual message acking and allows multiple consumption of the same data, i.e. pub/sub, by simply tracking offsets in the stream. Because data is split across partitions, we can also parallelize consumption and make it elastically scalable with Kafka’s unique automatically balanced consumer groups.
Given all these properties, it’s easy to see how Kafka can fit this central role as the hub for all your realtime data, and we can simplify the original image of our data pipeline. However, with the regular Kafka clients, we’re still leaving quite a bit on the table – each connection in the image still requires its own tool or Kafka application to get data to or from Kafka. Each tool uses these relatively low-level clients and has to implement many common features.
Today, I want to introduce you to Kafka Connect, Kafka’s new large-scale, streaming data import/export tool that drastically simplifies the construction, maintenance, and monitoring of these data pipelines. Kafka Connect is part of the Apache Kafka project, open source under the Apache license, and ships with Kafka. It’s a framework for building connectors between other data systems and Kafka, and the associated runtime to run these connectors in a distributed, fault tolerant manner at scale.
Goals: Focus – copying only Batteries included – framework does all the common stuff so connector developers can focus specifically on details that need to be customized for their system. This covers a lot more than many connector developers realize: beyond managing the producer or consumer, it includes challenges like scalability, recovery from faults and reasoning about delivery guarantees, serialization, connector control, monitoring for ops, and more. Standardize – configuration, status and connector control, monitoring, etc. Parallelism, scalability, fault tolerance built-in, without a lot of effort from connector developers or users. Scale – in two ways. First, scale individual connectors to copy as much data as possible – ingest an entire database rather than one table at a time. Second, scale up to organization-wide data pipelines or down to development, testing, or just copying a single log file into Kafka With these goals in mind, let’s explore the design of Kafka Connect to see how it fulfills these.
At it’s core, Kafka connect is pretty simple. It has source connectors which copy data from another system into Kafka, and sink connectors that copy data from Kafka into a destination system. Here I’ve shown a couple of examples. The source and sink systems don’t necessarily have to naturally match Kafka’s data model exactly. However, we do need to be able to translate data between the two. For example, we might load data from a database in a source connector. By using a timestamp column associated with each row, we can effectively generate an ordered stream of events that are then produced into Kafka. To store data into HDFS, we might load data from one or more topics in Kafka and then write it in sequence to files in an HDFS directory, rotating files periodically. Although Kafka Connect is designed around streaming data, because Kafka acts as a good buffer between streaming and batch systems, we can use it here to load data into HDFS. Neither of these systems map directly to Kafka’s model, but both can be adapted to the concepts of streams with offsets. More about this in a minute. The most important design point for Kafka Connect is that one half of a connection is always Kafka – the destination for sources, or the source of data for sink connectors. This allows the framework to handle the common functionality of connectors while maintaining the ability to automatically provide scalability, fault tolerance, and delivery guarantees without requiring a lot of effort from connector developers. This key assumption is what makes it possible for Kafka Connect to get a better set of tradeoffs than the systems I mentioned earlier.
So now, coming back to the model that connectors need to map to. Just as Kafka’s data model enables certain features around scalability, Kafka Connect’s data model can as well. Kafka Connect requires every connector to map to a “partitioned stream” model. The basic idea is a generalization of Kafka’s data model of topics and partitions. This mapping is defined by the input system for the connector – the source system for source connectors, and Kafka topics for sink connectors -- and has the following: A set of partitions which divide the whole set of data logically. Unlike Kafka, the number of partitions can potentially be very large and may be more dynamic than we would expect with Kafka. Each partition contains an ordered sequence of events/messages. Under the hood these are key/value pairs with byte[], but Kafka Connect requires that they can be converted into a generic data API Each event/message has a unique offset representing its position in the partition. Since the mapping is determined by the input system, these offsets must be meaningful to that system – these may be quite different from the Kafka offsets you’re used to.
To give a more concrete example, we can revisit the database example from earlier. Previously I only showed a single table, but if we consider the database as a whole, we can apply this model to copy the entire database. We partition by table, delivering each into its own Kafka topic. Each event represents a row that we’ve inserted into the database. The offsets are IDs or timestamps, or even more complex representations like a combination of ID and timestamp. Although there isn’t *actually* a stream for each table, we can effectively construct one by querying the database and ordering results according to specific rules. As a result of this model, we can see a few properties emerging: First, we have a built-in concept of parallelism, a requirement for automatically providing scalable data copying. We’re going to be able to distribute processing of partitions across multiple hosts. Second, this model encourages making copying broad by default – partitioned streams should cover the largest logical collection of data. Finally, offsets provide an easy way to track which data has been processed and which still needs to be copied. In some cases, mapping from the native data model to streams may not be simple; however, a bit of effort in creating this mapping pays off by providing a common framework and implementation for tracking which data has been copied. Again, we’ll revisit this a bit later, but this allows the framework to handle a lot of the heavy lifting with regards to delivery semantics.
Partitioned streams are the logical data model, but they don’t directly map to physical parallelism, or threads, in Kafka Connect. In the case of the database connector, a direct mapping might seem reasonable. However, some connectors will have a much larger number of partitions that are much finer-grained. For example, consider a connector for collecting metrics data – each metric might be considered its own partition, resulting in tens of thousands of partitions for even a small set of application servers. However, we do want to exploit the parallelism provided by partitions. Connectors do this by assigning partitions to tasks. Tasks are, simply, threads of control given to the connector code which perform the actual copying of data. Each connector is given a thread it can use to monitor the input system for the active set of partitions. Remember that this set can be dynamic, so continuous monitoring is sometimes needed to detect changes to the set of partitions. When there are changes, the connector notifies the framework so it can reconfigure the current set of tasks. Then, each task is given a dedicated thread for processing. The connector assigns a subset of partitions to each task and the task is the one that actually copies the data for that partition. Given the assignment, the connector implementer handles the reading or writing data from that set of partitions. And how do we decide how many tasks to generate? That’s up to the user, and it’s the primary way to control the total resources used by the connector. Since each task corresponds to a thread, the user can choose to dynamically increase or decrease the maximum number of tasks the connector may create in order to scale resource usage up or down. So now we have some set of threads, but where do they actually execute? Kafka Connect has two modes of execution.
Standalone mode works as a single process. This is really easy to get started with, easy to configure. We like this because it scales down really easily and stays local for testing. It’s also great for connectors that really only make sense on a single node – for example, processing log files, where you need to read the data off the local file system. If you’ve used systems like logstash or flume, this mode should look familiar. It’s commonly referred to as either standalone or agent mode.
In contrast, distributed mode can scale up while providing distribution and fault tolerance. Recall that each connector or task is a thread, and we’re considering each to be approximately equal in terms of resource usage. Connectors and tasks are auto-balanced across workers. Failures automatically handled by redistributing work, and you can easily scale the cluster up or down by adding more workers. Cool implementation note: reuses group membership functionality of consumer groups. Note how if you replace “worker” with “consumer” and “task” with “topic partition”, the things it is doing look largely the same: assigning tasks to workers, detecting when a worker is added or fails, and rebalancing the work. Kafka already provides support for doing a lot of this, so by leveraging the existing implementation and coordinating through Kafka’s group functionality (with internal data stored in Kafka topics), Kafka Connect can provide this functionality in a relatively small code footprint. Finally, note that Kafka Connect does not own the process management at all. We don’t want to make assumptions about using Mesos, YARN, or any other tool because that would unnecessarily limit Kafka Connect’s usage. Kafka Connect will work out of the box in any of these cluster management systems, or with orchestration tools, or if you just manage your processes with your own tooling.
In contrast, distributed mode can scale up while providing distribution and fault tolerance. Recall that each connector or task is a thread, and we’re considering each to be approximately equal in terms of resource usage. Connectors and tasks are auto-balanced across workers. Failures automatically handled by redistributing work, and you can easily scale the cluster up or down by adding more workers. Cool implementation note: reuses group membership functionality of consumer groups. Note how if you replace “worker” with “consumer” and “task” with “topic partition”, the things it is doing look largely the same: assigning tasks to workers, detecting when a worker is added or fails, and rebalancing the work. Kafka already provides support for doing a lot of this, so by leveraging the existing implementation and coordinating through Kafka’s group functionality (with internal data stored in Kafka topics), Kafka Connect can provide this functionality in a relatively small code footprint. Finally, note that Kafka Connect does not own the process management at all. We don’t want to make assumptions about using Mesos, YARN, or any other tool because that would unnecessarily limit Kafka Connect’s usage. Kafka Connect will work out of the box in any of these cluster management systems, or with orchestration tools, or if you just manage your processes with your own tooling.
In contrast, distributed mode can scale up while providing distribution and fault tolerance. Recall that each connector or task is a thread, and we’re considering each to be approximately equal in terms of resource usage. Connectors and tasks are auto-balanced across workers. Failures automatically handled by redistributing work, and you can easily scale the cluster up or down by adding more workers. Cool implementation note: reuses group membership functionality of consumer groups. Note how if you replace “worker” with “consumer” and “task” with “topic partition”, the things it is doing look largely the same: assigning tasks to workers, detecting when a worker is added or fails, and rebalancing the work. Kafka already provides support for doing a lot of this, so by leveraging the existing implementation and coordinating through Kafka’s group functionality (with internal data stored in Kafka topics), Kafka Connect can provide this functionality in a relatively small code footprint. All of this functionality can be accessed via REST API – submit connectors, see their status, update configs, and so on. Finally, note that Kafka Connect does not own the process management at all. We don’t want to make assumptions about using Mesos, YARN, or any other tool because that would unnecessarily limit Kafka Connect’s usage. Kafka Connect will work out of the box in any of these cluster management systems, or with orchestration tools, or if you just manage your processes with your own tooling.
I want to mention two important features that also simplify both connector developer’s and user’s lives. The first feature is offset management, which provides for standardized data delivery guarantees. Delivery guarantees are actually rarely provided in many other systems. They generally offer some sort of best effort, but unreliable, delivery. Ironically, stream processing frameworks often do a better job than tools specifically designed for data copying. Kafka Connect handles offset checkpointing for connectors, and this fits in as a natural extension to Kafka’s offset commit functionality. For sources this works with offsets that have complex structure (e.g. timestamps + autoincrementing IDs in a database) and requires no implementation support from the connector beyond defining the offsets and being able to start reading from a saved offset. For sinks, we can leverage Kafka’s existing offset functionality, but in order to ensure data is completely written, sinks must also support a flush operation. Commits are automatically processed periodically. By default, this mode of managing offsets will provide at least once delivery; internally both sources and sinks are simply flushing all data to the output and the committing offsets. Note that some connectors will opt out of this functionality in order to provide even stronger guarantees. For example, the HDFS connector manages its own offsets because (carefully) tracking them in HDFS along with the data allows for exactly-once delivery.
The second feature I want to mention are converters. Serialization formats may seem like a minor detail, but not separating the details of data serialization in Kafka from the details of source or sink systems results in a lot of inefficiency: A lot of code for doing simple data conversions are duplicated across a large number of ad hoc connector implementations. Each connector ultimately contains its own set of serialization options as it is used in more environments – JSON, Avro, Thrift, protobufs, and more. Much like the serializers in Kafka’s producer and consumer, the Converters abstract away the details of serialization. Converters are different because they guarantee data is transformed to a common data API defined by Kafka Connect. This API supports both schema and schemaless data, common primitive data types, complex types like structs, and logical type extensions. By sharing this API, connectors write one set of translation code and Converters handle format-specific details. For example, the JDBC connector can easily be used to produce either JSON or Avro to Kafka, without any format-specific code in the connector.
With all these pieces you can see how we can tie together Kafka and Kafka Connect with stream processing frameworks and applications to not only simplify building these data pipelines and solve data integration challenges, but also transform how your company manages its data pipelines. Kafka provides the central hub for real-time data and Kafka Connect simplifies operationalization: one service to maintain, common metrics, common monitoring, and agnostic to your choice of process and cluster management. You can centrally managed Kafka Connect cluster running in distributed mode, and accessed via REST API, allowing your ops team to provide data integration as a service to your entire organization. For developers who want to build a complex data pipeline, they can submit jobs to copy data into and out of Kafka – it’s zero coding (assuming a connector is available) Then, they can easily leverage either the traditional clients or stream processing frameworks to transform that data. The output is stored back into another Kafka topic or served up directly. As a side benefit, standardizing on Kafka encourages reuse of existing data (both raw and transformed). Providing this service not only makes it easy to build your *own* complex data pipeline, it encourages other people in the org to build on top of your existing work. Confluent Platform also provides additional tools that make this setup even more powerful. For example, the schema registry controls the format of data in each topic, and besides ensuring data quality and compatibility, it also encourages decoupling of teams by allowing anyone to discover what data is in a topic, grab its schema, and immediately start utilizing that data without ever adding coordination overhead with another team. A stream data platform built around Kafka and Kafka Connect allows you to scale to handle your entire organization’s real-time data, while maintaining simple management and easy operationalization of your data pipeline.
Kafka Connect provides the framework, but I want to spend a few minutes describing the current state of the connector ecosystem. While the framework ships with Apache Kafka, connectors use a federated approach to development. Confluent helped kick off connector development with a few key open source connectors – JDBC for importing data from any relational database and HDFS, for exactly once delivery of data into HDFS and Hive. Confluent will be continuing to add more open source connectors. We’ve also started tracking connectors that the community has been developing on a page we’re calling the Connector Hub. We’ve already got a dozen or so connectors, and more are popping up every week. We’ll be working to make this index as useful to users as possible, offering information about the current state of the connector implementations and feature sets.
I also wanted to highlight a few of the winning connectors and applications that came out of the Stream Data Hackathon we held last night to help demonstrate the variety of systems Kafka Connect can be used with.
Finally, the Kafka Connect framework is only a few months old and there are lots more improvements and refinements in the pipeline. I want to highlight a few here. The upcoming release of Kafka will include many improvements to the REST API, including better control over connectors, status tracking, standardized APIs for exposing configs, and improved metrics. I emphasized earlier isolating copying data from transforming it. This is important to do, but there are some small, limited transformations which are sometimes important to perform *before* the data hits Kafka, for example removing personally identifiable information. We’re planning to add support for this in a way that won’t compromise the guarantees Kafka Connect provides and also doesn’t require connectors to duplicate the implementation of this functionality. We’re also looking at leveraging both Kafka Connect and the new Kafka Streams library in a single application to enable a “data pipeline in an app”. The key enabling feature in Kafka Connect that will enable this is a sort of embedded mode which can scale up and down like a distributed cluster, but which you start and run from your own application. With both Kafka Connect and Kafka Streams able to scale up and down with more or fewer processes, your entire data pipeline can be easily scaled. And, of course, we’re also actively working with the community to develop many more connectors