SlideShare a Scribd company logo
1 of 57
1Confidential
Kafka Connect
Future of Data: Princeton Meetup
Kaufman Ng
kaufman@confluent.io
2Confidential
Agenda
• About Me
• About Confluent
• Apache Kafka 101
• Kafka Connect
• Logical Architecture
• Core Components and Execution Model
• Connector Configuration, Deployment and Management
• Connector Examples
• Out-of-the-Box
• Community
• Wrap Up and Questions
3Confidential
About Me
• Solutions Architect, Confluent
• Senior Solutions Architect, Cloudera
• Bloomberg
• Contributor to Kafka, Parquet
• kaufman@confluent.io
• Twitter: @kaufmanng
4Confidential
About Confluent
5Confidential
About Confluent and Apache Kafka
• Founded by the creators of Apache Kafka
• Founded September2014
• Technology developed while atLinkedIn
• 75%of active Kafka committers
Cheryl Dalrymple
CFO
Jay Kreps
CEO
Neha Narkhede
CTO, VP Engineering
Luanne Dauber
CMO
Leadership
Todd Barnett
VP WW Sales
Jabari Norton
VP Business Dev
6Confidential
What is the Confluent Platform?
7Confidential
Confluent’s Offerings
Connect
Streams
Java Client
Kafka
Core
Confluent EnterpriseConfluent Platform
Stream MonitoringAdditional Clients (C/C++, Python)
Message DeliveryREST Proxy
Stream AuditSchema Registry
Connector ManagementPre-Built Connectors
8Confidential
Confluent Platform: Kafka ++
Feature Benefit Apache Kafka
Confluent Platform
3.0
Confluent Enterprise
3.0
Apache Kafka
High throughput, low latency, high availability, secure distributed message
system
Kafka Connect
Advanced framework for connecting external sources/destinations into
Kafka
Java Client Provides easy integration into Java applications
Kafka Streams
Simple library that enables streaming application development within the
Kafka framework
Additional Clients Supports non-Java clients; C, C++, Python, etc.
Rest Proxy
Provides universal access to Kafka from any network connected device via
HTTP
Schema Registry
Central registry for the format of Kafka data – guarantees all data is always
consumable
Pre-Built Connectors
HDFS, JDBC and other connectors fully Certified
and fully supported by Confluent
Confluent Control Center Includes Connector Management and Stream Monitoring
Support
Connection and Monitoring command center provides advanced
functionality and control
Community Community 24x7x365
Free Free Subscription
9Confidential
What’s New in Confluent Platform 3.0
Kafka Streams
• A library for writing stream data applications
using Kafka
• Powerful and lightweight
Confluent’s First Commercial Product
• Confluent Control Center
• A comprehensive management system for Kafka
clusters
• Stream monitoring capability that monitors data
pipelines from end-to-end
Producers
Consumers
Kafka Connect
Kafka Connect
Streams
TopicTopic Topic
Confluent Control Center
Kafka Connect (version 2.0)
• Expanding community of connectors and
architectures
• Improvements to Schema Registry and REST
proxy
10Confidential
Latest Version: Confluent Platform 3.0.1 (released September 2016)
• Based on latest version of Apache Kafka 0.10.0.1
• Bug fixes + improvements (56 issues)
• Highlights
• Kafka Streams: new Application Reset Tool
• Security fixes
• And many others
11Confidential
Introduction
To Real Time Streaming with Apache Kafka
12Confidential
13Confidential
What does Kafka do?
Producers
Consumers
Kafka Connect
Kafka Connect
Topic
Your interfaces to the world
Connected to your systems in real time
14Confidential
Kafka is much more than
a pub-sub messaging system
15Confidential
Before: Many Ad Hoc Pipelines
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational Metrics
Hadoop Search Monitoring
Data
Warehouse
Espresso Cassandra Oracle
16Confidential
After: Central Hub with Kafka
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle
Hadoop Log Search Monitoring
Data
Warehouse
Kafka
17Confidential
Kafka Connect
Simplified, scalable data import/export for Kafka
18Confidential
What is Kafka Connect?
Kafka Connect is a distributed, scalable, fault-tolerant service designed to reliably
stream data between Kafka and other data systems.
Data is produced from a source and consumed to a sink.
Example use cases:
• Publishing an SQL Tables (or an entire SQL database) into Kafka
• Consuming Kafka topics into HDFS for batch processing
• Consuming Kafka topics into Elasticsearch for secondary indexing
• Integrating legacy systems with modern Kafka framework
• … many others
19Confidential
Why Kafka Connect?
Kafka producers and consumers are simple in theory, yet ungainly in practice
Many common requirements
• Data conversion (serialization)
• Parallelism / scaling
• Load balancing
• Fault tolerance / fail-over
• General management
A few data-specific requirements
• Move data to/from sink/source
• Support relevant delivery semantics
20Confidential
Kafka Connect : Separation of Concerns
21Confidential
High-level Model
Source connectors:
Sink connectors:
22Confidential
Kafka Connect : Source & Sink Connectors
Kafka
Data
Destination
Kafka
Connect
Data
Destination
Data
Destination
Data
Source
Data
Source
Data
Source
connector
connector
connector
connector
connector
connector
23Confidential
How is Connect different than a producer or consumer?
• Producers and consumers provide complete flexibility to send any data to Kafka
or process it in any way.
• This flexibility means you do everything yourself.
• Kafka Connect’s simple framework allows :
• developers to create connectors that copy data to/from other systems.
• operators/users to use said connectors just by writing configuration files and
submitting them to Connect -- no code necessary
• community and 3rd-party engineers to build reliable plugins for common data
sources and sinks
• deployments to deliver fault tolerance and automated load balancing out-of-
the-box
24Confidential
Let the Framework Do the Hard Work
• Serialization / de-serialization
• Schema Registry integration
• Fault tolerance / failover
• Partitioning / scale-out
• … and let the developer to focus on domain specific copying details
25Confidential
Kafka Connect Architecture (low-level)
Connect has three main components: Connectors, Tasks, and Workers
In practice, a stream partition could be a database table, a log file, etc.
26Confidential
Standalone vs Distributed Mode
Connect workers can run in two modes: standalone or distributed
Standalone mode runs in a single process -- ideal for testing and development,
and consumer/producer use cases that need not be distributed (for example,
tailing a log file).
Distributed mode runs in multiple processes on the same machine or spread
across multiple machines. Distributed mode is fault tolerant and scalable (thanks
to Kafka Consumer Group support).
27Confidential
Kafka Connect Architecture (in color) : Streams -> Partitions -> Tasks
27
Stream of
records
(ex: Database)
Stream partitions
(ex: Database Tables)
Task 1
Task 2
28Confidential
Kafka Connect Architecture : Execution Model
28
Task 1 Task 2 Task 3 Task 4
Worker 1 Worker 2 Worker 3
Host 1 Host 2
29Confidential
Kafka Connect Architecture : Fault Tolerance and Elasticity
29
Host 1 Host 2
Worker 1 Worker 2 Worker 3
Task 1 Task 2 Task 3 Task 4
30Confidential
Kafka Connect Architecture : Fault Tolerance and Elasticity
30
Host 1 Host 2
Worker 1 Worker 2
Task 1 Task 2 Task 3 Task 4
33Confidential
Execution Model - Distributed
34Confidential
Execution Model - Distributed
35Confidential
Execution Model - Distributed
37Confidential
Converters
Pluggable API to convert data between native formats and Kafka
• Source connectors: converters are invoked after the data has been fetched from the source and
before it is published to Kafka
• Sink connectors: converters are invoked after the data has been consumed from Kafka and
before it is stored to the sink
• Converters are applied to both key and value data
Apache Kafka ships with JSONConverter
Confluent Platform adds AvroConverter
• AvroConverter allows integration with Confluent Schema Registry to track topic schemas and
ensure compatible schema evolution.
37
38Confidential
Source Offset Management
Connect tracks the offset that was last consumed for a source,
to restart tasks at the correct “starting point” after a failure.
These offsets are different from Kafka offsets -- they’re based
on the source system (eg a database, file, etc).
In standalone mode, the source offset is tracked in a local file.
In distributed mode, the source offset is tracked in a Kafka
topic.
40Confidential
Connect configuration (1 of 2)
Distributed mode vs standalone mode
Connectors
Workers
Overriding Producer and Consumer settings
The following discrete configuration dimensions exist for Connect:
41Confidential
Connect configuration (2 of 2)
In standalone mode, Connector configuration is stored in a file and
specified as a command-line argument.
In distributed mode, Connector configuration is set via the REST API, in
the JSON payload.
Connect configuration options are as follows:
Parameter Description
name Connector’s unique name.
connector.class Connector’s Java class.
tasks.max Maximum tasks to create. The Connector may create fewer if it cannot achieve this level
of parallelism.
topics (sink
connectors only)
List of input topics (to consume from).
42Confidential
Configuring workers
Worker configuration is specified in a configuration file that is passed as an
argument to the script starting Connect.
The following slides will go through important configuration options in three
dimensions: standalone mode, distributed mode, and parameters common
to both modes.
See docs.confluent.io for a comprehensive list, along with an example
configuration file.
43Confidential
Configuring Kafka Connect Workers : Core Settings
These parameters define the worker connection to the Kafka cluster.
See docs.confluent.io for a comprehensive list.
Parameter Description
bootstrap.severs A list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
key.converter Converter class for key Connect data.
value.converter Converter class for value Connect data.
44Confidential
Configuring Kafka Connect Workers : Task / Storage Control
• Standalone mode
• Distributed mode
Parameter Description
group.id A unique string that identifies the Connect cluster group this worker belongs to.
config.storage.topic The topic to store Connector and task configuration data in. This must be the
same for all workers with the same group.id.
offset.storage.topic The topic to store offset data for Connectors in. This must be the same for all
workers with the same group.id.
session.timeout.ms The timeout used to detect failures when using Kafka’s group management
facilities.
heartbeat.interval.ms The expected time between heartbeats to the group coordinator when using
Kafka’s group management facilities. Must be smaller than session.timeout.ms.
Parameter Description
offset.storage.file.filename The file to store Connector offsets in.
45Confidential
Running in standalone mode
Starting Connect in standalone mode involves starting a
process with one or more connect configurations:
connect-standalone worker.properties connector1.properties
[connector2.properties connector3.properties ...]
Each connector instance will be run in its own thread.
Configuration will be covered later.
46Confidential
Distributed Mode
47Confidential
Running in distributed mode
Starting Connect in distributed mode involves starting
connect on each worker node with the following:
connect-distributed worker.properties
Connectors are then added, modified, and deleted via a
REST API. Configuration will be covered later.
48Confidential
Managing distributed mode
To install a new connector, the connector needs to be
packaged in a JAR in placed in the CLASSPATH of
each worker process.
A REST API is used to create, list, modify, or delete
connectors in distributed mode. All worker processes
listen for REST requests, by default on port 8083. REST
requests can be made to any worker node.
49Confidential
Kafka Connect: Basic REST API
Method Path Description
GET /connectors Get a list of active connectors.
POST /connectors Create a new connector.
PUT /connectors/(string: name)/config Create a new connector, or update the configuration
of an existing connector.
GET /connectors/(string: name)/config Get configuration info for a connector.
GET /connectors/(string: name)/tasks/<task-id>/ Retrieve details for specific tasks
DELETE /connectors/(string: name) Delete a configured connector from the worker pool
Learn more about the REST API at docs.confluent.io
50Confidential
Connector Management
51Confidential
Kafka Connectors
• Confluent-supported connectors (included in Confluent Platform)
• Community-written connectors (just a sampling)
JDB
C
52Confidential
More Connectors Coming
• Cassandra Sink (Data Mountaineers)
• Elastic Search Connector (Confluent)
• JDBC source and sink
53Confidential
Connector Hub
http://www.confluent.io/product/connectors
54Confidential
Local File Source and Sink Connector
• The local file source Connector tails a local file
• Each line is published as a Kafka message to the target topic
• The local file sink Connector appends Kafka messages to a local file
• Each message is written as a line to the target file.
• Both the source and sink Connectors need to be run in standalone mode.
55Confidential
JDBC Source Connector
The JDBC source Connector periodically polls a relational database for new or recently
modified rows, creates an Avro record, and produces the Avro record as a Kafka message.
Records are divided into Kafka topics based on table name.
New and deleted tables are handled automatically by the Connector.
56Confidential
HDFS Sink Connector
The HDFS Sink Connector writes Kafka messages into target files in an HDFS cluster.
Kafka topics are partitioned into chunks by the connector, and each chunk is stored as an
independent file in HDFS. The nature of the partitioning is configurable (by default, the
partitioning defined for the Kafka topic itself is preserved) .
The connector integrates directly with Hive. Configuration options enable the automatic
creation/extension of an external Hive partitioned table for each Kafka topic.
The connector supports exactly-once delivery semantics, so there is no risk of duplicate
data for Hive queries.
57Confidential
Connector Development : Best Practices
• Define data stream for reasonable parallelism and utility
• Develop flexible message syntax, leveraging Schema Registry
(users can opt for JSONConverter, so no lock-in)
• Check the community frequently!
58Confidential
Connect Security
• Supports SASL and TLS/SSL
• Uses java keystores/truststores
• # Source security settings are prefixed with "producer"
producer.security.protocol=SSL
producer.ssl.truststore.location=/var/private/ssl/kafka.client.truststore.jks
producer.ssl.truststore.password=test1234
• # Sink security settings are prefixed with "consumer"
consumer.security.protocol=SSL
consumer.ssl.truststore.location=/var/private/ssl/kafka.client.truststore.jks
consumer.ssl.truststore.password=test1234
59Confidential
Kafka Connect : Benefits for Developers
 Simplifies data flow in to and out of Kafka
 Reduces development costs compared to custom connectors
• Integrated schema management
• Task partitioning and rebalancing
• Offset management
• Fault tolerance
• Delivery semantics, operations, and monitoring
60Confidential
References
Confluent docs: http://docs.confluent.io/current/connect/index.html
Connectors: http://www.confluent.io/product/connectors/
61Confidential
Thank You

More Related Content

What's hot

Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
 

What's hot (20)

ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Microservices Integration Patterns with Kafka
Microservices Integration Patterns with KafkaMicroservices Integration Patterns with Kafka
Microservices Integration Patterns with Kafka
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Works
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Kafka connect 101
Kafka connect 101Kafka connect 101
Kafka connect 101
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 

Similar to Data Pipelines with Kafka Connect

Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 

Similar to Data Pipelines with Kafka Connect (20)

Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
 
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent RamièreAu delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
 
Riding the Streaming Wave DIY style
Riding the Streaming Wave  DIY styleRiding the Streaming Wave  DIY style
Riding the Streaming Wave DIY style
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Introduction to kafka connector
Introduction to  kafka connectorIntroduction to  kafka connector
Introduction to kafka connector
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the FieldKafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
 
Beyond the brokers - Un tour de l'écosystème Kafka
Beyond the brokers - Un tour de l'écosystème KafkaBeyond the brokers - Un tour de l'écosystème Kafka
Beyond the brokers - Un tour de l'écosystème Kafka
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
 
Beyond the Brokers: A Tour of the Kafka Ecosystem
Beyond the Brokers: A Tour of the Kafka EcosystemBeyond the Brokers: A Tour of the Kafka Ecosystem
Beyond the Brokers: A Tour of the Kafka Ecosystem
 
Beyond the brokers - A tour of the Kafka ecosystem
Beyond the brokers - A tour of the Kafka ecosystemBeyond the brokers - A tour of the Kafka ecosystem
Beyond the brokers - A tour of the Kafka ecosystem
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
 
Day in the life event-driven workshop
Day in the life  event-driven workshopDay in the life  event-driven workshop
Day in the life event-driven workshop
 
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdfDIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Data Pipelines with Kafka Connect

  • 1. 1Confidential Kafka Connect Future of Data: Princeton Meetup Kaufman Ng kaufman@confluent.io
  • 2. 2Confidential Agenda • About Me • About Confluent • Apache Kafka 101 • Kafka Connect • Logical Architecture • Core Components and Execution Model • Connector Configuration, Deployment and Management • Connector Examples • Out-of-the-Box • Community • Wrap Up and Questions
  • 3. 3Confidential About Me • Solutions Architect, Confluent • Senior Solutions Architect, Cloudera • Bloomberg • Contributor to Kafka, Parquet • kaufman@confluent.io • Twitter: @kaufmanng
  • 5. 5Confidential About Confluent and Apache Kafka • Founded by the creators of Apache Kafka • Founded September2014 • Technology developed while atLinkedIn • 75%of active Kafka committers Cheryl Dalrymple CFO Jay Kreps CEO Neha Narkhede CTO, VP Engineering Luanne Dauber CMO Leadership Todd Barnett VP WW Sales Jabari Norton VP Business Dev
  • 6. 6Confidential What is the Confluent Platform?
  • 7. 7Confidential Confluent’s Offerings Connect Streams Java Client Kafka Core Confluent EnterpriseConfluent Platform Stream MonitoringAdditional Clients (C/C++, Python) Message DeliveryREST Proxy Stream AuditSchema Registry Connector ManagementPre-Built Connectors
  • 8. 8Confidential Confluent Platform: Kafka ++ Feature Benefit Apache Kafka Confluent Platform 3.0 Confluent Enterprise 3.0 Apache Kafka High throughput, low latency, high availability, secure distributed message system Kafka Connect Advanced framework for connecting external sources/destinations into Kafka Java Client Provides easy integration into Java applications Kafka Streams Simple library that enables streaming application development within the Kafka framework Additional Clients Supports non-Java clients; C, C++, Python, etc. Rest Proxy Provides universal access to Kafka from any network connected device via HTTP Schema Registry Central registry for the format of Kafka data – guarantees all data is always consumable Pre-Built Connectors HDFS, JDBC and other connectors fully Certified and fully supported by Confluent Confluent Control Center Includes Connector Management and Stream Monitoring Support Connection and Monitoring command center provides advanced functionality and control Community Community 24x7x365 Free Free Subscription
  • 9. 9Confidential What’s New in Confluent Platform 3.0 Kafka Streams • A library for writing stream data applications using Kafka • Powerful and lightweight Confluent’s First Commercial Product • Confluent Control Center • A comprehensive management system for Kafka clusters • Stream monitoring capability that monitors data pipelines from end-to-end Producers Consumers Kafka Connect Kafka Connect Streams TopicTopic Topic Confluent Control Center Kafka Connect (version 2.0) • Expanding community of connectors and architectures • Improvements to Schema Registry and REST proxy
  • 10. 10Confidential Latest Version: Confluent Platform 3.0.1 (released September 2016) • Based on latest version of Apache Kafka 0.10.0.1 • Bug fixes + improvements (56 issues) • Highlights • Kafka Streams: new Application Reset Tool • Security fixes • And many others
  • 11. 11Confidential Introduction To Real Time Streaming with Apache Kafka
  • 13. 13Confidential What does Kafka do? Producers Consumers Kafka Connect Kafka Connect Topic Your interfaces to the world Connected to your systems in real time
  • 14. 14Confidential Kafka is much more than a pub-sub messaging system
  • 15. 15Confidential Before: Many Ad Hoc Pipelines Search Security Fraud Detection Application User Tracking Operational Logs Operational Metrics Hadoop Search Monitoring Data Warehouse Espresso Cassandra Oracle
  • 16. 16Confidential After: Central Hub with Kafka Search Security Fraud Detection Application User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle Hadoop Log Search Monitoring Data Warehouse Kafka
  • 18. 18Confidential What is Kafka Connect? Kafka Connect is a distributed, scalable, fault-tolerant service designed to reliably stream data between Kafka and other data systems. Data is produced from a source and consumed to a sink. Example use cases: • Publishing an SQL Tables (or an entire SQL database) into Kafka • Consuming Kafka topics into HDFS for batch processing • Consuming Kafka topics into Elasticsearch for secondary indexing • Integrating legacy systems with modern Kafka framework • … many others
  • 19. 19Confidential Why Kafka Connect? Kafka producers and consumers are simple in theory, yet ungainly in practice Many common requirements • Data conversion (serialization) • Parallelism / scaling • Load balancing • Fault tolerance / fail-over • General management A few data-specific requirements • Move data to/from sink/source • Support relevant delivery semantics
  • 20. 20Confidential Kafka Connect : Separation of Concerns
  • 22. 22Confidential Kafka Connect : Source & Sink Connectors Kafka Data Destination Kafka Connect Data Destination Data Destination Data Source Data Source Data Source connector connector connector connector connector connector
  • 23. 23Confidential How is Connect different than a producer or consumer? • Producers and consumers provide complete flexibility to send any data to Kafka or process it in any way. • This flexibility means you do everything yourself. • Kafka Connect’s simple framework allows : • developers to create connectors that copy data to/from other systems. • operators/users to use said connectors just by writing configuration files and submitting them to Connect -- no code necessary • community and 3rd-party engineers to build reliable plugins for common data sources and sinks • deployments to deliver fault tolerance and automated load balancing out-of- the-box
  • 24. 24Confidential Let the Framework Do the Hard Work • Serialization / de-serialization • Schema Registry integration • Fault tolerance / failover • Partitioning / scale-out • … and let the developer to focus on domain specific copying details
  • 25. 25Confidential Kafka Connect Architecture (low-level) Connect has three main components: Connectors, Tasks, and Workers In practice, a stream partition could be a database table, a log file, etc.
  • 26. 26Confidential Standalone vs Distributed Mode Connect workers can run in two modes: standalone or distributed Standalone mode runs in a single process -- ideal for testing and development, and consumer/producer use cases that need not be distributed (for example, tailing a log file). Distributed mode runs in multiple processes on the same machine or spread across multiple machines. Distributed mode is fault tolerant and scalable (thanks to Kafka Consumer Group support).
  • 27. 27Confidential Kafka Connect Architecture (in color) : Streams -> Partitions -> Tasks 27 Stream of records (ex: Database) Stream partitions (ex: Database Tables) Task 1 Task 2
  • 28. 28Confidential Kafka Connect Architecture : Execution Model 28 Task 1 Task 2 Task 3 Task 4 Worker 1 Worker 2 Worker 3 Host 1 Host 2
  • 29. 29Confidential Kafka Connect Architecture : Fault Tolerance and Elasticity 29 Host 1 Host 2 Worker 1 Worker 2 Worker 3 Task 1 Task 2 Task 3 Task 4
  • 30. 30Confidential Kafka Connect Architecture : Fault Tolerance and Elasticity 30 Host 1 Host 2 Worker 1 Worker 2 Task 1 Task 2 Task 3 Task 4
  • 34. 37Confidential Converters Pluggable API to convert data between native formats and Kafka • Source connectors: converters are invoked after the data has been fetched from the source and before it is published to Kafka • Sink connectors: converters are invoked after the data has been consumed from Kafka and before it is stored to the sink • Converters are applied to both key and value data Apache Kafka ships with JSONConverter Confluent Platform adds AvroConverter • AvroConverter allows integration with Confluent Schema Registry to track topic schemas and ensure compatible schema evolution. 37
  • 35. 38Confidential Source Offset Management Connect tracks the offset that was last consumed for a source, to restart tasks at the correct “starting point” after a failure. These offsets are different from Kafka offsets -- they’re based on the source system (eg a database, file, etc). In standalone mode, the source offset is tracked in a local file. In distributed mode, the source offset is tracked in a Kafka topic.
  • 36. 40Confidential Connect configuration (1 of 2) Distributed mode vs standalone mode Connectors Workers Overriding Producer and Consumer settings The following discrete configuration dimensions exist for Connect:
  • 37. 41Confidential Connect configuration (2 of 2) In standalone mode, Connector configuration is stored in a file and specified as a command-line argument. In distributed mode, Connector configuration is set via the REST API, in the JSON payload. Connect configuration options are as follows: Parameter Description name Connector’s unique name. connector.class Connector’s Java class. tasks.max Maximum tasks to create. The Connector may create fewer if it cannot achieve this level of parallelism. topics (sink connectors only) List of input topics (to consume from).
  • 38. 42Confidential Configuring workers Worker configuration is specified in a configuration file that is passed as an argument to the script starting Connect. The following slides will go through important configuration options in three dimensions: standalone mode, distributed mode, and parameters common to both modes. See docs.confluent.io for a comprehensive list, along with an example configuration file.
  • 39. 43Confidential Configuring Kafka Connect Workers : Core Settings These parameters define the worker connection to the Kafka cluster. See docs.confluent.io for a comprehensive list. Parameter Description bootstrap.severs A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. key.converter Converter class for key Connect data. value.converter Converter class for value Connect data.
  • 40. 44Confidential Configuring Kafka Connect Workers : Task / Storage Control • Standalone mode • Distributed mode Parameter Description group.id A unique string that identifies the Connect cluster group this worker belongs to. config.storage.topic The topic to store Connector and task configuration data in. This must be the same for all workers with the same group.id. offset.storage.topic The topic to store offset data for Connectors in. This must be the same for all workers with the same group.id. session.timeout.ms The timeout used to detect failures when using Kafka’s group management facilities. heartbeat.interval.ms The expected time between heartbeats to the group coordinator when using Kafka’s group management facilities. Must be smaller than session.timeout.ms. Parameter Description offset.storage.file.filename The file to store Connector offsets in.
  • 41. 45Confidential Running in standalone mode Starting Connect in standalone mode involves starting a process with one or more connect configurations: connect-standalone worker.properties connector1.properties [connector2.properties connector3.properties ...] Each connector instance will be run in its own thread. Configuration will be covered later.
  • 43. 47Confidential Running in distributed mode Starting Connect in distributed mode involves starting connect on each worker node with the following: connect-distributed worker.properties Connectors are then added, modified, and deleted via a REST API. Configuration will be covered later.
  • 44. 48Confidential Managing distributed mode To install a new connector, the connector needs to be packaged in a JAR in placed in the CLASSPATH of each worker process. A REST API is used to create, list, modify, or delete connectors in distributed mode. All worker processes listen for REST requests, by default on port 8083. REST requests can be made to any worker node.
  • 45. 49Confidential Kafka Connect: Basic REST API Method Path Description GET /connectors Get a list of active connectors. POST /connectors Create a new connector. PUT /connectors/(string: name)/config Create a new connector, or update the configuration of an existing connector. GET /connectors/(string: name)/config Get configuration info for a connector. GET /connectors/(string: name)/tasks/<task-id>/ Retrieve details for specific tasks DELETE /connectors/(string: name) Delete a configured connector from the worker pool Learn more about the REST API at docs.confluent.io
  • 47. 51Confidential Kafka Connectors • Confluent-supported connectors (included in Confluent Platform) • Community-written connectors (just a sampling) JDB C
  • 48. 52Confidential More Connectors Coming • Cassandra Sink (Data Mountaineers) • Elastic Search Connector (Confluent) • JDBC source and sink
  • 50. 54Confidential Local File Source and Sink Connector • The local file source Connector tails a local file • Each line is published as a Kafka message to the target topic • The local file sink Connector appends Kafka messages to a local file • Each message is written as a line to the target file. • Both the source and sink Connectors need to be run in standalone mode.
  • 51. 55Confidential JDBC Source Connector The JDBC source Connector periodically polls a relational database for new or recently modified rows, creates an Avro record, and produces the Avro record as a Kafka message. Records are divided into Kafka topics based on table name. New and deleted tables are handled automatically by the Connector.
  • 52. 56Confidential HDFS Sink Connector The HDFS Sink Connector writes Kafka messages into target files in an HDFS cluster. Kafka topics are partitioned into chunks by the connector, and each chunk is stored as an independent file in HDFS. The nature of the partitioning is configurable (by default, the partitioning defined for the Kafka topic itself is preserved) . The connector integrates directly with Hive. Configuration options enable the automatic creation/extension of an external Hive partitioned table for each Kafka topic. The connector supports exactly-once delivery semantics, so there is no risk of duplicate data for Hive queries.
  • 53. 57Confidential Connector Development : Best Practices • Define data stream for reasonable parallelism and utility • Develop flexible message syntax, leveraging Schema Registry (users can opt for JSONConverter, so no lock-in) • Check the community frequently!
  • 54. 58Confidential Connect Security • Supports SASL and TLS/SSL • Uses java keystores/truststores • # Source security settings are prefixed with "producer" producer.security.protocol=SSL producer.ssl.truststore.location=/var/private/ssl/kafka.client.truststore.jks producer.ssl.truststore.password=test1234 • # Sink security settings are prefixed with "consumer" consumer.security.protocol=SSL consumer.ssl.truststore.location=/var/private/ssl/kafka.client.truststore.jks consumer.ssl.truststore.password=test1234
  • 55. 59Confidential Kafka Connect : Benefits for Developers  Simplifies data flow in to and out of Kafka  Reduces development costs compared to custom connectors • Integrated schema management • Task partitioning and rebalancing • Offset management • Fault tolerance • Delivery semantics, operations, and monitoring

Editor's Notes

  1. Talking Points: 1. Companies are faced with very complex environments with difficult to manage parts. They want to organize large amounts of data into a well managed, unified stream data platform. 2. Customers use Confluent Platform for real-time, batch operational and analytical purposes. Take away the costly and labor intensive process of developing proprietary data replication practices and allow the Confluent Platform to make data available in real-time streams. 3. Our platform has Kafka at the core (same build as open source Kafka but with additional bug fixes applied) with components and tools that allow you successfully deploy to production, including: Kafka Schema management layer (ensures data compatibility across applications Java and Rest clients that integrate with our schema management layer Kafka Connect Kafka Streams Authentication and Authorization Confluent Control Center
  2. Apache Kafka provides solutions to all these problems. From a high level, it's "just" a pub/sub message queue. Producers generate events and store them in the queue, and consumers can read them. The same data can be consumed by multiple consumers, which makes it easy to use the same data in multiple systems, a requirement for our data integration problem.
  3. Today, I want to introduce you to Kafka Connect, Kafka’s new large-scale, streaming data import/export tool that drastically simplifies the construction, maintenance, and monitoring of these data pipelines. Kafka Connect is part of the Apache Kafka project, open source under the Apache license, and ships with Kafka. It’s a framework for building connectors between other data systems and Kafka, and the associated runtime to run these connectors in a distributed, fault tolerant manner at scale.
  4. Part of Apache Kafka
  5. Part of Apache Kafka
  6. Coming back to the idea of separation of concerns, we can now see how, using just a couple of tools, we can very easily construct data pipelines without making some of the significant compromises of other systems. The key stages in the ETL terminology help guide the best points to provide key abstractions and to choose different tools. By combining Kafka Connect and stream processing engines to perform streaming ETL (or ELT, as you prefer). Each does the job it is best at, and Kafka acts as the underlying data storage layer that supports them and allows simple integration with a variety of other applications. As you can see here, the basic pattern is to use Kafka Connect to perform Extraction of the data and load it into Kafka as a temporary, scalable, fault tolerant streaming data store. While you can do this with other, more generic data copy tools, you’ll commonly lose important semantics such as at least once delivery of data. Once the data is extracted, you use stream processing engines to perform Transformation and either this is the endpoint for the data or you can deliver it back into Kafka. Finally, Load the data with another Kafka connector into the destination system. Obviously this is a simplified picture and your pipeline will grow much more complex, have multiple stages of transformation (especially if the intermediate data is useful for reuse by multiple applications, including anything downstream that may not be processed by stream processing engines).
  7. A Connector is a Connector class and a configuration. Each connector defines and updates a set of Tasks that actually copy the data. Connect distributes these tasks across Workers. A Connect cluster is made up of one or more Workers that execute Tasks defined by a Connector. Worker processes are not managed by Connect. Other tools can do this: YARN, Mesos, Chef, Puppet, custom scripts, etc. Sink connectors are constrained to a single consumer group … so as to ensure consistency.
  8. Standalone mode works as a single process. This is really easy to get started with, easy to configure. We like this because it scales down really easily and stays local for testing. Much like Spark’s local mode, it’s great for development and testing. It’s also great for connectors that really only make sense on a single node – for example, processing log files that you need to read off the local file system. In contrast, distributed mode can scale up while providing distribution and fault tolerance. Provides fault tolerance and elasticity Tasks auto-balanced across workers. Failures automatically handled by redistributing work. Simple elasticity. Cool implementation note: reuses group membership functionality of consumer groups – implementation largely came for free
  9. At its heart, Connect is very simple. You have sources for importing data into Kafka and sinks for exporting data from Kafka. An implementation of a source or sink is called a connector. So there could be more than one connector for a source or sink. Users only interact with connectors to enable data flows on Kafka. Notice that one endpoint of a source or sink is always Kafka. This is a conscious decision for a simple reason - By leveraging Kafka for modeling the core abstractions required for stream data ingestion/export, it allows Connect to provide strong guarantees around delivery, fault tolerance, load balancing, offset maintenance. That Kafka already has support for. A stream is the logical abstraction for data flow in Kafka Connect. For example, if the MySQL connector operated on a complete database, this stream would represent the stream of updates to any table in the database. If it only operated on a single table, this stream would be a stream of updates to that table. Connect is designed for large scale data integration, which means that it has a parallelism model built in. The basic structure that all Connect source and sink systems must be mapped to partitioned streams of records. This is a generalization of Kafka's concept of topic partitions. The stream is the complete set of records, which are then split into independent partitioned streams of records. Streams may have a dynamic number of partitions. Over time, a stream may grow to add more partitions or shrink to remove them. For example, in this case, you can add or remove database tables on the fly. ----- Meeting Notes (9/24/15 13:57) ----- remove black
  10. Physically, a Connect cluster is a pool of worker processes. All connector tasks are evenly distributed over the pool of worker processes. You can view workers as just containers that hold the task threads that actually do the data copy. And the reason for this is because process and resource management is orthogonal to Connect’s runtime model on purpose. Every company has their own favorite tools for process and container management including YARN, Mesos, Kubernetes, or none if you prefer to manage processes manually or through other orchestration tools. The framework is not responsible for starting, restarting or stopping processes and can be used with any process management strategy.
  11. Fault tolerance is built-in and is completely automated. So if workers fail, connector tasks load balance automatically over the remaining alive set of worker processes. So data copying is resumed automatically and process failures are transparent to the user.
  12. Notice that task 4 migrated to worker 2 on host 1, given that host 2 failed
  13. Partitioned streams are the logical data model, but they don’t directly map to parallelism, or threads, in Kafka Connect. In the case of the database connector, this might seem reasonable. However, some connectors will have a much larger number of partitions that are much more fine-grained. For example, consider a connector for collecting metrics data – each metric might be considered its own partition, resulting in tens of thousands of partitions for even a small set of application servers. However, we do want to exploit the parallelism provided by partitions. Connectors do this by assigning partitions to tasks. Tasks are, simply, threads of control given to the connector code which perform the actual copying of data. Connectors generate tasks to exploit parallelism of the partitioned stream model Each task has a dedicated thread for processing. The connector assigns a subset of partitions to each task and the task is the one that actually copies the data for that partition. If appropriate, connectors may also use a thread to monitor the input system for changes in the input partitions. If a change occurs, the connector can reconfigure tasks (possibly adding or removing some). The connector will also be asked for new task configurations if the user changes the connector’s own config or the maximum number of tasks. Number of tasks can be controlled by the user. This controls total resource usage of the connector.
  14. Standalone mode works as a single process. This is really easy to get started with, easy to configure. We like this because it scales down really easily and stays local for testing. Much like Spark’s local mode, it’s great for development and testing. It’s also great for connectors that really only make sense on a single node – for example, processing log files that you need to read off the local file system.
  15. In contrast, distributed mode can scale up while providing distribution and fault tolerance. Provides fault tolerance and elasticity Tasks auto-balanced across workers. Failures automatically handled by redistributing work. Simple elasticity. Cool implementation note: reuses group membership functionality of consumer groups – implementation largely came for free
  16. Provides fault tolerance and elasticity Tasks auto-balanced across workers. Failures automatically handled by redistributing work. Simple elasticity. Cool implementation note: reuses group membership functionality of consumer groups – implementation largely came for free
  17. Provides fault tolerance and elasticity Tasks auto-balanced across workers. Failures automatically handled by redistributing work. Simple elasticity. Cool implementation note: reuses group membership functionality of consumer groups – implementation largely came for free
  18. Here is an example of a source and sink. The source is a database source for a single table. The sink is a HDFS sink. So end-to-end, this example is about a streaming data ingest from a database table into a directory in HDFS. Every source and sink has a concept of offset that marks the position of the connector as it copies data between sources and sinks. Defining and maintaining offsets during data copy is crucial for providing delivery guarantees, which I will talk about in some time. For sources, the connector implementation has to define what an offset means. For example, for the database source as shown here, offset is a timestamp. For MySQL, it could be the bin log sequence number For sinks, offsets are the Kafka topic partition offsets and work the same way as a Kafka consumer offset. The way this offset is used is that every source and sink exposes a commit() API that allows checkpoints the offset of the connector in Kafka, so the data copy can resume from where it left off in case of any failures.
  19. The second feature I want to mention are converters. Serialization formats may seem like a minor detail, but not separating the details of data serialization in Kafka from the details of source or sink systems results in a lot of inefficiency: A lot of code for doing simple data conversions are duplicated across a large number of ad hoc connector implementations. Each connector ultimately contains its own set of serialization options as it is used in more environments – JSON, Avro, Thrift, protobufs, and more. Much like the serializers in Kafka’s producer and consumer, the Converters abstract away the details of serialization. Converters are different because they guarantee data is transformed to a common data API defined by Kafka Connect. This API supports both schema and schemaless data, common primitive data types, complex types like structs, and logical type extensions. By sharing this API, connectors write one set of translation code and Converters handle format-specific details. For example, the JDBC connector can easily be used to produce either JSON or Avro to Kafka, without any format-specific code in the connector.
  20. As I previously mentioned, checkpointing of offsets as well as recovery on failures is baked into Copycat and transparent to the user. For example, if there are failures on the connector, on restart, Copycat refers to the checkpointed offsets and resumes the data copy from that offset onwards. The concept of committing offsets is related to the delivery guarantees you get from Copycat. Copycat can support three different levels of delivery guarantees between a source and sink system: at least once, at most once, and exactly once. Each of these require certain set of functionality for the input and output systems. At least once: each messages is delivered to the output system at least one time, but possibly multiple times. This requires the downstream system to support a flush operation Most connectors would offer this guarantee. At most once: each message is delivered to the downstream system at most once, but may not be delivered at all. You might want this if you have very large volume data copy where waiting to synchronously commit offsets might not be worth waiting for. Exactly once: you need some support from the systems. For source connectors, the output system is Kafka and should be able to achieve exactly once semantics with the idempotent producer; for sink connectors, it can be achieved in a system-specific way, e.g. Database transaction where you write offset and data together
  21. See docs.confluent.io for more details on all configuration options
  22. A three-node Kafka Connect distributed mode cluster. Connectors (monitoring the source or sink system for changes that require reconfiguring tasks) and tasks (copying a subset of a connector’s data) are automatically balanced across the active workers. The division of work between tasks is shown by the partitions that each task is assigned.
  23. NOTE: CLASSPATH automatically includes $CP_HOME/share/java/kafka-connect-*
  24. Exactly once
  25. A source for a system reading thousands of sensors probably doesn’t need one-topic per sensor … but might need 1 task per 1000 sensors
  26. In spite of this being a serious problems for companies in all industries, the current state of tooling for data integration is all over the place. This is an important need for companies but the current state of the art involves jury rigging one-off custom solutions designed for specific systems. This works when you have only a couple systems but is a huge pain as you grow over time. The overhead is that now you need to learn a lot of these different tools that each have very different guarantees and behavior. Operationally, this is a nightmare - the number of different tools that you have to monitor, administer, deploy and maintain just keep multiplying. The other end of the spectrum is using these generic data copy tools, like enterprise application integration tools. These tools are overly generic and hence, the guarantees provided by these tools often reduce to the lowest common denominator abstraction. The tool doesn’t do a great job of handling things like fault tolerance, offset management, load balancing that every connector needs since the underlying abstraction is typically weak, like an in-memory queue. These tools also try to abstract away the cost of making point-to-point connections between systems. This is doable when the data moved is small between fewer applications. But is infeasible and very expensive as the same data that is large in size needs to be copied several times between systems of different types. The 3rd anti pattern is using stream processing systems for data integration. First, this is an overkill. Second, stream processing APIs in these frameworks are not designed for data integration. As such, after the first couple connectors, the connector quality quickly drops. The impact is that a rich ecosystem of open source connectors simply doesn’t. It is simply an overkill.