3. Andrew Stevenson
Lead Mountain Goat at DM
Solution Architect
- Fast Data
- Big Data as long as it’s Fast
Contributed to
- Kafka Connect, Sqoop, Kite SDK
4. Kafka Connect Connectors
20+ connectors.
DataStax Certified Cassandra Sink
Professional Services
Implementations, Architecture reviews
DevOps & Tooling
Connector Support
8. Why?
Enterprise pipelines must consider:
Delivery semantics
Offset management
Serialization / de-serialization
Partitioning / scalability
Fault tolerance / failover
Data model integration
CI/CD
Metrics / monitoring
9. Which results in?
Multiple technologies
- Bash wrappers on Sqoop
- Oozie Xml
- Custom Java/Scala/C#
- Third Party
- Multiple teams hand roll similar solutions
Lack of separation of concerns
- Extract/loading ends up domain specific
10. What we really care
about…
DOMAIN SPECIFIC TRANSFORMATIONS
Focus on adding value
11. Kafka Connect?
✓Delivery semantics
✓Offset management
✓Serialization / de-serialization
✓Partitioning / scalability
✓Fault tolerance / fail-over
✓Data model integration
✓Metrics
Out of the Box – ONE FRAMEWROK
Lets you focus on domain logic
12.
13. Kafka Connect
“a common framework facilitating
data streams
between kafka and other systems”
14. Ease of use
deploy flows via configuration files
with no code necessary
Out of the box & Community
15. Configurations
are key-value mappings
name connector’s unique name
connector.class connector’s class
max.tasks maximum tasks to create
Option[topics] list of topics (for sinks)
16. Config Example
name = kudu-sink
connector.class = KuduSinkConnector
tasks.max = 1
topics = kudu_test
connect.kudu.master = quickstart
connect.kudu.sink.kcql = INSERT INTO KuduTable
SELECT * FROM kudu_test
17. KCQL
is a SQL like syntax allowing streamlined
configuration of Kafka Sink Connectors and
then some more..
Example:
Project fields, rename or ignore them and further
customise in plain text
INSERT INTO transactions SELECT field1 AS column1, field2 AS column2 FROM TransactionTopic;
INSERT INTO audits SELECT * FROM AuditsTopic;
INSERT INTO logs SELECT * FROM LogsTopic AUTOEVOLVE;
INSERT INTO invoices SELECT * FROM InvoiceTopic PK invoiceID;
20. Modes
Standalone
- single node, for testing, one off import/exports
Distributed
- 1 or more workers on 1 or more servers
form a cluster/consumer group
23. Connectors
Define the how
- which plugin to use, must be on CLASSPATH
- breaks up work into tasks
- multiple per cluster, unique name
- fault tolerant
26. Connector API
class CoherenceSinkConnector extends SinkConnector {
override def taskClass(): Class[_ <: Task]
override def start(props: util.Map[String, String]): Unit
override def taskConfigs(maxTasks: Int): util.List[util.Map[String, String]]
override def stop(): Unit
}
27. Tasks
Perform the actual work
- loading / unloading
- single threaded
Used to Scale
- more task more parallelism
- kafka consumer group managed
- if (tasks > partitions) => idle tasks
Contains no state, it’s in Kafka
- started
- stopped
- restarted
- pause
28. Task API
class CoherenceSinkTask extends SinkTask {
override def start(props: util.Map[String, String]): Unit
override def stop(): Unit
override def flush(offsets: util.Map[TopicPartition,
OffsetAndMetadata])
override def put(records: util.Collection[SinkRecord])
}
First me, hurray, then DataMountaineer, who we are and want we do, then Connect, why, what is it, how does it work. Hopefully time for a demo
Big Data background, Clearstream, HFTs, Investment banks.
We were Big Data experts but saw first hand Big data’s no good if it’s too slow.
Big data is no good if its slow.
Anyone know Knight Capital? Lost $440 million in 2008 in 30 minutes, that's 172,000 per second.
Hourly or EOD risk reporting is no good, too late your dead.
Put another way would cross the road outside with information that's 10 minutes old?
Everybody will have been involved in data integration, data is it new oil.
It should be easy....right? Just copy the data from that database and put it over here.
We need to think about these items even if we are already using Kafka consumers and producers. Or we should be.
All results in wasted time and resources.
If your a CTO of a large organization this will make your toes curl.
No common framework.
Team A puts logic in his extract component, means it's not reusable.
Team B re implements.
Kafka connect, scalable and fault tolerant, addresses developers needs. One stack, all part of Kafka.
Ingest data, unload data, buffer and persist data, process data all from one distribution.
SAT4s is an example, configured 4 second windfarm signal flow from a SQL Server database to HDFS and Cassandra, no coding.
Hazelcast requires that data is serializable, JSON and Avro are supported.
Highest level component as a developer you deal with. You can have multiple instances of connectors per cluster, must have a unique name
Config topic is a backing topic of each connect cluster. 1 Partition, guarantees order of configuration updates.
Starts up you connector, asks it for task configurations
assigns tasks to members of the cluster/consumer group.
How does it do this? It implements Kafka Abstract Coordinator.
Kafka notifies the Herder that a rebalance is happening, the framework, rebalances the partitions or tasks across the remaining consumer group or connect cluster.
Basically a Consumer group behind the scenes,
single threaded, so just like a regular if you have more tasks than partitions in you topics they'll be idle.
No state, so good for restful, micro service architectures. it's in kafka.
SinkRecords – is a wrapper, contains topic information, key and value schemas and payloads.
Connect is configured with Converters that serialized/deserialize back and forth between the data in Kafka to Connect.
You could for example write your own Protobuf converter and plug that.