SlideShare une entreprise Scribd logo
1  sur  19
Seattle Data Science And Data Engineering Meetup
Abhishek Goswami.
12/14/2016
abgoswam@gmail.com
https://www.linkedin.com/in/abgoswam
Table Of Content
Introduction
Motivation
What is Kafka
Characteristics
APIs
Demos
Internals
Logs
Logs in Distributed Systems
Design Fundamentals
ZooKeeper Dependency
Replication
Source Code
Summary, Q&A
2
● Introduction
○ Motivation
○ What is Kafka?
○ Characteristics
○ APIs
○ Demos
● Internals
● Summary, Q&A
3
Introduction: Motivation
4
Data integration.
Introduction: What is Kafka ?
Distributed, partitioned, replicated commit-log service
Provides the functionality of a messaging system, but with a unique-design
5
Competitive Landscape:
● AWS Kinesis, Azure EventHub
Use Cases:
● Messaging
● Website Activity Tracking
● Logging
● Stream Processing
Introduction: Characteristics
6
Scalability of a filesystem
High Throughput
Many TB per server
Guarantees of a database
Messages strictly ordered
All data persistent
Distributed by default
Replication
Partitioning
Introduction: APIs
Four core APIs:
Producer API
allows applications to send streams of data to topics in the Kafka cluster.
Consumer API
allows applications to read streams of data from topics in the Kafka cluster.
Connect API
allows implementing connectors that continually pull from some source system or application into
Kafka or push from Kafka into some sink system or application.
Streams API
generalization of batch processing in a real time environment, low latency requirements.
7
Introduction: Demos
8
● Introduction
● Internals
○ Log
○ Logs in Distributed Systems
○ Design Fundamentals
○ ZooKeeper Dependency
○ Replication
○ Source Code
● Summary, Q&A
9
Internals: Log
10
Internals: Logs in Distributed Systems
11
Internals: Logs in Distributed Systems
12
Internals: Design Fundamentals
13
Internals: ZooKeeper Dependency
Kafka requires ZooKeeper
Kafka uses ZooKeeper to do things like:
Cluster membership
Electing a controller
Topic Configuration (which topic exists, who’s the leader etc)
14
Internals: Replication
15
Internals: Source Code
Github Repo
https://github.com/apache/kafka
16
● Introduction
● Internals
● Summary, Q&A
17
Summary
18
Kafka solves data integration needs.
Distributed, partitioned, replicated commit-log service
Q&A
19
References:
1. Simplifying data pipelines with Apache Kafka
2. Learning Apache Kafka, 2nd Edition
3. https://www.tutorialspoint.com/apache_kafka/index.htm
4. https://www.infoq.com/articles/apache-kafka
5. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-
should-know-about-real-time-datas-unifying
abgoswam@gmail.com
https://www.linkedin.com/in/abgoswam

Contenu connexe

Tendances

Tendances (20)

Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
 
Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...
Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...
Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by Datio
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.
 
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
 
Bootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source ToolsBootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source Tools
 
Embracing Database Diversity with Kafka and Debezium
Embracing Database Diversity with Kafka and DebeziumEmbracing Database Diversity with Kafka and Debezium
Embracing Database Diversity with Kafka and Debezium
 
Kafka 탄생과 생태계
Kafka 탄생과 생태계Kafka 탄생과 생태계
Kafka 탄생과 생태계
 
Devops Days, 2019 - Charlotte
Devops Days, 2019 - CharlotteDevops Days, 2019 - Charlotte
Devops Days, 2019 - Charlotte
 
Agile Data Integration: How is it possible?
Agile Data Integration: How is it possible?Agile Data Integration: How is it possible?
Agile Data Integration: How is it possible?
 
Confluent Operations Training for Apache Kafka
Confluent Operations Training for Apache KafkaConfluent Operations Training for Apache Kafka
Confluent Operations Training for Apache Kafka
 
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
Introduction to Apache Kafka and Confluent... and why they matter
Introduction to Apache Kafka and Confluent... and why they matterIntroduction to Apache Kafka and Confluent... and why they matter
Introduction to Apache Kafka and Confluent... and why they matter
 
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
 
Column and hadoop
Column and hadoopColumn and hadoop
Column and hadoop
 
EventHub for kafka ecosystems kafka meetup
EventHub for kafka ecosystems   kafka meetupEventHub for kafka ecosystems   kafka meetup
EventHub for kafka ecosystems kafka meetup
 

En vedette

что делать если ребенок боится врачей
что делать если ребенок боится врачейчто делать если ребенок боится врачей
что делать если ребенок боится врачей
virtualtaganrog
 
у нас прошёл месячник пожарной безопасности !
у нас прошёл месячник пожарной безопасности !у нас прошёл месячник пожарной безопасности !
у нас прошёл месячник пожарной безопасности !
virtualtaganrog
 
Otras realidades, otros impactos, otras métricas: la nueva bibliometría
Otras realidades, otros impactos, otras métricas: la nueva bibliometríaOtras realidades, otros impactos, otras métricas: la nueva bibliometría
Otras realidades, otros impactos, otras métricas: la nueva bibliometría
Emilio Delgado Lopez-Cozar, Universidad de Granada
 

En vedette (20)

Talk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWATalk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWA
 
что делать если ребенок боится врачей
что делать если ребенок боится врачейчто делать если ребенок боится врачей
что делать если ребенок боится врачей
 
у нас прошёл месячник пожарной безопасности !
у нас прошёл месячник пожарной безопасности !у нас прошёл месячник пожарной безопасности !
у нас прошёл месячник пожарной безопасности !
 
Press Release - YFC
Press Release - YFCPress Release - YFC
Press Release - YFC
 
1 день в детском саду
1 день в детском саду1 день в детском саду
1 день в детском саду
 
Rendez-vous de l’économie FTI Consulting en partenariat avec Odoxa - Les Echo...
Rendez-vous de l’économie FTI Consulting en partenariat avec Odoxa - Les Echo...Rendez-vous de l’économie FTI Consulting en partenariat avec Odoxa - Les Echo...
Rendez-vous de l’économie FTI Consulting en partenariat avec Odoxa - Les Echo...
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Calendario Diciembre 2016
Calendario Diciembre 2016Calendario Diciembre 2016
Calendario Diciembre 2016
 
Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJS
 
La ingeniería en la edad media
La ingeniería en la edad media La ingeniería en la edad media
La ingeniería en la edad media
 
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop ConnectorMongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
 
Movilidad y su contexto de uso
Movilidad y su contexto de usoMovilidad y su contexto de uso
Movilidad y su contexto de uso
 
Webinar: Simplifying the Database Experience with MongoDB Atlas
Webinar: Simplifying the Database Experience with MongoDB AtlasWebinar: Simplifying the Database Experience with MongoDB Atlas
Webinar: Simplifying the Database Experience with MongoDB Atlas
 
Consejo 6 octubre 2016
Consejo 6 octubre 2016Consejo 6 octubre 2016
Consejo 6 octubre 2016
 
Importancia del color en la experiencia de uso
Importancia del color en la experiencia de usoImportancia del color en la experiencia de uso
Importancia del color en la experiencia de uso
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems
 
Otras realidades, otros impactos, otras métricas: la nueva bibliometría
Otras realidades, otros impactos, otras métricas: la nueva bibliometríaOtras realidades, otros impactos, otras métricas: la nueva bibliometría
Otras realidades, otros impactos, otras métricas: la nueva bibliometría
 
Big Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use CasesBig Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use Cases
 
Neurodiseño, una tendencia en el diseño de experiencia
Neurodiseño, una tendencia en el diseño de experienciaNeurodiseño, una tendencia en el diseño de experiencia
Neurodiseño, una tendencia en el diseño de experiencia
 
Design principles of scalable, distributed systems
Design principles of scalable, distributed systemsDesign principles of scalable, distributed systems
Design principles of scalable, distributed systems
 

Similaire à Kafka. seattle data science and data engineering meetup

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 

Similaire à Kafka. seattle data science and data engineering meetup (20)

A Short Presentation on Kafka
A Short Presentation on KafkaA Short Presentation on Kafka
A Short Presentation on Kafka
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
 
Distributed messaging through Kafka
Distributed messaging through KafkaDistributed messaging through Kafka
Distributed messaging through Kafka
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
 
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
AWS API Framework Overview
AWS API Framework OverviewAWS API Framework Overview
AWS API Framework Overview
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Open Marketing Meeting 03/27/2013
Open Marketing Meeting 03/27/2013Open Marketing Meeting 03/27/2013
Open Marketing Meeting 03/27/2013
 
IBM Message Hub service in Bluemix - Apache Kafka in a public cloud
IBM Message Hub service in Bluemix - Apache Kafka in a public cloudIBM Message Hub service in Bluemix - Apache Kafka in a public cloud
IBM Message Hub service in Bluemix - Apache Kafka in a public cloud
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksDeep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
 
Logging in Scala
Logging in ScalaLogging in Scala
Logging in Scala
 
What's New in Confluent Platform 5.5
What's New in Confluent Platform 5.5What's New in Confluent Platform 5.5
What's New in Confluent Platform 5.5
 
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
 
Building Microservices with Apache Kafka
Building Microservices with Apache KafkaBuilding Microservices with Apache Kafka
Building Microservices with Apache Kafka
 

Dernier

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Dernier (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 

Kafka. seattle data science and data engineering meetup

Notes de l'éditeur

  1. Two main challenges. Large volume of data Different sources and destinations (and the second challenge is to analyze the collected data. To overcome those challenges, you must need a messaging system) Kafka is designed for distributed high throughput systems. Kafka tends to work very well as a replacement for a more traditional message broker. In comparison to other messaging systems, Kafka has better throughput, built-in partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-scale message processing applications. What is a Messaging System? A Messaging System is responsible for transferring data from one application to another, so the applications can focus on data, but not worry about how to share it. Distributed messaging is based on the concept of reliable message queuing. Messages are queued asynchronously between client applications and messaging system. Two types of messaging patterns are available − one is point to point and the other is publish-subscribe (pub-sub) messaging system. Most of the messaging patterns follow pub-sub. In a point-to-point system, messages are persisted in a queue. One or more consumers can consume the messages in the queue, but a particular message can be consumed by a maximum of one consumer only. Once a consumer reads a message in the queue, it disappears from that queue In the publish-subscribe system, messages are persisted in a topic. Unlike point-to-point system, consumers can subscribe to one or more topic and consume all the messages in that topic. In the Publish-Subscribe system, message producers are called publishers and message consumers are called subscribers Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis. Benefits Following are a few benefits of Kafka − - Reliability − Kafka is distributed, partitioned, replicated and fault tolerance. - Scalability − Kafka messaging system scales easily without down time.. - Durability − Kafka uses Distributed commit log which means messages persists on disk as fast as possible, hence it is durable.. - Performance − Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored. Kafka is very fast and guarantees zero downtime and zero data loss. Use Cases Kafka can be used in many Use Cases. Some of them are listed below − - Metrics − Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.Log Aggregation Solution − Kafka can be used across an organization to collect logs from multiple services and make them available in a standard format to multiple con-sumers. - Stream Processing − Popular frameworks such as Storm and Spark Streaming read data from a topic, processes it, and write processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing. Need for Kafka Kafka is a unified platform for handling all the real-time data feeds. Kafka supports low latency message delivery and gives guarantee for fault tolerance in the presence of machine failures. It has the ability to handle a large number of diverse consumers. Kafka is very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially means that all the writes go to the page cache of the OS (RAM). This makes it very efficient to transfer data from page cache to a network socket -----------
  2. Kafka includes four core apis: The Producer API allows applications to send streams of data to topics in the Kafka cluster. The Consumer API allows applications to read streams of data from topics in the Kafka cluster. The Streams API allows transforming streams of data from input topics to output topics. The Connect API allows implementing connectors that continually pull from some source system or application into Kafka or push from Kafka into some sink system or application. Kafka exposes all its functionality over a language independent protocol which has clients available in many programming languages. However only the Java clients are maintained as part of the main Kafka project, the others are available as independent open source projects. A list of non-Java clients is available here. 2.1 Producer API The Producer API allows applications to send streams of data to topics in the Kafka cluster. Examples showing how to use the producer are given in the javadocs. To use the producer, you can use the following maven dependency: <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>0.10.1.0</version> </dependency> 2.2 Consumer API The Consumer API allows applications to read streams of data from topics in the Kafka cluster. Examples showing how to use the consumer are given in the javadocs. To use the consumer, you can use the following maven dependency: <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>0.10.1.0</version> </dependency> 2.3 Streams API The Streams API allows transforming streams of data from input topics to output topics. Examples showing how to use this library are given in the javadocs Additional documentation on using the Streams API is available here. To use Kafka Streams you can use the following maven dependency: <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.1.0</version> </dependency> 2.4 Connect API The Connect API allows implementing connectors that continually pull from some source data system into Kafka or push from Kafka into some sink data system. Many users of Connect won't need to use this API directly, though, they can use pre-built connectors without needing to write any code. Additional information on using Connect is available here. Those who want to implement custom connectors can see the javadoc.
  3. Kafka design fundamentals Kafka is neither a queuing platform where messages are received by a single consumer out of the consumer pool, nor a publisher-subscriber platform where messages are published to all the consumers. In a very basic structure, a producer publishes messages to a Kafka topic (synonymous with "messaging queue"). A topic is also considered as a message category or feed name to which messages are published. Kafka topics are created on a Kafka broker acting as a Kafka server. Kafka brokers also store the messages if required. Consumers then subscribe to the Kafka topic (one or more) to get the messages. Here, brokers and consumers use Zookeeper to get the state information and to track message offsets, respectively. This is described in the following diagram: In the preceding diagram, a single node—single broker architecture is shown with a topic having four partitions. In terms of the components, the preceding diagram shows all the five components of the Kafka cluster: Zookeeper, Broker, Topic, Producer, and Consumer. In Kafka topics, every partition is mapped to a logical log file that is represented as a set of segment files of equal sizes. Every partition is an ordered, immutable sequence of messages; each time a message is published to a partition, the broker appends the message to the last segment file. These segment files are flushed to disk after configurable numbers of messages have been published or after a certain amount of time has elapsed. Once the segment file is flushed, messages are made available to the consumers for consumption. All the message partitions are assigned a unique sequential number called the offset, which is used to identify each message within the partition. Each partition is optionally replicated across a configurable number of servers for fault tolerance. Each partition available on either of the servers acts as the leader and has zero or more servers acting as followers. Here the leader is responsible for handling all read and write requests for the partition while the followers asynchronously replicate data from the leader. Kafka dynamically maintains a set of in-sync replicas (ISR) that are caught-up to the leader and always persist the latest ISR set to ZooKeeper. In if the leader fails, one of the followers (in-sync replicas) will automatically become the new leader. In a Kafka cluster, each server plays a dual role; it acts as a leader for some of its partitions and also a follower for other partitions. This ensures the load balance within the Kafka cluster. The Kafka platform is built based on what has been learned from both the traditional platforms and has the concept of consumer groups. Here, each consumer is represented as a process and these processes are organized within groups called consumer groups. A message within a topic is consumed by a single process (consumer) within the consumer group and, if the requirement is such that a single message is to be consumed by multiple consumers, all these consumers need to be kept in different consumer groups. Consumers always consume messages from a particular partition sequentially and also acknowledge the message offset. This acknowledgement implies that the consumer has consumed all prior messages. Consumers issue an asynchronous pull request containing the offset of the message to be consumed to the broker and get the buffer of bytes. In line with Kafka's design, brokers are stateless, which means the message state of any consumed message is maintained within the message consumer, and the Kafka broker does not maintain a record of what is consumed by whom. If this is poorly implemented, the consumer ends up in reading the same message multiple times. If the message is deleted from the broker (as the broker doesn't know whether the message is consumed or not), Kafka defines the time-based SLA (service level agreement) as a message retention policy. In line with this policy, a message will be automatically deleted if it has been retained in the broker longer than the defined SLA period. This message retention policy empowers consumers to deliberately rewind to an old offset and re-consume data although, as with traditional messaging systems, this is a violation of the queuing contract with consumers. Let's discuss the message delivery semantic Kafka provides between producer and consumer. There are multiple possible ways to deliver messages, such as: Messages are never redelivered but may be lost Messages may be redelivered but never lost Messages are delivered once and only once When publishing, a message is committed to the log. If a producer experiences a network error while publishing, it can never be sure if this error happened before or after the message was committed. Once committed, the message will not be lost as long as either of the brokers that replicate the partition to which this message was written remains available. For guaranteed message publishing, configurations such as getting acknowledgements and the waiting time for messages being committed are provided at the producer's end. From the consumer point-of-view, replicas have exactly the same log with the same offsets, and the consumer controls its position in this log. For consumers, Kafka guarantees that the message will be delivered at least once by reading the messages, processing the messages, and finally saving their position. If the consumer process crashes after processing messages but before saving their position, another consumer process takes over the topic partition and may receive the first few messages, which are already processed. ------------------- Kafka Storage Kafka has a very simple storage layout. Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of equal sizes. Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file. Segment file is flushed to disk after configurable numbers of messages have been published or after a certain amount of time elapsed. Messages are exposed to consumer after it gets flushed. Unlike traditional message system, a message stored in Kafka system doesn’t have explicit message ids. Messages are exposed by the logical offset in the log. This avoids the overhead of maintaining auxiliary, seek-intensive random-access index structures that map the message ids to the actual message locations. Messages ids are incremental but not consecutive. To compute the id of next message adds a length of the current message to its logical offset. Consumer always consumes messages from a particular partition sequentially and if the consumer acknowledges particular message offset, it implies that the consumer has consumed all prior messages. Consumer issues asynchronous pull request to the broker to have a buffer of bytes ready to consume. Each asynchronous pull request contains the offset of the message to consume. Kafka exploits the sendfile API to efficiently deliver bytes in a log segment file from a broker to a consumer. ---------------------- Kafka Broker Unlike other message system, Kafka brokers are stateless. This means that the consumer has to maintain how much it has consumed. Consumer maintains it by itself and broker would not do anything. Such design is very tricky and innovative in itself. It is very tricky to delete message from the broker as broker doesn't know whether consumer consumed the message or not. Kafka innovatively solves this problem by using a simple time-based SLA for the retention policy. A message is automatically deleted if it has been retained in the broker longer than a certain period. This innovative design has a big benefit, as consumer can deliberately rewind back to an old offset and re-consume data. This violates the common contract of a queue, but proves to be an essential feature for many consumers.
  4. Role of ZooKeeper. A critical dependency of Apache Kafka is Apache Zookeeper, which is a distributed configuration and synchronization service. Zookeeper serves as the coordination interface between the Kafka brokers and consumers. The Kafka servers share information via a Zookeeper cluster. Kafka stores basic metadata in Zookeeper such as information about topics, brokers, consumer offsets (queue readers) and so on. Since all the critical information is stored in the Zookeeper and it normally replicates this data across its ensemble, failure of Kafka broker / Zookeeper does not affect the state of the Kafka cluster. Kafka will restore the state, once the Zookeeper restarts. This gives zero downtime for Kafka. The leader election between the Kafka broker is also done by using Zookeeper in the event of leader failure. ------------- Zookeeper: ZooKeeper serves as the coordination interface between the Kafka broker and consumers. The ZooKeeper overview given on the Hadoop Wiki site is as follows (http://wiki.apache.org/hadoop/ZooKeeper/ProjectDescription):"ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system."The main differences between ZooKeeper and standard filesystems are that every znode can have data associated with it and znodes are limited to the amount of data that they can have. ZooKeeper was designed to store coordination data: status information, configuration, location information, and so on. ------------- Zookeeper and Kafka Consider a distributed system with multiple servers, each of which is responsible for holding data and performing operations on that data. Some potential examples are distributed search engine, distributed build system or known system like Apache Hadoop. One common problem with all these distributed systems is how would you determine which servers are alive and operating at any given point of time? Most importantly, how would you do these things reliably in the face of the difficulties of distributed computing such as network failures, bandwidth limitations, variable latency connections, security concerns, and anything else that can go wrong in a networked environment, perhaps even across multiple data centers? These types of questions are the focus of Apache ZooKeeper, which is a fast, highly available, fault tolerant, distributed coordination service. Using ZooKeeper you can build reliable, distributed data structures for group membership, leader election, coordinated workflow, and configuration services, as well as generalized distributed data structures like locks, queues, barriers, and latches. Many well-known and successful projects already rely on ZooKeeper. Just a few of them include HBase, Hadoop 2.0, Solr Cloud, Neo4J, Apache Blur (incubating), and Accumulo. ZooKeeper is a distributed, hierarchical file system that facilitates loose coupling between clients and provides an eventually consistent view of its znodes, which are like files and directories in a traditional file system. It provides basic operations such as creating, deleting, and checking existence of znodes. It provides an event-driven model in which clients can watch for changes to specific znodes, for example if a new child is added to an existing znode. ZooKeeper achieves high availability by running multiple ZooKeeper servers, called an ensemble, with each server holding an in-memory copy of the distributed file system to service client read requests. Figure 4 above shows typical ZooKeeper ensemble in which one server acting as a leader while the rest are followers. On start of ensemble leader is elected first and all followers replicate their state with leader. All write requests are routed through leader and changes are broadcast to all followers. Change broadcast is termed as atomic broadcast. Usage of Zookepper in Kafka: As for coordination and facilitation of distributed system ZooKeeper is used, for the same reason Kafka is using it. ZooKeeper is used for managing, coordinating Kafka broker. Each Kafka broker is coordinating with other Kafka brokers using ZooKeeper. Producer and consumer are notified by ZooKeeper service about the presence of new broker in Kafka system or failure of the broker in Kafka system. As per the notification received by the Zookeeper regarding presence or failure of the broker producer and consumer takes decision and start coordinating its work with some other broker. Overall Kafka system architecture is shown below in Figure 5 below.