SlideShare a Scribd company logo
1 of 45
Download to read offline
2010
2014
- Error handling first class citizen 
schema registry
Your
App
Producer
Serializer
Check is format is
acceptable 

Retrieve schema ID
Topic
Incompatible 

data error
Schema ID + Data
Kafka
producerProps.put(“key.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");

producerProps.put("value.serializer","io.confluent.kafka.serializers.KafkaAvroSerializer");
shipments topic
sales topic
low inventory topic
spark
streaming
generate

data
let’s see some code
Define the data contract / schema
in Avro format
generate data
1,9 M msg / sec

using 1 thread
https://schema-registry-ui.landoop.com
Schemas registered for us :-)
Defining the typed data format
Initiate the streaming from 2 topics
The business logic
shipments topic
sales topic
low inventory topic
spark
streaming
elastic-search
re-ordering
Simple is beautiful
landoop.com/blog
github.com/landoop

More Related Content

What's hot

Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streams
confluent
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 

What's hot (20)

Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streams
 
Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
 
Audience counting at Scale
Audience counting at ScaleAudience counting at Scale
Audience counting at Scale
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to KafkaApache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
 
Building Serverless Data Infrastructure in the AWS Cloud
Building Serverless Data Infrastructure in the AWS CloudBuilding Serverless Data Infrastructure in the AWS Cloud
Building Serverless Data Infrastructure in the AWS Cloud
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19
 
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with Spark
 
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 

Viewers also liked

Viewers also liked (8)

Kafka Tutorial: Streaming Data Architecture
Kafka Tutorial: Streaming Data ArchitectureKafka Tutorial: Streaming Data Architecture
Kafka Tutorial: Streaming Data Architecture
 
Kafka Tutorial - DevOps, Admin and Ops
Kafka Tutorial - DevOps, Admin and OpsKafka Tutorial - DevOps, Admin and Ops
Kafka Tutorial - DevOps, Admin and Ops
 
London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)
 
Athens BigData Meetup - Sept 17
Athens BigData Meetup - Sept 17Athens BigData Meetup - Sept 17
Athens BigData Meetup - Sept 17
 
Python and test
Python and testPython and test
Python and test
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
 
Kafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersKafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka Consumers
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced Producers
 

Similar to From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and #stream-processing

The Migration to Event-Driven Microservices (Adam Bellemare, Flipp) Kafka Sum...
The Migration to Event-Driven Microservices (Adam Bellemare, Flipp) Kafka Sum...The Migration to Event-Driven Microservices (Adam Bellemare, Flipp) Kafka Sum...
The Migration to Event-Driven Microservices (Adam Bellemare, Flipp) Kafka Sum...
confluent
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 

Similar to From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and #stream-processing (20)

Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
 
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
The Migration to Event-Driven Microservices (Adam Bellemare, Flipp) Kafka Sum...
The Migration to Event-Driven Microservices (Adam Bellemare, Flipp) Kafka Sum...The Migration to Event-Driven Microservices (Adam Bellemare, Flipp) Kafka Sum...
The Migration to Event-Driven Microservices (Adam Bellemare, Flipp) Kafka Sum...
 
IaC: Tools of the trade
IaC: Tools of the tradeIaC: Tools of the trade
IaC: Tools of the trade
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
 
What is Apache Kafka®?
What is Apache Kafka®?What is Apache Kafka®?
What is Apache Kafka®?
 
What is apache Kafka?
What is apache Kafka?What is apache Kafka?
What is apache Kafka?
 
Real time dashboards with Kafka and Druid
Real time dashboards with Kafka and DruidReal time dashboards with Kafka and Druid
Real time dashboards with Kafka and Druid
 
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 

From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and #stream-processing

Editor's Notes

  1. Unbounded, unordered, large scale data sets are increasingly common in day to day business and IoT for example is continuously bringing more and more data. So, big data is a buzz word that describes unusually large scale data systems that we started building to deal with internet scale data sets. Hadoop is a canonical example of a system built for this purpose – and recently there is a big push towards streaming models and hence faster data 
  2. In this presentation, we are going to see the evolution around data technologies. From the early days and the introduction of Data Warehouses, to the evolution of Hadoop. We are going to see how of MapReduce changed the way we think about data, And how we reached the real-time / streaming data technologies and tools.
  3. From a long time ago, we are using databases to serve on-line transactions But when it comes to aggregating – joining – filtering – transforming this data, data warehouses was the technology used for integration and data archival So in a traditional ETL pipeline – we would run over night batches to run some reports. But isn’t the batching mechanism limiting our responsiveness ?
  4. A lot of things started changing in the early 2000s, there was kind of a series of google papers that described things they were doing internally and people took those ideas and wrote alternatives in the open sourced world. One of the most famous one is the MR paper (2004) that described a general purpose distributed compute model that basically took away the problem of end users having the know-how to break up their computations into smaller pieces, distribute over a cluster and handle failures and reruns, bringing all that data back together for the final result..The framework handles all that for you. So this was a very important step forward and that became Hadoop ultimately.
  5. This resulted into the birth of Hadoop and everything (!) in this talk is related with Map-Reduce Hadoop celebrated it’s 10th year anniversary in the Hadoop Summit last April in Dublin.
  6. Hadoop being a framework that encompasses the M-R paradigm, and introduces an immutable distributed file-system and then other tools that have been added to this ecosystem - Hadoop is using the MR paradigm, and introduces By default it replicates data across multiple nodes of the cluster (usually 3) and executes distributed sets of Map – Reduce task by utilizing data locality. … it’s core philosophy is send computation to the data instead of pulling data
  7. So Hadoop proved to be both resilient and very flexible. There are a number of use cases ranging from - Social media analytics - to data-warehousing - to machine learning And is currently used in production across many industries However, it’s not perfect. It kind of difficult and sometimes slow Also it’s not supporting streaming
  8. Apache Spark became very popular in early 2014 when it graduated into an Apache top level project as one of the fastest growing OSS projects Unlike M-R that writes to disk at every MAP and REDUCE phase, Spark is a lot more efficient as it keeps the intermediate data in-memory, by introducing new distributed collections
  9. On top of it’s API, Spark added the capability to run SQL queries So it became a natural choice to be the compute engine which still plugs into hadoop and other systems and now you can write spark jobs instead, with a better performance. In fact, Spark claims to be up to 100 times faster, but in practise Spark is usually 4-7 times faster than M-R and it’s getting better So while with M-R it’s common for your computation to take 45 minutes on a M-R job, Today we can optimize it to take 5 - 15 minutes on Spark.
  10. How about low latency? What happens when I want to react in real time events say for example update my search engines based on inventory changes? How can I immediately detect anomalies so I can respond fast before I get angry customers ?
  11. Application requirements have changed dramatically in recent years. Only a few years ago a large application had tens of servers, seconds of response time, hours of offline maintenance and gigabytes of data. Today applications are deployed on everything from mobile devices to cloud-based clusters running thousands of multi-core processors. Users expect millisecond response times and 100% uptime. Data is measured in Petabytes. Today's demands are simply not met by yesterday’s software architectures. Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation and location transparency, they scale up & down and are also resilient as they respond to failures which actually makes Error handling is a first class citizen.
  12. So we want to apply those universal principals in fast, cheap and scalable way
  13. And of course not to loose data.
  14. So with data streaming from everywhere…
  15. How do I build scalable fault tolerant distributed data processing systems, that can handle massive amount of data? From diverse sources? With different structures? How about back-pressure ? How long to buffer data – in order not to run out of memory ?
  16. Kafka is a distributed pub-sub message system, started in LinkedIn and: Writes in high throughput and low latency Is a multi-subscriber system Replicates data for resilience Uses partitions for sharding
  17. It’s designed to feed data in multiple systems including batch systems and is persistent by default If the queue is persistent – you don’t have to worry about back-pressure. This is solved by definition. This is what Kafka is – a persistent queue that can buffer way more data that can live in an application’s memory So Kafka is becoming the de-facto standard to store stream events The key abstraction in Kafka is the topic. Producers publish their records to a topic, and consumers subscribe to one or more topics.
  18. The key idea of Kafka is the log. The log is 1. An abstract data structure – that has some properties, 2. A structured ordered layer of messages 3. It is immutable – so once written it does not change 4. Not only written in order – but also read in order – which means sequential access ensures high performance So log, provides the ordering semantics that is required for stream processing
  19. But if you want to scale this log, then you shard it in multiple-partitions. And if you do that – it’s essential the backend of Kafka – where the log is the topic, that physically lives in partitions – that is replicated in a whole bunch of brokers. So a Kafka topic is just a sharded write-ahead log. Producers append records to these logs and consumers subscribe to changes ETSY.com for example – is using a single topic sharded over 200 partitions – distributed over 20+ servers
  20. Microsoft, Netflix and LinkedIn have surpassed the 1 Trillion messages / day rate (!) So it’s pretty damned fast So let’s dive a little bit more into Kafka
  21. Recently Kafka connect was introduced which is a large scale streaming data import/export data tool for Kafka  It’s both a framework to build connectors and a tool for copying streaming data to and from Kafka We normally have source connectors ( which can be for example your JMS system) and sink connectors ( for example update some indexes on ElasticSearch). What is interesting – is that multiple open source connectors already exist – some of them certified that ensure Exactly-once semantics Retries and redeliveries Error policies
  22. Designing and maintaining the ETL process is often considered one of the most difficult and resource intensive portions of a data warehouse project: - Each step can introduce errors and risks - Tools can cost millions - Increased complexity – due to tight coupling - Can introduce data duplication after fail-over Kafka enable’s us to break down the E from the T and the L – and de-couple We can now source data from an external system and sink them into another So the only necessary thing to define is – how to run our data transformations And with data transformations we mean, filtering, aggregations, data enrichment etc…
  23. Unlike the request-response model – which is synchronous , tightly coupled and latency sensitive, (where you send one input and get one output) and the only way to scale this services is by deploying multiple instances of this service and Unlike batch where you send ALL your data and wait a long time to get ALL the outputs, Stream-processing is a model where we have some inputs and get some outputs back, where the definition of some is left to the program It a generalization of request-response & batch The most prominent streaming frameworks right now are: Spark Streaming – Flink and Kafka Streams
  24. Our available options range from a DIY stream processing approach (using the kafka libraries) – that seems simple to begin with, but we need to manually care about many aspects regarding: ▪ fault tolerance and fast failover ▪ state - when doing distributed joins and aggregations ▪ reprocessing  ▪ time windowing There are also established stream processing frameworks – such as spark-streaming Spark started as a batch model, such as Map-Reduce. But because it’s much more efficient than M-R They realized, that they can actually support a stream model by using a definable window of time up to a few seconds Pretty-much like a mini-batch Flink and Kafka Streams use a different model – an event-at-a-time processing model (not microbatch) with millisecond latency The key differences between spark-streaming and Flink / Kafka Streams is the expected latency. When spark-streaming runs mini-batches and produces results in a matter of 2 – 10 seconds
  25. What happens if a developer introduces a BUG and sends some bad data into a topic ? We also mentioned that topics are immutable. So are multiple consumer – application going to be affected ?
  26. We can avoid a lot of suffering - by using a Schema Registry It provides a serving layer for your metadata. It provides a rest-full interface for storing and retrieving Avro Schemas. It provides serializers that plug-in to kafka-clients, and handle schema retrieval for kafka messages that are send into the Avro format. In that case our Application – would configure a particular serializer into every producer object. That serializer ensures that the schema of messages is both registered and valid. Using same schema registry in dev and production allows Uber to catch schema mismatches in unit tests, before rolling to production.
  27. So lets consider a simple model of a retail store. The core streams in retail are sales of products, orders placed for new products and shipments of products that arrive in stores. The inventory on hand is a table computed off the sale and shipment streams which add and subtract from our stock of products on hand. Then reordering products when the stock starts to run low and adjusting prices based on demand. How do we model real world things as a combination of streams? This streaming example was presented in the Hadoop Strata conference in London last month, and we were motivated to implement it
  28. So let’s assume you have a shipment topic and a sales topic. For the shake of the example the message format is ItemID, StoreId and the product count. So messages stream in real time..
  29. So we implemented this example (which by the way code will be available afterwards) by following the steps: Generate some synthetic data to capture the problem and continuously feed KAFKA Run a long running spark streaming application which aggregates the topics and generates messages for low inventory. Applications can subscribe to the low inventory topic and react in real time.
  30. So as we said it’s important to define a schema that validates the messages format and type. Avro provides 1) rich data structures 2) compact fast binary data format and Avro schema’s are defined in JSON. It’s considered one of the best practices in both streaming and batch processing applications. What is really interesting about Avro is that supports schema evolution and ensures forward and backward compatibility.
  31. So we generate both sales and shipment records for about 2 M messages / sec… and this is only by utilising 1 thread.
  32. The serializer automatically registers the new schema into schema registry and this is how it would look like.
  33. To start building a spark streaming application you need to add the dependencies of course, and create the spark streaming context (a bit of boilerplate required) to set up the window mainly and pretty much run a spark job every 4 seconds.
  34. In our spark st app we are defining the typed data format. What is interesting about is that even if the producer of this message decides to evolve/change the data schema this code will still work.
  35. We we set up the code to subscribe one consumer to each topic and continuously poll data from kafka about these topics and deserialise into objects.
  36. We pull shipment records and we calculate the “inventory on hand” by increasing the availability of the product from the shipment messages and decreasing from the sales. If the availability of a product drops below a certain threashold, we generate a ‘low-availability’ event/message to a new topic, for other applications to react
  37. The idea is for downstream applications (or connectors) to update our catalogue in real-time as presented in web and mobile apps Re-order products with low-inventory at a particular store
  38. This is the entire spark streaming application you require for this example. Simple?
  39. So we generated 1B messages in less than 30 min.
  40. Obviously when it comes to production monitoring is crusial. There are noumerus tools out there for this job, and kafka itself provides a rich set of metrics that applications such as Prometeus and Graphana can expose and visualise.
  41. And if you are the poor devops guy that needs to deploy, configure and scale up or down, view logs etc… How can you deliver such an infrastructure ? Actually this is a hard task. Fortunately some amazing developers provide an integration that sorts this problem In a matter of minutes. Confluent-On-Cloudera => Excellent integration with Hadoop http://www.landoop.com/blog/2016/07/confluent-on-cloudera/
  42. So these tools we’ve seen, seem to be very commonly used together in the streaming architecture. Spark and in particular Spark Streaming and Kafka for ingesting the data in adorable resilient scalable away. You normally require some sort of scalable distributed storage most commonly used Cassandra but in principle it can be almost any other data store. And some missing parts ”how do I glue things together?” and “what infrastructure to run them on?” So we’ve seen people adapting this kind of humourus acronym SMACK stack which stands for … Mesos is emerging the next generation in managing a clustering system, still early days but we really like it. Akka is meeting the need for micro services and glues things together especially with Akka Streams.. Not necessarily agree with it, but all these technologies fit nicely together and we need to be wise when choosing our technology stack...
  43. So this would be a high level architecture on your distributed systems. We can use our Hadoop cluster, as a DW and for running our Analytics and Machine Learning, And run on it our Kafka based streaming platform as well We will most probably use one or more NoSQL clusters, and for our custom and stateless applications we can utilize a Mesos Cluster and run them within docker containers. Depending on your needs, you adjust your architecture – but this a very common pattern we’ve seen across many organizations.