SlideShare une entreprise Scribd logo
1  sur  17
© MOSAIC SMART DATA 1
Egor Kraev, Head of AI, Mosaic Smart Data
PyData, April 4, 2017
Streaming analytics with asynchronous Python and Kafka
© MOSAIC SMART DATA 2
Overview
▪ This talk will show what streaming graphs are, why you
really want to use them, and what the pitfalls are
▪ It then presents a simple, lightweight, yet reasonably robust
way of structuring your Python code as a streaming graph,
with the help of asyncio and Kafka
© MOSAIC SMART DATA 3
A simple streaming system
▪ The processing nodes are often stateful, need to process the messages
in the correct sequence, and update their internal state after each
message (an exponential average calculator is a basic example)
▪ The graphs often contain cycles, so for example A ->B -> C -> A
▪ The graphs nearly always have some nodes containing and emitting
multiple streams
© MOSAIC SMART DATA 4
Why structure your system as a streaming graph?
▪ Makes the code clearer
▪ Makes the code more granular and testable
▪ Allows for organic scaling
▪ Start out with the whole graph in one file, can gradually split it up to each node
being a microservice with multiple workers
▪ As the system grows, nodes can run in different languages/frameworks
▪ Makes it easier to run the same code on historic and live data
▪ Treating your historical run as replay also solves some realtime problems such
as aligning different streams correctly
© MOSAIC SMART DATA 5
Two key features of a streaming graph framework
1. Language for graph definition
▪ Ideally, the same person who writes the business logic in the processing nodes should
define the graph structure as well
▪ This means the graph definition language must be simple and natural
2. Once the graph is defined, scheduling is an entirely separate, hard
problem
▪ If we have multiple nodes in a complex graph, with branchings, cycles, etc, what order
do we call them in?
▪ Different consumers of the same message stream, consuming at different rates - what to
do?
▪ If one node has multiple inputs, what order does it receive and process them in?
▪ What if an upstream node produces more data than a downstream node can process?
© MOSAIC SMART DATA 6
Popular kinds of scheduling logic
1. Agents
▪ Each node autonomously decides what messages to send
▪ Each node accepts messages sends to it
▪ Logic for buffering and message scheduling needs to be defined in each node
▪ For example, pykka
2. 'Push' approach
▪ First attempt at event-driven systems tends to be ‘push’
▪ For example 'reactive' systems, eg Microsoft’s RXPy
▪ When an external event appears, it’s fed to the entry point node.
▪ Each node processes what it receives, once done, triggers its downstream nodes
▪ Benefit: simpler logic in the nodes; each node must only have a list of its
downsteam nodes to send messages to
© MOSAIC SMART DATA 7
Problems with the Push approach
1. What if the downstream can't cope?
▪ Solution: 'backpressure': downstream nodes are allowed to signal upstream when
they're not coping
▪ That limits the amount of buffering we need to do internally, but can bring its own
problems.
▪ Backpressure needs to be implemented well at framework level, else we end up with a
callback nightmare: each node must have callbacks to both upstream and downstream,
and manage these as well as an internal message buffer (RXPy as example)
▪ Backpressure combined with multiple dowstreams can lead to processing locking up
accidentally,
2. Push does really badly at aligning merging streams
▪ Even if individual streams are ordered, different streams are often out of sync
▪ What if the graph branches and then re-converges, how do we make sure the 'right'
messages from both branches are processed together?
© MOSAIC SMART DATA 8
The Pull approach
▪ Let's turn the problem on its head!
▪ Let's say each node doesn't need to know its downstream, only its
parents.
▪ The execution is controlled by the downmost node. When it's ready, it
requests more messages from its parents
▪ No buffering needed
▪ When streams merge, the merging node is in control, decides which
stream to consume from first
Limitations:
▪ The sources must be able to wait until queried
▪ Has problems with two downstream nodes wanting to consume the
same message stream
© MOSAIC SMART DATA 9
The challenge
I set out to find or create an architecture with the following properties:
▪ Allows realtime processing
▪ All user-defined logic is in Python with fairly simple syntax
▪ Both processing nodes and graph structure
▪ Lightweight approach, thin layer on top of core Python
▪ Can run on a laptop
▪ Scheduling happens transparently to the user
▪ No need to buffer data inside the Python process (unless you want to)
▪ Must scale gracefully to larger data volumes
© MOSAIC SMART DATA 10
What is out there?
▪ In the JVM world, there's no shortage of great streaming systems
▪ Akka Streams: a mature library
▪ Kafka Streams: allows you to treat Kafka logs as database tables, do joins etc
▪ Flink: Stream processing framework that is good at stateful nodes
▪ On the Python side, a couple of frameworks are almost what I want
▪ Google Dataflow only supports streaming Python when running in Google Cloud,
local runner only supports finite datasets
▪ Spark has awesome Python support, but it's basic approach is map-reduce on
steroids, doesn't fit that well with stateful nodes and cyclical graphs
© MOSAIC SMART DATA 11
Collaborative multitasking, event loop, and asyncio
▪ The event loop pattern, collaborative multitasking
▪ An ‘event loop’ keeps track of multiple functions that want to be executed
▪ Each function can signal to it whether it’s ready to execute or waiting for input
▪ The event loop runs the next function until it has nothing to process, it then
surrenders control back to event loop
▪ A great way of running multiple bits of logic ‘simultaneously’ without
worrying about threading – runs well on a single thread
▪ Asyncio is Python’s official event loop implementation
© MOSAIC SMART DATA 12
Kafka
▪ A simple yet powerful messaging system
▪ A producer client can create a topic in Kafka and write messages to it
▪ Multiple consumer clients can then read these messages in sequence, each
at their own pace
▪ Partitioning of topics - if multiple consumers in the same group, each sees a
distinct subset of partitions of the topic
▪ It's a proper Big Data application with many other nice properties, the only
one that concerns us is that it's designed to deal with lots of data and lots of
clients, fast!
▪ Can spawn an instance locally in seconds, using Docker, eg using the image
at https://hub.docker.com/r/flozano/kafka/
© MOSAIC SMART DATA 13
Now let’s put it all together!
▪ Structure your graph as a collection of pull-only subgraphs, that consume
from and publish to multiple Kafka topics
▪ Inside each subgraph, can merge streams; can also choose to send each
message of a stream to one of many sources,
▪ Inside each subgraph, each message goes to at most one downstream!
▪ If two consumers want to consume the same stream, push that stream to
Kafka and let them each read from Kafka at their own pace
▪ If you have a 'hot' source that won't wait: just run a separate process that
just pushes the output of that source into a Kafka topic, then consume at
leisure
© MOSAIC SMART DATA 14
Our example streaming graph sliced up according to the
pattern
▪ The ‘active’ nodes are green – exactly one per subgraph
▪ All buffering happens in Kafka, it was built to handle it!
© MOSAIC SMART DATA 15
Scaling
▪ Thanks to asyncio, can run multiple subgraphs in the same Python
process and thread, so can in principle have a whole graph in one file (two
if you want one dedicated to user input)
▪ Scale using Kafka partitioning to begin with: for slow subgraphs, spawn
multiple nodes each looking at its own partitions of a topic
▪ If that doesn't help, replace the problematic subgraphs by applications in
other languages/frameworks
▪ So stateful Python nodes and Spark subgraphs can coexist happily,
communicating via Kafka
© MOSAIC SMART DATA 16
Example application
▪ To give a nice syntax to users, we implement a thin façade over the
AsyncIterator interface, adding overloading of operators | and >
▪ So a data source is just an async iterator with some operator
overloading on top:
▪ The | operator applies an operator (such as ‘map’) to a source, returning
a new source
▪ The a > b operator creates a coroutine that, when run, will iterate over a
and feed the results to b, a can be an iterable or async iterable
▪ The ‘run’ command asks the event loop to run all its arguments
▪ The kafka interface classes are a bit of syntactic sugar on top of aiokafka
© MOSAIC SMART DATA 17
Summary
▪ Pull-driven subgraphs
▪ Asyncio and async iterators to run many subgraphs at once
▪ Kafka to glue it all together (and to the world)
Questions? Comments?
▪ Please feel free to contact me at egor@dagon.ai

Contenu connexe

Tendances

Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to RealitySriram Subramanian
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
High performance messaging with Apache Pulsar
High performance messaging with Apache PulsarHigh performance messaging with Apache Pulsar
High performance messaging with Apache PulsarMatteo Merli
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructuremattlieber
 
Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 Matteo Merli
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Large scale log pipeline using Apache Pulsar_Nozomi
Large scale log pipeline using Apache Pulsar_NozomiLarge scale log pipeline using Apache Pulsar_Nozomi
Large scale log pipeline using Apache Pulsar_NozomiStreamNative
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lakeTimothy Spann
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
 
Building event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarStreamNative
 
Mystery Machine Overview
Mystery Machine OverviewMystery Machine Overview
Mystery Machine OverviewIvan Glushkov
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaJay Kreps
 

Tendances (20)

Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to Reality
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
High performance messaging with Apache Pulsar
High performance messaging with Apache PulsarHigh performance messaging with Apache Pulsar
High performance messaging with Apache Pulsar
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Load balancing at tuenti
Load balancing at tuentiLoad balancing at tuenti
Load balancing at tuenti
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Large scale log pipeline using Apache Pulsar_Nozomi
Large scale log pipeline using Apache Pulsar_NozomiLarge scale log pipeline using Apache Pulsar_Nozomi
Large scale log pipeline using Apache Pulsar_Nozomi
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Building event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache Pulsar
 
Mystery Machine Overview
Mystery Machine OverviewMystery Machine Overview
Mystery Machine Overview
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
 

Similaire à Streaming analytics with Python and Kafka

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka Hua Chu
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...Igalia
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Stateful stream processing with kafka and samza
Stateful stream processing with kafka and samzaStateful stream processing with kafka and samza
Stateful stream processing with kafka and samzaGeorge Li
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 

Similaire à Streaming analytics with Python and Kafka (20)

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
 
Stateful stream processing with kafka and samza
Stateful stream processing with kafka and samzaStateful stream processing with kafka and samza
Stateful stream processing with kafka and samza
 
Data streaming
Data streamingData streaming
Data streaming
 
webservers
webserverswebservers
webservers
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 

Dernier

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Dernier (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Streaming analytics with Python and Kafka

  • 1. © MOSAIC SMART DATA 1 Egor Kraev, Head of AI, Mosaic Smart Data PyData, April 4, 2017 Streaming analytics with asynchronous Python and Kafka
  • 2. © MOSAIC SMART DATA 2 Overview ▪ This talk will show what streaming graphs are, why you really want to use them, and what the pitfalls are ▪ It then presents a simple, lightweight, yet reasonably robust way of structuring your Python code as a streaming graph, with the help of asyncio and Kafka
  • 3. © MOSAIC SMART DATA 3 A simple streaming system ▪ The processing nodes are often stateful, need to process the messages in the correct sequence, and update their internal state after each message (an exponential average calculator is a basic example) ▪ The graphs often contain cycles, so for example A ->B -> C -> A ▪ The graphs nearly always have some nodes containing and emitting multiple streams
  • 4. © MOSAIC SMART DATA 4 Why structure your system as a streaming graph? ▪ Makes the code clearer ▪ Makes the code more granular and testable ▪ Allows for organic scaling ▪ Start out with the whole graph in one file, can gradually split it up to each node being a microservice with multiple workers ▪ As the system grows, nodes can run in different languages/frameworks ▪ Makes it easier to run the same code on historic and live data ▪ Treating your historical run as replay also solves some realtime problems such as aligning different streams correctly
  • 5. © MOSAIC SMART DATA 5 Two key features of a streaming graph framework 1. Language for graph definition ▪ Ideally, the same person who writes the business logic in the processing nodes should define the graph structure as well ▪ This means the graph definition language must be simple and natural 2. Once the graph is defined, scheduling is an entirely separate, hard problem ▪ If we have multiple nodes in a complex graph, with branchings, cycles, etc, what order do we call them in? ▪ Different consumers of the same message stream, consuming at different rates - what to do? ▪ If one node has multiple inputs, what order does it receive and process them in? ▪ What if an upstream node produces more data than a downstream node can process?
  • 6. © MOSAIC SMART DATA 6 Popular kinds of scheduling logic 1. Agents ▪ Each node autonomously decides what messages to send ▪ Each node accepts messages sends to it ▪ Logic for buffering and message scheduling needs to be defined in each node ▪ For example, pykka 2. 'Push' approach ▪ First attempt at event-driven systems tends to be ‘push’ ▪ For example 'reactive' systems, eg Microsoft’s RXPy ▪ When an external event appears, it’s fed to the entry point node. ▪ Each node processes what it receives, once done, triggers its downstream nodes ▪ Benefit: simpler logic in the nodes; each node must only have a list of its downsteam nodes to send messages to
  • 7. © MOSAIC SMART DATA 7 Problems with the Push approach 1. What if the downstream can't cope? ▪ Solution: 'backpressure': downstream nodes are allowed to signal upstream when they're not coping ▪ That limits the amount of buffering we need to do internally, but can bring its own problems. ▪ Backpressure needs to be implemented well at framework level, else we end up with a callback nightmare: each node must have callbacks to both upstream and downstream, and manage these as well as an internal message buffer (RXPy as example) ▪ Backpressure combined with multiple dowstreams can lead to processing locking up accidentally, 2. Push does really badly at aligning merging streams ▪ Even if individual streams are ordered, different streams are often out of sync ▪ What if the graph branches and then re-converges, how do we make sure the 'right' messages from both branches are processed together?
  • 8. © MOSAIC SMART DATA 8 The Pull approach ▪ Let's turn the problem on its head! ▪ Let's say each node doesn't need to know its downstream, only its parents. ▪ The execution is controlled by the downmost node. When it's ready, it requests more messages from its parents ▪ No buffering needed ▪ When streams merge, the merging node is in control, decides which stream to consume from first Limitations: ▪ The sources must be able to wait until queried ▪ Has problems with two downstream nodes wanting to consume the same message stream
  • 9. © MOSAIC SMART DATA 9 The challenge I set out to find or create an architecture with the following properties: ▪ Allows realtime processing ▪ All user-defined logic is in Python with fairly simple syntax ▪ Both processing nodes and graph structure ▪ Lightweight approach, thin layer on top of core Python ▪ Can run on a laptop ▪ Scheduling happens transparently to the user ▪ No need to buffer data inside the Python process (unless you want to) ▪ Must scale gracefully to larger data volumes
  • 10. © MOSAIC SMART DATA 10 What is out there? ▪ In the JVM world, there's no shortage of great streaming systems ▪ Akka Streams: a mature library ▪ Kafka Streams: allows you to treat Kafka logs as database tables, do joins etc ▪ Flink: Stream processing framework that is good at stateful nodes ▪ On the Python side, a couple of frameworks are almost what I want ▪ Google Dataflow only supports streaming Python when running in Google Cloud, local runner only supports finite datasets ▪ Spark has awesome Python support, but it's basic approach is map-reduce on steroids, doesn't fit that well with stateful nodes and cyclical graphs
  • 11. © MOSAIC SMART DATA 11 Collaborative multitasking, event loop, and asyncio ▪ The event loop pattern, collaborative multitasking ▪ An ‘event loop’ keeps track of multiple functions that want to be executed ▪ Each function can signal to it whether it’s ready to execute or waiting for input ▪ The event loop runs the next function until it has nothing to process, it then surrenders control back to event loop ▪ A great way of running multiple bits of logic ‘simultaneously’ without worrying about threading – runs well on a single thread ▪ Asyncio is Python’s official event loop implementation
  • 12. © MOSAIC SMART DATA 12 Kafka ▪ A simple yet powerful messaging system ▪ A producer client can create a topic in Kafka and write messages to it ▪ Multiple consumer clients can then read these messages in sequence, each at their own pace ▪ Partitioning of topics - if multiple consumers in the same group, each sees a distinct subset of partitions of the topic ▪ It's a proper Big Data application with many other nice properties, the only one that concerns us is that it's designed to deal with lots of data and lots of clients, fast! ▪ Can spawn an instance locally in seconds, using Docker, eg using the image at https://hub.docker.com/r/flozano/kafka/
  • 13. © MOSAIC SMART DATA 13 Now let’s put it all together! ▪ Structure your graph as a collection of pull-only subgraphs, that consume from and publish to multiple Kafka topics ▪ Inside each subgraph, can merge streams; can also choose to send each message of a stream to one of many sources, ▪ Inside each subgraph, each message goes to at most one downstream! ▪ If two consumers want to consume the same stream, push that stream to Kafka and let them each read from Kafka at their own pace ▪ If you have a 'hot' source that won't wait: just run a separate process that just pushes the output of that source into a Kafka topic, then consume at leisure
  • 14. © MOSAIC SMART DATA 14 Our example streaming graph sliced up according to the pattern ▪ The ‘active’ nodes are green – exactly one per subgraph ▪ All buffering happens in Kafka, it was built to handle it!
  • 15. © MOSAIC SMART DATA 15 Scaling ▪ Thanks to asyncio, can run multiple subgraphs in the same Python process and thread, so can in principle have a whole graph in one file (two if you want one dedicated to user input) ▪ Scale using Kafka partitioning to begin with: for slow subgraphs, spawn multiple nodes each looking at its own partitions of a topic ▪ If that doesn't help, replace the problematic subgraphs by applications in other languages/frameworks ▪ So stateful Python nodes and Spark subgraphs can coexist happily, communicating via Kafka
  • 16. © MOSAIC SMART DATA 16 Example application ▪ To give a nice syntax to users, we implement a thin façade over the AsyncIterator interface, adding overloading of operators | and > ▪ So a data source is just an async iterator with some operator overloading on top: ▪ The | operator applies an operator (such as ‘map’) to a source, returning a new source ▪ The a > b operator creates a coroutine that, when run, will iterate over a and feed the results to b, a can be an iterable or async iterable ▪ The ‘run’ command asks the event loop to run all its arguments ▪ The kafka interface classes are a bit of syntactic sugar on top of aiokafka
  • 17. © MOSAIC SMART DATA 17 Summary ▪ Pull-driven subgraphs ▪ Asyncio and async iterators to run many subgraphs at once ▪ Kafka to glue it all together (and to the world) Questions? Comments? ▪ Please feel free to contact me at egor@dagon.ai