Call Girls In Mahipalpur O9654467111 Escorts Service
Stream all the things
1. Stream All the Things
Streaming Architectures for
Data Streams that Never End
Dean Wampler, Ph.D. (@deanwampler)
2. Free, as in 🍺
bit.ly/fastdata-ORbook
My book, published last fall that describes the points in this talk in greater depth. I’ve refined the talk a bit since this was published.
11. Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
Numbering is in the report, so it’s easier for the text to refer back to the figure.
12. Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
1 & 2. Data sources include streams of data over sockets and from logs. You might also ingest through REST channels, but unless they are async, the overhead will be too high, so REST might be used only for communicating with the microservices (3) you write to complete your environment.
13. Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
3. A real fast data environment is similar to more general services, you’ll have the “heavy hitters” like Spark, Kafka, etc., but you’ll also need to write support microservices to complete the environment.
14. Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
4. Some services, like Kafka, require ZooKeeper for managing shares state, such as leaders.
15. Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
5. Kafka is the core of this architecture, the place where data is ingested in queues, organized into topics. Publishers are consumers are decoupled and N-to-M. Kafka has massive scalability and excellent resiliency and data durability. All services can communicate through each other using Kafka, too, rather than having to manage arbitrary
point-to-point connections.
16. Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
6, 8, and 10. Many pure streaming, mini-batch streaming, and batch engines are vying for your attention. We’ll focus into this area after this overview.
17. Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
7, 9, and 10. Data can also be read and written between these compute engines and storage, not just Kafka.
18. Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
11. You can run this architecture on Mesos, YARN (Hadoop), on premise or in the cloud. (At Lightbend, we think Mesos is the best choice.)
21. •Low latency? How low?
•…
If you have tight latency requirements, you can’t use a mini-batch or batch engine. If you’re constraints are more flexible, you can do more sophisticated things, like training ML models, writing to databases, etc.
22. •Low latency? How low?
•High volume? How high?
•…
Some tools are better at large volumes than others. If you have low volumes, you’ll want tools that are efficient at low volumes, whereas some of the high-volume tools are efficient per record when amortized over the stream.
23. •Low latency? How low?
•High volume? How high?
•Integration with other tools?
•Which ones?
•…
Connecting everything through Kafka can eliminate this requirement, but often that’s not ideal and you may need some tools to have direct connections to other tools.
24. •…
•Which kinds of data processing,
analytics?
•Process individual events?
•Process records in bulk?
Is your data processing essential data warehouse kinds of analytics or similar SQL queries? Are you doing ETL of data? Are you doing aggregations? Who is consuming this data?
Are you doing complex event processing (CEP) where you need to make decisions individually, per event? Or, is it a stream with processing “en masse”, like typical SQL and ETL processing.
We’ll comment on how these characteristics are realized in the specific tools we’ll discuss next, but if you’re using other tools, consider how they fit these dimensions.
27. Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• Apache Beam
• Based on Google Dataflow
• Requires a “runner”
• (Dataflow is the runner in
Google Cloud…)
• Most sophisticated streaming
semantics.
Google has spent years thinking about all the things a streaming engine has to handle if you want to do accurate calculations on streams, accounting for all the contingencies that can happen. The latest incarnation is Google Dataflow, available as a service in Google Cloud Platform. Google recently open-sourced the part of Dataflow for
defining “data flows” with these sophisticated semantics, called Apache Beam. Beam doesn’t provide a runner, so third-party tools provide this capability.
30. Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• Akka Streams
• Complex event processing
• Efficient per-event
• Not for large scale “pipes”
• Low latency
• “Might” become an Apache
Beam runner
There is a lesser-known engine called Gearpump (Intel project) that offers similar capabilities to Flink, but is implemented in Akka, so it could provide a Beam runner capability on top of Akka Streams. Hence, Akka Streams might be extended to run Beam flows for the situation where event processing (i.e., CEP) is the preferred model, as
opposed to “fat” data pipes. Akka also provides a powerful Actor model for rich concurrency, and libraries for clustering, state management, and interop with many sources and sinks (the “Alpakka” project).
31. Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• Kafka Streams
• Low-overhead processing of
Kafka topics.
• Ideal for:
• ETL of raw data
• Last value for a key,
aggregations (“KTable”
abstraction)
Kafka Streams is a great 80% solution that works close with Kafka (in terms of production deployment and overhead). It is designed to read Kafka topics, perform processing like transformations, filtering, aggregations, “last-seen” value for keys, etc. then write results back to Kafka. It nicely addresses many common design challenges, but isn’t
designed to be a complete solution for stream processing, e.g., running SQL queries and training ML models.
32. Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• Spark
• Streaming: mini-batch model
• Latency > ~500 msecs
• Ideal for:
• Rich SQL
• ML model training
• Beam support? Coming…
Spark Streaming is a mini-batch model, although this is slowly being replaced with a true, low-latency streaming model. So, today, use it for more expensive calculations, like training ML models, but don’t use it when the lowest-latency processing is required. Lightbend’s Fast Data Platform is working on tools to make it easy to train ML models
in Spark and serve them with the other tools. Another advantage of Spark is the rich ecosystem of tools for a wide variety of batch and streaming scenarios.
35. • … like this?
• Image from:
You can get this great report by Jonas Boner here: lightbend.com/reactive-microservices-architecture
36. • A Data app | microservice:
• Has one responsibility
• …
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
Each data app, streaming or batch, is typically doing one thing, like ETL this Kafka topic to Cassandra, or compute aggregates for a dashboard, etc. Similarly, services evolving now into microservices are also supposed to do one thing and communicate with other services for the “help” they need.
37. • A Data app | microservice:
• Has one responsibility
• Ingests and processes a never
ending stream of data |
messages
• …
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
A streaming data app will have a never-ending stream of data to process. Similarly, requests for service will never stop coming to microservices.
38. • A Data app | microservice:
• Has one responsibility
• Ingests and processes a never
ending stream of data |
messages
• Must be available, responsive,
resilient, scalable, … reactive
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
Hence, both kinds of systems must provide high availability to respond to the stream of data|messages coming in. That means that both must be resilient against failure, scale on demand to meet the load. These are the reactive principles.
39. • Thesis:
• Microservice and Data
architectures are converging.
• Similar design problems.
• Data tends to dominate as
environment grows.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a Streams
Akka Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
So, both kinds of systems have similar design problems, hence both should be implemented in similar ways. Also, organizations that aren’t particularly data focused, especially when they are new endeavors, often find that data grows to be a dominant concern, a sign of success! For example, Twitter started as a classic three-tier application,
but now most of their infrastructure looks like a petroleum refinery.
41. Free, as in 🍺
bit.ly/fastdata-ORbook
Once again, the link to my book, published last fall that describes the points in this talk in greater depth. I’ve refined the talk a bit since this was published.
42. Interested in
Lightbend Fast Data Platform?
lightbend.com/fast-data-platform
This URL takes you to information about the product we’re building, Fast Data Platform, that helps teams succeed with these technologies, from initial installation and configuration, to writing applications, to runtime production monitoring, to advanced machine-learning based automation.