Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like.
The document discusses implementing reliable, isolated, and unified job submission for a distributed stream processing platform. It proposes:
1) Defining job submission and execution as atomic by requiring the job graph to be persisted before a job is considered submitted, and the job status to be set to DONE before a job is considered completed.
2) Compiling jobs in isolation on the cluster side by packaging user programs and dependencies and executing them in isolated containers to avoid bottlenecks and security risks at the client.
3) Exposing a three-layer unified client interface for deployment, cluster, and job management to provide a programmatic submission approach.
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang WangFlink Forward
Currently Flink supports the resource management system YARN and Mesos. However, they were not designed for fast moving cloud native architectures, and they could not support mixed workloads (e.g. batch, streaming, deep learning, web services, etc.) relatively well. At the same time, Kubernetes is evolving very fast to fill those gaps and become the de-facto orchestration framework. So running Flink on Kubernetes is a very basic requirement for many users. In this talk, firstly we will quickly go through Kubernetes architecture and the efforts we have been made to run Flink on Kubernetes. Then we deep dive into the technical details about how to make Flink natively run on Kubernetes. Native means Flink KubernetesResourceManager calls directly the Kubernetes APIs to allocate and release TaskManager pods. Next we will share some practices of application lifecycle management and production optimizations (e.g. high-availability, storage, network, etc.). Finally, we will conclude the talk with advantages for Flink on Kubernetes and a simple demo. This talk is aimed at users and companies who are looking to run Flink on Kubernetes cluster. We assume that the listener has some basic knowledge of cluster orchestration and containers.
At Yelp we run hundreds of Flink jobs to power a wide range of applications: push notifications, data replication, ETL, sessionizing and more. Routine operations like deploys, restart, and savepointing for so many jobs would take quite a bit of developers’ time without the right degree of automation. The latest addition to our toolshed is a Kubernetes operator managing the deployment and the lifetime of Flink clusters on PaaSTA, Yelp’s Platform As A Service.
We replaced our deployment framework launching Flink clusters on top of AWS EMR with a Kubernetes operator managing fully Docker-ized Flink clusters. Compared to EMR, this architecture allowed us to both drastically reduce the deployment time of our Flink clusters and to share our hardware resources more efficiently. In addition, we now offer to our developers the same interface they are used to for running REST services, batch jobs and many other workloads on PaaSTA.
This talk will give a brief overview of Yelp’s PaaSTA before diving into the details of how the Kubernetes operator has been implemented and how it has been integrated with Yelp developers’ workflow (deploy, logs, savepoints, upgrades, etc), to end with a glimpse of the future features we are planning for the operator (Flink as a library, autoscaling, etc.).
We present a web service named FLOW to let users do FLink On Web. FLOW aims to minimize the effort of handwriting streaming applications similar in spirit to Hortonworks Stream Analytics Manager, StreamAnalytix, and Nussknacker by letting users drag and drop graphical icons representing streaming operators on GUI.
FLOW builds on Flink Table API and lets users assemble graphical icons associated with not only basic SQL operations but also advanced SQL operations like window aggregation, temporal join, and pattern recognition (MATCH_RECOGNIZE clause). Its data preview function enables to observe how sample data changes before and after applying each operation on screen. In addition, FLOW shows the sample data as time-series charts and geographical maps by interacting with Elasticsearch and Kibana. Therefore, domain experts with basic knowledge of SQL can design their streaming applications easily on GUI without understanding of Flink DataStream API and Flink CEP library.
In this talk, we first present what motivates the development of FLOW, then show how FLOW can be used to figure out the "Popular Places" exercise in its own style, and lastly explain how FLOW leverages Flink Table API.
Flink Connector Development Tips & TricksEron Wright
A look at some of the challenges and techniques for developing a connector for Apache Flink, covering the different types of connectors, lifecycle, metrics, event-time support, and fault tolerance.
Presentation video: https://www.youtube.com/watch?v=ZkbYO5S4z18
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...Flink Forward
Within fintech catching fraudsters is one of the primary opportunities for us to use streaming applications to apply ML models in real-time. This talk will be a review of our journey to bring fraud decisioning to our tellers at Capital One using Kafka, Flink and AWS Lambda. We will share our learnings and experiences to common problems such as custom windowing, breaking down a monolith app to small queryable state apps, feature engineering with Jython, dealing with back pressure from combining two disparate streams, model/feature validation in a regulatory environment, and running Flink jobs on Kubernetes.
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward
This document discusses using Apache Flink to operationalize a streaming machine learning lifecycle. It describes Comcast's need to improve customer experiences through predictive analytics over streaming data. Flink is used to orchestrate feature engineering, model training/evaluation, and real-time predictions. Key aspects of the solution include a metadata-driven pipeline, automated deployments, consistent feature stores for training and prediction, and monitoring of multiple models. The document outlines the various components of the ML lifecycle and pipeline implemented on Flink and discusses next steps around UI/UX, continuous monitoring, and supporting multiple feature stores.
The document discusses implementing reliable, isolated, and unified job submission for a distributed stream processing platform. It proposes:
1) Defining job submission and execution as atomic by requiring the job graph to be persisted before a job is considered submitted, and the job status to be set to DONE before a job is considered completed.
2) Compiling jobs in isolation on the cluster side by packaging user programs and dependencies and executing them in isolated containers to avoid bottlenecks and security risks at the client.
3) Exposing a three-layer unified client interface for deployment, cluster, and job management to provide a programmatic submission approach.
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang WangFlink Forward
Currently Flink supports the resource management system YARN and Mesos. However, they were not designed for fast moving cloud native architectures, and they could not support mixed workloads (e.g. batch, streaming, deep learning, web services, etc.) relatively well. At the same time, Kubernetes is evolving very fast to fill those gaps and become the de-facto orchestration framework. So running Flink on Kubernetes is a very basic requirement for many users. In this talk, firstly we will quickly go through Kubernetes architecture and the efforts we have been made to run Flink on Kubernetes. Then we deep dive into the technical details about how to make Flink natively run on Kubernetes. Native means Flink KubernetesResourceManager calls directly the Kubernetes APIs to allocate and release TaskManager pods. Next we will share some practices of application lifecycle management and production optimizations (e.g. high-availability, storage, network, etc.). Finally, we will conclude the talk with advantages for Flink on Kubernetes and a simple demo. This talk is aimed at users and companies who are looking to run Flink on Kubernetes cluster. We assume that the listener has some basic knowledge of cluster orchestration and containers.
At Yelp we run hundreds of Flink jobs to power a wide range of applications: push notifications, data replication, ETL, sessionizing and more. Routine operations like deploys, restart, and savepointing for so many jobs would take quite a bit of developers’ time without the right degree of automation. The latest addition to our toolshed is a Kubernetes operator managing the deployment and the lifetime of Flink clusters on PaaSTA, Yelp’s Platform As A Service.
We replaced our deployment framework launching Flink clusters on top of AWS EMR with a Kubernetes operator managing fully Docker-ized Flink clusters. Compared to EMR, this architecture allowed us to both drastically reduce the deployment time of our Flink clusters and to share our hardware resources more efficiently. In addition, we now offer to our developers the same interface they are used to for running REST services, batch jobs and many other workloads on PaaSTA.
This talk will give a brief overview of Yelp’s PaaSTA before diving into the details of how the Kubernetes operator has been implemented and how it has been integrated with Yelp developers’ workflow (deploy, logs, savepoints, upgrades, etc), to end with a glimpse of the future features we are planning for the operator (Flink as a library, autoscaling, etc.).
We present a web service named FLOW to let users do FLink On Web. FLOW aims to minimize the effort of handwriting streaming applications similar in spirit to Hortonworks Stream Analytics Manager, StreamAnalytix, and Nussknacker by letting users drag and drop graphical icons representing streaming operators on GUI.
FLOW builds on Flink Table API and lets users assemble graphical icons associated with not only basic SQL operations but also advanced SQL operations like window aggregation, temporal join, and pattern recognition (MATCH_RECOGNIZE clause). Its data preview function enables to observe how sample data changes before and after applying each operation on screen. In addition, FLOW shows the sample data as time-series charts and geographical maps by interacting with Elasticsearch and Kibana. Therefore, domain experts with basic knowledge of SQL can design their streaming applications easily on GUI without understanding of Flink DataStream API and Flink CEP library.
In this talk, we first present what motivates the development of FLOW, then show how FLOW can be used to figure out the "Popular Places" exercise in its own style, and lastly explain how FLOW leverages Flink Table API.
Flink Connector Development Tips & TricksEron Wright
A look at some of the challenges and techniques for developing a connector for Apache Flink, covering the different types of connectors, lifecycle, metrics, event-time support, and fault tolerance.
Presentation video: https://www.youtube.com/watch?v=ZkbYO5S4z18
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...Flink Forward
Within fintech catching fraudsters is one of the primary opportunities for us to use streaming applications to apply ML models in real-time. This talk will be a review of our journey to bring fraud decisioning to our tellers at Capital One using Kafka, Flink and AWS Lambda. We will share our learnings and experiences to common problems such as custom windowing, breaking down a monolith app to small queryable state apps, feature engineering with Jython, dealing with back pressure from combining two disparate streams, model/feature validation in a regulatory environment, and running Flink jobs on Kubernetes.
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward
This document discusses using Apache Flink to operationalize a streaming machine learning lifecycle. It describes Comcast's need to improve customer experiences through predictive analytics over streaming data. Flink is used to orchestrate feature engineering, model training/evaluation, and real-time predictions. Key aspects of the solution include a metadata-driven pipeline, automated deployments, consistent feature stores for training and prediction, and monitoring of multiple models. The document outlines the various components of the ML lifecycle and pipeline implemented on Flink and discusses next steps around UI/UX, continuous monitoring, and supporting multiple feature stores.
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...Flink Forward
Over the past few years almost all data processing has moved from batch to stream processing. This isn’t simply driven by a desire for lower latency, but by a fundamental understanding that streams are a more effective primitive for data processing, providing a better impedance match to varied downstream systems and services. Splunk, like many others, has been evolving its core data infrastructure to better provide a simpler and more consistent programming model, address correctness and latency of data, and allow for a more open integration model with our data platform. Throughout this process, we’ve come to view Apache Flink as a critical backbone in our core data infrastructure. Join us to learn more about how our data infrastructure - and how we think about it - has fundamentally changed.
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward
Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...Flink Forward
The document discusses how Pravega, an open source stream storage system, enables features like watermarking, scaling, and exactly-once processing in stream processing systems. It explains that Pravega stores streams as sequences of events across distributed segments, which allows for watermarking of event timestamps, dynamic scaling of streams, and tracking of event ingestion to enable exactly-once processing. Checkpointing and replay of events from checkpoints also allows stream processors using Pravega to recover from failures while maintaining exactly-once semantics.
Flink Forward San Francisco 2018: Jörg Schad and Biswajit Das - "Operating Fl...Flink Forward
Flink has supported Apache Mesos officially since the 1.2 release and many users have been using them together even before that. The latest releases 1.4 and 1.5 (not released at the time of writing) add a deeper integration for resource schedulers, such as Mesos, which also resulted in many new features around this integration. But what does that mean in practice for operating large cluster? In this talk, we will discuss operational best practices-alongside with some pitfalls- for operating large Flink cluster on top of Apache Mesos, including topics such as: * Deployments, * Monitoring, * Scaling, * Upgrades, * Debugging.
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
In this session, we will look at how Apache Flink can be used to stream anonymized API request and response data from a production environment to make sure staging environments are up-to-date and reflect the most recent features (and bugs) that comprise a service. The talk will also examine how to deal with issues of data retention, throttling, and persistence, finishing with recommendations for how to use these sandbox environments to rapidly prototype and test new features and fixes.
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
“Customer experience is the next big battle ground for telcos,” proclaimed recently Amit Akhelikar, Global Director of Lynx Analytics at TM Forum Live! Asia in Singapore. But, how to fight in this battle? A common approach has been to keep “under control” some well-known network quality indicators, like dropped calls, radio access congestion, availability, and so on; but this has proven not to be enough to keep customers happy, like a siege weapon is not enough to conquer a city. But, what if it were possible to know how customers perceive services, at least most demanded ones, like web browsing or video streaming? That would be like a squad of archers ready to battle. And even having that, how to extract value of it and take actions in no time, giving our skilled archers the right targets? Meet CANVAS (Customer And Network Visualization and AnaltyticS), one of the first LATAM implementations of a Flink-based stream processing use case for a telco, which successfully combines leading and innovative technologies like Apache Hadoop, YARN, Kafka, Nifi, Druid and advanced visualizations with Flink core features like non-trivial stateful stream processing (joins, windows and aggregations on event time) and CEP capabilities for alarm generation, delivering a next-generation tool for SOC (Service Operation Center) teams.
Deploying Flink on Kubernetes - David AndersonVerverica
Kubernetes has rapidly established itself as the de facto standard for orchestrating containerized infrastructures. And with the recent completion of the refactoring of Flink's deployment and process model known as FLIP-6, Kubernetes has become a natural choice for Flink deployments. In this talk we will walk through how to get Flink running on Kubernetes
In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment.
The goal of this talk is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework.
To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python and Apache Flink as the streaming engine.
https://sf-2019.flink-forward.org/conference-program#streaming-your-lyft-ride-prices
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...Flink Forward
Stream Processing in conjunction with a Consistent, Durable, Reliable stream storage is kicking the revolution up a notch in Big Data processing. This modern paradigm is enabling a new generation of data middleware that delivers on the streaming promise of a simplified and unified programming model. From data ingest, transformation, and messaging to search, time series and more, a robust streaming data ecosystem means we’ll all be able to more quickly build applications that solve problems we could not solve before.
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...Flink Forward
HyperConnect's 1to1 video matchmaking system is consist of various machine learning techniques to maximize user satisfaction. Our matchmaking system manages large user context containing actions a few seconds ago, and reacts in milliseconds to produce meaningful new results in each user session. It's difficult in traditional way. So, distributed streaming is essential to handle in this cases. Topics include: - Why our team choose Apache Flink in comparison with alternatives - Matchmaking streaming architecture with detail abstraction levels based on Flink operator - Pairwise scoring microservice management with Flink - Stateful matchmaking computation with low latency, fault-tolerance, and scalability - How to manage large-scale events: classifying feature types, collecting with a multi-window stream - Applications: personalization, multi-armed-bandit on stream.
Scaling stream data pipelines with Pravega and Apache FlinkTill Rohrmann
The document discusses scaling stream data pipelines. It covers how streams from social networks, online shopping, server monitoring and IoT sensors are becoming more common. It discusses how workloads have daily, weekly and seasonal cycles that cause spikes. It then discusses how to scale event processing by adding more processors as the input rate increases. It introduces Pravega as a stream storage system that can store streams permanently while preserving ordering and scaling to varying workloads. Pravega uses segments that can be split or merged to scale a stream. It describes Pravega's auto-scaling policies and how it triggers scaling events based on throughput metrics. Finally, it discusses how reader groups allow Pravega to maintain ordering during scaling.
Flink Forward San Francisco 2019: Using Flink to inspect live data as it flow...Flink Forward
Using Flink to inspect live data as it flows through a data pipeline
One of the hardest challenges with authoring a data pipeline in Flink is understanding what your data looks like at each stage of the pipeline. Pipeline authors would love to answer questions like ""why is no data coming through my filter?"" Or ""why did my regex not extract any fields?"" Or ""is my pipeline even reading anything from Kafka?"" Unit and integration testing pipeline logic goes a long way, and metrics are another great tool to understand what a pipeline is doing, but sometimes you need the data itself to answer why a pipeline is behaving the way it is.
To answer these questions for ourselves and our customers, at Splunk we created a simple yet robust architecture for extracting data as it moves through a pipeline. You'll also learn about our implementation of this architecture, including the lessons learned while creating it, and how you can apply this architecture yourself. You'll hear about how to rewrite your Flink job graph at job submission time, how to retrieve data from all the nodes in the job graph, and how to expose this information to a user interface through a REST API.
One of the biggest challenge in streaming application development is making sure your pipeline does exactly what it is supposed to do. The combination of different data sources, sinks and complex application behavior such as time based functionality or interaction with external systems doesn’t make the problem of proper testing any easier. In this talk we show you some of the excellent testing utilities built into Flink that can be used to unit-test parts of our application and to integration test complex data pipelines. We will also look at some external libraries developed by the community that can be used to further improve the testing experience and reduce time to production. Last but not least we will share some tools and best practices that can help debugging problems that managed to fall through the cracks. By the end of this talk you will be familiar with some of the best tools to test and debug your streaming pipelines to give you extra confidence in your applications.
This document discusses Apache Flink version 1.7 and beyond. It summarizes key features of Flink 1.7 including contributions from 112 contributors and over 1,000 commits. It also discusses upcoming features in Flink 1.8 such as support for state schema evolution, dynamic scaling, unifying batch and streaming, an extendable scheduler, and end-to-end SQL-only pipelines. The document encourages participation in the Flink community.
dA Platform is a production-ready platform for stream processing with Apache Flink®. The Platform includes open source Apache Flink, a stateful stream processing and event-driven application framework, and dA Application Manager, a central deployment and management component. dA Platform schedules clusters on Kubernetes, deploys stateful Flink applications, and controls these applications and their state.
Netflix uses Conductor, an open source microservices orchestrator, to manage complex content processing workflows involving ingestion, encoding, localization, and delivery. Conductor provides visibility, control, and reuse of tasks through a task queuing system and workflow definitions. It has scaled to process millions of workflow executions across Netflix's content platform using a stateless architecture with Dynomite for storage and Dyno-Queues for task distribution.
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward
Apache Flink, a powerful distributed stateful stream processing framework, is an especially good fit for deployment on a containerization platform: its storage requirement is primarily external (e.g. HDFS or S3), clusters often share the lifetime of the jobs that run on them, and the flexibility of allocating resources on such a platform allows for scaling jobs up and down as necessary. In this talk I will give a brief introduction to Apache Flink, then describe the journey to making it a first-class citizen of the container world. I will cover my experience preparing to publish the “official repository” of Flink images on Docker Hub, the challenges of fitting a Flink deployment in a Kubernetes-shaped box, and the rough edges of Flink itself that were exposed by this process.
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Video: https://www.youtube.com/watch?v=F7HQd3KX2TQ&list=PLDX4T_cnKjD207Aa8b5CsZjc7Z_KRezGz&index=48&t=6s
This is a talk that I gave at the Data Council Berlin Meetup on May 16th, 2019
Abstract:
Stream processing is being rapidly adopted by the enterprise. While in the past, stream processing frameworks mostly provided Java- or Scala-based APIs, stream processing with SQL is growing increasingly popular because it makes stream processing accessible to non-programmers and significantly reduces the effort to solve common tasks.
About three years ago, the Apache Flink community started adding SQL support to process static and streaming data in a unified fashion. Today, Flink SQL powers production systems at Alibaba, Huawei, Lyft, and Uber. Fabian Hueske discusses the current state of Flink’s SQL support and explains the importance of Flink’s unified approach to process static and streaming data. After covering the basics, he shares common real-world use cases ranging from low-latency ETL to pattern detection and demonstrates how easily they can be addressed with Flink SQL.
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...Flink Forward
Over the past few years almost all data processing has moved from batch to stream processing. This isn’t simply driven by a desire for lower latency, but by a fundamental understanding that streams are a more effective primitive for data processing, providing a better impedance match to varied downstream systems and services. Splunk, like many others, has been evolving its core data infrastructure to better provide a simpler and more consistent programming model, address correctness and latency of data, and allow for a more open integration model with our data platform. Throughout this process, we’ve come to view Apache Flink as a critical backbone in our core data infrastructure. Join us to learn more about how our data infrastructure - and how we think about it - has fundamentally changed.
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward
Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...Flink Forward
The document discusses how Pravega, an open source stream storage system, enables features like watermarking, scaling, and exactly-once processing in stream processing systems. It explains that Pravega stores streams as sequences of events across distributed segments, which allows for watermarking of event timestamps, dynamic scaling of streams, and tracking of event ingestion to enable exactly-once processing. Checkpointing and replay of events from checkpoints also allows stream processors using Pravega to recover from failures while maintaining exactly-once semantics.
Flink Forward San Francisco 2018: Jörg Schad and Biswajit Das - "Operating Fl...Flink Forward
Flink has supported Apache Mesos officially since the 1.2 release and many users have been using them together even before that. The latest releases 1.4 and 1.5 (not released at the time of writing) add a deeper integration for resource schedulers, such as Mesos, which also resulted in many new features around this integration. But what does that mean in practice for operating large cluster? In this talk, we will discuss operational best practices-alongside with some pitfalls- for operating large Flink cluster on top of Apache Mesos, including topics such as: * Deployments, * Monitoring, * Scaling, * Upgrades, * Debugging.
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
In this session, we will look at how Apache Flink can be used to stream anonymized API request and response data from a production environment to make sure staging environments are up-to-date and reflect the most recent features (and bugs) that comprise a service. The talk will also examine how to deal with issues of data retention, throttling, and persistence, finishing with recommendations for how to use these sandbox environments to rapidly prototype and test new features and fixes.
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
“Customer experience is the next big battle ground for telcos,” proclaimed recently Amit Akhelikar, Global Director of Lynx Analytics at TM Forum Live! Asia in Singapore. But, how to fight in this battle? A common approach has been to keep “under control” some well-known network quality indicators, like dropped calls, radio access congestion, availability, and so on; but this has proven not to be enough to keep customers happy, like a siege weapon is not enough to conquer a city. But, what if it were possible to know how customers perceive services, at least most demanded ones, like web browsing or video streaming? That would be like a squad of archers ready to battle. And even having that, how to extract value of it and take actions in no time, giving our skilled archers the right targets? Meet CANVAS (Customer And Network Visualization and AnaltyticS), one of the first LATAM implementations of a Flink-based stream processing use case for a telco, which successfully combines leading and innovative technologies like Apache Hadoop, YARN, Kafka, Nifi, Druid and advanced visualizations with Flink core features like non-trivial stateful stream processing (joins, windows and aggregations on event time) and CEP capabilities for alarm generation, delivering a next-generation tool for SOC (Service Operation Center) teams.
Deploying Flink on Kubernetes - David AndersonVerverica
Kubernetes has rapidly established itself as the de facto standard for orchestrating containerized infrastructures. And with the recent completion of the refactoring of Flink's deployment and process model known as FLIP-6, Kubernetes has become a natural choice for Flink deployments. In this talk we will walk through how to get Flink running on Kubernetes
In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment.
The goal of this talk is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework.
To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python and Apache Flink as the streaming engine.
https://sf-2019.flink-forward.org/conference-program#streaming-your-lyft-ride-prices
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...Flink Forward
Stream Processing in conjunction with a Consistent, Durable, Reliable stream storage is kicking the revolution up a notch in Big Data processing. This modern paradigm is enabling a new generation of data middleware that delivers on the streaming promise of a simplified and unified programming model. From data ingest, transformation, and messaging to search, time series and more, a robust streaming data ecosystem means we’ll all be able to more quickly build applications that solve problems we could not solve before.
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...Flink Forward
HyperConnect's 1to1 video matchmaking system is consist of various machine learning techniques to maximize user satisfaction. Our matchmaking system manages large user context containing actions a few seconds ago, and reacts in milliseconds to produce meaningful new results in each user session. It's difficult in traditional way. So, distributed streaming is essential to handle in this cases. Topics include: - Why our team choose Apache Flink in comparison with alternatives - Matchmaking streaming architecture with detail abstraction levels based on Flink operator - Pairwise scoring microservice management with Flink - Stateful matchmaking computation with low latency, fault-tolerance, and scalability - How to manage large-scale events: classifying feature types, collecting with a multi-window stream - Applications: personalization, multi-armed-bandit on stream.
Scaling stream data pipelines with Pravega and Apache FlinkTill Rohrmann
The document discusses scaling stream data pipelines. It covers how streams from social networks, online shopping, server monitoring and IoT sensors are becoming more common. It discusses how workloads have daily, weekly and seasonal cycles that cause spikes. It then discusses how to scale event processing by adding more processors as the input rate increases. It introduces Pravega as a stream storage system that can store streams permanently while preserving ordering and scaling to varying workloads. Pravega uses segments that can be split or merged to scale a stream. It describes Pravega's auto-scaling policies and how it triggers scaling events based on throughput metrics. Finally, it discusses how reader groups allow Pravega to maintain ordering during scaling.
Flink Forward San Francisco 2019: Using Flink to inspect live data as it flow...Flink Forward
Using Flink to inspect live data as it flows through a data pipeline
One of the hardest challenges with authoring a data pipeline in Flink is understanding what your data looks like at each stage of the pipeline. Pipeline authors would love to answer questions like ""why is no data coming through my filter?"" Or ""why did my regex not extract any fields?"" Or ""is my pipeline even reading anything from Kafka?"" Unit and integration testing pipeline logic goes a long way, and metrics are another great tool to understand what a pipeline is doing, but sometimes you need the data itself to answer why a pipeline is behaving the way it is.
To answer these questions for ourselves and our customers, at Splunk we created a simple yet robust architecture for extracting data as it moves through a pipeline. You'll also learn about our implementation of this architecture, including the lessons learned while creating it, and how you can apply this architecture yourself. You'll hear about how to rewrite your Flink job graph at job submission time, how to retrieve data from all the nodes in the job graph, and how to expose this information to a user interface through a REST API.
One of the biggest challenge in streaming application development is making sure your pipeline does exactly what it is supposed to do. The combination of different data sources, sinks and complex application behavior such as time based functionality or interaction with external systems doesn’t make the problem of proper testing any easier. In this talk we show you some of the excellent testing utilities built into Flink that can be used to unit-test parts of our application and to integration test complex data pipelines. We will also look at some external libraries developed by the community that can be used to further improve the testing experience and reduce time to production. Last but not least we will share some tools and best practices that can help debugging problems that managed to fall through the cracks. By the end of this talk you will be familiar with some of the best tools to test and debug your streaming pipelines to give you extra confidence in your applications.
This document discusses Apache Flink version 1.7 and beyond. It summarizes key features of Flink 1.7 including contributions from 112 contributors and over 1,000 commits. It also discusses upcoming features in Flink 1.8 such as support for state schema evolution, dynamic scaling, unifying batch and streaming, an extendable scheduler, and end-to-end SQL-only pipelines. The document encourages participation in the Flink community.
dA Platform is a production-ready platform for stream processing with Apache Flink®. The Platform includes open source Apache Flink, a stateful stream processing and event-driven application framework, and dA Application Manager, a central deployment and management component. dA Platform schedules clusters on Kubernetes, deploys stateful Flink applications, and controls these applications and their state.
Netflix uses Conductor, an open source microservices orchestrator, to manage complex content processing workflows involving ingestion, encoding, localization, and delivery. Conductor provides visibility, control, and reuse of tasks through a task queuing system and workflow definitions. It has scaled to process millions of workflow executions across Netflix's content platform using a stateless architecture with Dynomite for storage and Dyno-Queues for task distribution.
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward
Apache Flink, a powerful distributed stateful stream processing framework, is an especially good fit for deployment on a containerization platform: its storage requirement is primarily external (e.g. HDFS or S3), clusters often share the lifetime of the jobs that run on them, and the flexibility of allocating resources on such a platform allows for scaling jobs up and down as necessary. In this talk I will give a brief introduction to Apache Flink, then describe the journey to making it a first-class citizen of the container world. I will cover my experience preparing to publish the “official repository” of Flink images on Docker Hub, the challenges of fitting a Flink deployment in a Kubernetes-shaped box, and the rough edges of Flink itself that were exposed by this process.
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Video: https://www.youtube.com/watch?v=F7HQd3KX2TQ&list=PLDX4T_cnKjD207Aa8b5CsZjc7Z_KRezGz&index=48&t=6s
This is a talk that I gave at the Data Council Berlin Meetup on May 16th, 2019
Abstract:
Stream processing is being rapidly adopted by the enterprise. While in the past, stream processing frameworks mostly provided Java- or Scala-based APIs, stream processing with SQL is growing increasingly popular because it makes stream processing accessible to non-programmers and significantly reduces the effort to solve common tasks.
About three years ago, the Apache Flink community started adding SQL support to process static and streaming data in a unified fashion. Today, Flink SQL powers production systems at Alibaba, Huawei, Lyft, and Uber. Fabian Hueske discusses the current state of Flink’s SQL support and explains the importance of Flink’s unified approach to process static and streaming data. After covering the basics, he shares common real-world use cases ranging from low-latency ETL to pattern detection and demonstrates how easily they can be addressed with Flink SQL.
What's new for Apache Flink's Table & SQL APIs?Timo Walther
About three years ago, the Apache Flink community started adding a Table & SQL API to process static and streaming data in a unified fashion. It makes data processing accessible to non-programmers and significantly reduces the effort to solve common tasks. Today, Flink SQL already powers production systems at Alibaba, Huawei, Lyft, and Uber. But we are only getting started! The community is currently re-shaping the future of data processing.
Even for followers of the Flink mailing lists, it can be quite difficult to keep track with all the developments that happen on Flink's relational APIs. In this talk, we give an overview of recent contributions, such as pluggable optimizers, the new type system with consistent type inference, SQL DDL support, and the Python Table API. We elaborate on how all these efforts interact and discuss the future roadmap.
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
Apache Kafka is one of the most commonly used connectors with Apache Flink for exactly-once streaming use cases. The combination of both systems allows you to build mission-critical systems that require low end-to-end latency and exactly-once processing eg. banks processing transactions. In Apache Flink 1.14, we released a new KafkaSink based on Apache Flink’s unified Sink interface that natively supports streaming and batch executions.
However, we needed to stretch Kafka’s transactions API to fully support exactly-once processing in Flink. In this talk, we will start with a quick recap of Apache Kafka’s transactions and Flink’s checkpointing mechanism. Then, we describe the two-phase commit protocol implemented in KafkaSink in-depth and emphasize the difficulties we have overcome when applying Kafka’s transaction API to longer-lasting transactions.
We explain how we ensure performant writing to Apache Kafka and how the KafkaSink recovery works.
In summary, this talk should give users a deep dive into how Apache Flink leverages Apache Kafka’s transactions and developers an overview of what they have to consider when using Apache Kafka’s transactions.
Unified Data Processing with Apache Flink and Apache Pulsar_Seth WiesmanStreamNative
Come learn how the combination of Apache Pulsar and Apache Flink is making stateful stream processing even more expressive and flexible to support applications in streaming that were previously not considered streamable. The new world of applications and fast data architectures has broken up the database: Raw data persistence comes in the form of event logs, and the state of the world is computed by a stream processor. Apache Pulsar provides a strong solution for the event log, while Apache Flink forms a powerful foundation for the computation over the event streams.
We will discuss the key concepts behind Apache Flink's approach to stream processing and how it is a powerful abstraction for stateful event-driven applications. We will then see how to use Flink in conjunction with Apache Pulsar to creates a unified data processing platform.
Spark Streaming 2.0 introduces Structured Streaming which addresses some areas for improvement in Spark Streaming 1.X. Structured Streaming builds streaming queries on the Spark SQL engine, providing implicit benefits like extending the primary batch API to streaming and gaining an optimizer. It introduces a more seamless API between batch and stream processing, supports event time semantics, and provides end-to-end fault tolerance guarantees through checkpointing. Structured Streaming also aims to simplify streaming application development by managing streaming queries and allowing continuous queries to be started, stopped, and modified more gracefully.
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Timo Walther
Apache Flink is a distributed, stateful stream processor. It features exactly-once state consistency, sophisticated event-time support, high throughput and low latency processing, and APIs at different levels of abstraction (Java, Scala, SQL). In my talk, I'll give an introduction to Apache Flink, its features and discuss the use cases it solves. I'll explain why batch is just a special case of stream processing, how its community evolves Flink into a truly unified stream and batch processor and what this means for its users.
https://www.meetup.com/de-DE/Bangalore-Apache-Kafka-Group/events/265285812/
https://www.youtube.com/watch?v=Ych5bbmDIoA&list=PLvkUPePDi9sa27SG9eGNXH25cfUeo_WY9&index=2
OSMC 2019 | The Telegraf Toolbelt: It Can Do That, Really? by David McKayNETWAYS
Telegraf is an agent for collecting, processing, aggregating, and writing metrics.
With over 200 plugins, Telegraf can fetch metrics from a variety of sources, allowing you to build aggregations and write those metrics to InfluxDB, Prometheus, Kafka, and more.
In this talk, we will take a look at some of the lesser known, but awesome, plugins that are often overlooked; as well as how to use Telegraf for monitoring of Cloud Native systems.
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...confluent
While many companies are embracing Apache Kafka as their core event streaming platform they may still have events they want to unlock in other systems. Kafka Connect provides a common API for developers to do just that and the number of open-source connectors available is growing rapidly. The IBM MQ sink and source connectors allow you to flow messages between your Apache Kafka cluster and your IBM MQ queues. In this session I will share our lessons learned and top tips for building a Kafka Connect connector. I’ll explain how a connector is structured, how the framework calls it and some of the things to consider when providing configuration options. The more Kafka Connect connectors the community creates the better, as it will enable everyone to unlock the events in their existing systems.
This document provides an overview of a tutorial on building an SRv6-enabled fabric with P4 and ONOS. The tutorial consists of 4 exercises: 1) enabling packet I/O between the switch and control plane, 2) adding Ethernet bridging, 3) adding IPv6 routing, and 4) adding Segment Routing (SRv6). It introduces the software tools used, including P4Runtime for runtime control of P4 switches, Stratum as a P4Runtime server, and ONOS as the control plane. The goal is to learn how to program P4 switches and build full-stack network applications from a P4 program to an end-to-end solution.
Photon Controller: An Open Source Container Infrastructure Platform from VMwareDocker, Inc.
This document summarizes a presentation about VMware's open source Photon Platform, which is optimized for running container workloads at scale. It discusses how Photon Platform uses the Photon Controller distributed management plane and Photon Machine compute hosts to provide a cloud-native platform that can support hundreds of thousands of containers. It also demonstrates deploying a Docker Swarm cluster on top of Photon Platform through the Photon API and controller.
Webinar: Flink SQL in Action - Fabian HueskeVerverica
Stream processing is rapidly adopted by the enterprise. While in the past, stream processing frameworks mostly provided Java or Scala-based APIs, stream processing with SQL is recently gaining a lot of attention because it makes stream processing accessible to non-programmers and significantly reduces the effort to solve common tasks.
About three years ago, the Apache Flink community started adding SQL support to process static and streaming data in a unified fashion. Today, Flink SQL powers production systems at Alibaba, Huawei, Lyft, and Uber. In this talk, I will discuss the current state of Flink’s SQL support and explain the importance of Flink’s unified approach to process static and streaming data. Once the basics are covered, I will present common real-world use cases ranging from low-latency ETL to pattern detection and demonstrate how easily they can be addressed by Flink SQL.
We discuss the existing and new hardware virtualization features. First, we review the existing hardware features that are not used by Xen today, showing examples for use cases. 1) For example, The "descriptor-table exiting" should be useful for the guest kernels or security agent to enhance security features. 2) The VMX-preemption timer allows the hypervisor to preempt guest VM execution after a specified amount of time, which is useful to implement fair scheduling. The hardware can save the timer value on each successive VMexit, after setting the initial VM quantum. 3) VMFUNC is an operation provided by the processor that can be invoked from VMX non-root operation without a VM exit. Today, EPTP switching is available, and we discuss how we can use the feature. Second, we talk about new hardware features, especially for interrupt optimizations.
NETCONF & YANG Enablement of Network DevicesCisco DevNet
A technical discussion and a demo showing how Tail-f's ConfD management agent can be used to implement NETCONF and YANG, the industry-leading solution for providing a programmable management interface in a network element. ConfD is recognized as the best-in-breed embedded software for implementing management functions in network elements, including physical devices and virtualized network functions (VNF) for NFV.
This Workshop is a best fit for engineers who are involved in the design and development of embedded software for network devices. Attendees will gain a basic understanding of what NETCONF and YANG are and how ConfD provides a solution for embedding this technology in the network devices. More information about ConfD can be found at: https://developer.cisco.com/site/confD/
Watch the DevNet 1216 replay from the Cisco Live On-Demand Library at: https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=92703&backBtn=true
Check out more and register for Cisco DevNet: http://ow.ly/jCNV3030OfS
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)Phil Wilkins
This document provides an overview of implementing GitHub Actions pipelines on Oracle Cloud Infrastructure (OCI). It discusses how GitHub Actions works differently than Jenkins by breaking up pipelines into more granular tasks that can run highly parallelized. It also covers how to configure GitHub Actions runners on different platforms including OCI, other clouds, and on-premises. The document demonstrates how to structure a sample Java pipeline in GitHub Actions and discusses some advanced features like retrieving artifacts between jobs and using environment variables. It concludes by highlighting considerations for building GitHub Actions pipelines like security, orchestration approach, and cleanup of runners.
Flink has evolved from a batch processor to a unified stream and batch processing framework. It now supports event-time processing, state, and low-level streaming with ProcessFunction. Looking ahead, Flink aims to improve elasticity, fault tolerance, SQL support, and handling large state through incremental snapshots. It also plans to offer more control over resource allocation and scaling through both active and reactive modes.
Similaire à Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ververica (20)
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Time in data stream must be quasi monotonous to produce time progress (watermarks)
Always have close-to-latest incremental results
Resource requirements change over time
Recovery must catch up very fast
Order of time in data does not matter (parallel unordered reads)
Bulk operations (2 phase hash/sort)
Longer time for recovery (no low latency SLA)
Resource requirements change fast throughout the execution of a single job
Understanding this difference will help later, when we discuss scheduling changes.
Different requirements
Optimization potential for batch and streaming
Also: historic developments and slow-changing organizations
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
Recall the earlier processing-styles slide:
batch wants step by step
streaming is all at once
This has been mentioned a lot.
Lyft has given a talk about this at last FF
* FLINK-10886: Event-time alignment for sources; Jamie Grier (Lyft) contributed the first parts of this
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
Batch:
random reads
Coordinated by JM
Streaming:
sequential read
No coordination between sources
This must support both batch and streaming use cases, allow Flink to be clever, be able to deal with event-time, watermarks, source idiosyncrasies, and enable snapshotting
This should enable new features: generic idleness detection, event-time alignment*
* FLINK-10886: Event-time alignment for sources; Jamie Grier (Lyft) contributed the first parts of this
Talk about how this will enable event-time alignment for sources in generic way
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
Mention here that you can basically build your Job Jar that includes flink-runtime, and execute that any way you want: Put it in docker, Spring boot, just start multiple of these.
As-a-library mode
Note that this nicely jibes with the pull-based model. Enables the things we need for batch.
Mention the dog with the hose. Sources just keep spitting out records as fast as they can.
Possibly put these on separate slides, with fewer words. Or even some graphics.
Possibly put these on separate slides, with fewer words. Or even some graphics.
There are some quirks when you use DataStream for batch
a groupReduce would be window with a GlobalWindow
MapPartition would have to finalizing things in close()
Joins would have to specify global window
Of course, state requirements are bad for the naïve approach, i.e. large state, inefficient access patterns
Joins and grouping can be a lot faster with specific algorithms
Hash Join, Merge join, etc…
For example
different window operator
Different join implementations
The scheduling stuff and networking would be a whole talk on their own. Memory management is another issue.
Pull-based operator is how most databases were/are implemented.
Note how the pull model enables hash join, merge join, …
Side inputs benefit from a pull-based model
Bring the dog-drinking-from-hose example, also for Join operator
This will allow porting batch operators/algorithms to StreamOperator