Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Flink operations at GoJek"

•

2 j'aime•1,064 vues

Managing Flink operations at GO-JEK involves resource provisioning, isolation, data quality, monitoring, and failovers across multiple clusters and data centers. GO-JEK uses Flink for use cases like surge pricing, API health monitoring, and fraud detection. They developed tools to automate cluster provisioning reducing time by 90%, isolate resources for security and performance, manage data quality with schema changes, monitor multiple clusters, and enable kafka input stream and cluster failovers for resiliency. Chaos engineering experiments also help test systems through disaster simulations and load testing.

Technologie

Managing Flink
operations at GO-JEK
Ravi Suhag | Sumanth K N

ONE APP FOR ALL YOUR NEEDS
Established in 2010 as a motorcycle ride-
hailing phone-based service, GO-JEK has
evolved into an on-demand provider of
transport and other lifestyle services.
GO-JEK?

Resource Provisioning
Resource Isolation
Data Quality
Monitoring
Cluster Failovers
Chaos and Load Testing
Agenda

As data engineers, we look out for patterns in data
usages and transformation and build tools,
infrastructure, frameworks, and services.
SCALE
AUTOMATION
PRODUCT MINDSET

At GO-JEK, location is built
into the fabric of all our
products
2B +
GPS points
per day
25M +
Booking events
per day
3B +
API log events
per day

Flink use cases
at GO-JEK
Surge Pricing
API Health monitoring
Driver allocation monitoring
Fraud detection and more …

Resource Provisioning
How do we create new clusters with
increasing number of requests for more than
15 internal teams?

Challenges
Multiple cloud providers and DCs
Frequent need for cluster provisioning
Man-hour intensive and repetitive
Error-prone process

Impact
Provisioning time reduced by 90%
On the ﬂy infra for load testing
Infrastructure As Code
Self-serve and no workﬂows

Resource Isolation
How do we isolate resources for security,
resilience, and segregation for more than 15
internal teams?

Why
Critical kafka topics handicapped by low priority scripts
Non-critical jobs wipe out resources of critical jobs
Extensive downtime due to issues in Kafka
Human errors

Nature, time and transactional criticality,
sensitivity and volume of data.
Security concerns, team segregation, job
loads and criticality which comes at the
cost of handling large volume data
replication and maintenance.
Kafka input stream
Flink clusters

Data Quality
How do we manage quality of data for
aggregation with more than 15 teams being
responsible for producing data?

Challenges
JSON weekly typed
Errors during schema update
Bad deployment sequence of an incompatible
schema change
No data for real-time actors

Measures
Protobuf - Strongly typed
Maintains version of the schema
Version locking per job
Allows to re-run a job with different schema
Upstream actors to have an `inadequate data
points` behavior

Monitoring
How do we manage monitoring for multiple
clusters across data-centers?

Failovers
How do we manage kafka input stream
failover and ﬂink cluster failover for multiple
clusters?

Why need
failover?
Kafka cluster down
Flink clusters down or in bad state
Kafka or Flink cluster migrate
Job migration for teams

Impact
Kafka input stream resiliency
Allocate jobs dynamically to clusters
Testing and upgradation becomes easy
Self-serve and no workﬂow

Chaos and load Testing
How do we build conﬁdence in system
behaviour through disaster simulation
experiments?

LOKI for chaos
engineering
Disaster simulation
Load testing
Reports
CLI interface
https://bit.ly/2NkGE9L

Simulation
Simulate throughtput in ﬂink
jobs with kafka ingestion
Play chaos simulation on ﬂink
cluster or kafka cluster
Generate reports
Prepare playbook

Road Ahead
Flink on Kubernetes
Complex Event Processing
Data Enrichment

Let’s talk!
Ravi Suhag
@ravi_suhag
medium.com/@ravisuhag
Sumanth K N
medium.com/@kn.sumanth

Contenu connexe

Tendances

Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day. These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate. Our presentation will be based on our recent experience in building a real-time data analytics platform for telco events. This platform has been jointly built by GetInData and Kcell - the leading telco in Kazakhstan - in just a few months and it currently runs in production at the scale of 10M subscribers and 160K events per second on a still small cluster. It's used as a backbone for personalized marketing campaigns, detecting frauds, cross-sell & up-sell by following the behavior of millions of users in real-time and reacting to it instantly. We will share how we build such platform using current best of breed open-source projects like Flink, Kafka, and Nifi. We won't skimp on the details how we designed and optimized our Flink applications for high load and performance. We will also describe challenges that we faced during development and try to provide some tips what one should pay attention to when developing similar solutions, not only for telco, but also for banks, e-commerce, IoT and other industries.

Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...

Flink Forward

Stream Processing as helped to turn many monolithic database-centric applications into fast, scalable, and flexible real time applications. However, there are still entire classes of applications that are built against databases, because today's streaming processing model is not yet rich enough to support these applications. We present what we believe is still missing in today's stream processing models, and how we envision to evolve stream processing for the next wave of application to be able to move to a stream processing architecture.

Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...

Flink Forward

In Zalando's microservice architecture, each service continuously generates streams of events for the purposes of inter-service communication or data integration. Some of these events describe business processes, e.g. a customer has placed an order or a parcel has been shipped. Out of this, the need to materialize event streams from the central event bus into persistent cloud storage evolved. The temporarily persisted data is then integrated into our relational data warehouse. In this talk we present a materialization engine backed by Apache Flink. We show how we employ Flink’s RESTful API, custom accumulators and stoppable sources to provide another API abstraction layer for deploying, monitoring and controlling our materialization jobs. Our jobs compact event streams depending on event properties and transform their complex JSON structures into flat files for easier integration into the data warehouse.

Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...

Flink Forward

Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so. We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc. We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.

The Past, Present, and Future of Apache Flink

Aljoscha Krettek

Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered and written out in real-time with an ease matching that of batch processing. However the real challenge of matching the prowess of batch ETL remains in doing joins, in maintaining state and to have the data be paused or rested dynamically. Netflix has a microservices architecture. Different microservices serve and record different kind of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when we want to combine the events coming from one high-traffic microservice to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations. Historically we have done this joining of large volume data-sets in batch. However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real time? Why wait a full day to get information from an event that was generated a few mins ago? In this talk, we will share how we solved a complex join of two high-volume event streams using Flink. We will talk about maintaining large state, fault tolerance of a stateful application and strategies for failure recovery.

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...

Flink Forward

As a distributed stream processing engine, Flink provides users with convenient operators to manipulate data on the fly. Among all these operators, join could be the most complicated one as it requires the capability to cross-analyze various sources simultaneously. In this talk, we aim to give a comprehensive introduction to the stream join in Flink. Specifically, we'll first provide an overview of the different join types and which of them are currently supported by Flink DataStream and Table & SQL APIs. Then we'll discuss some key points when performing distributed stream join. After that, we'd like to focus the rationale and implementation details of the time-windowed join launched in version 1.4. Since there are still a lot of improvements can be made, we'll end our talk by sharing some proposals for the future work.

Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...

Flink Forward

Stream Processing in conjunction with a Consistent, Durable, Reliable stream storage is kicking the revolution up a notch in Big Data processing. This modern paradigm is enabling a new generation of data middleware that delivers on the streaming promise of a simplified and unified programming model. From data ingest, transformation, and messaging to search, time series and more, a robust streaming data ecosystem means we’ll all be able to more quickly build applications that solve problems we could not solve before.

Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...

Flink Forward

SQL is the lingua franca of data processing and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink’s SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers. In our talk, we will discuss why you should and how you can (not being Alibaba or Uber) leverage the simplicity and power of SQL on Flink. We will start exploring the use cases that Flink SQL was designed for and present real-world problems that it can solve. In particular, you will learn why unified batch and stream processing is important and what it means to run SQL queries on streams of data. After we explored why you should use Flink SQL, we will show how you can leverage its full potential. Since recently, the Flink community is working on a service that integrates a query interface, (external) table catalogs, and result serving functionality for static, appending, and updating result sets. We will discuss the design and feature set of this query service and how it can be used for exploratory batch and streaming queries, ETL pipelines, and live updating query results that serve applications, such as real-time dashboards. The talk concludes with a brief demo of a client running queries against the service.

Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...

Flink Forward

Is it possible to build an efficient, focused web crawler using Flink? That was the question that led to the creation of the flink-crawler open source project. In this talk I’ll discuss how we use Flink’s support for AsyncFunctions and iterations to create a scalable web crawler that continuously and efficiently performs a focused web crawl with no additional infrastructure. I’ll also discuss some of the testing and debugging challenges encountered when using features such as AsyncFunctions and iterations.

Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...

Flink Forward

Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...

Flink Forward

CloudStream service is a Full Management Service in Huawei Cloud. Support several features, such as On-Demand Billing, easy-to-use Stream SQL in online SQL editor, test Stream SQL in real-time style, Multi-tenant, security isolation and so on. We choose Apache Flink as streaming compute platform. Inside of CloudStream Cluster, Flink job can run on Yarn, Mesos, Kubernetes. We also have extended Apache Flink to meet IoT scenario needs. There are specialized tests on Flink reliability with college cooperation. Finally continuously improve the infrastructure around CS including open source projects and cloud services. CloudStream is different with any other real-time analysis cloud service. The development process can also be shared at architecture and principles.

Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...

Flink Forward

Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

Flink Forward

Extracting insights out of continuously generated data requires a stream processor with powerful data analytics features such as Apache Flink. A stream data pipeline with Flink typically includes a storage component to ingest and serve the data. Pravega is a stream store that ingests and stores stream data permanently, making the data available for tail, catch-up, and historical reads. One important challenge for such stream data pipelines is coping with the variations in the workload. Daily cycles and seasonal spikes might require the provisioning of the application to adapt accordingly. Pravega has a feature called stream scaling, which enables the capacity offered for the ingestion of events of a stream to grow and shrink over time according to workload. Such a feature is useful when the application downstream has the ability of accommodating such changes and also scale its provisioning accordingly. In this presentation, we introduce stream scaling in Pravega and how Flink jobs leverage this feature to rescale stateful jobs according to variations in the workload.

Scaling stream data pipelines with Pravega and Apache Flink

Till Rohrmann

Flink metrics module allows to use Dropwizard-like metrics and reporters in Flink pipelines. It opens a rich opportunity to not only monitor health of Flink pipelines but also attach real-time business intelligence metrics to run alongside existing Flink data jobs, thus avoiding a need to build a separate BI data-flow infrastructure. However, most of the work still resides with application developers: they have to define how those metrics are registered, calculated and attached to Flink functional operators – and then Flink will do the computation and reporting. The most challenging task is to add BI-like metrics, with key dimensions dynamically extracted from streamed data and metric values likely needed to be aggregated over multiple Flink tasks. We present a toolkit that extends Flink metrics and allows to decorate existing Flink operators with both simple health-check and complex BI-like metrics, updated and observable in real-time through IT monitoring and BI visualization dashboards. This work is based on a real-world production use-case of computing retail merchandise prices in real-time for the Walmart e-commerce catalog using Flink.

Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...

Flink Forward

The streaming platform team at Lyft has been running Flink jobs in production for more than a year now, powering critical use cases like improving pickup ETA accuracy, dynamic pricing, generating machine learning features for fraud detection, real-time analytics among many others. Broadly, the jobs fall into two abstraction layers: applications (Flink jobs that run on the native platform) and analytics (that leverage Dryft, Lyft’s fully managed data processing engine). This talk will give an overview of the platform architecture, deployment model and user experience. The talk will also dive deeper into some of the challenges and the lessons that were learnt, running Flink jobs at scale, specifically around scaling Flink connectors, dealing with event time skew (source synchronization) and highlight common patterns of problems observed across several Flink jobs. Finally, the talk will give insights into how we are re-architecting the streaming platform @ Lyft using a Kubernetes based deployment.

Running Flink in Production: The good, The bad and The in Between - Lakshmi ...

Flink Forward

Apache Flink @ Alibaba - Seattle Apache Flink Meetup

Bowen Li

In this talk we present Zalando's microservices architecture, introduce Saiki – our next generation data integration and distribution platform on AWS and show how we employ stream processing for near-real time business intelligence. Zalando is one of the largest online fashion retailers in Europe. In order to secure our future growth and remain competitive in this dynamic market, we are transitioning from a monolithic to a microservices architecture and from a hierarchical to an agile organization. We first have a look at how business intelligence processes have been working inside Zalando for the last years and present our current approach - Saiki. It is a scalable, cloud-based data integration and distribution infrastructure that makes data from our many microservices readily available for analytical teams. We no longer live in a world of static data sets, but are instead confronted with an endless stream of events that constantly inform us about relevant happenings from all over the enterprise. The processing of these event streams enables us to do near-real time business intelligence. In this context we have evaluated Apache Flink vs. Apache Spark in order to choose the right stream processing framework. Given our requirements, we decided to use Flink as part of our technology stack, alongside with Kafka and Elasticsearch. With these technologies we are currently working on two use cases: a near real-time business process monitoring solution and streaming ETL. Monitoring our business processes enables us to check if technically the Zalando platform works. It also helps us analyze data streams on the fly, e.g. order velocities, delivery velocities and to control service level agreements. On the other hand, streaming ETL is used to relinquish resources from our relational data warehouse, as it struggles with increasingly high loads. In addition to that, it also reduces the latency and facilitates the platform scalability. Finally, we have an outlook on our future use cases, e.g. near-real time sales and price monitoring. Another aspect to be addressed is to lower the entry barrier of stream processing for our colleagues coming from a relational database background.

Stream Processing using Apache Flink in Zalando's World of Microservices - Re...

Zalando Technology

dA Platform is a production-ready platform for stream processing with Apache Flink®. The Platform includes open source Apache Flink, a stateful stream processing and event-driven application framework, and dA Application Manager, a central deployment and management component. dA Platform schedules clusters on Kubernetes, deploys stateful Flink applications, and controls these applications and their state.

dA Platform Overview

Robert Metzger

Speaker: Yupeng Fu, Staff Engineer, Uber High availability and reliability are important requirements to Uber services, and the services shall tolerate datacenter failures in a region and fail over to another region. In this talk, we will present the active-active Apache Kafka® at Uber and how it facilitates disaster discovery across regions for Uber services. In particular, we will highlight the key components including topic replication, topic aggregation, offsets sync and then walk through several use cases of their disaster recovery strategy using active-active Kafka. Lastly, we will present several interesting challenges and the future work planned. Yupeng Fu is a staff engineer in Uber Data Org leading the streaming data platform. Previously, he worked at Alluxio and Palantir, building distributed data analysis and storage platforms. Yupeng holds a B.S. and an M.S. from Tsinghua University and did his Ph.D. research on databases at UCSD.

Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber

confluent

“Customer experience is the next big battle ground for telcos,” proclaimed recently Amit Akhelikar, Global Director of Lynx Analytics at TM Forum Live! Asia in Singapore. But, how to fight in this battle? A common approach has been to keep “under control” some well-known network quality indicators, like dropped calls, radio access congestion, availability, and so on; but this has proven not to be enough to keep customers happy, like a siege weapon is not enough to conquer a city. But, what if it were possible to know how customers perceive services, at least most demanded ones, like web browsing or video streaming? That would be like a squad of archers ready to battle. And even having that, how to extract value of it and take actions in no time, giving our skilled archers the right targets? Meet CANVAS (Customer And Network Visualization and AnaltyticS), one of the first LATAM implementations of a Flink-based stream processing use case for a telco, which successfully combines leading and innovative technologies like Apache Hadoop, YARN, Kafka, Nifi, Druid and advanced visualizations with Flink core features like non-trivial stateful stream processing (joins, windows and aggregations on event time) and CEP capabilities for alarm generation, delivering a next-generation tool for SOC (Service Operation Center) teams.

Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...

Flink Forward

Tendances (20)