Productionize Spark Structured Streaming for Real-Time Analytics

•

0 j'aime•311 vues

Ivan Kosianenko discusses how AppsFlyer has evolved its data pipeline from using MemSQL and Druid to handle 12 billion events daily to now using ClickHouse and Druid to handle 38 billion events daily. He then explains how building a custom data streaming solution with Clojure can become complex due to issues like distributed orchestration, monitoring, and delivery guarantees. As an alternative, Ivan recommends using Spark Structured Streaming for its built-in functionality addressing these issues like write-ahead logging and exactly-once processing. He demonstrates how AppsFlyer leverages Spark Structured Streaming to ingest streaming data from Kafka into ClickHouse while monitoring streaming queries and end-to-end latency.

Données & analyses

Productionize
Spark Structured Streaming
Ivan Kosianenko

Who am I
Software Architect at AppsFlyer
Data Engineering
Machine Learning
Ivan Kosianenko

Time Series Lifetime Value
1.1.2018
1.1.2017
5.1.2018 5.1.2018
...
2.1.2018
3.1.2018
4.1.2018
5.1.2018
1.1.2018
2.1.2018
3.1.2018
4.1.2018
5.1.2018
- Mutable data Immutable data
VS

December 2017: MemSQL + Druid
Clojure
streamers
Live 1 Day Data
MemSQL
Cluster
Dashboard
Middleware
API
Druid
Daily 4 Year
LTV Table
12 billion events daily
KAFKA

Now: ClickHouse + Druid
Live 1 Day Data
ClickHouse
Cluster
Dashboard
Middleware
API
Druid
Daily 5 Year
LTV Table
38 billion events daily
KAFKA
Structured Streaming

How to DIY Clojure streamer?
● take clj-kafka and JDBC driver
● add multi-threaded execution
● write custom business logic

How to DIY Clojure streamer?
● distributed orchestration
● resource allocation
● retry logic
● monitoring
● delivery guarantee
● ???

Kafka Connect
Apache Storm
Heron Streaming
Onyx

Spark Structured Streaming
source
sink
business logic

Custom Foreach Writer
Add row
Write batch to anywhere
Create batch

Spark Structured Streaming
Write-ahead logging

ClickHouse
+
Exactly-once end-to-end processing

Spark Structured Streaming
Streaming query

Streaming Query
● Explain
● Stop
● Await termination
● Get termination error
● Track progress

Checkpoint dir
Track lag between Spark and Kafka
1
2
3
1. Get last offsets from Spark
2. Get last offsets from Kafka
3. Send difference to statd

End-to-end test monitoring
1
2
3
1. Sends test message to
Kafka
2. Wait while message will
appear in Clickhouse
3. Send latency to statsd

Production config
turn off dynamic allocation
or
set min/max executors

Deploy with custom Spark operator
Airflow
Custom operator tries to figure out:
● Is current Spark job running?
● Is it latest version with latest config?
If no -> submit Spark job to YARN with
yarn.tag = hash(jar_version+run config)
Once in a minute

Join us
Thank you
ivan.kosianenko@appsflyer.com
t.me/appsflyerkyiv

Recommandé

Lambda architectureIvan Kosianenko

GraphQL in Kiwi.comMichal Sänger

Building GraphQL Servers with Node.JS & PrismaNikolas Burk

Unreal Engine 4 Blueprints: Odio e amore Roberto De Ioris - Codemotion Rome 2017Codemotion

GraphQL API on a Serverless EnvironmentItai Yaffe

Selenium camp 2017. Alexander ChumakinAlex Chumakin

From logging to monitoring to reactive insights - C Schneidermfrancis

Apache Airflow overviewNikolayGrishchenkov

Recommandé

Lambda architectureIvan Kosianenko

GraphQL in Kiwi.comMichal Sänger

Building GraphQL Servers with Node.JS & PrismaNikolas Burk

Unreal Engine 4 Blueprints: Odio e amore Roberto De Ioris - Codemotion Rome 2017Codemotion

GraphQL API on a Serverless EnvironmentItai Yaffe

Selenium camp 2017. Alexander ChumakinAlex Chumakin

From logging to monitoring to reactive insights - C Schneidermfrancis

Apache Airflow overviewNikolayGrishchenkov

Implementing GraphQL - Without a BackendShowmax Engineering

Data Provision API with BigQuery - Google Cloud Summit Jakarta 18Imre Nagi

From business requirements to working pipelines with apache airflowDerrick Qin

Serverless microservices in the wildRotem Tamir

GraphConnect 2014 SF: How eBay and Shutl Deliver Even Faster Using Neo4jNeo4j

"Smooth Operator" [Bay Area NewSQL meetup]Kevin Xu

Modern Monitoring and processing logsDmitry Lavrinenko

Pracital application logging and monitoringLaurynas Tretjakovas

Athena 0.2.0 - NimbleNimble

Fall in Love with Graphs and Metrics using Grafanatorkelo

Handle insane devices traffic using Google Cloud Platform - Andrea Ulisse - C...Codemotion

OpenStack MagnetoDB. Atlanta Summit 2014Ilya Sviridov

Quick prototyping with VulcanJSMelek Hakim

Spring Cloud KubernetesMauricio (Salaboy) Salatino

Apache Airflow at DailymotionGermain Tanguy

RxJS streams handling for PadawanSeven Peaks Speaks

10 EZ Steps to SOLR Domination - Berlin Buzzwords 2012Alex Pinkin

Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldOutlyer

Airflow presentationAnant Corporation

Apache AirflowKnoldus Inc.

Azure Data Factory v2inovex GmbH

いそがしいひとのための Microsoft Ignite 2018 最新情報 Data 編Miho Yamamoto

Contenu connexe

Tendances

Implementing GraphQL - Without a BackendShowmax Engineering

Data Provision API with BigQuery - Google Cloud Summit Jakarta 18Imre Nagi

From business requirements to working pipelines with apache airflowDerrick Qin

Serverless microservices in the wildRotem Tamir

GraphConnect 2014 SF: How eBay and Shutl Deliver Even Faster Using Neo4jNeo4j

"Smooth Operator" [Bay Area NewSQL meetup]Kevin Xu

Modern Monitoring and processing logsDmitry Lavrinenko

Pracital application logging and monitoringLaurynas Tretjakovas

Athena 0.2.0 - NimbleNimble

Fall in Love with Graphs and Metrics using Grafanatorkelo

Handle insane devices traffic using Google Cloud Platform - Andrea Ulisse - C...Codemotion

OpenStack MagnetoDB. Atlanta Summit 2014Ilya Sviridov

Quick prototyping with VulcanJSMelek Hakim

Spring Cloud KubernetesMauricio (Salaboy) Salatino

Apache Airflow at DailymotionGermain Tanguy

RxJS streams handling for PadawanSeven Peaks Speaks

10 EZ Steps to SOLR Domination - Berlin Buzzwords 2012Alex Pinkin

Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldOutlyer

Airflow presentationAnant Corporation

Apache AirflowKnoldus Inc.

Tendances (20)

Implementing GraphQL - Without a Backend

Data Provision API with BigQuery - Google Cloud Summit Jakarta 18

From business requirements to working pipelines with apache airflow

Serverless microservices in the wild

GraphConnect 2014 SF: How eBay and Shutl Deliver Even Faster Using Neo4j

"Smooth Operator" [Bay Area NewSQL meetup]

Modern Monitoring and processing logs

Pracital application logging and monitoring

Athena 0.2.0 - Nimble

Fall in Love with Graphs and Metrics using Grafana

Handle insane devices traffic using Google Cloud Platform - Andrea Ulisse - C...

OpenStack MagnetoDB. Atlanta Summit 2014

Quick prototyping with VulcanJS

Spring Cloud Kubernetes

Apache Airflow at Dailymotion

RxJS streams handling for Padawan

10 EZ Steps to SOLR Domination - Berlin Buzzwords 2012

Leonard Austin (Ravelin) - DevOps in a Machine Learning World

Airflow presentation

Apache Airflow

Similaire à Productionize Spark Structured Streaming for Real-Time Analytics

Azure Data Factory v2inovex GmbH

いそがしいひとのための Microsoft Ignite 2018 最新情報 Data 編Miho Yamamoto

Airbyte - Seed deckAirbyte

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit

A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidHostedbyConfluent

Airbyte - Seed deckAirbyte

Chris D'Agostino | Kafka Summit 2018 Keynote (Building an Enterprise Streamin...confluent

Pivoting Spring XD to Spring Cloud Data Flow with Sabby AnandanPivotalOpenSourceHub

Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari

Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion

SpringOne 2016 in a nutshellJeroen Resoort

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari

Unleashing the Potential of GraphQL with Streaming Data - Kishore Banala, Net...Nordic APIs

Why and How SmartNews uses SaaS?Takumi Sakamoto

Data-Driven Software Engineering for Agile TeamsTechWell

Neo4j: The path to success with Graph Database and Graph Data ScienceNeo4j

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson

Spring Boot & Spring Cloud Apps on Pivotal Application Service - Daniel LavoieVMware Tanzu

Neo4j Database and Graph Platform OverviewNeo4j

Similaire à Productionize Spark Structured Streaming for Real-Time Analytics (20)

Azure Data Factory v2

いそがしいひとのための Microsoft Ignite 2018 最新情報 Data 編

Airbyte - Seed deck

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...

A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid

Airbyte - Seed deck

Chris D'Agostino | Kafka Summit 2018 Keynote (Building an Enterprise Streamin...

Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan

Google Cloud Dataflow Two Worlds Become a Much Better One

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...

Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...

SpringOne 2016 in a nutshell

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017

Unleashing the Potential of GraphQL with Streaming Data - Kishore Banala, Net...

Why and How SmartNews uses SaaS?

Data-Driven Software Engineering for Agile Teams

Neo4j: The path to success with Graph Database and Graph Data Science

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...

Spring Boot & Spring Cloud Apps on Pivotal Application Service - Daniel Lavoie

Neo4j Database and Graph Platform Overview

Dernier

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

Learn How Data Science Changes Our WorldEduminds Learning

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

detection and classification of knee osteoarthritis.pptxAleenaJamil4

Multiple time frame trading analysis -brianshannon.pdfchwongval

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

Dernier (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Data Factory in Microsoft Fabric (MsBIP #82)

Vision, Mission, Goals and Objectives ppt..pptx

Identifying Appropriate Test Statistics Involving Population Mean

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

Learn How Data Science Changes Our World

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

detection and classification of knee osteoarthritis.pptx

Multiple time frame trading analysis -brianshannon.pdf

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Heart Disease Classification Report: A Data Analysis Project

DBA Basics: Getting Started with Performance Tuning.pdf

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

Productionize Spark Structured Streaming for Real-Time Analytics

1. Productionize Spark Structured Streaming Ivan Kosianenko

2. Who am I Software Architect at AppsFlyer Data Engineering Machine Learning Ivan Kosianenko

3. AppsFlyer in a Nutshell

4. DEMO DATA

5. Time Series Lifetime Value 1.1.2018 1.1.2017 5.1.2018 5.1.2018 ... 2.1.2018 3.1.2018 4.1.2018 5.1.2018 1.1.2018 2.1.2018 3.1.2018 4.1.2018 5.1.2018 - Mutable data Immutable data VS

6. December 2017: MemSQL + Druid Clojure streamers Live 1 Day Data MemSQL Cluster Dashboard Middleware API Druid Daily 4 Year LTV Table 12 billion events daily KAFKA

7. Now: ClickHouse + Druid Live 1 Day Data ClickHouse Cluster Dashboard Middleware API Druid Daily 5 Year LTV Table 38 billion events daily KAFKA Structured Streaming

8. How to DIY Clojure streamer? ● take clj-kafka and JDBC driver ● add multi-threaded execution ● write custom business logic

9. How to DIY Clojure streamer? ● distributed orchestration ● resource allocation ● retry logic ● monitoring ● delivery guarantee ● ???

10. The Law of Leaky Abstractions

11. Kafka Connect Apache Storm Heron Streaming Onyx

12. Spark Structured Streaming source sink business logic

13. Custom Foreach Writer Add row Write batch to anywhere Create batch

14. Be careful! For each row

15. Spark Structured Streaming Write-ahead logging

16. ClickHouse + Exactly-once end-to-end processing

17. Spark Structured Streaming Streaming query

18. Streaming Query ● Explain ● Stop ● Await termination ● Get termination error ● Track progress

19. Last progress

20. Checkpoint dir Track lag between Spark and Kafka 1 2 3 1. Get last offsets from Spark 2. Get last offsets from Kafka 3. Send difference to statd

21. Dashboard and alerts

22. End-to-end test monitoring 1 2 3 1. Sends test message to Kafka 2. Wait while message will appear in Clickhouse 3. Send latency to statsd

23. Dashboard and alerts

24. Production config turn off dynamic allocation or set min/max executors

25. Deploy with custom Spark operator Airflow Custom operator tries to figure out: ● Is current Spark job running? ● Is it latest version with latest config? If no -> submit Spark job to YARN with yarn.tag = hash(jar_version+run config) Once in a minute

26.

27. Join us Thank you ivan.kosianenko@appsflyer.com t.me/appsflyerkyiv