Samza tech talk_2015 - strata

•Télécharger en tant que PPTX, PDF•

2 j'aime•1,227 vues

Yi Pan

This is a quick overview of LinkedIn's stream processing tech stack.

Logiciels

Stream Processing @Scale in
LinkedIn
Yi Pan
Data Infrastructure
Samza Team @LinkedIn
Databus

• What is Stream Processing?
• What is Samza?
• Stream Processing @LinkedIn
• Upcoming features
Overview

• What’s stream processing
– Input: an unbounded sequence of events
• E.g. web server logs, user activity tracking events,
database changelogs, etc.
– Latency: near real-time
• From milliseconds to minutes, instead of hours to days
– Output: an unbounded sequence of changes to
the derived dataset
• The derived dataset is usually the final or partial
analytic results that can either be in another stream, or
a serving data store
Stream Processing

Response latency
Milliseconds to minutes
Synchronous Later. Possibly much later.
0 ms
Stream Processing

• What are the application requirements?
– Scalable, fast, stateful stream processing
– What scale should we operate at?
• Traffic Volume: 1.4 Trillion events/day
• Intermediate State Size: multi TB / colo (*)
– Why is it expensive to run stream processing at
scale?
• Intermediate data set needs to be stored to allow low
latency processing
• Large volume of data needs to be pulled and pushed
via network
Stream Processing

• Samza is a distributed Turing machine
– Single Task Samza Job is a stateful Turing
machine
What’s Samza
Samza Task
Input stream Output stream
State
changelog
checkpoint

– Scaling a Samza job: partition the streams
What’s SamzaInputstreamA
partition 0
partition 1
partition 2
partition 3
partition n

– Scaling a Samza job: partition the streams
What’s SamzaInputstreamB
partition 0
partition 1
partition 2
partition 3
partition n

– Scaling a Samza job: replicating the state
machine
What’s Samza
shared checkpoint
Job

• Samza Execution in Yarn
What’s Samza
Host 1 Host 2 Host 3
Application
Master
Samza container Samza container
Samza container
Deploy Samza job

• States in Samza
– Checkpoints
• Offsets per input stream partitions
– State Stores
• In-memory or on-disk (RocksDB) derived data set
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1

• States in Samza
– Checkpoints and local state stores are backed by
distributed logs
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1

Stream Processing @
LinkedIn
WebServers
WebServers
WebServers
WebServers
WebServers
WebServers
WebServersMonitor
Servers
Oracle
Espresso
Kafka Databus
Tracking
events
Metrics
changelog
changelog
Samza
Jobs
Samza
Jobs
Samza
Jobs
Samza
Jobs
bootstrap
bootstrap
Voldemort
Derived
Data Derived
Data

Stream Processing @
LinkedIn
• Tracking aggregate/analysis (ACG)

Stream Processing @
LinkedIn
• Kafka Deployment
– 1.1 Trillion messages / day
• Databus Deployment
– 300 Billion messages / day
• Samza Deployment
– multiple colos
– 10+ Yarn clusters
– 200+ nodes
– 100+ Jobs in production

• What is Stream Processing?
• What’s Samza
• Stream Processing @LinkedIn
• Upcoming features
Overview

• New features
– Local state store improvements
• RocksDB TTL support
• Fast recovery
– Dynamic configuration
– Easier deployment w/ standalone jobs
– High-level query language for faster
development
Upcoming Features

Contact Us / Get Involved
• Open Source
–Documentation: samza.apache.org
–Mailing list: dev@samza.apache.org
–JIRA:
https://issues.apache.org/jira/browse/SA
MZA

Recommandé

Samza tech talk_2015 - huaweiYi Pan

Scalable complex event processing on samza @UBERShuyi Chen

Samza la hugSriram Subramanian

Air traffic controller - Streams Processing meetupEd Yakabosky

Event Stream Processing with Kafka and SamzaZach Cox

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent

The Netflix Way to deal with Big Data ProblemsMonal Daxini

Recommandé

Samza tech talk_2015 - huaweiYi Pan

Scalable complex event processing on samza @UBERShuyi Chen

Samza la hugSriram Subramanian

Air traffic controller - Streams Processing meetupEd Yakabosky

Event Stream Processing with Kafka and SamzaZach Cox

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent

The Netflix Way to deal with Big Data ProblemsMonal Daxini

Samza: Real-time Stream Processing at LinkedInC4Media

High cardinality time series search: A new level of scale - Data Day Texas 2016Eric Sammer

Easily Build a Smart Pulsar Stream Processor_Simon CrosbyStreamNative

Flink forward-2017-netflix keystones-paasMonal Daxini

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...Hakka Labs

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh

Principles in Data Stream Processing | Matthias J Sax, ConfluentHostedbyConfluent

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent

A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...confluent

Robust stream processing with Apache FlinkAljoscha Krettek

Apache Spark Streaming - www.know bigdata.comknowbigdata

Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programmingconfluent

Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler

Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent

Zurich Flink MeetupKonstantinos Kloudas

Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini

Data pipeline with kafkaMole Wong

SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent

ApacheCon BigData - What it takes to process a trillion events a day?Jagadish Venkatraman

Cloud Security Monitoring and Spark Analyticsamesar0

Contenu connexe

Tendances

Samza: Real-time Stream Processing at LinkedInC4Media

High cardinality time series search: A new level of scale - Data Day Texas 2016Eric Sammer

Easily Build a Smart Pulsar Stream Processor_Simon CrosbyStreamNative

Flink forward-2017-netflix keystones-paasMonal Daxini

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...Hakka Labs

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh

Principles in Data Stream Processing | Matthias J Sax, ConfluentHostedbyConfluent

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent

A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...confluent

Robust stream processing with Apache FlinkAljoscha Krettek

Apache Spark Streaming - www.know bigdata.comknowbigdata

Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programmingconfluent

Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler

Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent

Zurich Flink MeetupKonstantinos Kloudas

Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini

Data pipeline with kafkaMole Wong

SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent

Tendances (20)

Samza: Real-time Stream Processing at LinkedIn

High cardinality time series search: A new level of scale - Data Day Texas 2016

Easily Build a Smart Pulsar Stream Processor_Simon Crosby

Flink forward-2017-netflix keystones-paas

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...

Netflix Data Pipeline With Kafka

Will it Scale? The Secrets behind Scaling Stream Processing Applications

Principles in Data Stream Processing | Matthias J Sax, Confluent

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...

Robust stream processing with Apache Flink

Apache Spark Streaming - www.know bigdata.com

Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programming

Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive

Zurich Flink Meetup

Kappa Architecture on Apache Kafka and Querona: datamass.io

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015

Data pipeline with kafka

SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...

Similaire à Samza tech talk_2015 - strata

ApacheCon BigData - What it takes to process a trillion events a day?Jagadish Venkatraman

Cloud Security Monitoring and Spark Analyticsamesar0

Play With StreamsTianjian Chen

Unified Batch & Stream Processing with Apache SamzaDataWorks Summit

Distributed monitoringLeon Torres

Apache samza past, present and futureEd Yakabosky

Flink Streaming @BudapestDataGyula Fóra

Apache Spark ComponentsGirish Khanzode

BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...Amazon Web Services

John adams talk cloudyJohn Adams

Introduction to Apache ApexApache Apex

AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services

Real-time Data Processing Using AWS LambdaAmazon Web Services

Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandFuenteovejuna

Aerospike Hybrid Memory ArchitectureAerospike, Inc.

Apache Samza Past, Present and FutureKartik Paramasivam

Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das

Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini

Beam me up, Samza!Xinyu Liu

Next Gen Big Data Analytics with Apache Apex DataWorks Summit/Hadoop Summit

Similaire à Samza tech talk_2015 - strata (20)

ApacheCon BigData - What it takes to process a trillion events a day?

Cloud Security Monitoring and Spark Analytics

Play With Streams

Unified Batch & Stream Processing with Apache Samza

Distributed monitoring

Apache samza past, present and future

Flink Streaming @BudapestData

Apache Spark Components

BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...

John adams talk cloudy

Introduction to Apache Apex

AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...

Real-time Data Processing Using AWS Lambda

Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland

Aerospike Hybrid Memory Architecture

Apache Samza Past, Present and Future

Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP

Netflix Keystone Pipeline at Samza Meetup 10-13-2015

Beam me up, Samza!

Next Gen Big Data Analytics with Apache Apex

Dernier

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Dernier (20)

why an Opensea Clone Script might be your perfect match.pdf

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

Exploring iOS App Development: Simplifying the Process

A Secure and Reliable Document Management System is Essential.docx

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

How To Use Server-Side Rendering with Nuxt.js

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Optimizing AI for immediate response in Smart CCTV

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Diamond Application Development Crafting Solutions with Precision

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

Unlocking the Future of AI Agents with Large Language Models

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Der Spagat zwischen BIAS und FAIRNESS (2024)

Samza tech talk_2015 - strata

1. Stream Processing @Scale in LinkedIn Yi Pan Data Infrastructure Samza Team @LinkedIn Databus

2. • What is Stream Processing? • What is Samza? • Stream Processing @LinkedIn • Upcoming features Overview

3. • What’s stream processing – Input: an unbounded sequence of events • E.g. web server logs, user activity tracking events, database changelogs, etc. – Latency: near real-time • From milliseconds to minutes, instead of hours to days – Output: an unbounded sequence of changes to the derived dataset • The derived dataset is usually the final or partial analytic results that can either be in another stream, or a serving data store Stream Processing

4. Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 0 ms Stream Processing

5. • What are the application requirements? – Scalable, fast, stateful stream processing – What scale should we operate at? • Traffic Volume: 1.4 Trillion events/day • Intermediate State Size: multi TB / colo (*) – Why is it expensive to run stream processing at scale? • Intermediate data set needs to be stored to allow low latency processing • Large volume of data needs to be pulled and pushed via network Stream Processing

6. • What is Stream Processing? • What is Samza? • Stream Processing @LinkedIn • Upcoming features Overview

7. • Samza is a distributed Turing machine – Single Task Samza Job is a stateful Turing machine What’s Samza Samza Task Input stream Output stream State changelog checkpoint

8. – Scaling a Samza job: partition the streams What’s SamzaInputstreamA partition 0 partition 1 partition 2 partition 3 partition n

9. – Scaling a Samza job: partition the streams What’s SamzaInputstreamB partition 0 partition 1 partition 2 partition 3 partition n

10. – Scaling a Samza job: replicating the state machine What’s Samza shared checkpoint Job

11. • Samza Execution in Yarn What’s Samza Host 1 Host 2 Host 3 Application Master Samza container Samza container Samza container Deploy Samza job

12. • States in Samza – Checkpoints • Offsets per input stream partitions – State Stores • In-memory or on-disk (RocksDB) derived data set What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 1

13. • States in Samza – Checkpoints and local state stores are backed by distributed logs What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 1

14. • What is Stream Processing? • What is Samza? • Stream Processing @LinkedIn • Upcoming features Overview

15. Stream Processing @ LinkedIn WebServers WebServers WebServers WebServers WebServers WebServers WebServersMonitor Servers Oracle Espresso Kafka Databus Tracking events Metrics changelog changelog Samza Jobs Samza Jobs Samza Jobs Samza Jobs bootstrap bootstrap Voldemort Derived Data Derived Data

16. Stream Processing @ LinkedIn • Tracking aggregate/analysis (ACG)

17. Stream Processing @ LinkedIn • Content standardization w/ adjunct data set Member Profile DB Bootstrap Job Databus Kafka Content Standardization Kafka Kafka

18. Stream Processing @ LinkedIn • Kafka Deployment – 1.1 Trillion messages / day • Databus Deployment – 300 Billion messages / day • Samza Deployment – multiple colos – 10+ Yarn clusters – 200+ nodes – 100+ Jobs in production

19. • What is Stream Processing? • What’s Samza • Stream Processing @LinkedIn • Upcoming features Overview

20. • New features – Local state store improvements • RocksDB TTL support • Fast recovery – Dynamic configuration – Easier deployment w/ standalone jobs – High-level query language for faster development Upcoming Features

21. Contact Us / Get Involved • Open Source –Documentation: samza.apache.org –Mailing list: dev@samza.apache.org –JIRA: https://issues.apache.org/jira/browse/SA MZA

Notes de l'éditeur

Rpc – bulk of what we do, we expect immediate response (web page, bunch of requests sent) Other extreme is batch processing, typically done in hadoop (order of hours if not days) Samza fits in middle, async, relatively quickly, Order of ms to minute stream processing for us = anything asynchronous, but not batch computed.