SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Sessionization
with Spark streaming
Ramūnas Urbonas
@ Platform Lunar
Disclosure
• This work was implemented in Adform
• Thanks the Hadoop team for permission and help
History
• Original idea from Ted Alaska @ 2014
How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop
• Hands on 2016 at Adform
The Problem
• Constant flow of page visits
110 GB average per day, volume variations, catch-up scenario
• Wait for session interrupts
Timeout, specific action, midnight, sanity checks
• Calculate session duration, length, reaction times
The Problem
• Constant ingress / egress
One car enters, car trailer exits
Join for every incoming car
• Some cars loop for hours
• Uncontrollable loop volume
Stream / Not
• Still not 100% sure if it’s worth streaming
People still frown when this topic is brought up
• More frequent ingress means less effective join
Is 2 minute period of ingress is still streaming? :)
• Another degree of complexity
Cons
• More complex application
Just like cars - ride to Work vs travel to Portugal
• Steady pace is required
Throttling is mandatory, volume control is essential, good GC
• Permanently reserved resources
Pros
• Fun
If this one is on your list, you should probably not do it :)
• Speed
This is “result speed”. Do you actually need it?
• Stability
You have to work really hard to get this benefit
Extra context
• User data is partitioned by nature
User ID (range) is obvious partition key
Helps us to control ingress size and most importantly - loop volume
• Loop volume is hard to control
Average flow was around 150 MB, the loop varied from 2 - 8 GB
Algorithm
ingress
state
updateStateByKey
join
Algorithm
complete
incomplete
decision calculate results
store for later
Copy & Paste
• Ted solution relies on updateStateByKey
This method requires checkpointing
• Checkpoints
Are good only on paper
They are meant for soft-recovery
The Thought
val sc = new SparkContext(…)
val ssc = new StreamingContext(sc, Minutes(2))
val ingress = ssc.textFileStream(“folder”).groupBy(userId)
val checkpoint = sc.textFile("checkpoint").groupBy(userId)
val sessions = checkpoint.fullOuterJoin(ingress)(userId)
.cache
sessions.filter(complete).map(enrich).saveAsTextFile("output")
sessions.filter(inComplete).saveAsTextFile("checkpoint")
fileStream
• Works based on file timestamp with some memory
Bit fuzzy, ugly for testing
• We wanted to have more control and monitoring
Our file names had meta information (source, oldest record time)
Custom implementation with external state (key-valuestore)
We could control ingress size
Tip: persisting actual job plan
Checkpoint
user-1 1477983123 page-26
user-1 1477983256 page-2
user-2 1477982342 home
user-2 1477982947 page-9
user-2 1477984343 home
Checkpoint
• Custom implementation
We wanted to maintain checkpoint grouping
• Nothing fancy
class SessionInputFormat
extends FileInputFormat[SessionKey, List[Record]]
fullOuterJoin
• Probably the most expensive operation
The average ratio is 1:35, with extremes of 1:100
We found IndexedRDD contribution
IndexedRDD
• IndexedRDD
https://github.com/amplab/spark-indexedrdd
• Partition control is essential
Avoid extra stage in your job, extra shuffles
Explicit partitioner, even if it is HashPartitioner
Get used to specifying partitioner for every groupBy / combineByKey
Exact and controllable partition count
IndexedRDD
cache & repetition
• Remember?
.cache .filter(complete).doStuff .filter(incomplete).doStuff
• You never want to repeat actions when streaming
We had to scan entire dataset twice
Also… two phase commit
Multi Output Format
• Custom implementation
We wanted different format for each output
Not that hard, but lot’s of copy-paste
Communication via Hadoop configuration
• MultipleOutputFormat
Why we did not use it?
Gotcha
val conf = new JobConf(rdd.context.hadoopConfiguration)


conf.set("mapreduce.job.outputformat.class",
classOf[SessionMultiOutputFormat].getName)


conf.set(COMPLETE_SESSIONS_PATH, job.outputPath)
conf.set(ONGOING_SESSION_PATH, job.checkpointPath)

sessions.saveAsNewAPIHadoopDataset(conf)
Non-natural partitioning
• Our ingress comes pre-partitioned
File names like server_oldest-record-timestamp.txt.gz
Where server works on a range of user ids
• Just foreachRDD
… or is it? :D
Resource utilisation
0
25
50
75
100
Resource utilisation
0
25
50
75
100
Parallelise
• Just rdds.par.foreach(processOne)
… or is it ? :D
• Limit thread pool
val par = rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
The Algorithm
val stream = new OurCustomDStream(..)
stream.foreachRDD(processUnion)
…
val par = unionRdd.rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
unionRdd.rdds.par.foreach(processOne)
The Algorithm
val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20))
val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...)
val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20))
val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc)
val split = sessions.flatMap(splitSessionFunc)
val conf = new JobConf(...)
split.saveAsNewAPIHadoopDataset(conf)
Result
Configuration
• Current configuration
Driver: 6 GB RAM
15 executors: 4GB RAM and 2 cores each
• Total size not that big
60 GB RAM and 30 cores
Previously it was 52 SQL instances.. doing other things too
• Hasn’t changed for half a year already
Metrics
My Pride
Other tips
• -XX:+UseG1GC
For both driver and executors
• Plan & store jobs, repeat if failed
When repeating, environment changes
• Use named RDDs
Helps to read your DAGs
Thanks

Contenu connexe

Tendances

Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
Denish Patel
 

Tendances (20)

SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
 
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
 
PostgreSQL Terminology
PostgreSQL TerminologyPostgreSQL Terminology
PostgreSQL Terminology
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
 
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
 
Pgcenter overview
Pgcenter overviewPgcenter overview
Pgcenter overview
 
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified Logging
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
 
How to upgrade to MongoDB 4.0 - Percona Europe 2018
How to upgrade to MongoDB 4.0 - Percona Europe 2018How to upgrade to MongoDB 4.0 - Percona Europe 2018
How to upgrade to MongoDB 4.0 - Percona Europe 2018
 
GitLab PostgresMortem: Lessons Learned
GitLab PostgresMortem: Lessons LearnedGitLab PostgresMortem: Lessons Learned
GitLab PostgresMortem: Lessons Learned
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 

En vedette

A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
Ben Stopford
 

En vedette (20)

Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream Processing
 
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAM
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: Amadeus
 
Wrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech EcosystemWrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech Ecosystem
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache Kafka
 
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11gBest Practices for testing of SOA-based systems - with examples of SOA Suite 11g
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
 
A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Big Data & the Enterprise
Big Data & the EnterpriseBig Data & the Enterprise
Big Data & the Enterprise
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 

Similaire à Sessionization with Spark streaming

Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
Enkitec
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
Enkitec
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
End to-end async and await
End to-end async and awaitEnd to-end async and await
End to-end async and await
vfabro
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 

Similaire à Sessionization with Spark streaming (20)

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Say YES to Premature Optimizations
Say YES to Premature OptimizationsSay YES to Premature Optimizations
Say YES to Premature Optimizations
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
End to-end async and await
End to-end async and awaitEnd to-end async and await
End to-end async and await
 
Performance
PerformancePerformance
Performance
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
 
Operating System
Operating SystemOperating System
Operating System
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 

Dernier

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Dernier (20)

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

Sessionization with Spark streaming

  • 3. Disclosure • This work was implemented in Adform • Thanks the Hadoop team for permission and help
  • 4. History • Original idea from Ted Alaska @ 2014 How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop • Hands on 2016 at Adform
  • 5. The Problem • Constant flow of page visits 110 GB average per day, volume variations, catch-up scenario • Wait for session interrupts Timeout, specific action, midnight, sanity checks • Calculate session duration, length, reaction times
  • 6. The Problem • Constant ingress / egress One car enters, car trailer exits Join for every incoming car • Some cars loop for hours • Uncontrollable loop volume
  • 7. Stream / Not • Still not 100% sure if it’s worth streaming People still frown when this topic is brought up • More frequent ingress means less effective join Is 2 minute period of ingress is still streaming? :) • Another degree of complexity
  • 8. Cons • More complex application Just like cars - ride to Work vs travel to Portugal • Steady pace is required Throttling is mandatory, volume control is essential, good GC • Permanently reserved resources
  • 9. Pros • Fun If this one is on your list, you should probably not do it :) • Speed This is “result speed”. Do you actually need it? • Stability You have to work really hard to get this benefit
  • 10. Extra context • User data is partitioned by nature User ID (range) is obvious partition key Helps us to control ingress size and most importantly - loop volume • Loop volume is hard to control Average flow was around 150 MB, the loop varied from 2 - 8 GB
  • 13. Copy & Paste • Ted solution relies on updateStateByKey This method requires checkpointing • Checkpoints Are good only on paper They are meant for soft-recovery
  • 14. The Thought val sc = new SparkContext(…) val ssc = new StreamingContext(sc, Minutes(2)) val ingress = ssc.textFileStream(“folder”).groupBy(userId) val checkpoint = sc.textFile("checkpoint").groupBy(userId) val sessions = checkpoint.fullOuterJoin(ingress)(userId) .cache sessions.filter(complete).map(enrich).saveAsTextFile("output") sessions.filter(inComplete).saveAsTextFile("checkpoint")
  • 15. fileStream • Works based on file timestamp with some memory Bit fuzzy, ugly for testing • We wanted to have more control and monitoring Our file names had meta information (source, oldest record time) Custom implementation with external state (key-valuestore) We could control ingress size Tip: persisting actual job plan
  • 16. Checkpoint user-1 1477983123 page-26 user-1 1477983256 page-2 user-2 1477982342 home user-2 1477982947 page-9 user-2 1477984343 home
  • 17. Checkpoint • Custom implementation We wanted to maintain checkpoint grouping • Nothing fancy class SessionInputFormat extends FileInputFormat[SessionKey, List[Record]]
  • 18. fullOuterJoin • Probably the most expensive operation The average ratio is 1:35, with extremes of 1:100 We found IndexedRDD contribution
  • 19. IndexedRDD • IndexedRDD https://github.com/amplab/spark-indexedrdd • Partition control is essential Avoid extra stage in your job, extra shuffles Explicit partitioner, even if it is HashPartitioner Get used to specifying partitioner for every groupBy / combineByKey Exact and controllable partition count
  • 21. cache & repetition • Remember? .cache .filter(complete).doStuff .filter(incomplete).doStuff • You never want to repeat actions when streaming We had to scan entire dataset twice Also… two phase commit
  • 22. Multi Output Format • Custom implementation We wanted different format for each output Not that hard, but lot’s of copy-paste Communication via Hadoop configuration • MultipleOutputFormat Why we did not use it?
  • 23. Gotcha val conf = new JobConf(rdd.context.hadoopConfiguration) 
 conf.set("mapreduce.job.outputformat.class", classOf[SessionMultiOutputFormat].getName) 
 conf.set(COMPLETE_SESSIONS_PATH, job.outputPath) conf.set(ONGOING_SESSION_PATH, job.checkpointPath)
 sessions.saveAsNewAPIHadoopDataset(conf)
  • 24. Non-natural partitioning • Our ingress comes pre-partitioned File names like server_oldest-record-timestamp.txt.gz Where server works on a range of user ids • Just foreachRDD … or is it? :D
  • 27. Parallelise • Just rdds.par.foreach(processOne) … or is it ? :D • Limit thread pool val par = rdds.par par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
  • 28. The Algorithm val stream = new OurCustomDStream(..) stream.foreachRDD(processUnion) … val par = unionRdd.rdds.par par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10)) unionRdd.rdds.par.foreach(processOne)
  • 29. The Algorithm val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20)) val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...) val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20)) val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc) val split = sessions.flatMap(splitSessionFunc) val conf = new JobConf(...) split.saveAsNewAPIHadoopDataset(conf)
  • 31. Configuration • Current configuration Driver: 6 GB RAM 15 executors: 4GB RAM and 2 cores each • Total size not that big 60 GB RAM and 30 cores Previously it was 52 SQL instances.. doing other things too • Hasn’t changed for half a year already
  • 34. Other tips • -XX:+UseG1GC For both driver and executors • Plan & store jobs, repeat if failed When repeating, environment changes • Use named RDDs Helps to read your DAGs