Sessionization with Spark streaming

Sessionization
with Spark streaming

Ramūnas Urbonas
@ Platform Lunar

Disclosure
• This work was implemented in Adform
• Thanks the Hadoop team for permission and help

History
• Original idea from Ted Alaska @ 2014
How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop
• Hands on 2016 at Adform

The Problem
• Constant ﬂow of page visits
110 GB average per day, volume variations, catch-up scenario
• Wait for session interrupts
Timeout, speciﬁc action, midnight, sanity checks
• Calculate session duration, length, reaction times

The Problem
• Constant ingress / egress
One car enters, car trailer exits
Join for every incoming car
• Some cars loop for hours
• Uncontrollable loop volume

Stream / Not
• Still not 100% sure if it’s worth streaming
People still frown when this topic is brought up
• More frequent ingress means less effective join
Is 2 minute period of ingress is still streaming? :)
• Another degree of complexity

Cons
• More complex application
Just like cars - ride to Work vs travel to Portugal
• Steady pace is required
Throttling is mandatory, volume control is essential, good GC
• Permanently reserved resources

Pros
• Fun
If this one is on your list, you should probably not do it :)
• Speed
This is “result speed”. Do you actually need it?
• Stability
You have to work really hard to get this beneﬁt

Extra context
• User data is partitioned by nature
User ID (range) is obvious partition key
Helps us to control ingress size and most importantly - loop volume
• Loop volume is hard to control
Average ﬂow was around 150 MB, the loop varied from 2 - 8 GB

Algorithm
ingress
state
updateStateByKey
join

Algorithm
complete
incomplete
decision calculate results
store for later

Copy & Paste
• Ted solution relies on updateStateByKey
This method requires checkpointing
• Checkpoints
Are good only on paper
They are meant for soft-recovery

The Thought
val sc = new SparkContext(…)
val ssc = new StreamingContext(sc, Minutes(2))
val ingress = ssc.textFileStream(“folder”).groupBy(userId)
val checkpoint = sc.textFile("checkpoint").groupBy(userId)
val sessions = checkpoint.fullOuterJoin(ingress)(userId)
.cache
sessions.filter(complete).map(enrich).saveAsTextFile("output")
sessions.filter(inComplete).saveAsTextFile("checkpoint")

fileStream
• Works based on file timestamp with some memory
Bit fuzzy, ugly for testing
• We wanted to have more control and monitoring
Our file names had meta information (source, oldest record time)
Custom implementation with external state (key-valuestore)
We could control ingress size
Tip: persisting actual job plan

Checkpoint
user-1 1477983123 page-26
user-1 1477983256 page-2
user-2 1477982342 home
user-2 1477982947 page-9
user-2 1477984343 home

Checkpoint
• Custom implementation
We wanted to maintain checkpoint grouping
• Nothing fancy
class SessionInputFormat
extends FileInputFormat[SessionKey, List[Record]]

fullOuterJoin
• Probably the most expensive operation
The average ratio is 1:35, with extremes of 1:100
We found IndexedRDD contribution

IndexedRDD
• IndexedRDD
https://github.com/amplab/spark-indexedrdd
• Partition control is essential
Avoid extra stage in your job, extra shufﬂes
Explicit partitioner, even if it is HashPartitioner
Get used to specifying partitioner for every groupBy / combineByKey
Exact and controllable partition count

cache & repetition
• Remember?
.cache .ﬁlter(complete).doStuff .ﬁlter(incomplete).doStuff
• You never want to repeat actions when streaming
We had to scan entire dataset twice
Also… two phase commit

Multi Output Format
• Custom implementation
We wanted different format for each output
Not that hard, but lot’s of copy-paste
Communication via Hadoop conﬁguration
• MultipleOutputFormat
Why we did not use it?

Gotcha
val conf = new JobConf(rdd.context.hadoopConfiguration)
 
conf.set("mapreduce.job.outputformat.class",
classOf[SessionMultiOutputFormat].getName)
 
conf.set(COMPLETE_SESSIONS_PATH, job.outputPath)
conf.set(ONGOING_SESSION_PATH, job.checkpointPath) 
sessions.saveAsNewAPIHadoopDataset(conf)

Non-natural partitioning
• Our ingress comes pre-partitioned
File names like server_oldest-record-timestamp.txt.gz
Where server works on a range of user ids
• Just foreachRDD
… or is it? :D

Resource utilisation
0
25
50
75
100

Parallelise
• Just rdds.par.foreach(processOne)
… or is it ? :D
• Limit thread pool
val par = rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))

The Algorithm
val stream = new OurCustomDStream(..)
stream.foreachRDD(processUnion)
…
val par = unionRdd.rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
unionRdd.rdds.par.foreach(processOne)

The Algorithm
val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20))
val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...)
val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20))
val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc)
val split = sessions.flatMap(splitSessionFunc)
val conf = new JobConf(...)
split.saveAsNewAPIHadoopDataset(conf)

Conﬁguration
• Current conﬁguration
Driver: 6 GB RAM
15 executors: 4GB RAM and 2 cores each
• Total size not that big
60 GB RAM and 30 cores
Previously it was 52 SQL instances.. doing other things too
• Hasn’t changed for half a year already

Other tips
• -XX:+UseG1GC
For both driver and executors
• Plan & store jobs, repeat if failed
When repeating, environment changes
• Use named RDDs
Helps to read your DAGs

Sessionization with Spark streaming

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Sessionization with Spark streaming

Similaire à Sessionization with Spark streaming (20)

Dernier

Dernier (20)

Sessionization with Spark streaming