11. At LinkedIn
10+ billion
writes per day
172k
messages per second
(average)
55+ billion
messages per day
to real-time consumers
12. Quick aside…
Kafka: First among (pluggable) equals
LinkedIn: Espresso and Databus
Coming soon? HDFS, ActiveMQ, Amazon SQS
13. Kafka in four bullet points
• Producers send messages to brokers
• Messages are key, value pairs
• Brokers store messages in topics for
consumers
• Consumers pull messages from brokers
14. A Kafka Topic
“The ref’s blind!”
534
“Car nicked!”
234
“Very sleepy”
755
534
Topic: StatusUpdateEvent
“Nicked a car!”
Value: Timestamp, new status, geolocation, etc.
Key: User ID of user who updated the status
18. What we use YARN for
• Distributing our tasks across multiple
machines
• Letting us know when one has died
• Distributing a replacement
• Isolating our tasks from each other
23. Awesome feature: State
MyStreamTask:process()
Samza TaskRunner: Partition 0
Store state
• Generic data store interface
• Key-value out-of-box
– More soon? Bloom filter, lucene, etc.
• Restored by Samza upon task crash
24. (Pseudo)code snippet: Newsfeed
• Consume StatusUpdateEvent
– Send those updates to all your conmections via
the NewsUpdatePost topic
• Consume NewConnectionEvent
– Maintain state of connections to know who to
send to
RPC = lots of questions, but very quick and specificHadoop = fewer questions, but can take a long time to ponder them
ClassicHadoop because modern Hadoop also uses YARN and TezSamza leverages these existing technologies to build its own framework
Very much a production system, critical to LinkedIn
Log or topic, same termAt least once semanticsMessage kept around on order of days
Analagous to Map-ReduceInput directories =
Pretty standard use of YARN. Came along at exactly the right time for Samza. Nice not to have to have written something ouselves
Gives us distribution, task restart
Guarantee that messages that are partitioned on the same key will be handled by the same task.In the same way that MapReduce allows you to group on keys, copartitioning of the tasks on the keys, allows you to group on the message keysVery useful feature
Also provide interfaces for windowing tasks that are called specific amounts of time, number messagesAlso provide methods for initialization, configuration, etc.Checkpointing is handled behind the scenes
Neat feature that’s unique among current streaming
Note: Not how LinkedIn really does this!
One could imagine lots of Samza tasks consuming different events and publishing them to the NewsUpdatePostAnother task could then rank these and output them to a key value store so that the users see all the most relevant post
In production at LinkedInIncubatorLots of documentationLooking to build a new communityNewbie JIRAs