Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.
9. Sanity
Billions of events, terabytes of logs per day
Don’t have NSA’s budget
Clear data retention policy
Store aggregations
10. Decouple Components
Bidder only bids, graph-building
process only builds graph
Data stream can split and merge
11. Data accessible at multiple stages
Logs on edge of system
Local spool of data
Kafka broker
Consumer local spool
HDFS
12. Evolution of the Data Pipeline
Dark Ages: Monolithic process, synchronous process
Renaissance: Queues, asynchronous work in same process
Age of Exploration: Inter-process comm, ad hoc batching
Age of Enlightenment: Standardize on Kafka and Avro
13. Dark Ages
Monolithic process, synchronous process
It was fast enough, and we had to start somewhere.
19. Batching
Batching is great, will really help throughput
Batching != slow
A Unified View.
The Tapad Difference.
20. Queues
Queues are amazing, until they explode and destroy the Rube Goldberg
machine.
“I’ll just increase the buffer size.”
- spoken one day before someone ended up on double PagerDuty rotation
A Unified View.
The Tapad Difference.
21. Care and feeding of your queue
Monitor
Back-pressure
Buffering
Spooling
Degraded mode
A Unified View.
The Tapad Difference.
22. Serialization - Protocol Buffers
Tagged fields
Sort of self-describing
required, optional, repeated fields in schema
“Map” type:
message StringPair {
required string key = 1;
optional string value = 2;
}
A Unified View.
The Tapad Difference.
23. Serialization - Avro
Optional field: union { null, long } user_timestamp = null;
Splittable (Hadoop world)
Schema evolution and storage
A Unified View.
The Tapad Difference.
24. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Day in the life of a pixel
Browser loads pixel from pixel server
Pixel server immediately responds with 200 and transparent gif,
then serializes requests into a batch file
Batch file ships every few seconds or when the file reaches 2K
25. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Day in the life of a pixel
Pixel ingress server receives 2 kilobyte file containing serialized
web requests.
Deserialize, process some requests immediately (update
database), then convert into Avro records with schema hash
header, and publish to various Kafka topics
26. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Day in the life of a pixel
Producer client figures out where to publish via the broker they
connect to
Kafka topics are partitioned into multiple chunks, each has a master
and slave and are on different servers to survive an outage.
Configurable retention based on time
Can add topics dynamically
27. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Day in the life of a pixel
Consumer processes are organized into groups
Many consumer groups can read from same Kafka topic
Plugins:
trait Plugin[A] {
def onStartup(): Unit
def onSuccess(a: A): Unit
def onFailure(a: A): Unit
def onShutdown(): Unit
}
GraphitePlugin, BatchingLogfilePlaybackPlugin, TimestampDrivenClockPlugin,
BatchingTimestampDrivenClockPlugin, …
28. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Day in the life of a pixel
trait Plugins[A] {
private val _plugins = ArrayBuffer.empty[Plugin[A]]
def plugins: Seq[Plugin[A]] = _plugins
def registerPlugin(plugin: Plugin[A]) = _plugins += plugin
}
29. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Day in the life of a pixel
object KafkaConsumer {
sealed trait Result {
def notify[A](plugins: Seq[Plugin[A]], a: A): Unit
}
case object Success extends Result {
def notify[A](plugins: Seq[Plugin[A]], a: A) {
plugins.foreach(_.onSuccess(a))
}
}
}
30. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
/** Decorate a Function1[A, B] with retry logic */
case class Retry[A, B](maxAttempts: Int, backoff: Long)(f: A => B){
def apply(a: A): Result[A, B] = {
def execute(attempt: Int, errorLog: List[Throwable]): Result[A, B] = {
val result = try {
Success(this, a, f(a))
} catch {
… Failure(this, a, e :: errorLog) …
}
result match {
case failure @ Failure(_, _, errorLog) if errorLog.size < maxAttempts =>
val _backoff = (math.pow(2, attempt) * backoff).toLong
Thread.sleep(_backoff) // wait before the next invocation
execute(attempt + 1, errorLog) // try again
case failure @ Failure(_, _, errorLog) =>
failure
}
}
execute(attempt = 0, errorLog = Nil)
}
31. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Day in the life of a pixel
Consumers log into “permanent storage” in HDFS.
File format is Avro, written in batches.
Data retention policy is essential.
32. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Day in the life of a pixel
Hadoop 2 - YARN
Scalding to write map-reduce jobs easily
Rewrite Avro files as Parquet
Oozie to schedule regular jobs
33. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
YARN
34. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Scalding
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")
}
}
35. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Parquet
Column-oriented storage for Hadoop
Nested data is okay
Projections
Predicates
38. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs
Day in the life of a pixel
Near real-time consumers and batch hadoop jobs generate data
cubes from incoming events and save those aggregations into
Vertica for fast and easy querying with SQL.
41. Thank You yes, we’re hiring! :)
@tobym
@TapadEng
Toby Matejovsky, Director of Engineering
toby@tapad.com
@tobym
Notes de l'éditeur
Data pipelines can look a bit like a Rube Goldberg machine
HTTP requests indicating “user is interested in a widget”, “want to show an ad?”, “ad was served”, “user bought a widget”
At any given time, have roughly a billion devices and a quarter billion edges. Graph is constantly changing in realtime whenever a signal is processed, or a record expires. Accuracy is checked against an objective third party dataset.
Generating a terabyte of logs per day, can’t store it all. Don’t want to store it all either, more data takes longer to process
Realtime bidding infrastructure has very tight SLA, is very sensitive to latency. It needs access to the graph database, and incoming signals may add or modify an edge depending on a a big list of rules. Used to do this in-process; obvious problem to have the bidder to work that isn’t directly related to bidding. Solution, publish the signals to a queue (Kafka), let a consumer pull from that and build the graph in near-realtime. All one signal at a time, plus some contextual history for similar signals.
Batch Mode - Scalding job running on a one petabyte, 50-node hadoop cluster. Looks at several weeks worth of signals and creates entire “new” graph. More connections, same or better accuracy.
Data retention policy
For some data, fine to store aggregations instead of individual elements
Transparency, not just input-> black box -> output
Slow graph-building process won’t slow down bidder
Deploy new versions of some component in the pipeline without needing to interrupt another process
Easy to tap into data stream at any point
Can inspect the data at any one of these places, aids debugging
Log produced vs consumed at each stage to see if things are flowing properly
Dark ages - had to start somewhere, and it was fast enough
Had to start somewhere, and it was fast enough in the beginning.
Pretty obvious that the synchronous stuff didn’t work once we started to scale, so just process things in a separate thread pool. Standard software development here; nothing fancy.
Edge servers serialize HTTP request using protocol buffers, write delimited records to a file, ship the file every N seconds or when the file hits a certain size, whichever comes first. Easy because it was the same code deployed on different machines, just needed to add the serialization/deserialization, ship/receive, and batch modes. Very simple, batch mode is just a loop that calls the original single event processor.
Apache Kafka is a distributed queuing system.
Fast (A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.)
Scalable (can expand capacity without downtime, queues are partitioned and replicated, not limited by single node capacity, distributed by design)
Durable (messages are written to disk on master and slave machines)
Avro - serialization format like protobuf. Supports maps and default values; protobuf doesn’t. Used for our HDFS storage as well; standardizing allows us to use the same code whether it’s running in a consumer reading from Kafka or in a hadoop job reading from HDFS.
Apache Kafka is a distributed queuing system.
Fast (A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.)
Scalable (can expand capacity without downtime, queues are partitioned and replicated, not limited by single node capacity, distributed by design)
Durable (messages are written to disk on master and slave machines)
Avro - serialization format like protobuf. Supports maps and default values; protobuf doesn’t. Used for our HDFS storage as well; standardizing allows us to use the same code whether it’s running in a consumer reading from Kafka or in a hadoop job reading from HDFS.
Batching will really improve processing throughput, because you save the cost of repeated setup and teardown. Works at all scales, batching != slow: on the small end, think about how an optimizing compiler performs “loop unrolling” – perform a dozen operations on each iteration instead of one per iteration. Can batch inside of some function in your application, and inter-process.
Queues are great because they allow for elasticity. However, this can be a double-edged sword because the elasticity may hide a problem until it becomes catastrophic. An unbounded queue WILL cause the system to fail one day. If the producer is faster than the consumer, it will put messages in the queue until you run out of memory.
Monitor – graphite metrics for produced vs consumed counts, alert if things are too far off
Back-pressure – Provide back-pressure via a bounded queue. bounded java.util.concurrent.LinkedBlockingQueue is great for this; if it’s full the inserting thread blocks until there is space. Similar with ExecutorService which is backed by the same; thread fails submit job, either throw an exception or have the inserting thread process.
“Increase the buffer size” – Actually this is okay, just take some time to think about what a good size is. Main issue with big queue size is GC pressure.
Spooling – producer can spool messages locally and retry later. Avoid OOMing
Degraded mode – just drop some data. Bidder process does this with incoming big requests by discarding from the front of the queue (those are the messages that have been in the queue the longest, so get rid of them if they are already stale or at risk of becoming stale)
Protocol buffers have tagged fields (just a number, so you can use whatever name you want, and change it later), then a type (int, string, etc), then the length of the field, then the field value. This is cool because each record is can be decoded without having the same schema as the encoder. Each field describes its type, but not the name so you need the generated classes to fully deserialize into something useful with the field names you expect.
Evolve the schema by adding new field with a new tag number, or deleting and old field. Never reuse tag number.
Easier to evolve schema than Avro because of this technique.
No optional type, because all fields are always present in same order as the schema; so use union with null for optional. Also there is a Map type.
Schema evolution possible by resolution rules, need to be careful though; fields are matched by name so cannot rename stuff thoughtlessly. For example, give a default value to a new field so it’s possible to parse a record encoded without that field. Lots of overhead to send schema with each request; don’t do it. So how does one deal with having multiple records with multiple versions of the schema? Store the schema hash, then storage the actual schema (JSON) somewhere else; we use ZooKeeper. Also in HDFS, the header of a giant Avro file can contain the schema for the records contained within.
Naturally splittable, good for map-reduce jobs because a single file can be split up automatically among N mappers. Uses a split marker.
Test with unit tests - serialize with one schema, deserialize with the other, ensure there are no exceptions and you have expected values in each field.
Serialize with protocol buffers
Some things are supposed to be processed immediately, so do it. Others can wait long enough to do it the right way, so publish the request to the appropriate topic. Topic is just another name for a particular queue.
Configure number of partitions per topic in the broker config files.
Consumers can autodiscover brokers via zookeeper, producers autodiscover based on connecting to an existing broker
We have 24 hour retention policy, and brokers each have a terabyte of storage available. Once the data is older than the configured age, it’s gone. Don’t fall behind!
Started using at v0.7.1. Built some tooling for ourselves that didn’t exist yet.
Consumers autodiscover brokers via zookeeper
Batching and discrete consumers
Plugins such as GraphitePlugin, BatchingLogfilePlaybackPlugin, TimestampDrivenClockPlugin, BatchingTimestampDrivenClockPlugin, …
TimestampDrivenClockPlugin is for a producer. It registers itself with Zookeeper, and saves the latest timestamp that it has processed. This allows other processes to coordinate by taking the minimum timestamp published by the group of producers.
This is how a plugin is registered with a given producer or consumer client.
Example of plugin callbacks being run after notification of a success
A consumer is basically a Function1[A, B]
Here’s some retry logic with exponential back-off. Eventually it will fail and stop processing.
Batch write so you have a smaller number of bigger files. Many small files is the Achilles heel of hadoop. Mappers take too long to spin up.
Data retention policy is essential because storage consumed WILL expand to the limits of storage available. Make clear distinctions between data that lives for a week, a month, a year. Scratch space as well, use it but be aware that it could be wiped out if necessary.
YARN is like the OS of the hadoop cluster; it allocates resources like compute power to jobs which need it
Scalding is a Scala API which makes it easy to write map-reduce jobs
Oozie is a job scheduler and coordinator. It’s sort of clunky and uses lots of XML. Not in love with it, but it get the job done and we haven’t committed to seriously exploring other options yet.
Photo credit Hortonworks (http://hortonworks.com/hadoop/yarn/)
Basically, HDFS is great and everything just reads from that. YARN allows any application to then run on the same hadoop cluster so it can easily get at the data in HDFS.
Scalding is a Scala API which makes it easy to write map-reduce jobs
See example code. joinWithTiny is fantastically fast if you can get get away with it because everything is done in-memory in the mapper; no need for extra map-reduce steps for the join.
Parquet is a column-oriented storage format for Hadoop.
Push-down predicates and projections make for faster reads, sometimes giving HUGE speedups.
Predicate lets you check some field before reading data into your application
Projection lets you load only specified fields out of a record
Meta-format, so we still use Avro-generated classes
Example of a projection
Oozie coordinates workflows, which are directed acyclic graphs of actions like “wait for this file, then run this job, if it errors goto this step (kill/cleanup), otherwise go to that step (export to database with sqoop). XML workflow, plus some properties files.
Hive to make data in HDFS available to non-programmers. SQL is easier than writing a map-reduce job
Oozie is a bit awkward, we know there are alternatives
Druid - realtime big data analytics database. We essentially have our own homegrown version of this; not as mature though
Impala is another SQL-on-Hadoop sort of thing