SlideShare une entreprise Scribd logo
1  sur  41
Data Pipeline at Tapad 
@tobym 
@TapadEng
Who am I? 
Toby Matejovsky 
First engineer hired at Tapad 3+ years 
ago 
Scala developer 
@tobym
What are we talking about?
Outline 
• What Tapad does 
• Why bother with a data pipeline? 
• Evolution of the pipeline 
• Day in the life of a analytics pixel 
• What’s next
What Tapad Does 
Cross-platform advertising and analytics 
Process billions of events per day 
A Unified View. 
The Tapad Difference.
Cross platform? 
Device Graph 
Node=device 
edge=inferred connection 
Billion devices 
Quarter billion edges 
85+% accuracy 
A Unified View. 
The Tapad Difference.
Why a Data Pipeline? 
Graph building 
Sanity while processing big data 
Decouple components 
Data accessible at multiple stages
Graph Building 
Realtime mode, but don’t impact bidding latency 
Batch mode
Sanity 
Billions of events, terabytes of logs per day 
Don’t have NSA’s budget 
Clear data retention policy 
Store aggregations
Decouple Components 
Bidder only bids, graph-building 
process only builds graph 
Data stream can split and merge
Data accessible at multiple stages 
Logs on edge of system 
Local spool of data 
Kafka broker 
Consumer local spool 
HDFS
Evolution of the Data Pipeline 
Dark Ages: Monolithic process, synchronous process 
Renaissance: Queues, asynchronous work in same process 
Age of Exploration: Inter-process comm, ad hoc batching 
Age of Enlightenment: Standardize on Kafka and Avro
Dark Ages 
Monolithic process, synchronous process 
It was fast enough, and we had to start somewhere.
Renaissance 
Queues, asynchronous work in same process 
No, it wasn’t fast enough.
Age of Exploration 
Inter-process communication, ad hoc batching 
Servers at the edge batch up events, ship them to another 
service.
Age of Enlightenment 
Standardize on Kafka and Avro 
Properly engineered and supported, reliable
Age of Enlightenment 
Standardize on Kafka and Avro 
Properly engineered and supported, reliable
Tangent! 
Batching, queues, and serialization 
A Unified View. 
The Tapad Difference.
Batching 
Batching is great, will really help throughput 
Batching != slow 
A Unified View. 
The Tapad Difference.
Queues 
Queues are amazing, until they explode and destroy the Rube Goldberg 
machine. 
“I’ll just increase the buffer size.” 
- spoken one day before someone ended up on double PagerDuty rotation 
A Unified View. 
The Tapad Difference.
Care and feeding of your queue 
Monitor 
Back-pressure 
Buffering 
Spooling 
Degraded mode 
A Unified View. 
The Tapad Difference.
Serialization - Protocol Buffers 
Tagged fields 
Sort of self-describing 
required, optional, repeated fields in schema 
“Map” type: 
message StringPair { 
required string key = 1; 
optional string value = 2; 
} 
A Unified View. 
The Tapad Difference.
Serialization - Avro 
Optional field: union { null, long } user_timestamp = null; 
Splittable (Hadoop world) 
Schema evolution and storage 
A Unified View. 
The Tapad Difference.
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Browser loads pixel from pixel server 
Pixel server immediately responds with 200 and transparent gif, 
then serializes requests into a batch file 
Batch file ships every few seconds or when the file reaches 2K
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Pixel ingress server receives 2 kilobyte file containing serialized 
web requests. 
Deserialize, process some requests immediately (update 
database), then convert into Avro records with schema hash 
header, and publish to various Kafka topics
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Producer client figures out where to publish via the broker they 
connect to 
Kafka topics are partitioned into multiple chunks, each has a master 
and slave and are on different servers to survive an outage. 
Configurable retention based on time 
Can add topics dynamically
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Consumer processes are organized into groups 
Many consumer groups can read from same Kafka topic 
Plugins: 
trait Plugin[A] { 
def onStartup(): Unit 
def onSuccess(a: A): Unit 
def onFailure(a: A): Unit 
def onShutdown(): Unit 
} 
GraphitePlugin, BatchingLogfilePlaybackPlugin, TimestampDrivenClockPlugin, 
BatchingTimestampDrivenClockPlugin, …
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
trait Plugins[A] { 
private val _plugins = ArrayBuffer.empty[Plugin[A]] 
def plugins: Seq[Plugin[A]] = _plugins 
def registerPlugin(plugin: Plugin[A]) = _plugins += plugin 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
object KafkaConsumer { 
sealed trait Result { 
def notify[A](plugins: Seq[Plugin[A]], a: A): Unit 
} 
case object Success extends Result { 
def notify[A](plugins: Seq[Plugin[A]], a: A) { 
plugins.foreach(_.onSuccess(a)) 
} 
} 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
/** Decorate a Function1[A, B] with retry logic */ 
case class Retry[A, B](maxAttempts: Int, backoff: Long)(f: A => B){ 
def apply(a: A): Result[A, B] = { 
def execute(attempt: Int, errorLog: List[Throwable]): Result[A, B] = { 
val result = try { 
Success(this, a, f(a)) 
} catch { 
… Failure(this, a, e :: errorLog) … 
} 
result match { 
case failure @ Failure(_, _, errorLog) if errorLog.size < maxAttempts => 
val _backoff = (math.pow(2, attempt) * backoff).toLong 
Thread.sleep(_backoff) // wait before the next invocation 
execute(attempt + 1, errorLog) // try again 
case failure @ Failure(_, _, errorLog) => 
failure 
} 
} 
execute(attempt = 0, errorLog = Nil) 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Consumers log into “permanent storage” in HDFS. 
File format is Avro, written in batches. 
Data retention policy is essential.
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Hadoop 2 - YARN 
Scalding to write map-reduce jobs easily 
Rewrite Avro files as Parquet 
Oozie to schedule regular jobs
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
YARN
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Scalding 
class WordCountJob(args : Args) extends Job(args) { 
TextLine( args("input") ) 
.flatMap('line -> 'word) { line : String => tokenize(line) } 
.groupBy('word) { _.size } 
.write( Tsv( args("output") ) ) 
// Split a piece of text into individual words. 
def tokenize(text : String) : Array[String] = { 
// Lowercase each word and remove punctuation. 
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") 
} 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Parquet 
Column-oriented storage for Hadoop 
Nested data is okay 
Projections 
Predicates
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Parquet 
val requests = ParquetAvroSource 
.project[Request](args("requests"), Projection[Request]("header.query_params", "partner_id")) 
.read 
.sample(args("sample-rate").toDouble) 
.mapTo('Request -> ('queryParams, 'partnerId)) { req: TapestryRequest => 
(req.getHeader.getQueryParams, req.getPartnerId) 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Oozie 
<workflow-app name="combined_queries" xmlns="uri:oozie:workflow:0.3"> 
<start to="devices-location"/> 
<!--<start to="export2db"/>--> 
<action name="devices-location"> 
<shell xmlns="uri:oozie:shell-action:0.1"> 
<job-tracker>${jobTracker}</job-tracker> 
<name-node>${nameNode}</name-node> 
<exec>hadoop</exec> 
<argument>fs</argument> 
<argument>-cat</argument> 
<argument>${devicesConfig}</argument> 
<capture-output/> 
</shell> 
<ok to="networks-location"/> 
<error to="kill"/> 
</action>
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Near real-time consumers and batch hadoop jobs generate data 
cubes from incoming events and save those aggregations into 
Vertica for fast and easy querying with SQL.
Stack summary 
Scala, Jetty/Netty, Finagle 
Avro, Protocol Buffers, Parquet 
Kafka 
Zookeeper 
Hadoop - YARN and HDFS 
Vertica 
Scalding 
Oozie, Sqoop
What’s next? 
Hive 
Druid 
Impala 
Oozie alternative
Thank You yes, we’re hiring! :) 
@tobym 
@TapadEng 
Toby Matejovsky, Director of Engineering 
toby@tapad.com 
@tobym

Contenu connexe

Tendances

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 

Tendances (20)

Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 

Similaire à Data Pipeline at Tapad

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 

Similaire à Data Pipeline at Tapad (20)

Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Unleashing your Kafka Streams Application Metrics!
Unleashing your Kafka Streams Application Metrics!Unleashing your Kafka Streams Application Metrics!
Unleashing your Kafka Streams Application Metrics!
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at Pinterest
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Data Pipeline at Tapad

  • 1. Data Pipeline at Tapad @tobym @TapadEng
  • 2. Who am I? Toby Matejovsky First engineer hired at Tapad 3+ years ago Scala developer @tobym
  • 3. What are we talking about?
  • 4. Outline • What Tapad does • Why bother with a data pipeline? • Evolution of the pipeline • Day in the life of a analytics pixel • What’s next
  • 5. What Tapad Does Cross-platform advertising and analytics Process billions of events per day A Unified View. The Tapad Difference.
  • 6. Cross platform? Device Graph Node=device edge=inferred connection Billion devices Quarter billion edges 85+% accuracy A Unified View. The Tapad Difference.
  • 7. Why a Data Pipeline? Graph building Sanity while processing big data Decouple components Data accessible at multiple stages
  • 8. Graph Building Realtime mode, but don’t impact bidding latency Batch mode
  • 9. Sanity Billions of events, terabytes of logs per day Don’t have NSA’s budget Clear data retention policy Store aggregations
  • 10. Decouple Components Bidder only bids, graph-building process only builds graph Data stream can split and merge
  • 11. Data accessible at multiple stages Logs on edge of system Local spool of data Kafka broker Consumer local spool HDFS
  • 12. Evolution of the Data Pipeline Dark Ages: Monolithic process, synchronous process Renaissance: Queues, asynchronous work in same process Age of Exploration: Inter-process comm, ad hoc batching Age of Enlightenment: Standardize on Kafka and Avro
  • 13. Dark Ages Monolithic process, synchronous process It was fast enough, and we had to start somewhere.
  • 14. Renaissance Queues, asynchronous work in same process No, it wasn’t fast enough.
  • 15. Age of Exploration Inter-process communication, ad hoc batching Servers at the edge batch up events, ship them to another service.
  • 16. Age of Enlightenment Standardize on Kafka and Avro Properly engineered and supported, reliable
  • 17. Age of Enlightenment Standardize on Kafka and Avro Properly engineered and supported, reliable
  • 18. Tangent! Batching, queues, and serialization A Unified View. The Tapad Difference.
  • 19. Batching Batching is great, will really help throughput Batching != slow A Unified View. The Tapad Difference.
  • 20. Queues Queues are amazing, until they explode and destroy the Rube Goldberg machine. “I’ll just increase the buffer size.” - spoken one day before someone ended up on double PagerDuty rotation A Unified View. The Tapad Difference.
  • 21. Care and feeding of your queue Monitor Back-pressure Buffering Spooling Degraded mode A Unified View. The Tapad Difference.
  • 22. Serialization - Protocol Buffers Tagged fields Sort of self-describing required, optional, repeated fields in schema “Map” type: message StringPair { required string key = 1; optional string value = 2; } A Unified View. The Tapad Difference.
  • 23. Serialization - Avro Optional field: union { null, long } user_timestamp = null; Splittable (Hadoop world) Schema evolution and storage A Unified View. The Tapad Difference.
  • 24. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Browser loads pixel from pixel server Pixel server immediately responds with 200 and transparent gif, then serializes requests into a batch file Batch file ships every few seconds or when the file reaches 2K
  • 25. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Pixel ingress server receives 2 kilobyte file containing serialized web requests. Deserialize, process some requests immediately (update database), then convert into Avro records with schema hash header, and publish to various Kafka topics
  • 26. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Producer client figures out where to publish via the broker they connect to Kafka topics are partitioned into multiple chunks, each has a master and slave and are on different servers to survive an outage. Configurable retention based on time Can add topics dynamically
  • 27. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Consumer processes are organized into groups Many consumer groups can read from same Kafka topic Plugins: trait Plugin[A] { def onStartup(): Unit def onSuccess(a: A): Unit def onFailure(a: A): Unit def onShutdown(): Unit } GraphitePlugin, BatchingLogfilePlaybackPlugin, TimestampDrivenClockPlugin, BatchingTimestampDrivenClockPlugin, …
  • 28. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel trait Plugins[A] { private val _plugins = ArrayBuffer.empty[Plugin[A]] def plugins: Seq[Plugin[A]] = _plugins def registerPlugin(plugin: Plugin[A]) = _plugins += plugin }
  • 29. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel object KafkaConsumer { sealed trait Result { def notify[A](plugins: Seq[Plugin[A]], a: A): Unit } case object Success extends Result { def notify[A](plugins: Seq[Plugin[A]], a: A) { plugins.foreach(_.onSuccess(a)) } } }
  • 30. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs /** Decorate a Function1[A, B] with retry logic */ case class Retry[A, B](maxAttempts: Int, backoff: Long)(f: A => B){ def apply(a: A): Result[A, B] = { def execute(attempt: Int, errorLog: List[Throwable]): Result[A, B] = { val result = try { Success(this, a, f(a)) } catch { … Failure(this, a, e :: errorLog) … } result match { case failure @ Failure(_, _, errorLog) if errorLog.size < maxAttempts => val _backoff = (math.pow(2, attempt) * backoff).toLong Thread.sleep(_backoff) // wait before the next invocation execute(attempt + 1, errorLog) // try again case failure @ Failure(_, _, errorLog) => failure } } execute(attempt = 0, errorLog = Nil) }
  • 31. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Consumers log into “permanent storage” in HDFS. File format is Avro, written in batches. Data retention policy is essential.
  • 32. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Hadoop 2 - YARN Scalding to write map-reduce jobs easily Rewrite Avro files as Parquet Oozie to schedule regular jobs
  • 33. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs YARN
  • 34. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Scalding class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  • 35. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Parquet Column-oriented storage for Hadoop Nested data is okay Projections Predicates
  • 36. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Parquet val requests = ParquetAvroSource .project[Request](args("requests"), Projection[Request]("header.query_params", "partner_id")) .read .sample(args("sample-rate").toDouble) .mapTo('Request -> ('queryParams, 'partnerId)) { req: TapestryRequest => (req.getHeader.getQueryParams, req.getPartnerId) }
  • 37. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Oozie <workflow-app name="combined_queries" xmlns="uri:oozie:workflow:0.3"> <start to="devices-location"/> <!--<start to="export2db"/>--> <action name="devices-location"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>hadoop</exec> <argument>fs</argument> <argument>-cat</argument> <argument>${devicesConfig}</argument> <capture-output/> </shell> <ok to="networks-location"/> <error to="kill"/> </action>
  • 38. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Near real-time consumers and batch hadoop jobs generate data cubes from incoming events and save those aggregations into Vertica for fast and easy querying with SQL.
  • 39. Stack summary Scala, Jetty/Netty, Finagle Avro, Protocol Buffers, Parquet Kafka Zookeeper Hadoop - YARN and HDFS Vertica Scalding Oozie, Sqoop
  • 40. What’s next? Hive Druid Impala Oozie alternative
  • 41. Thank You yes, we’re hiring! :) @tobym @TapadEng Toby Matejovsky, Director of Engineering toby@tapad.com @tobym

Notes de l'éditeur

  1. Data pipelines can look a bit like a Rube Goldberg machine
  2. HTTP requests indicating “user is interested in a widget”, “want to show an ad?”, “ad was served”, “user bought a widget”
  3. At any given time, have roughly a billion devices and a quarter billion edges. Graph is constantly changing in realtime whenever a signal is processed, or a record expires. Accuracy is checked against an objective third party dataset.
  4. Generating a terabyte of logs per day, can’t store it all. Don’t want to store it all either, more data takes longer to process
  5. Realtime bidding infrastructure has very tight SLA, is very sensitive to latency. It needs access to the graph database, and incoming signals may add or modify an edge depending on a a big list of rules. Used to do this in-process; obvious problem to have the bidder to work that isn’t directly related to bidding. Solution, publish the signals to a queue (Kafka), let a consumer pull from that and build the graph in near-realtime. All one signal at a time, plus some contextual history for similar signals. Batch Mode - Scalding job running on a one petabyte, 50-node hadoop cluster. Looks at several weeks worth of signals and creates entire “new” graph. More connections, same or better accuracy.
  6. Data retention policy For some data, fine to store aggregations instead of individual elements
  7. Transparency, not just input-> black box -> output Slow graph-building process won’t slow down bidder Deploy new versions of some component in the pipeline without needing to interrupt another process Easy to tap into data stream at any point
  8. Can inspect the data at any one of these places, aids debugging Log produced vs consumed at each stage to see if things are flowing properly
  9. Dark ages - had to start somewhere, and it was fast enough
  10. Had to start somewhere, and it was fast enough in the beginning.
  11. Pretty obvious that the synchronous stuff didn’t work once we started to scale, so just process things in a separate thread pool. Standard software development here; nothing fancy.
  12. Edge servers serialize HTTP request using protocol buffers, write delimited records to a file, ship the file every N seconds or when the file hits a certain size, whichever comes first. Easy because it was the same code deployed on different machines, just needed to add the serialization/deserialization, ship/receive, and batch modes. Very simple, batch mode is just a loop that calls the original single event processor.
  13. Apache Kafka is a distributed queuing system. Fast (A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.) Scalable (can expand capacity without downtime, queues are partitioned and replicated, not limited by single node capacity, distributed by design) Durable (messages are written to disk on master and slave machines) Avro - serialization format like protobuf. Supports maps and default values; protobuf doesn’t. Used for our HDFS storage as well; standardizing allows us to use the same code whether it’s running in a consumer reading from Kafka or in a hadoop job reading from HDFS.
  14. Apache Kafka is a distributed queuing system. Fast (A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.) Scalable (can expand capacity without downtime, queues are partitioned and replicated, not limited by single node capacity, distributed by design) Durable (messages are written to disk on master and slave machines) Avro - serialization format like protobuf. Supports maps and default values; protobuf doesn’t. Used for our HDFS storage as well; standardizing allows us to use the same code whether it’s running in a consumer reading from Kafka or in a hadoop job reading from HDFS.
  15. Batching will really improve processing throughput, because you save the cost of repeated setup and teardown. Works at all scales, batching != slow: on the small end, think about how an optimizing compiler performs “loop unrolling” – perform a dozen operations on each iteration instead of one per iteration. Can batch inside of some function in your application, and inter-process.
  16. Queues are great because they allow for elasticity. However, this can be a double-edged sword because the elasticity may hide a problem until it becomes catastrophic. An unbounded queue WILL cause the system to fail one day. If the producer is faster than the consumer, it will put messages in the queue until you run out of memory.
  17. Monitor – graphite metrics for produced vs consumed counts, alert if things are too far off Back-pressure – Provide back-pressure via a bounded queue. bounded java.util.concurrent.LinkedBlockingQueue is great for this; if it’s full the inserting thread blocks until there is space. Similar with ExecutorService which is backed by the same; thread fails submit job, either throw an exception or have the inserting thread process. “Increase the buffer size” – Actually this is okay, just take some time to think about what a good size is. Main issue with big queue size is GC pressure. Spooling – producer can spool messages locally and retry later. Avoid OOMing Degraded mode – just drop some data. Bidder process does this with incoming big requests by discarding from the front of the queue (those are the messages that have been in the queue the longest, so get rid of them if they are already stale or at risk of becoming stale)
  18. Protocol buffers have tagged fields (just a number, so you can use whatever name you want, and change it later), then a type (int, string, etc), then the length of the field, then the field value. This is cool because each record is can be decoded without having the same schema as the encoder. Each field describes its type, but not the name so you need the generated classes to fully deserialize into something useful with the field names you expect. Evolve the schema by adding new field with a new tag number, or deleting and old field. Never reuse tag number. Easier to evolve schema than Avro because of this technique.
  19. No optional type, because all fields are always present in same order as the schema; so use union with null for optional. Also there is a Map type. Schema evolution possible by resolution rules, need to be careful though; fields are matched by name so cannot rename stuff thoughtlessly. For example, give a default value to a new field so it’s possible to parse a record encoded without that field. Lots of overhead to send schema with each request; don’t do it. So how does one deal with having multiple records with multiple versions of the schema? Store the schema hash, then storage the actual schema (JSON) somewhere else; we use ZooKeeper. Also in HDFS, the header of a giant Avro file can contain the schema for the records contained within. Naturally splittable, good for map-reduce jobs because a single file can be split up automatically among N mappers. Uses a split marker. Test with unit tests - serialize with one schema, deserialize with the other, ensure there are no exceptions and you have expected values in each field.
  20. Serialize with protocol buffers
  21. Some things are supposed to be processed immediately, so do it. Others can wait long enough to do it the right way, so publish the request to the appropriate topic. Topic is just another name for a particular queue.
  22. Configure number of partitions per topic in the broker config files. Consumers can autodiscover brokers via zookeeper, producers autodiscover based on connecting to an existing broker We have 24 hour retention policy, and brokers each have a terabyte of storage available. Once the data is older than the configured age, it’s gone. Don’t fall behind! Started using at v0.7.1. Built some tooling for ourselves that didn’t exist yet.
  23. Consumers autodiscover brokers via zookeeper Batching and discrete consumers Plugins such as GraphitePlugin, BatchingLogfilePlaybackPlugin, TimestampDrivenClockPlugin, BatchingTimestampDrivenClockPlugin, … TimestampDrivenClockPlugin is for a producer. It registers itself with Zookeeper, and saves the latest timestamp that it has processed. This allows other processes to coordinate by taking the minimum timestamp published by the group of producers.
  24. This is how a plugin is registered with a given producer or consumer client.
  25. Example of plugin callbacks being run after notification of a success
  26. A consumer is basically a Function1[A, B] Here’s some retry logic with exponential back-off. Eventually it will fail and stop processing.
  27. Batch write so you have a smaller number of bigger files. Many small files is the Achilles heel of hadoop. Mappers take too long to spin up. Data retention policy is essential because storage consumed WILL expand to the limits of storage available. Make clear distinctions between data that lives for a week, a month, a year. Scratch space as well, use it but be aware that it could be wiped out if necessary.
  28. YARN is like the OS of the hadoop cluster; it allocates resources like compute power to jobs which need it Scalding is a Scala API which makes it easy to write map-reduce jobs Oozie is a job scheduler and coordinator. It’s sort of clunky and uses lots of XML. Not in love with it, but it get the job done and we haven’t committed to seriously exploring other options yet.
  29. Photo credit Hortonworks (http://hortonworks.com/hadoop/yarn/) Basically, HDFS is great and everything just reads from that. YARN allows any application to then run on the same hadoop cluster so it can easily get at the data in HDFS.
  30. Scalding is a Scala API which makes it easy to write map-reduce jobs See example code. joinWithTiny is fantastically fast if you can get get away with it because everything is done in-memory in the mapper; no need for extra map-reduce steps for the join.
  31. Parquet is a column-oriented storage format for Hadoop. Push-down predicates and projections make for faster reads, sometimes giving HUGE speedups. Predicate lets you check some field before reading data into your application Projection lets you load only specified fields out of a record Meta-format, so we still use Avro-generated classes
  32. Example of a projection
  33. Oozie coordinates workflows, which are directed acyclic graphs of actions like “wait for this file, then run this job, if it errors goto this step (kill/cleanup), otherwise go to that step (export to database with sqoop). XML workflow, plus some properties files.
  34. Hive to make data in HDFS available to non-programmers. SQL is easier than writing a map-reduce job Oozie is a bit awkward, we know there are alternatives Druid - realtime big data analytics database. We essentially have our own homegrown version of this; not as mature though Impala is another SQL-on-Hadoop sort of thing