SlideShare a Scribd company logo
1 of 33
Will Norman
Lessons Learned Building and Operating a
Serverless Data Pipeline
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
serverless-data-pipeline/
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Introduction
● Will Norman - Director of Engineering @ Intent
○ FinTech and AdTech background
● Intent
○ Data Science company for commerce sites
○ Primary application is an ad network for travel sites
● MOD owns data
● 4 Engineers
● 1 Product Manager
MOD Squad
What we’ll be covering
● What is Serverless?
● Intent Data Platform
● Lessons Learned
● More about managed services than lack servers
● Not just FaaS
● Scale on demand / pay for only what you use
● Empowers developers to own their platform
What is Serverless?
● Active MQ
● Log Processors
○ Java applications
○ Kept state locally
○ Cron scheduled tasks to roll files to S3
○ Ran on dedicated EC2 instances
● S3
Intent Data Platform [Old World]
● Kinesis
● Lambda
● Kinesis Firehose
● SNS
● AWS Batch
● S3
Intent Data Platform [New World]
● Streaming Data Consumers
● Spark Jobs / Aggregations -> Redshift
● Snowflake Loader -> Snowflake
● Parqour -> Athena
○ EMR based jobs that convert AVRO -> Parquet
Data Consumers
● Fewer production issues
● Separation of concerns
● Horizontally scalable
● Removed a lot of undifferentiated heavy lifting
Worth the move?
Lessons Learned
1. Total Cost of Ownership
2. Think about data formats upfront
3. Design for Failure
4. Design for Scalability
5. Not NoOps just DiffOps
6. Build Components
7. CI / CD Strategies
8. Leverage the Community
● On demand costs
● Hidden Costs / Tag All The Things!
● Enterprise Support
● Value of being able to focus on core business problems
Total Cost of Ownership
● What does the ecosystem support?
● Schema vs Schemaless (eg AVRO vs JSON)
● Data validation & Data evolution
● Data at rest vs data in flight
● JSON / CSV / AVRO / Parquet?
Think about data formats up front
record DataWrapper {
string dataType;
long schemaFingerprint;
bytes data;
}
● Publish Schema in JSON format to S3
● Consumers lookup schemas, and calculate fingerprints
Schema Registry
● System Guarantees?
● Idempotency
● Over process (data lookbacks)
● Dead Letter Queues
Design for failure
● Decouple from non-scalable systems
● Don’t run lambdas in VPC if you can help it
● Partition data at rest
● Shard events based on GUID / random id if ordering isn’t
necessary
● Think about fan out patterns
Design for Scalability
● Application problem or service problem
● Platform Limits
● Logs
● Metrics
● Dashboards
● Alerts
Not NoOps, just DiffOps
Some things remain the same
● Help to reason about different parts of the system
● Make it easy to do the right thing
● Easier to extend
● Infrastructure as Code
Build Components
module "conversion_event_processor" {
source = "../modules/event_processor"
data_type = "conversion"
data_source = "ad_server"
processor_lambda_handler = "com.intentmedia.data.stream.ConversionLambda::handler"
environment = "${var.environment}"
firehose_lambda_handler = "com.intentmedia.data.stream.ConversionFirehose::handler"
processor_lambda_reserved_concurrent_executions = 3
firehose_lambda_reserved_concurrent_executions = 2
}
● Step backwards from being able to run stack locally
● Unit tests for business logic
● Integration Tests / End to End tests to ensure that
everything is working as expected
● Use different AWS accounts to segregate staging and
production
CI / CD
● Slack
○ Serverless Forum
○ og-aws
● Blogs
○ Symphonia https://www.symphonia.io/
○ Charity Majors https://charity.wtf/
○ Jeremy Daly https://www.jeremydaly.com/
● Twitter
● Meetup Events / Conferences
Leverage the Community
Questions?
Will Norman
will.norman@intent.com
We’re hiring!
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
serverless-data-pipeline/

More Related Content

More from C4Media

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Building and Operating a Serverless Data Pipeline

  • 1. Will Norman Lessons Learned Building and Operating a Serverless Data Pipeline
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ serverless-data-pipeline/
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
  • 5. Introduction ● Will Norman - Director of Engineering @ Intent ○ FinTech and AdTech background ● Intent ○ Data Science company for commerce sites ○ Primary application is an ad network for travel sites
  • 6. ● MOD owns data ● 4 Engineers ● 1 Product Manager MOD Squad
  • 7. What we’ll be covering ● What is Serverless? ● Intent Data Platform ● Lessons Learned
  • 8. ● More about managed services than lack servers ● Not just FaaS ● Scale on demand / pay for only what you use ● Empowers developers to own their platform What is Serverless?
  • 9. ● Active MQ ● Log Processors ○ Java applications ○ Kept state locally ○ Cron scheduled tasks to roll files to S3 ○ Ran on dedicated EC2 instances ● S3 Intent Data Platform [Old World]
  • 10.
  • 11. ● Kinesis ● Lambda ● Kinesis Firehose ● SNS ● AWS Batch ● S3 Intent Data Platform [New World]
  • 12.
  • 13.
  • 14. ● Streaming Data Consumers ● Spark Jobs / Aggregations -> Redshift ● Snowflake Loader -> Snowflake ● Parqour -> Athena ○ EMR based jobs that convert AVRO -> Parquet Data Consumers
  • 15. ● Fewer production issues ● Separation of concerns ● Horizontally scalable ● Removed a lot of undifferentiated heavy lifting Worth the move?
  • 16. Lessons Learned 1. Total Cost of Ownership 2. Think about data formats upfront 3. Design for Failure 4. Design for Scalability 5. Not NoOps just DiffOps 6. Build Components 7. CI / CD Strategies 8. Leverage the Community
  • 17. ● On demand costs ● Hidden Costs / Tag All The Things! ● Enterprise Support ● Value of being able to focus on core business problems Total Cost of Ownership
  • 18. ● What does the ecosystem support? ● Schema vs Schemaless (eg AVRO vs JSON) ● Data validation & Data evolution ● Data at rest vs data in flight ● JSON / CSV / AVRO / Parquet? Think about data formats up front
  • 19. record DataWrapper { string dataType; long schemaFingerprint; bytes data; } ● Publish Schema in JSON format to S3 ● Consumers lookup schemas, and calculate fingerprints Schema Registry
  • 20. ● System Guarantees? ● Idempotency ● Over process (data lookbacks) ● Dead Letter Queues Design for failure
  • 21.
  • 22.
  • 23. ● Decouple from non-scalable systems ● Don’t run lambdas in VPC if you can help it ● Partition data at rest ● Shard events based on GUID / random id if ordering isn’t necessary ● Think about fan out patterns Design for Scalability
  • 24.
  • 25. ● Application problem or service problem ● Platform Limits ● Logs ● Metrics ● Dashboards ● Alerts Not NoOps, just DiffOps
  • 26. Some things remain the same
  • 27.
  • 28. ● Help to reason about different parts of the system ● Make it easy to do the right thing ● Easier to extend ● Infrastructure as Code Build Components
  • 29. module "conversion_event_processor" { source = "../modules/event_processor" data_type = "conversion" data_source = "ad_server" processor_lambda_handler = "com.intentmedia.data.stream.ConversionLambda::handler" environment = "${var.environment}" firehose_lambda_handler = "com.intentmedia.data.stream.ConversionFirehose::handler" processor_lambda_reserved_concurrent_executions = 3 firehose_lambda_reserved_concurrent_executions = 2 }
  • 30. ● Step backwards from being able to run stack locally ● Unit tests for business logic ● Integration Tests / End to End tests to ensure that everything is working as expected ● Use different AWS accounts to segregate staging and production CI / CD
  • 31. ● Slack ○ Serverless Forum ○ og-aws ● Blogs ○ Symphonia https://www.symphonia.io/ ○ Charity Majors https://charity.wtf/ ○ Jeremy Daly https://www.jeremydaly.com/ ● Twitter ● Meetup Events / Conferences Leverage the Community
  • 33. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ serverless-data-pipeline/