Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Building and Operating a Serverless Data Pipeline

43 vues

Publié le

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2W2f96r.

Will Norman discusses the motivations of switching to a serverless infrastructure, and lessons learned while building and operating such a system at scale; focusing on operability, stability, scalability, and ease of development. Filmed at qconnewyork.com.

Will Norman is a Director of Engineering at Intent where he leads the Data Platform team. He has over 10 years of experience architecting, and building, high volume systems in the FinTech and AdTech industries.

Publié dans : Technologie
  • Soyez le premier à commenter

Building and Operating a Serverless Data Pipeline

  1. 1. Will Norman Lessons Learned Building and Operating a Serverless Data Pipeline
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ serverless-data-pipeline/
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Introduction ● Will Norman - Director of Engineering @ Intent ○ FinTech and AdTech background ● Intent ○ Data Science company for commerce sites ○ Primary application is an ad network for travel sites
  5. 5. ● MOD owns data ● 4 Engineers ● 1 Product Manager MOD Squad
  6. 6. What we’ll be covering ● What is Serverless? ● Intent Data Platform ● Lessons Learned
  7. 7. ● More about managed services than lack servers ● Not just FaaS ● Scale on demand / pay for only what you use ● Empowers developers to own their platform What is Serverless?
  8. 8. ● Active MQ ● Log Processors ○ Java applications ○ Kept state locally ○ Cron scheduled tasks to roll files to S3 ○ Ran on dedicated EC2 instances ● S3 Intent Data Platform [Old World]
  9. 9. ● Kinesis ● Lambda ● Kinesis Firehose ● SNS ● AWS Batch ● S3 Intent Data Platform [New World]
  10. 10. ● Streaming Data Consumers ● Spark Jobs / Aggregations -> Redshift ● Snowflake Loader -> Snowflake ● Parqour -> Athena ○ EMR based jobs that convert AVRO -> Parquet Data Consumers
  11. 11. ● Fewer production issues ● Separation of concerns ● Horizontally scalable ● Removed a lot of undifferentiated heavy lifting Worth the move?
  12. 12. Lessons Learned 1. Total Cost of Ownership 2. Think about data formats upfront 3. Design for Failure 4. Design for Scalability 5. Not NoOps just DiffOps 6. Build Components 7. CI / CD Strategies 8. Leverage the Community
  13. 13. ● On demand costs ● Hidden Costs / Tag All The Things! ● Enterprise Support ● Value of being able to focus on core business problems Total Cost of Ownership
  14. 14. ● What does the ecosystem support? ● Schema vs Schemaless (eg AVRO vs JSON) ● Data validation & Data evolution ● Data at rest vs data in flight ● JSON / CSV / AVRO / Parquet? Think about data formats up front
  15. 15. record DataWrapper { string dataType; long schemaFingerprint; bytes data; } ● Publish Schema in JSON format to S3 ● Consumers lookup schemas, and calculate fingerprints Schema Registry
  16. 16. ● System Guarantees? ● Idempotency ● Over process (data lookbacks) ● Dead Letter Queues Design for failure
  17. 17. ● Decouple from non-scalable systems ● Don’t run lambdas in VPC if you can help it ● Partition data at rest ● Shard events based on GUID / random id if ordering isn’t necessary ● Think about fan out patterns Design for Scalability
  18. 18. ● Application problem or service problem ● Platform Limits ● Logs ● Metrics ● Dashboards ● Alerts Not NoOps, just DiffOps
  19. 19. Some things remain the same
  20. 20. ● Help to reason about different parts of the system ● Make it easy to do the right thing ● Easier to extend ● Infrastructure as Code Build Components
  21. 21. module "conversion_event_processor" { source = "../modules/event_processor" data_type = "conversion" data_source = "ad_server" processor_lambda_handler = "com.intentmedia.data.stream.ConversionLambda::handler" environment = "${var.environment}" firehose_lambda_handler = "com.intentmedia.data.stream.ConversionFirehose::handler" processor_lambda_reserved_concurrent_executions = 3 firehose_lambda_reserved_concurrent_executions = 2 }
  22. 22. ● Step backwards from being able to run stack locally ● Unit tests for business logic ● Integration Tests / End to End tests to ensure that everything is working as expected ● Use different AWS accounts to segregate staging and production CI / CD
  23. 23. ● Slack ○ Serverless Forum ○ og-aws ● Blogs ○ Symphonia https://www.symphonia.io/ ○ Charity Majors https://charity.wtf/ ○ Jeremy Daly https://www.jeremydaly.com/ ● Twitter ● Meetup Events / Conferences Leverage the Community
  24. 24. Questions? Will Norman will.norman@intent.com We’re hiring!
  25. 25. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ serverless-data-pipeline/

×