Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Optimise your data pipeline without rewriting it - Big data conference Vilnius

191 vues

Publié le

It is not fast enough! That is one of the more common responses to a data engineer when putting a data pipeline in production. It is easy to dig down into the code and try to optimize it. My experience as a data engineer shows me that it is often easier and more efficient, both in time spent and outcome, to focus on a more holistic view of the pipeline.
In this talk, we will look at a structured process to optimize our batch pipelines. We will introduce steps that make our process data-driven instead of a gut feeling. With examples from real-world cases where delivery time was reduced in order by magnitude, we will look at actions where taken.
The intended audience is a beginner to intermediate data engineers. After the talk, you will have a better understanding of how to optimize your pipeline and be able to explain the steps taken for a stakeholder. You will know:
* what metrics to look at
* how to visualize the metrics
* how to detect bottlenecks and other time thieves from the metrics
* what actions to take.

Publié dans : Données & analyses
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Optimise your data pipeline without rewriting it - Big data conference Vilnius

  1. 1. Optimize your data pipeline without rewriting it Magnus Runesson
  2. 2. “How did you do that?”
  3. 3. Who am I?
  4. 4. We are the rails and brains of open banking Trusted by the industry leaders • Natwest • BNP Paribas Fortis • ABN AMRO • PayPal • Klarna • SEB 260 Employees across Europe Local offices in • Sweden • Finland • Denmark • United Kingdom • Germany • Netherlands • Spain • France • Poland 2,500 + banks & FIs connected We’re connected for access to all types of accounts, from banks, neo-banks, credit cards and more, and bring them together for you via a single, beautiful API. Industry Authority Member of EU, Berlin Group & Open Banking advisory boards ISO/IEC 27001 Certified PSD2 Licensed 2,000 + Platform users
  5. 5. Agenda Introduction The problem The process to success Summary Proprietary & confidential
  6. 6. It is not fast enough!
  7. 7. The process
  8. 8. Visualize
  9. 9. Database DB to files on storage bucket Load to bigquery Database Google BigQuery
  10. 10. 00 03 06 09 12 15 18 21 DB to files on storage bucket Load files to bigquery
  11. 11. Requirements
  12. 12. Break down
  13. 13. 00 03 06 09 12 15 18 21 DB to files on storage bucket Load files to bigquery
  14. 14. Table 1 Table 2 Table 3 Table 4 Table 1 Table 2 Table 3 Tab 4 00 03 06 09 12 15 18 21
  15. 15. Parallelise
  16. 16. Table 1 Table 2 Table 3 Table 4 Table 1 Table 2 Table 3 Tab 4 00 03 06 09 12 15 18 21
  17. 17. Table 1 Table 2 Table 3 Table 4 Table 1 Table 2 Table 3 Tab 4 00 03 06 09 12 15 18 21
  18. 18. Remove idle time
  19. 19. Table 1 Table 2 Table 3 Table 4 Table 1 Table 2 Table 3 Tab 4 00 03 06 09 12 15 18 21
  20. 20. Table 1 Table 2 Table 3 Table 4 Table 1 Table 2 Table 3 Tab 4 00 03 06 09 12 15 18 21
  21. 21. Find bottleneck
  22. 22. Table 1 Table 2 Table 3 Table 4 Table 1 Table 2 Table 3 Tab 4 00 03 06 09 12 15 18 21
  23. 23. Task 1 Task 2 Task 6Task 3 Task 5 Task 4 Time Bottleneck example
  24. 24. Task 1 Task 2 Task 6Task 3 Task 5 Task 4 Time Bottleneck example - critical path
  25. 25. Zoom in
  26. 26. Process recap
  27. 27. Visualize Understand requirements Break down Parallelise Remove idle time Find bottlenecks Zoom in
  28. 28. Remember ● Use dependency based scheduling ● Look on the metrics ● Use metrics to make your stand ● Work with the bottleneck once entering the code ● Stop when good enough
  29. 29. Thank you! (Psst… come work with us!) Magnus.Runesson@tink.se @MRunesson

×