Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Scaling
Warehouse with
Flink, Parquet &
Kubernetes
Aditi Verma &
Ramesh Shanmugam
Aditi Verma
Sr Software Engineer
@aditiverma89 <averma@branch.io>
Ramesh Shanmugam
Sr Data Engineer
@rameshs01 <rshanmugam...
Agenda
● Background
● Moving data with Flink @ Branch
● Scale & Performance
● Flink on Kubernetes
● Auto Scaling & Failure...
12B requests per day (+70% y/y)
3B user sessions per day
50 TB of data per day
200K events per second
60+ Flink pipelines
...
Moving data
with Flink @
Branch
“Life is 10% what happens to you and 90% how
you react to it.”
― Charles R. Swindoll
Recei...
Flink @ Branch
State Backend
- Relatively small state
backend
- File system backed state
Parquet
- Higher compression
- Read heavy data set: ingested to Druid and Presto (3M+ queries/day)
- Avro data format
- Me...
Writing parquet with Flink
Two approaches:
1) Close the file with checkpointing
Writing parquet with Flink
Two approaches:
a) Close the file with checkpointing
b) Bucketing file sink
i) Configured with ...
Performance and Scale
- 100% traffic increase each year
- Higher parallelism impacts application performance and state siz...
Analyzing memory usage
❖ Network Buffers
❖ Memory Segments
❖ User code
❖ Memory and GC stats
❖ JVM parameters
Containerizing Flink - Mesos
● Longer start-up time on Mesos
● Moved to containerizing Flink application on Kubernetes
● K...
Kubernetes Terms
Flink on Kubernetes @ Branch
● Single job per cluster
● Docker image
○ flink image - Task manager
+ job manager
○ Job laun...
Auto Scaling
● When & How much scale
○ Auto - Joblauncher
● Scale
○ Replica Set
● Flink job with new
parallelism
Failure Recovery
Job / Task Manager Goes Down?
Job / Task Manager Goes Down?
Savepoint Failure
● Reasons
○ Truncation
○ Schema mismatch
○ Hdfs outage
Savepoint Structure
Foo
Flink
10 11 1312
H=10/*.parquet
H=11/*parquet
H=12/*.in-progress
Sfoo
Run-id 1
CP-1
CP-2
Job-id 1
...
Savepoint failure recovery
Foo
Flink
10 11 1312
H=10/*parquet
H=11/*.parquet
H=12/*.in-progress
foo
Run-id 1
CP-1
CP-2
Job...
Auto Recovery does not work?
● Continuous
monitoring and
proper alerts
● start job from latest
offset
● Have different bac...
Next Steps….
● Parquet memory consumption (when too many buckets open)
○ Window + Rocks db => Parquet
○ Two stage process
...
Q & A
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Adit...
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Adit...
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Adit...
Prochain SlideShare
Chargement dans…5
×

Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam

338 vues

Publié le

Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes
At Branch, we process more than 12 billions events per day, and store and aggregate terabytes of data daily. We use Apache Flink for processing, transforming and aggregating events, and parquet as the data storage format. This talk covers our challenges with scaling our warehouse, namely:

How did we scale our Flink-Parquet warehouse to handle 3x increase in traffic?
How do we ensure exactly once, event-time based, fault tolerant processing of events?
In this talk, we also provide an overview on deploying and scaling our streaming warehouse. We give an overview on:

How we scaled our Parquet warehouse by tuning memory
Running on Kubernetes cluster for resource management
How we migrated our streaming jobs with no disruption from Mesos to Kubernetes
Our challenges and learnings along the way

Publié dans : Technologie
  • Real Ways To Make Money, Most online opportunities are nothing but total scams! ■■■ https://tinyurl.com/y4urott2
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Soyez le premier à aimer ceci

Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam

  1. 1. Scaling Warehouse with Flink, Parquet & Kubernetes Aditi Verma & Ramesh Shanmugam
  2. 2. Aditi Verma Sr Software Engineer @aditiverma89 <averma@branch.io> Ramesh Shanmugam Sr Data Engineer @rameshs01 <rshanmugam@branch.io>
  3. 3. Agenda ● Background ● Moving data with Flink @ Branch ● Scale & Performance ● Flink on Kubernetes ● Auto Scaling & Failure Recovery
  4. 4. 12B requests per day (+70% y/y) 3B user sessions per day 50 TB of data per day 200K events per second 60+ Flink pipelines 5+ Kubernetes cluster
  5. 5. Moving data with Flink @ Branch “Life is 10% what happens to you and 90% how you react to it.” ― Charles R. Swindoll Receive information Process it React to it FAST!!
  6. 6. Flink @ Branch
  7. 7. State Backend - Relatively small state backend - File system backed state
  8. 8. Parquet - Higher compression - Read heavy data set: ingested to Druid and Presto (3M+ queries/day) - Avro data format - Memory intensive writes
  9. 9. Writing parquet with Flink Two approaches: 1) Close the file with checkpointing
  10. 10. Writing parquet with Flink Two approaches: a) Close the file with checkpointing b) Bucketing file sink i) Configured with custom event-time bucketer, parquet writer and batch size ii) Files are rolled out with a timeout of 10 min within a bucket
  11. 11. Performance and Scale - 100% traffic increase each year - Higher parallelism impacts application performance and state size - Kafka partitions < Flink parallelism requires rebalance on the input stream - Task manager timeouts
  12. 12. Analyzing memory usage ❖ Network Buffers ❖ Memory Segments ❖ User code ❖ Memory and GC stats ❖ JVM parameters
  13. 13. Containerizing Flink - Mesos ● Longer start-up time on Mesos ● Moved to containerizing Flink application on Kubernetes ● Kubernetes is resource oriented, declarative
  14. 14. Kubernetes Terms
  15. 15. Flink on Kubernetes @ Branch ● Single job per cluster ● Docker image ○ flink image - Task manager + job manager ○ Job launcher - custom launcher + job jar ● Job launcher ○ Application jar ○ Uploads jar ● Config map - flink config.xml ○ jobmanager.rpc.address
  16. 16. Auto Scaling ● When & How much scale ○ Auto - Joblauncher ● Scale ○ Replica Set ● Flink job with new parallelism
  17. 17. Failure Recovery
  18. 18. Job / Task Manager Goes Down?
  19. 19. Job / Task Manager Goes Down?
  20. 20. Savepoint Failure ● Reasons ○ Truncation ○ Schema mismatch ○ Hdfs outage
  21. 21. Savepoint Structure Foo Flink 10 11 1312 H=10/*.parquet H=11/*parquet H=12/*.in-progress Sfoo Run-id 1 CP-1 CP-2 Job-id 1 ● job/run-id/flink-job-id/cp-x ● Run id - incremental number ● Job id - flink job name Save Point Structure
  22. 22. Savepoint failure recovery Foo Flink 10 11 1312 H=10/*parquet H=11/*.parquet H=12/*.in-progress foo Run-id 1 CP-1 CP-2 Job-id x Run-id 2 CP-x Job-id x
  23. 23. Auto Recovery does not work? ● Continuous monitoring and proper alerts ● start job from latest offset ● Have different backfill route
  24. 24. Next Steps…. ● Parquet memory consumption (when too many buckets open) ○ Window + Rocks db => Parquet ○ Two stage process ● row oriented streaming ● batch to convert columnar
  25. 25. Q & A

×