Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Change Data Capture
To Data Lakes
Using
Apache Pulsar/Hudi
Speaker Bio
PMC Chair/Creator of Hudi
Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking)
Principal Eng @ Confluent (ks...
Agenda
1) Background On CDC
2) Make a Lake
3) Hudi Deep Dive
4) Onwards
Background
CDC, Data Lakes - What, Why
Change Data Capture
Design Pattern for Data Integration
- Not tied to any particular technology
- Deliver low-latency
Syst...
Examples of CDC
Polling an external API for new events
- Timestamps, status indicators, versions
- Simple, works for small...
CDC vs ETL?
CDC is merely Incremental Extraction
- Not really competing concepts
- ETL needs one-time full bootstrap
- <>
...
CDC vs Streaming Processing
CDC enables Streaming ETL
- Why bulk T & L anymore?
- Process change streams
- Mutable Sinks
R...
Ideal CDC Source
Support reliable incremental consumption
- <>
Support rewinding/replay
- <>
Support ordering of changes
-...
Ideal CDC Sink
Mutable, Transactional
- <>
Quickly absorb changes
- <>
Bonus: Also act as CDC Source
- <>
Data Lakes
Architectural Pattern for Analytical Data
- Data Lake != Spark, Flink
- Data Lake != Files on S3
- <>
Raw Data
...
Database
Events
Apps/
Services
Queries
DFS/Cloud Storage
Change Stream
Operational
Data Infrastructure
Analytics
Data Infr...
Make a Lake
Putting Pulsar and Hudi to work
Data Flow Design
<show diagram showing e2e data flow>
- <..>
Pre-requirements
Running MySQL Instance (RDS)
- <..>
Running Pulsar Cluster (??)
- <..>
Running Spark Cluster (e.g EMR)
- ...
Test Data
Explain ‘users’ table
- <..>
Explain ‘github_events’ data emitted into Pulsar
- <..>
#1: Setup CDC Connector
<Show configurations>
- <..>
<Sample data out of Pulsar>
- <..>
#2: Kick Off Hudi DeltaStreamer
<Show configurations, Command to submit>
- <..>
<Query data out of Hudi tables>
- <..>
#3: Streaming ETL using Hudi
<Show how to CDC from Hudi itself>
- <..>
<Sample pipeline that does some enrichment of
event...
Hudi Deep Dive
Intro, Components, APIs, Design Choices
Hudi Data Lake
Original pioneer of the transactional
data lake movement
Embeddable, Serverless, Distributed
Database abstr...
Stream Processing is Fast & Efficient
Streaming Stack
+ Intelligent, Incremental
+ Fast, Efficient
- Row oriented
- Not sc...
What If: Streaming Model on Batch Data?
The Incremental Stack
+ Intelligent, Incremental
+ Fast, Efficient
+ Scans, Column...
Hudi : Open Sourcing & Evolution..
2015 : Published core ideas/principles for incremental processing (O’reilly article)
20...
Apache Hudi - Adoption
Committers/
Contributors:
Uber, AWS,
Alibaba, Tencent,
Robinhood,
Moveworks,
Confluent,
Snowflake,
...
The Hudi Stack
Complete “data” lake platform
Tightly integrated, Self managing
Write using Spark, Flink
Query using Spark,...
Design of a Hudi Table
File Layout
File Groups & Slices
Query Types
Read Optimized
Query at 10:10
Snapshot Query
at 10:10
Incremental Query
(10:08, 10:10)
Our Design Goals
Streaming/Incremental
- Upsert/Delete Optimized
- Key based operations
Faster
- Frequent Commits
- Design...
Delta Logs at File Level over Global
Each file group is it’s own self
contained log
- Constant metadata size,
controlled b...
Record Indexes over Just File/Column Stats
Index maps key to a file group
- During upsert/deletes
- Much like streaming st...
MVCC Concurrency Control over Only OCC
Frequent commits => More frequent
clustering/compaction => More contention
Differen...
Record Level Merge API over Only Overwrites
More generalized approach
- Default: overwrite w/ latest writer wins
- Support...
Specialized Database over Generalized Format
Approach it more like a shared-nothing
database
- Daemons aware of each other...
Record level CDC over File/Snapshot Diffing
Per record metadata
- _hoodie_commit_time : Kafka style
compacted change strea...
Onwards
Ideas, Ongoing work, Future Plans
Scalable, Multi Model Indexes
Partitions are very coarse file-level indexes
Finer grained indexes as new partitions to
met...
Caching
LRU Cache ala DB Buffer Pool
Frequent Commits => Small objects/blocks
- Today : Aggressively table services
- Tomo...
Timeline Metaserver
Interesting fact : Hudi has a metaserver already
- Runs on Spark driver; Serves FileSystem
RPCs + quer...
Beyond Just Lake Engines
Pulsar Sink
<Outline strawman design, Hudi facing work,
Call for collab>
Pulsar Tiered Storage
<Research sharing current challenges, call for
collaboration>
Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display...
Thanks!
Questions?
Hudi powers one of the largest transactional
data lakes on the planet @ Uber
Operated 150PB+ Data Lake platform for 4+
yea...
Prochain SlideShare
Chargement dans…5
×

0

Partager

Télécharger pour lire hors ligne

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021

Télécharger pour lire hors ligne

Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.

In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir
  • Soyez le premier à aimer ceci

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021

  1. 1. Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
  2. 2. Speaker Bio PMC Chair/Creator of Hudi Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking) Principal Eng @ Confluent (ksqlDB, Kafka/Streams) Staff Eng @ Linkedin (Voldemort, DDS) Sr Eng @ Oracle (CDC/Goldengate/XStream)
  3. 3. Agenda 1) Background On CDC 2) Make a Lake 3) Hudi Deep Dive 4) Onwards
  4. 4. Background CDC, Data Lakes - What, Why
  5. 5. Change Data Capture Design Pattern for Data Integration - Not tied to any particular technology - Deliver low-latency System for tracking, fetching new data - Not concerned with how to use such data - Ideally, incremental update downstream - Minimizing number of bits read-written/change Change is the ONLY Constant - Even in Computer science - Data is immutable = Myth (well, kinda)
  6. 6. Examples of CDC Polling an external API for new events - Timestamps, status indicators, versions - Simple, works for small-scale data changes - E.g: Polling github events API Emit Events directly from Application - Data model to encode deltas - Scales for high-volume data changes - E.g: Emitting sensor state changes to Pulsar Scanning a database’s redo log - SCN and other watermarks to extract data/metadata changes - Operationally heavy, very high fidelity - E.g: Using Debezium to obtain changelogs from MySQL
  7. 7. CDC vs ETL? CDC is merely Incremental Extraction - Not really competing concepts - ETL needs one-time full bootstrap - <> CDC changes T and L significantly - T on change streams, not just table state - L incrementally, not just bulk reloads
  8. 8. CDC vs Streaming Processing CDC enables Streaming ETL - Why bulk T & L anymore? - Process change streams - Mutable Sinks Reliable Stream Processing needs distributed logs - Rewind/Replay CDC logs - Absorb spikes/batch writes to sinks
  9. 9. Ideal CDC Source Support reliable incremental consumption - <> Support rewinding/replay - <> Support ordering of changes - <>
  10. 10. Ideal CDC Sink Mutable, Transactional - <> Quickly absorb changes - <> Bonus: Also act as CDC Source - <>
  11. 11. Data Lakes Architectural Pattern for Analytical Data - Data Lake != Spark, Flink - Data Lake != Files on S3 - <> Raw Data - <> Derived Data - <>
  12. 12. Database Events Apps/ Services Queries DFS/Cloud Storage Change Stream Operational Data Infrastructure Analytics Data Infrastructure External Sources Tables CDC to Data Lakes
  13. 13. Make a Lake Putting Pulsar and Hudi to work
  14. 14. Data Flow Design <show diagram showing e2e data flow> - <..>
  15. 15. Pre-requirements Running MySQL Instance (RDS) - <..> Running Pulsar Cluster (??) - <..> Running Spark Cluster (e.g EMR) - <..>
  16. 16. Test Data Explain ‘users’ table - <..> Explain ‘github_events’ data emitted into Pulsar - <..>
  17. 17. #1: Setup CDC Connector <Show configurations> - <..> <Sample data out of Pulsar> - <..>
  18. 18. #2: Kick Off Hudi DeltaStreamer <Show configurations, Command to submit> - <..> <Query data out of Hudi tables> - <..>
  19. 19. #3: Streaming ETL using Hudi <Show how to CDC from Hudi itself> - <..> <Sample pipeline that does some enrichment of events> - <..>
  20. 20. Hudi Deep Dive Intro, Components, APIs, Design Choices
  21. 21. Hudi Data Lake Original pioneer of the transactional data lake movement Embeddable, Serverless, Distributed Database abstraction layer over DFS - We invented this! Hadoop Upserts, Deletes & Incrementals Provide transactional updates/deletes First class support for record level CDC streams
  22. 22. Stream Processing is Fast & Efficient Streaming Stack + Intelligent, Incremental + Fast, Efficient - Row oriented - Not scan optimized Batch Stack + Scans, Columnar formats + Scalable Compute - Naive, In-efficient
  23. 23. What If: Streaming Model on Batch Data? The Incremental Stack + Intelligent, Incremental + Fast, Efficient + Scans, Columnar formats + Scalable Compute https://www.oreilly.com/content/ubers-case-for- incremental-processing-on-hadoop/; 2016
  24. 24. Hudi : Open Sourcing & Evolution.. 2015 : Published core ideas/principles for incremental processing (O’reilly article) 2016 : Project created at Uber & powers all database/business critical feeds @ Uber 2017 : Project open sourced by Uber & work begun on Merge-On-Read, Cloud support 2018 : Picked up adopters, hardening, async compaction.. 2019 : Incubated into ASF, community growth, added more platform components. 2020 : Top level Apache project, Over 10x growth in community, downloads, adoption 2021 : SQL DMLs, Flink Continuous Queries, More indexing schemes, Metaserver, Caching
  25. 25. Apache Hudi - Adoption Committers/ Contributors: Uber, AWS, Alibaba, Tencent, Robinhood, Moveworks, Confluent, Snowflake, Bytedance, Zendesk, Yotpo and more https://hudi.apache.org/docs/powered_by.html
  26. 26. The Hudi Stack Complete “data” lake platform Tightly integrated, Self managing Write using Spark, Flink Query using Spark, Flink, Hive, Presto, Trino, Impala, AWS Athena/Redshift, Aliyun DLA etc Out-of-box tools/services for painless dataops
  27. 27. Design of a Hudi Table
  28. 28. File Layout
  29. 29. File Groups & Slices
  30. 30. Query Types Read Optimized Query at 10:10 Snapshot Query at 10:10 Incremental Query (10:08, 10:10)
  31. 31. Our Design Goals Streaming/Incremental - Upsert/Delete Optimized - Key based operations Faster - Frequent Commits - Design around logs - Minimize overhead
  32. 32. Delta Logs at File Level over Global Each file group is it’s own self contained log - Constant metadata size, controlled by “retention” parameters - Leverage append() when available; lower metadata overhead Merges are local to each file group - UUID keys throw off any range pruning
  33. 33. Record Indexes over Just File/Column Stats Index maps key to a file group - During upsert/deletes - Much like streaming state store Workloads have different shapes - Late arriving updates; Totally random - Trickle down to derived tables Many pluggable options - Bloom Filters + Key ranges - HBase, Join based - Global vs Local
  34. 34. MVCC Concurrency Control over Only OCC Frequent commits => More frequent clustering/compaction => More contention Differentiate writers vs table services - Much like what databases do - Table services don’t contend with writers - Async compaction/clustering Don’t be so “Optimistic” - OCC b/w writers; works, until it does n’t - Retries, split txns, wastes resources - MVCC/Log based between writers/table services
  35. 35. Record Level Merge API over Only Overwrites More generalized approach - Default: overwrite w/ latest writer wins - Support business-specific resolution Log partial updates - Log just changed column; - Drastic reduction in write amplification Log based reconciliation - Delete, Undelete based on business logic - CRDT, Operational Transform like delayed conflict resolution
  36. 36. Specialized Database over Generalized Format Approach it more like a shared-nothing database - Daemons aware of each other - E.g: Compaction, Cleaning in rocksDB E.g: Clustering & Compaction know each other - Reconcile metadata based on time order - Compactions avoid redundant scheduling Self Managing - Sorting, Time-order preservation, File- sizing
  37. 37. Record level CDC over File/Snapshot Diffing Per record metadata - _hoodie_commit_time : Kafka style compacted change streams in commit order - _hoodie_commit_seqno: Consume large commits in chunks, ala Kafka offsets File group design => CDC friendly - Efficient retrieval of old, new values - Efficient retrieval of all values for key Infinite Retention/Lookback coming later in 2021
  38. 38. Onwards Ideas, Ongoing work, Future Plans
  39. 39. Scalable, Multi Model Indexes Partitions are very coarse file-level indexes Finer grained indexes as new partitions to metadata table - Bloom Filter, Bitmaps - Column ranges (RFC-27) - HFile/Hash indexes - Search? External indexes - DynamoDB, Spanner + other cloud stores - C*, Mongo and other
  40. 40. Caching LRU Cache ala DB Buffer Pool Frequent Commits => Small objects/blocks - Today : Aggressively table services - Tomorrow : File Group/Hudi file model aware caching - Mutable data => FileSystem/Block level caches are not that effective. Benefits - Great performance for CDC tables - Avoid open/close costs for small objects
  41. 41. Timeline Metaserver Interesting fact : Hudi has a metaserver already - Runs on Spark driver; Serves FileSystem RPCs + queries on timeline - Backed by rocksDB, updated incrementally on every timeline action - Very useful in streaming jobs - But, still standalone Data lakes need a new metaserver - Flat file metastores are cool? (really?) - Sometimes I miss HMS (sometimes..) - Let’s learn from Cloud warehouses
  42. 42. Beyond Just Lake Engines
  43. 43. Pulsar Sink <Outline strawman design, Hudi facing work, Call for collab>
  44. 44. Pulsar Tiered Storage <Research sharing current challenges, call for collaboration>
  45. 45. Engage With Our Community User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) dev@hudi.apache.org (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup
  46. 46. Thanks! Questions?
  47. 47. Hudi powers one of the largest transactional data lakes on the planet @ Uber Operated 150PB+ Data Lake platform for 4+ years Multi engine environment with Presto, Spark, Hive, Vertica & more Architected several data services for deletion/GDPR across 15K+ data users Mission critical to all of Uber w/ data monitoring/schemas/quality enforcement ~8000 Tables 150+ PB 3-30 Mins Fresh ~1.5 PB/day ~850 million vcore-secs ~4 Engines Hudi @ Uber

Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time. In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.

Vues

Nombre de vues

154

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

0

Actions

Téléchargements

9

Partages

0

Commentaires

0

Mentions J'aime

0

×