Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi
1. Building highly efficient data
lakes using Apache Hudi
(Incubating)
Vinoth Chandar | Sr. Staff Engineer, Uber
Apache®, Apache Hudi, and Hudi logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or
other countries.
4. OK.. May be not that simple ...
Database
Event
s
Service
Mesh
Queries
DFS/Cloud Storage
Ingestion
(Extract-Load)
Real-time/OLTP Analytics/OLAP
External
Sources
Raw Tables
Data Lake
Derived
Tables
Schemas
Data Audit
6. High-value data
- User information in RDBMS
- Trip, transaction logs in NoSQL
Replicate CRUD operations
- Strict ordering guarantees
- Zero data loss
Bulk loads don’t scale
- Adds more load to database
- Wasteful re-writing of data
Requirement #1: Incremental Database Ingestion
MySQL
users
users
Inserts, updates, deletes
Replicate
userID int
country string
last_mod long
... ...
Data Lake
7. High-scale time series data
- Several billions/day
- Few millions/sec
- Heavily aggregated
Cause of duplicates
- Client retries/failures/network
errors
- At least-once data pipes
Overcounting problems
- More impressions => more $
- Low fidelity data
Requirement #2: De-Duping Log Events
Impressions
Impressions
Produce impression
events
Replicate
w/o
duplicates
event_id string
datestr string
time long
... ...
Data Lake
8. Requirement #3: Transactional Writes
Atomic publish of data
- Ingestion can fail midway
- Rollback bad data
Consistency Guarantees
- No partial data exposed
- Repeatable queries
Snapshot Isolation
- Time-travel queries
- Concurrent writer/readers
Strong Durability
- No data loss
9. Requirement #4: Unique Key Constraints
Data model parity
- Enforce upstream primary keys
- 1-1 Mapping w/ source table
- Great data quality!
Transaction Processing
- e.g: Settling orders, fraud
detection
- Lakes are well-suited for large
scale processes
10. Multi stage ETL DAGS
- Very common in batch analytics
- Large amount of data
Derived/ETL tables
- Keep afresh with new/changed raw
data
- Star schema/warehousing
Scaling challenges
- Intelligent recomputations
- Window based joins
Requirement #5: Faster Derived Data
raw_trips
std_trips
standardize_fare(row)
id string
datestr string
currency string
fare double
id string
datestr string
std_fare double
... ...
Raw Table
Derived
Table
11. Requirement #6: File Management
Small Files = Big Problem
- Slow queries
- Stress filesystem metadata
Big Files = Large Delays
- 2GB Parquet writing => ~5-10
mins
File Stitching?
- Band-aid for bullet wound
- Consistency?
- Standardization?
12. Requirement #7: Scalable DFS/Storage RPCs
Ingestion/Query all list DFS
- List folders/files, take action
- Single threaded vs parallel
Subtle gotchas/differences
- Cloud storage => no append()
- S3 => Eventual consistency
- S3 => rename() = copy()
- Large directory listings
- HDFS NameNode bottlenecks
13. Requirement #8: Incremental Copy to Data marts
Data Marts
- Specialized, often MPP OLAP databases
- E.g Redshift, Vertica
Online Serving
- Sync ML features to databases
- Throttling syncing rate
Need to sync Lake => Mart
- Full data refresh often very expensive
- Need for incremental egress
14. Requirement #9: Legal Requirements/Data Deletions
Strict rules on data retention
- Delete records
- Correct data
- Raw + Derived tables
Need efficient delete()
- “needle in haystack”
- Indexed on write (point-ish lookup)
- Still optimized for scans
- Propagate deleted records downstream
15. Requirement #10: Late Data Handling
Data often arrives late
- Minutes, Hours, even days
- E.g: credit card txn settlement
Not implicitly complete
- Can lead to large data quality issues
- Trigger recomputation of derived tables
Data arrival tracking
- First class, audit log
- Flexible, rewindable windowing
18. ● Snapshot isolation between writer & queries
● upsert() support with pluggable indexes
● Atomically publish data with rollback support
● Savepoints for data recovery
● Manages file sizes, layout using statistics
● Async compaction of new & old data
● Timeline metadata to track lineage
Apache Hudi (Incubating)
Storage
19. ● Three logical views on single physical dataset
● Read Optimized View
○ Provides excellent query performance
○ Replaces plain Apache Parquet tables
● Incremental View
○ Change stream to feed downstream jobs/ETLs
● Near-Real time Table
○ Provides queries on real-time data
○ Combination of Apache Parquet & Apache Avro
data
Apache Hudi (Incubating)
Queries/Views of data
REALTIME
READ
OPTIMIZED
Cost
Latency
20. Hudi: Upserts + Incremental Changes
Incrementalize batch jobs
Dataset
Hudi upsert
Incoming
Changes
Outgoing
Changes
Hudi Incremental
Pull
upsert(RDD<Record>)
Updates records if present already or inserts them
into its corresponding partitions
RDD<Record> pullDelta(startTs, endTs)
Gets all the records that changed (updated or
inserted) between start and end time. The Delta can
span any number of partitions.
21. Apache Hudi @ Uber
Foundation for the vast Data Lake
>1 Trillion
Records/day
10s PB
Entire Data Lake
1000s
Pipelines/Tables
23. Data Lake built on Apache Hudi
Database
Event
s
Service
Mesh
Queries
DFS/Cloud Storage
Ingestion
(Extract-Load)
Real-time/OLTP Analytics/OLAP
External
Sources
Raw Tables
Data Lake
Derived
Tables
upsert()
/insert()
Incr
Pull()
24. #1: upsert() database changelogs
// Command to extract incrementals using sqoop
bin/sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://localhost/users
--username root
--password *******
--table users
--as-avrodatafile
--target-dir
s3:///tmp/sqoop/import-1/users
// Spark Datasource
Import com.uber.hoodie.DataSourceWriteOptions._
// Use Spark datasource to read avro
Dataset<Row> inputDataset
spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’);
// save it as a Hudi dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”)
.option(RECORDKEY_FIELD_OPT_KEY(), "userID")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"country")
.option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Step 1: Extract new changes to users table
in MySQL, as avro data files on DFS
(or)
Use data integration tool of choice to feed
db changelogs to Kafka/event queue
Step 2: Use your fav datasource to read
extracted data and directly “upsert” the
users table on DFS/Hive
(or)
Use the Hudi DeltaStreamer tool
26. #3: Timeline consistency
Atomic multi-row commits
- Mask partial failures using timeline
- Rollback/savepoint support
Timeline
- Special .hoodie folder
- Actions are instantaneous
MVCC based isolation
- Between queries/ingestion
- Between ingestion/compaction
Future
- Unlimited timeline lookback
27. #4: Keyed update/insert() operations
Ingested record tagging
- Merge updates
- Log inserts
- HoodieRecordPayload interface to
support complex merges
Pluggable indexing
- Built-in : Bloom/Range based, HBase
- Scales with long term data growth
- Handles data skews
Future
- Support via SQL
28. #5: Incremental ETL/Data Pipelines
// Spark Datasource
Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._
// Use Spark datasource to read avro
Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie")
.option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM)
.load(“s3://tables/raw_trips”);
Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF)
// save it as a Hudi dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”)
.option(RECORDKEY_FIELD_OPT_KEY(), "id")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr")
.option(PRECOMBINE_FIELD_OPT_KEY(), "time")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Bring Streaming APIs
on Data Lake
Incrementally pull
- Avoid recomputes!
- Order of magnitudes
faster
Transform + upsert
- Avoid rewriting all data
Future
- Incr pull on logs
- Watermark APIs
29. #6: File Sizing & Fast Ingestion
Enforce file size on write
- Pay up cost to keep queries healthy
- Set hoodie.parquet.max.file.size &
hoodie.parquet.small.file.limit
- See docs for full list
Near real-time log ingest
- Asynchronous compact & write
columnar data
Future
- Support for split/collapse
- Auto tune compression ratio etc
30. #7: Optimized Timeline/FileSystem APIs
Embedded Timeline Server
- 0-listings from Spark executors
- Incremental file-system views on Spark driver
Consistency Guards
- Masks eventual consistency on S3
- No data file renames, in-place writing
- Storage aware “append” usage
- Graceful MVCC design to handle various
failures
Future
- Standalone timeline server
31. #8: Data Dispersal out of Lake
Incremental pull as sync mechanism
- Only copy updated ML features
- Only copy affected data ranges
Decoupled from ETL writing
- Shock absorber between Lake & Mart
- Enables throttling, retrying, rewinding
Future
- Support Lake => Mart in DeltaStreamer tool
32. #9: Efficient/Fast Deletes
Soft deletes
- upsert(k, null)
- Propagates seamlessly via incr-pull
Hard deletes
- Using EmptyHoodieRecordPayload
Indexing
- 7-10x faster than using regular joins
Future
- Standardized tooling
33. #10: Safe Reprocessing
Identify late data
- Timeline tracks all write activity
- E.g: obtain bounds on lateness
Adjust incremental pull windows
- Still much efficient than bulk
recomputation
Future
- Support parrival(data, window) APIs in
TimelineServer
- Apache Beam support for composing
safe, incremental pipelines
35. Current Status
Where we are at
● Committed to open, vendor neutral data lake standard
● 2+ yrs of OSS community support
● First Apache release imminent
● EMIS Health, Yields.io + more in production
● Bunch of companies trying out
● Production tested on cloud
● hudi.apache.org/community.html
36. 2019 Roadmap
Key initiatives
Bootstrapping tables into Hudi
- With indexing benefits
- Convenient tooling
Standalone Timeline Server
- Eliminate fs listings for query planning/ingestion
- Track column level statistics for query
Smart storage layouts
- Increase file sizes for older data
- Re-clustering data for queries