SlideShare une entreprise Scribd logo
1  sur  39
Building highly efficient data
lakes using Apache Hudi
(Incubating)
Vinoth Chandar | Sr. Staff Engineer, Uber
Apache®, Apache Hudi, and Hudi logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or
other countries.
Data Architectures
Lakes, Marts, Silos
Simple… Right?
Database
Events
Service
Mesh
Queries
DFS/Cloud Storage
Extract-Transform-Load
Real-time/OLTP Analytics/OLAP
External
Sources
Tables
OK.. May be not that simple ...
Database
Event
s
Service
Mesh
Queries
DFS/Cloud Storage
Ingestion
(Extract-Load)
Real-time/OLTP Analytics/OLAP
External
Sources
Raw Tables
Data Lake
Derived
Tables
Schemas
Data Audit
Data Lake Implementation : It’s actually hard..
High-value data
- User information in RDBMS
- Trip, transaction logs in NoSQL
Replicate CRUD operations
- Strict ordering guarantees
- Zero data loss
Bulk loads don’t scale
- Adds more load to database
- Wasteful re-writing of data
Requirement #1: Incremental Database Ingestion
MySQL
users
users
Inserts, updates, deletes
Replicate
userID int
country string
last_mod long
... ...
Data Lake
High-scale time series data
- Several billions/day
- Few millions/sec
- Heavily aggregated
Cause of duplicates
- Client retries/failures/network
errors
- At least-once data pipes
Overcounting problems
- More impressions => more $
- Low fidelity data
Requirement #2: De-Duping Log Events
Impressions
Impressions
Produce impression
events
Replicate
w/o
duplicates
event_id string
datestr string
time long
... ...
Data Lake
Requirement #3: Transactional Writes
Atomic publish of data
- Ingestion can fail midway
- Rollback bad data
Consistency Guarantees
- No partial data exposed
- Repeatable queries
Snapshot Isolation
- Time-travel queries
- Concurrent writer/readers
Strong Durability
- No data loss
Requirement #4: Unique Key Constraints
Data model parity
- Enforce upstream primary keys
- 1-1 Mapping w/ source table
- Great data quality!
Transaction Processing
- e.g: Settling orders, fraud
detection
- Lakes are well-suited for large
scale processes
Multi stage ETL DAGS
- Very common in batch analytics
- Large amount of data
Derived/ETL tables
- Keep afresh with new/changed raw
data
- Star schema/warehousing
Scaling challenges
- Intelligent recomputations
- Window based joins
Requirement #5: Faster Derived Data
raw_trips
std_trips
standardize_fare(row)
id string
datestr string
currency string
fare double
id string
datestr string
std_fare double
... ...
Raw Table
Derived
Table
Requirement #6: File Management
Small Files = Big Problem
- Slow queries
- Stress filesystem metadata
Big Files = Large Delays
- 2GB Parquet writing => ~5-10
mins
File Stitching?
- Band-aid for bullet wound
- Consistency?
- Standardization?
Requirement #7: Scalable DFS/Storage RPCs
Ingestion/Query all list DFS
- List folders/files, take action
- Single threaded vs parallel
Subtle gotchas/differences
- Cloud storage => no append()
- S3 => Eventual consistency
- S3 => rename() = copy()
- Large directory listings
- HDFS NameNode bottlenecks
Requirement #8: Incremental Copy to Data marts
Data Marts
- Specialized, often MPP OLAP databases
- E.g Redshift, Vertica
Online Serving
- Sync ML features to databases
- Throttling syncing rate
Need to sync Lake => Mart
- Full data refresh often very expensive
- Need for incremental egress
Requirement #9: Legal Requirements/Data Deletions
Strict rules on data retention
- Delete records
- Correct data
- Raw + Derived tables
Need efficient delete()
- “needle in haystack”
- Indexed on write (point-ish lookup)
- Still optimized for scans
- Propagate deleted records downstream
Requirement #10: Late Data Handling
Data often arrives late
- Minutes, Hours, even days
- E.g: credit card txn settlement
Not implicitly complete
- Can lead to large data quality issues
- Trigger recomputation of derived tables
Data arrival tracking
- First class, audit log
- Flexible, rewindable windowing
Apache Hudi
At a glance
Apache Hudi (Incubating)
Overview
● Snapshot isolation between writer & queries
● upsert() support with pluggable indexes
● Atomically publish data with rollback support
● Savepoints for data recovery
● Manages file sizes, layout using statistics
● Async compaction of new & old data
● Timeline metadata to track lineage
Apache Hudi (Incubating)
Storage
● Three logical views on single physical dataset
● Read Optimized View
○ Provides excellent query performance
○ Replaces plain Apache Parquet tables
● Incremental View
○ Change stream to feed downstream jobs/ETLs
● Near-Real time Table
○ Provides queries on real-time data
○ Combination of Apache Parquet & Apache Avro
data
Apache Hudi (Incubating)
Queries/Views of data
REALTIME
READ
OPTIMIZED
Cost
Latency
Hudi: Upserts + Incremental Changes
Incrementalize batch jobs
Dataset
Hudi upsert
Incoming
Changes
Outgoing
Changes
Hudi Incremental
Pull
upsert(RDD<Record>)
Updates records if present already or inserts them
into its corresponding partitions
RDD<Record> pullDelta(startTs, endTs)
Gets all the records that changed (updated or
inserted) between start and end time. The Delta can
span any number of partitions.
Apache Hudi @ Uber
Foundation for the vast Data Lake
>1 Trillion
Records/day
10s PB
Entire Data Lake
1000s
Pipelines/Tables
Apache Hudi Data Lake
Meeting the requirements
Data Lake built on Apache Hudi
Database
Event
s
Service
Mesh
Queries
DFS/Cloud Storage
Ingestion
(Extract-Load)
Real-time/OLTP Analytics/OLAP
External
Sources
Raw Tables
Data Lake
Derived
Tables
upsert()
/insert()
Incr
Pull()
#1: upsert() database changelogs
// Command to extract incrementals using sqoop
bin/sqoop import 
-Dmapreduce.job.user.classpath.first=true 
--connect jdbc:mysql://localhost/users 
--username root 
--password ******* 
--table users 
--as-avrodatafile 
--target-dir 
s3:///tmp/sqoop/import-1/users
// Spark Datasource
Import com.uber.hoodie.DataSourceWriteOptions._
// Use Spark datasource to read avro
Dataset<Row> inputDataset
spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’);
// save it as a Hudi dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”)
.option(RECORDKEY_FIELD_OPT_KEY(), "userID")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"country")
.option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Step 1: Extract new changes to users table
in MySQL, as avro data files on DFS
(or)
Use data integration tool of choice to feed
db changelogs to Kafka/event queue
Step 2: Use your fav datasource to read
extracted data and directly “upsert” the
users table on DFS/Hive
(or)
Use the Hudi DeltaStreamer tool
#2: Filter out duplicate events
// Deltastreamer command to ingest kafka events, dedupe, ingest
spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
/path/to/hoodie-utilities-bundle-*.jar` 
--props s3://path/to/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider 
--source-class com.uber.hoodie.utilities.sources.AvroKafkaSource 
--source-ordering-field time 
--target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions 
--op BULK_INSERT
--filter-dupes
// kafka-source-properties
include=base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=datestr
# schema provider configs
hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-
value/versions/latest
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=impressions
#Kafka props
metadata.broker.list=localhost:9092
auto.offset.reset=smallest
schema.registry.url=http://localhost:8081
#3: Timeline consistency
Atomic multi-row commits
- Mask partial failures using timeline
- Rollback/savepoint support
Timeline
- Special .hoodie folder
- Actions are instantaneous
MVCC based isolation
- Between queries/ingestion
- Between ingestion/compaction
Future
- Unlimited timeline lookback
#4: Keyed update/insert() operations
Ingested record tagging
- Merge updates
- Log inserts
- HoodieRecordPayload interface to
support complex merges
Pluggable indexing
- Built-in : Bloom/Range based, HBase
- Scales with long term data growth
- Handles data skews
Future
- Support via SQL
#5: Incremental ETL/Data Pipelines
// Spark Datasource
Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._
// Use Spark datasource to read avro
Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie")
.option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM)
.load(“s3://tables/raw_trips”);
Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF)
// save it as a Hudi dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”)
.option(RECORDKEY_FIELD_OPT_KEY(), "id")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr")
.option(PRECOMBINE_FIELD_OPT_KEY(), "time")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Bring Streaming APIs
on Data Lake
Incrementally pull
- Avoid recomputes!
- Order of magnitudes
faster
Transform + upsert
- Avoid rewriting all data
Future
- Incr pull on logs
- Watermark APIs
#6: File Sizing & Fast Ingestion
Enforce file size on write
- Pay up cost to keep queries healthy
- Set hoodie.parquet.max.file.size &
hoodie.parquet.small.file.limit
- See docs for full list
Near real-time log ingest
- Asynchronous compact & write
columnar data
Future
- Support for split/collapse
- Auto tune compression ratio etc
#7: Optimized Timeline/FileSystem APIs
Embedded Timeline Server
- 0-listings from Spark executors
- Incremental file-system views on Spark driver
Consistency Guards
- Masks eventual consistency on S3
- No data file renames, in-place writing
- Storage aware “append” usage
- Graceful MVCC design to handle various
failures
Future
- Standalone timeline server
#8: Data Dispersal out of Lake
Incremental pull as sync mechanism
- Only copy updated ML features
- Only copy affected data ranges
Decoupled from ETL writing
- Shock absorber between Lake & Mart
- Enables throttling, retrying, rewinding
Future
- Support Lake => Mart in DeltaStreamer tool
#9: Efficient/Fast Deletes
Soft deletes
- upsert(k, null)
- Propagates seamlessly via incr-pull
Hard deletes
- Using EmptyHoodieRecordPayload
Indexing
- 7-10x faster than using regular joins
Future
- Standardized tooling
#10: Safe Reprocessing
Identify late data
- Timeline tracks all write activity
- E.g: obtain bounds on lateness
Adjust incremental pull windows
- Still much efficient than bulk
recomputation
Future
- Support parrival(data, window) APIs in
TimelineServer
- Apache Beam support for composing
safe, incremental pipelines
Open Source
Roadmap, community, and the future
Current Status
Where we are at
● Committed to open, vendor neutral data lake standard
● 2+ yrs of OSS community support
● First Apache release imminent
● EMIS Health, Yields.io + more in production
● Bunch of companies trying out
● Production tested on cloud
● hudi.apache.org/community.html
2019 Roadmap
Key initiatives
Bootstrapping tables into Hudi
- With indexing benefits
- Convenient tooling
Standalone Timeline Server
- Eliminate fs listings for query planning/ingestion
- Track column level statistics for query
Smart storage layouts
- Increase file sizes for older data
- Re-clustering data for queries
Thank you
dev@hudi.apache.org
@apachehudi
https://hudi.apache.org
?
Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or
utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or
retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to
whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable
law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information
of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.

Contenu connexe

Tendances

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 

Tendances (20)

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Similaire à SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi

SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From UberChester Chen
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianData Con LA
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4N Masahiro
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisationgrooverdan
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 

Similaire à SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi (20)

SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerian
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 

Plus de Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfChester Chen
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdfChester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataChester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapChester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bigheadChester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_indexChester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathChester Chen
 

Plus de Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 

Dernier

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Dernier (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi

  • 1. Building highly efficient data lakes using Apache Hudi (Incubating) Vinoth Chandar | Sr. Staff Engineer, Uber Apache®, Apache Hudi, and Hudi logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
  • 4. OK.. May be not that simple ... Database Event s Service Mesh Queries DFS/Cloud Storage Ingestion (Extract-Load) Real-time/OLTP Analytics/OLAP External Sources Raw Tables Data Lake Derived Tables Schemas Data Audit
  • 5. Data Lake Implementation : It’s actually hard..
  • 6. High-value data - User information in RDBMS - Trip, transaction logs in NoSQL Replicate CRUD operations - Strict ordering guarantees - Zero data loss Bulk loads don’t scale - Adds more load to database - Wasteful re-writing of data Requirement #1: Incremental Database Ingestion MySQL users users Inserts, updates, deletes Replicate userID int country string last_mod long ... ... Data Lake
  • 7. High-scale time series data - Several billions/day - Few millions/sec - Heavily aggregated Cause of duplicates - Client retries/failures/network errors - At least-once data pipes Overcounting problems - More impressions => more $ - Low fidelity data Requirement #2: De-Duping Log Events Impressions Impressions Produce impression events Replicate w/o duplicates event_id string datestr string time long ... ... Data Lake
  • 8. Requirement #3: Transactional Writes Atomic publish of data - Ingestion can fail midway - Rollback bad data Consistency Guarantees - No partial data exposed - Repeatable queries Snapshot Isolation - Time-travel queries - Concurrent writer/readers Strong Durability - No data loss
  • 9. Requirement #4: Unique Key Constraints Data model parity - Enforce upstream primary keys - 1-1 Mapping w/ source table - Great data quality! Transaction Processing - e.g: Settling orders, fraud detection - Lakes are well-suited for large scale processes
  • 10. Multi stage ETL DAGS - Very common in batch analytics - Large amount of data Derived/ETL tables - Keep afresh with new/changed raw data - Star schema/warehousing Scaling challenges - Intelligent recomputations - Window based joins Requirement #5: Faster Derived Data raw_trips std_trips standardize_fare(row) id string datestr string currency string fare double id string datestr string std_fare double ... ... Raw Table Derived Table
  • 11. Requirement #6: File Management Small Files = Big Problem - Slow queries - Stress filesystem metadata Big Files = Large Delays - 2GB Parquet writing => ~5-10 mins File Stitching? - Band-aid for bullet wound - Consistency? - Standardization?
  • 12. Requirement #7: Scalable DFS/Storage RPCs Ingestion/Query all list DFS - List folders/files, take action - Single threaded vs parallel Subtle gotchas/differences - Cloud storage => no append() - S3 => Eventual consistency - S3 => rename() = copy() - Large directory listings - HDFS NameNode bottlenecks
  • 13. Requirement #8: Incremental Copy to Data marts Data Marts - Specialized, often MPP OLAP databases - E.g Redshift, Vertica Online Serving - Sync ML features to databases - Throttling syncing rate Need to sync Lake => Mart - Full data refresh often very expensive - Need for incremental egress
  • 14. Requirement #9: Legal Requirements/Data Deletions Strict rules on data retention - Delete records - Correct data - Raw + Derived tables Need efficient delete() - “needle in haystack” - Indexed on write (point-ish lookup) - Still optimized for scans - Propagate deleted records downstream
  • 15. Requirement #10: Late Data Handling Data often arrives late - Minutes, Hours, even days - E.g: credit card txn settlement Not implicitly complete - Can lead to large data quality issues - Trigger recomputation of derived tables Data arrival tracking - First class, audit log - Flexible, rewindable windowing
  • 18. ● Snapshot isolation between writer & queries ● upsert() support with pluggable indexes ● Atomically publish data with rollback support ● Savepoints for data recovery ● Manages file sizes, layout using statistics ● Async compaction of new & old data ● Timeline metadata to track lineage Apache Hudi (Incubating) Storage
  • 19. ● Three logical views on single physical dataset ● Read Optimized View ○ Provides excellent query performance ○ Replaces plain Apache Parquet tables ● Incremental View ○ Change stream to feed downstream jobs/ETLs ● Near-Real time Table ○ Provides queries on real-time data ○ Combination of Apache Parquet & Apache Avro data Apache Hudi (Incubating) Queries/Views of data REALTIME READ OPTIMIZED Cost Latency
  • 20. Hudi: Upserts + Incremental Changes Incrementalize batch jobs Dataset Hudi upsert Incoming Changes Outgoing Changes Hudi Incremental Pull upsert(RDD<Record>) Updates records if present already or inserts them into its corresponding partitions RDD<Record> pullDelta(startTs, endTs) Gets all the records that changed (updated or inserted) between start and end time. The Delta can span any number of partitions.
  • 21. Apache Hudi @ Uber Foundation for the vast Data Lake >1 Trillion Records/day 10s PB Entire Data Lake 1000s Pipelines/Tables
  • 22. Apache Hudi Data Lake Meeting the requirements
  • 23. Data Lake built on Apache Hudi Database Event s Service Mesh Queries DFS/Cloud Storage Ingestion (Extract-Load) Real-time/OLTP Analytics/OLAP External Sources Raw Tables Data Lake Derived Tables upsert() /insert() Incr Pull()
  • 24. #1: upsert() database changelogs // Command to extract incrementals using sqoop bin/sqoop import -Dmapreduce.job.user.classpath.first=true --connect jdbc:mysql://localhost/users --username root --password ******* --table users --as-avrodatafile --target-dir s3:///tmp/sqoop/import-1/users // Spark Datasource Import com.uber.hoodie.DataSourceWriteOptions._ // Use Spark datasource to read avro Dataset<Row> inputDataset spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’); // save it as a Hudi dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”) .option(RECORDKEY_FIELD_OPT_KEY(), "userID") .option(PARTITIONPATH_FIELD_OPT_KEY(),"country") .option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”) Step 1: Extract new changes to users table in MySQL, as avro data files on DFS (or) Use data integration tool of choice to feed db changelogs to Kafka/event queue Step 2: Use your fav datasource to read extracted data and directly “upsert” the users table on DFS/Hive (or) Use the Hudi DeltaStreamer tool
  • 25. #2: Filter out duplicate events // Deltastreamer command to ingest kafka events, dedupe, ingest spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer /path/to/hoodie-utilities-bundle-*.jar` --props s3://path/to/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider --source-class com.uber.hoodie.utilities.sources.AvroKafkaSource --source-ordering-field time --target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions --op BULK_INSERT --filter-dupes // kafka-source-properties include=base.properties # Key fields, for kafka example hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.partitionpath.field=datestr # schema provider configs hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions- value/versions/latest # Kafka Source hoodie.deltastreamer.source.kafka.topic=impressions #Kafka props metadata.broker.list=localhost:9092 auto.offset.reset=smallest schema.registry.url=http://localhost:8081
  • 26. #3: Timeline consistency Atomic multi-row commits - Mask partial failures using timeline - Rollback/savepoint support Timeline - Special .hoodie folder - Actions are instantaneous MVCC based isolation - Between queries/ingestion - Between ingestion/compaction Future - Unlimited timeline lookback
  • 27. #4: Keyed update/insert() operations Ingested record tagging - Merge updates - Log inserts - HoodieRecordPayload interface to support complex merges Pluggable indexing - Built-in : Bloom/Range based, HBase - Scales with long term data growth - Handles data skews Future - Support via SQL
  • 28. #5: Incremental ETL/Data Pipelines // Spark Datasource Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._ // Use Spark datasource to read avro Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie") .option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL()) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM) .load(“s3://tables/raw_trips”); Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF) // save it as a Hudi dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”) .option(RECORDKEY_FIELD_OPT_KEY(), "id") .option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr") .option(PRECOMBINE_FIELD_OPT_KEY(), "time") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”) Bring Streaming APIs on Data Lake Incrementally pull - Avoid recomputes! - Order of magnitudes faster Transform + upsert - Avoid rewriting all data Future - Incr pull on logs - Watermark APIs
  • 29. #6: File Sizing & Fast Ingestion Enforce file size on write - Pay up cost to keep queries healthy - Set hoodie.parquet.max.file.size & hoodie.parquet.small.file.limit - See docs for full list Near real-time log ingest - Asynchronous compact & write columnar data Future - Support for split/collapse - Auto tune compression ratio etc
  • 30. #7: Optimized Timeline/FileSystem APIs Embedded Timeline Server - 0-listings from Spark executors - Incremental file-system views on Spark driver Consistency Guards - Masks eventual consistency on S3 - No data file renames, in-place writing - Storage aware “append” usage - Graceful MVCC design to handle various failures Future - Standalone timeline server
  • 31. #8: Data Dispersal out of Lake Incremental pull as sync mechanism - Only copy updated ML features - Only copy affected data ranges Decoupled from ETL writing - Shock absorber between Lake & Mart - Enables throttling, retrying, rewinding Future - Support Lake => Mart in DeltaStreamer tool
  • 32. #9: Efficient/Fast Deletes Soft deletes - upsert(k, null) - Propagates seamlessly via incr-pull Hard deletes - Using EmptyHoodieRecordPayload Indexing - 7-10x faster than using regular joins Future - Standardized tooling
  • 33. #10: Safe Reprocessing Identify late data - Timeline tracks all write activity - E.g: obtain bounds on lateness Adjust incremental pull windows - Still much efficient than bulk recomputation Future - Support parrival(data, window) APIs in TimelineServer - Apache Beam support for composing safe, incremental pipelines
  • 35. Current Status Where we are at ● Committed to open, vendor neutral data lake standard ● 2+ yrs of OSS community support ● First Apache release imminent ● EMIS Health, Yields.io + more in production ● Bunch of companies trying out ● Production tested on cloud ● hudi.apache.org/community.html
  • 36. 2019 Roadmap Key initiatives Bootstrapping tables into Hudi - With indexing benefits - Convenient tooling Standalone Timeline Server - Eliminate fs listings for query planning/ingestion - Track column level statistics for query Smart storage layouts - Increase file sizes for older data - Re-clustering data for queries
  • 38. ?
  • 39. Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.