SlideShare une entreprise Scribd logo
1  sur  25
Reshape Data Lake (As of 2020.07)
Eric Sun @ LinkedIn
https://www.linkedin.com/in/ericsun
SF Big Analytics
Similar Presentation/Blog(s)
https://databricks.com/session_na20/a-thorough-comparison-of-delta-lake-
iceberg-and-hudi
https://databricks.com/session_eu19/end-to-end-spark-tensorflow-pytorch-
pipelines-with-databricks-delta
https://bit.ly/comparison-of-delta-iceberg-hudi-by-domisj
https://bit.ly/acid-iceberg-delta-comparison-by-wssbck
Disclaimer
The views expressed in this presentation are those of the author and do not reflect any policy or
position of the employers of the author. Audience may verify the anecdotes mentioned below.
Vocabulary & Jargon
● T+1: event/transaction time plus 1 day - typical daily-batch
T+0: realtime process which can deliver insight with minimal delay
T+0.000694: mintely-batch; T+0.041666: hourly-batch
● Delta Engine: Spark compiled in LLVM (similar to Dremio Gandiva)
● Skipping Index: Min/Max, Bloom Filter, and ValueList w/ Z-Ordering
● DML: Insert + Delete + Update + Upsert/Merge
● Time Travel: isolate & preserver multiple snapshot versions
● SCD-2: type 2 of multi-versioned data model to provide time travel
● Object/Cloud Storage: S3/IA/Glacier, ABS/Cool/Archive, GCS/NL/CL
● Streaming & Batch Unification: union historical bounded data with
continuous stream; interactively query both anytime
Data Warehouse Data Lake v1 Data Lake v2
Relational DB based MPP
ETL done by IT team
ELT inside MPP
Star schema
OLAP and BI focused
SQL is the main DSL
ODBC + JDBC as ⇿ interface
<Expensive to scale …>
Limited UD*F to run R and Data
Mining inside database
HDFS + NoSQL
ETL done by Java folks
Nested schema or no schema
Hive used by non-engineers
Export data back to RDBMS
for OLAP/BI
M/R API & DSL dominated
Scalable ML became possible
<Hard to operate …>
UD*F & SerDe made easier
Cloud + HTAP/MPP + NoSQL
ETL done by data people in
Spark and Presto
Data model and schema matter
again
Streaming + Batch ⇨ unified
More expressed in SQL + Python
ML as a critical use case
<Too confused to migrate…>
Non-JVM engines emerge
Share So Much
Despite of all the marketing
buzzwords and manipulations,
‘data lakehouse’, ‘data lake’,
and ‘data warehouse’ are all
there to solve the same data
integration and insight
generation problems.
The implementation will
continue to evolve as the new
hardware and software
become viable and practical.
● ACID
● Mutable (Delete, Update, Compact)
● Schema (DDL and Evolution)
● Metadata (Rich, Performant)
● Open (Format, API, Tooling, Adoption)
● Fast (Optimized for Various Patterns)
● Extensible (User-defined ***, Federation)
● Intuitive (Data-centric Operation/Language)
● Productive (Achieve more with less)
● Practical (Join, Aggregate, Cache, View)
In Common
Solution Architecture Template
Sources
Ads
BI/OLAP
Machine Learning
Deep Learning
Observability
Recommendation
A/B Test
Storage
Data Format and SerDe
Metadata Catalog and Table API
Unified Data Interface
CDC
Ingestion
T+0 or T+0.000694
T+0.0416 or T+1
...
Data Analytics in Cloud Storage
● Object Store File System
○ There is no hierarchy semantics to rename or inherit
○ Object is not appendable (in general)
○ Metadata is limited to a few KB
● REST is easy to program but RPC is much faster
○ Job/query planning step needs a lot of small scans (it is chatty)
○ 4MB cache block size may be inefficient for metadata operations
● Hadoop stack is tightly-coupled with HDFS notions
○ Hive and Spark (originally) were not optimized for object stores
○ Running HDFS as a cache/intermediate layer on a VM fleet can be
useful yet suboptimal (and operational heavy)
○ Data locally still matters for SLA-sensitive batch jobs
Is Not
Big Data becomes too big, even Metadata
● Computation cost keep rising for big data
○ Partitioning the files by date is not enough
○ Hot and warm data sizes are still very big (how to save $$$)
○ Analytics often scan big data files but discard 90% records and 80%
fields. The CPU, memory, network and I/O cost is billed for 100%
○ Columnar format has skipping index and projection pushdown, but
how to fetch them swiftly
● Hive Metadata only manages directory (HIVE-9452 abandoned)
○ Commits can happen at file or file group level (instead of directory)
○ High-performance engines need better file layout and rich metadata at
field level for each segment/chunk in a file
○ Process metadata via Java ORM?
Immutable or Mutable
● Big data is all about immutable schemaless data
○ To get useful insights and features out the raw data, we still have to
dedupe, transform, conform, merge, aggregate, and backfill
○ Schema evolution happens frequently when merge & backfill occurs
● Storage is infinite and compute is cheap
○ Why not rewriting the entire data file or directory all the time
○ If it is slow, increase the number of partitions and executors
● Streaming and Batch Unification requires a decent incremental logic
○ Store granularly with ACID isolation and clear watermarks
○ Process incrementally without partial reads or duplicates
○ Evolve reliably with enough flexibility
Are All Open Standards Equal?
● Hive 3.x
○ DML (based on ORC + Bucketing + on-the-fly Merge + Compactor)
○ Streaming Ingestion API, LLAP (daemon, caching, faster execution)
● Iceberg
○ Flexible Field Schema and Partition Layout Evolution (S3-first)
○ Hidden Partition (expression-based) and Bucket Transformation
● Delta Lake
○ Everything done by Spark + Parquet, DML (Copy-On-Write) + SCD-2
○ Fully supported in SparkSQL, PySpark and Delta Engine
● Hudi
○ Optimized UPSERT with indexing (record key, file id, partition path)
○ Copy-on-Read (low-latency write) or Copy-on-Write (HDFS-first)
Why Iceberg is so cool?
● Netflix is the most advanced AWS flagship partner
○ S3 is very scalable but a little bit over-simplified
○ Solve the critical cloud storage problems:
■ Avoid rename
■ Avoid directory hierarchy and naming convention
■ Aggregate (index) metadata into a compacted (manifest) file
● Netflix has migrated to Flink for stream processing
○ Fast ETL/analytics are needed to respond to its non-stop VOD
○ w/ One of the biggest Cassandra cluster (less mutable headache)
○ No urgent need for DML yet
● Netflix uses multiple data platforms/engines, and migrates faster than ...
○ Support other file formats, engines, schema, bucketing by nature
Why Delta Lake is so handy?
● If you love to use Spark for ETL (Steaming & Batch), Delta
Lake just makes it so much more powerful
○ The API and SQL syntax are so easy to use (especially for data folks)
○ Wide range of patterns provided by paid customers and OSS community
○ (feel locked-in?) it is well-tested, less buggy, and more useable in 3 clouds
● Databricks has full control and moves very fast
○ v0.2 (cloud storage support: June 2019)
○ v0.3 (DML: Aug 2019), v0.4(SQL syntax, Python API: Sep 2019)
○ v0.5 (DML & compaction performance, Presto integration: Dec 2019)
○ v0.6 (Schema evolution during merge, read by path: Apr 2020)
○ v0.7 (DDL for Hive Metastore, retention control, ADLSv2: Jun 2020)
Why Hudi is faster?
● Uber is a true fast-data company
○ Their marketing place and supply-demand-matching business model
seriously depends on near real-time analytics:
■ Directly upsert MySql BIN log to Hudi table
■ Frequently bulk dump Cassandra is obviously infeasible
■ record_key is indexed (file names + bloom filters) to speed up
■ Batch favors Copy-on-Write but Streaming likes Merge-on-Read
■ Snapshot query is faster, while Incremental query has low latency
● Uber is also committed to Flink
● Uber mainly builds its own data centers and HDFS clusters
○ So Hudi is mainly optimized for on-prem HDFS with Hive convention
○ GCP and AWS support was added later
Code Snippets - Delta
spark.readStream.format("delta").load("/path/to/delta/events")
deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table")
# Upsert (merge) new data
newData = spark.range(0, 20)
deltaTable.alias("oldData") 
.merge(
newData.alias("newData"),
"oldData.id = newData.id") 
.whenMatchedUpdate(set = { "id": col("newData.id") }) 
.whenNotMatchedInsert(values = { "id": col("newData.id") }) 
.execute()
val df = spark.read.format(“delta”).load("/path/to/my/table@v5238")
// ---- Spark SQL ----
SELECT * FROM events -- query table in the metastore
SELECT * FROM delta.`/delta/events` -- query table by path
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2020-07-28 09:30:00.000"
SELECT count(*) FROM my_table VERSION AS OF 5238
UPDATE delta.`/data/events/` SET eventType = 'click' WHERE eventType = 'clck'
Code Snippets - Hudi
val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")
// load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
// since partition (region/country/city) is 3 levels nested from basePath, using 4 levels "/*/*/*/*" here
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from
hudi_trips_snapshot").show()
// -------------------
val beginTime = "000" // Represents all commits > this time.
val endTime = commits(commits.length - 2) // point in time to query
// incrementally query data
val tripsPointInTimeDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare
> 20.0").show()
Code Snippets - Iceberg
CREATE TABLE prod.db.sample_table (
id bigint,
data string,
category string,
ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)
SELECT * FROM prod.db.sample_table.files
INSERT OVERWRITE prod.my_app.logs
SELECT uuid, first(level), first(ts), first(message)
FROM prod.my_app.logs
WHERE cast(ts as date) = '2020-07-01'
GROUP BY uuid
spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table")
// time travel to October 26, 1986 at 01:21:00
spark.read.option("as-of-timestamp", "499162860000").table("prod.db.sample_table")
// time travel to snapshot with ID 10963874102873L
spark.read.option("snapshot-id", 10963874102873L).table("prod.db.sample_table")
Time Travel
● Time Travel is focused on keeping both Batch and Streaming
jobs isolated from the Concurrent Reads & Writes
● Typical Range for Time Travel is 7 ~ 30 days
● Machine Learning (Feature reGeneration) often needs to
travel to 3~24 months back
○ Need to reduce the precision/granularity of commits kept
in Data Lake (compact the logs to daily or monthly level)
■ Monthly baseline/snapshot + daily delta/changes
○ Consider a more advanced SCD-2 data model for ML
What Else Should be Part of Data Lake?
● Catalog (next-generation metastore alternatives)
○ Daemon service: scalable, easy to update and query
○ Federation across data centers (across cloud and on-premises)
● Better file format and in-memory columnar format
○ Less SerDe overhead, zero-copy, directly vectorized operation on
compressed data (Artus-like). Tungsten v2 (Arrow-like)
● Performance and Data Management (for OLAP and AI)
○ New compute engines (non-JVM based) with smart caching and pre-
aggregation & materialized view
○ Mechanism to enable Time Travel with more flexible and wider range
○ Rich DSL with code generation and pushdown capability for faster AI
training and inference
How to
What are the pain points?
Each Data Lake framework has
its own emphasis, please find
the alignment of your pain
points accordingly.
● Motivations
Smoother integration with existing development
language and compute engine?
Contribute to the framework to solve new problems?
Have more control of the infrastructure, is the
framework’s open source governance friendly?
● Restrictions
...
Choose?
⧫ Delta Lake + Spark + Delta Engine +
Python support will effectively help
Databricks pull ahead in the race.
⧫ Flink community is all in for Iceberg.
⧫ GCP BigQuery, EMR, and Azure Synapse
(will) support reading from all table
formats, so you can lift-and-shift to ...
What’s next?
Data Lake can do more
Can be faster
Can be easier
Additional Readings
● Gartner Research
○ Are You Shifting Your Problems to the Cloud or Solving Them?
○ Demystifying Cloud Data Warehouse Characteristics
● Google
○ Procella + Artus (https://www.youtube.com/watch?v=QwXj7o4dLpw)
○ BigQuery + Capacitor (https://bit.ly/bigquery-capacitor)
● Uber
○ Incremental Processing on Hadoop (https://bit.ly/uber-incremental)
● Alibaba
○ AnalyticDB (https://www.vldb.org/pvldb/vol12/p2059-zhan.pdf)
○ Iceberg Sink for Flink (https://bit.ly/flink-iceberg-sink)
○ Use Iceberg in Flink 中文 (https://developer.aliyun.com/article/755329)
Data Lake implementations are still
evolving, don’t hold your breath for
the single best choice. Roll up
sleeves and build practical solutions
with 2 or 3 options combined.
Computation engine gravity/bias
will directly reshape the waterscape.
Thanks!
Presentation URL:
https://bit.ly/SFBA0728
Blog:
http://bit.ly/iceberg-delta-hudi-hive

Contenu connexe

Tendances

Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 

Tendances (20)

Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 

Similaire à Reshape Data Lake (as of 2020.07)

Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 

Similaire à Reshape Data Lake (as of 2020.07) (20)

Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 

Dernier

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Dernier (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Reshape Data Lake (as of 2020.07)

  • 1. Reshape Data Lake (As of 2020.07) Eric Sun @ LinkedIn https://www.linkedin.com/in/ericsun SF Big Analytics
  • 3. Vocabulary & Jargon ● T+1: event/transaction time plus 1 day - typical daily-batch T+0: realtime process which can deliver insight with minimal delay T+0.000694: mintely-batch; T+0.041666: hourly-batch ● Delta Engine: Spark compiled in LLVM (similar to Dremio Gandiva) ● Skipping Index: Min/Max, Bloom Filter, and ValueList w/ Z-Ordering ● DML: Insert + Delete + Update + Upsert/Merge ● Time Travel: isolate & preserver multiple snapshot versions ● SCD-2: type 2 of multi-versioned data model to provide time travel ● Object/Cloud Storage: S3/IA/Glacier, ABS/Cool/Archive, GCS/NL/CL ● Streaming & Batch Unification: union historical bounded data with continuous stream; interactively query both anytime
  • 4. Data Warehouse Data Lake v1 Data Lake v2 Relational DB based MPP ETL done by IT team ELT inside MPP Star schema OLAP and BI focused SQL is the main DSL ODBC + JDBC as ⇿ interface <Expensive to scale …> Limited UD*F to run R and Data Mining inside database HDFS + NoSQL ETL done by Java folks Nested schema or no schema Hive used by non-engineers Export data back to RDBMS for OLAP/BI M/R API & DSL dominated Scalable ML became possible <Hard to operate …> UD*F & SerDe made easier Cloud + HTAP/MPP + NoSQL ETL done by data people in Spark and Presto Data model and schema matter again Streaming + Batch ⇨ unified More expressed in SQL + Python ML as a critical use case <Too confused to migrate…> Non-JVM engines emerge
  • 5. Share So Much Despite of all the marketing buzzwords and manipulations, ‘data lakehouse’, ‘data lake’, and ‘data warehouse’ are all there to solve the same data integration and insight generation problems. The implementation will continue to evolve as the new hardware and software become viable and practical. ● ACID ● Mutable (Delete, Update, Compact) ● Schema (DDL and Evolution) ● Metadata (Rich, Performant) ● Open (Format, API, Tooling, Adoption) ● Fast (Optimized for Various Patterns) ● Extensible (User-defined ***, Federation) ● Intuitive (Data-centric Operation/Language) ● Productive (Achieve more with less) ● Practical (Join, Aggregate, Cache, View) In Common
  • 6. Solution Architecture Template Sources Ads BI/OLAP Machine Learning Deep Learning Observability Recommendation A/B Test Storage Data Format and SerDe Metadata Catalog and Table API Unified Data Interface CDC Ingestion T+0 or T+0.000694 T+0.0416 or T+1 ...
  • 7. Data Analytics in Cloud Storage ● Object Store File System ○ There is no hierarchy semantics to rename or inherit ○ Object is not appendable (in general) ○ Metadata is limited to a few KB ● REST is easy to program but RPC is much faster ○ Job/query planning step needs a lot of small scans (it is chatty) ○ 4MB cache block size may be inefficient for metadata operations ● Hadoop stack is tightly-coupled with HDFS notions ○ Hive and Spark (originally) were not optimized for object stores ○ Running HDFS as a cache/intermediate layer on a VM fleet can be useful yet suboptimal (and operational heavy) ○ Data locally still matters for SLA-sensitive batch jobs Is Not
  • 8. Big Data becomes too big, even Metadata ● Computation cost keep rising for big data ○ Partitioning the files by date is not enough ○ Hot and warm data sizes are still very big (how to save $$$) ○ Analytics often scan big data files but discard 90% records and 80% fields. The CPU, memory, network and I/O cost is billed for 100% ○ Columnar format has skipping index and projection pushdown, but how to fetch them swiftly ● Hive Metadata only manages directory (HIVE-9452 abandoned) ○ Commits can happen at file or file group level (instead of directory) ○ High-performance engines need better file layout and rich metadata at field level for each segment/chunk in a file ○ Process metadata via Java ORM?
  • 9. Immutable or Mutable ● Big data is all about immutable schemaless data ○ To get useful insights and features out the raw data, we still have to dedupe, transform, conform, merge, aggregate, and backfill ○ Schema evolution happens frequently when merge & backfill occurs ● Storage is infinite and compute is cheap ○ Why not rewriting the entire data file or directory all the time ○ If it is slow, increase the number of partitions and executors ● Streaming and Batch Unification requires a decent incremental logic ○ Store granularly with ACID isolation and clear watermarks ○ Process incrementally without partial reads or duplicates ○ Evolve reliably with enough flexibility
  • 10. Are All Open Standards Equal? ● Hive 3.x ○ DML (based on ORC + Bucketing + on-the-fly Merge + Compactor) ○ Streaming Ingestion API, LLAP (daemon, caching, faster execution) ● Iceberg ○ Flexible Field Schema and Partition Layout Evolution (S3-first) ○ Hidden Partition (expression-based) and Bucket Transformation ● Delta Lake ○ Everything done by Spark + Parquet, DML (Copy-On-Write) + SCD-2 ○ Fully supported in SparkSQL, PySpark and Delta Engine ● Hudi ○ Optimized UPSERT with indexing (record key, file id, partition path) ○ Copy-on-Read (low-latency write) or Copy-on-Write (HDFS-first)
  • 11. Why Iceberg is so cool? ● Netflix is the most advanced AWS flagship partner ○ S3 is very scalable but a little bit over-simplified ○ Solve the critical cloud storage problems: ■ Avoid rename ■ Avoid directory hierarchy and naming convention ■ Aggregate (index) metadata into a compacted (manifest) file ● Netflix has migrated to Flink for stream processing ○ Fast ETL/analytics are needed to respond to its non-stop VOD ○ w/ One of the biggest Cassandra cluster (less mutable headache) ○ No urgent need for DML yet ● Netflix uses multiple data platforms/engines, and migrates faster than ... ○ Support other file formats, engines, schema, bucketing by nature
  • 12. Why Delta Lake is so handy? ● If you love to use Spark for ETL (Steaming & Batch), Delta Lake just makes it so much more powerful ○ The API and SQL syntax are so easy to use (especially for data folks) ○ Wide range of patterns provided by paid customers and OSS community ○ (feel locked-in?) it is well-tested, less buggy, and more useable in 3 clouds ● Databricks has full control and moves very fast ○ v0.2 (cloud storage support: June 2019) ○ v0.3 (DML: Aug 2019), v0.4(SQL syntax, Python API: Sep 2019) ○ v0.5 (DML & compaction performance, Presto integration: Dec 2019) ○ v0.6 (Schema evolution during merge, read by path: Apr 2020) ○ v0.7 (DDL for Hive Metastore, retention control, ADLSv2: Jun 2020)
  • 13. Why Hudi is faster? ● Uber is a true fast-data company ○ Their marketing place and supply-demand-matching business model seriously depends on near real-time analytics: ■ Directly upsert MySql BIN log to Hudi table ■ Frequently bulk dump Cassandra is obviously infeasible ■ record_key is indexed (file names + bloom filters) to speed up ■ Batch favors Copy-on-Write but Streaming likes Merge-on-Read ■ Snapshot query is faster, while Incremental query has low latency ● Uber is also committed to Flink ● Uber mainly builds its own data centers and HDFS clusters ○ So Hudi is mainly optimized for on-prem HDFS with Hive convention ○ GCP and AWS support was added later
  • 14. Code Snippets - Delta spark.readStream.format("delta").load("/path/to/delta/events") deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table") # Upsert (merge) new data newData = spark.range(0, 20) deltaTable.alias("oldData") .merge( newData.alias("newData"), "oldData.id = newData.id") .whenMatchedUpdate(set = { "id": col("newData.id") }) .whenNotMatchedInsert(values = { "id": col("newData.id") }) .execute() val df = spark.read.format(“delta”).load("/path/to/my/table@v5238") // ---- Spark SQL ---- SELECT * FROM events -- query table in the metastore SELECT * FROM delta.`/delta/events` -- query table by path SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1) SELECT count(*) FROM my_table TIMESTAMP AS OF "2020-07-28 09:30:00.000" SELECT count(*) FROM my_table VERSION AS OF 5238 UPDATE delta.`/data/events/` SET eventType = 'click' WHERE eventType = 'clck'
  • 15. Code Snippets - Hudi val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*") // load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery // since partition (region/country/city) is 3 levels nested from basePath, using 4 levels "/*/*/*/*" here tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show() spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show() // ------------------- val beginTime = "000" // Represents all commits > this time. val endTime = commits(commits.length - 2) // point in time to query // incrementally query data val tripsPointInTimeDF = spark.read.format("hudi"). option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL). option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). option(END_INSTANTTIME_OPT_KEY, endTime). load(basePath) tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time") spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()
  • 16. Code Snippets - Iceberg CREATE TABLE prod.db.sample_table ( id bigint, data string, category string, ts timestamp) USING iceberg PARTITIONED BY (bucket(16, id), days(ts), category) SELECT * FROM prod.db.sample_table.files INSERT OVERWRITE prod.my_app.logs SELECT uuid, first(level), first(ts), first(message) FROM prod.my_app.logs WHERE cast(ts as date) = '2020-07-01' GROUP BY uuid spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table") // time travel to October 26, 1986 at 01:21:00 spark.read.option("as-of-timestamp", "499162860000").table("prod.db.sample_table") // time travel to snapshot with ID 10963874102873L spark.read.option("snapshot-id", 10963874102873L).table("prod.db.sample_table")
  • 17. Time Travel ● Time Travel is focused on keeping both Batch and Streaming jobs isolated from the Concurrent Reads & Writes ● Typical Range for Time Travel is 7 ~ 30 days ● Machine Learning (Feature reGeneration) often needs to travel to 3~24 months back ○ Need to reduce the precision/granularity of commits kept in Data Lake (compact the logs to daily or monthly level) ■ Monthly baseline/snapshot + daily delta/changes ○ Consider a more advanced SCD-2 data model for ML
  • 18. What Else Should be Part of Data Lake? ● Catalog (next-generation metastore alternatives) ○ Daemon service: scalable, easy to update and query ○ Federation across data centers (across cloud and on-premises) ● Better file format and in-memory columnar format ○ Less SerDe overhead, zero-copy, directly vectorized operation on compressed data (Artus-like). Tungsten v2 (Arrow-like) ● Performance and Data Management (for OLAP and AI) ○ New compute engines (non-JVM based) with smart caching and pre- aggregation & materialized view ○ Mechanism to enable Time Travel with more flexible and wider range ○ Rich DSL with code generation and pushdown capability for faster AI training and inference
  • 19. How to What are the pain points? Each Data Lake framework has its own emphasis, please find the alignment of your pain points accordingly. ● Motivations Smoother integration with existing development language and compute engine? Contribute to the framework to solve new problems? Have more control of the infrastructure, is the framework’s open source governance friendly? ● Restrictions ... Choose?
  • 20. ⧫ Delta Lake + Spark + Delta Engine + Python support will effectively help Databricks pull ahead in the race. ⧫ Flink community is all in for Iceberg. ⧫ GCP BigQuery, EMR, and Azure Synapse (will) support reading from all table formats, so you can lift-and-shift to ...
  • 21. What’s next? Data Lake can do more Can be faster Can be easier
  • 22. Additional Readings ● Gartner Research ○ Are You Shifting Your Problems to the Cloud or Solving Them? ○ Demystifying Cloud Data Warehouse Characteristics ● Google ○ Procella + Artus (https://www.youtube.com/watch?v=QwXj7o4dLpw) ○ BigQuery + Capacitor (https://bit.ly/bigquery-capacitor) ● Uber ○ Incremental Processing on Hadoop (https://bit.ly/uber-incremental) ● Alibaba ○ AnalyticDB (https://www.vldb.org/pvldb/vol12/p2059-zhan.pdf) ○ Iceberg Sink for Flink (https://bit.ly/flink-iceberg-sink) ○ Use Iceberg in Flink 中文 (https://developer.aliyun.com/article/755329)
  • 23.
  • 24. Data Lake implementations are still evolving, don’t hold your breath for the single best choice. Roll up sleeves and build practical solutions with 2 or 3 options combined. Computation engine gravity/bias will directly reshape the waterscape.

Notes de l'éditeur

  1. The views expressed in this presentation are those of the author and do not reflect any policy or position of the employers of the author.
  2. IA = Infrequent Access; NL = Near Line; CL = Code Line; https://flink.apache.org/news/2019/02/13/unified-batch-streaming-blink.html
  3. During v1 time, there are several attempts for non-JVM engines, but none of them have really thrived. GPU, C++ and LLVM are really changing the game of Deep Learning and OLAP. HDFS are reaching it peak time and it starts fading away.
  4. if all you have is a hammer, everything looks like a nail
  5. The Druid/Pinot (near real time analytics) block can be merged into the Data Lake with T+0 ingestion and processing capability. It can also be replaced by HTAP (such as TiDB) as a super ODS.
  6. AWS EFS is really a NFS/NAS solution, so it can’t even replace HDFS on S3. Use EmrFileSystem instead. And s3a:// has https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/bk_cloud-data-access/content/s3-limitations.html Azure Data Lake Storage Gen2 is almost capable of replacing HDFS. abfs:// Google Colossus is years ahead of OSS, a true distributed file system. HIVE-14269, HIVE-14270, HIVE-20517, HADOOP-15364, HADOOP-15281 Hive ACID is not allowed if S3 is the storage layer (Hudi or others can be used as SerDe)
  7. Snowflake uses FoundationDB to organize a lot of metadata to speed up its Query Processing. https://www.snowflake.com/blog/how-foundationdb-powers-snowflake-metadata-forward/ S3 Select was launched Apr 2018 to provide some pushdown (Sep 2018 for Parquet) (Nov 2018, output committer to avoid rename)
  8. Record-grain mutable is expensive, but how about min-batch level? GDPR, CCPA, IDPC and … affect offline big data as well.
  9. Iceberg is mainly optimized for Parquet, but its spec and API are open to support ORC and Avro too. The Bucket Transformation is designed to work across Hive, Spark, Presto and Flink.
  10. Clearly distinguish and handle processing_time (a.k.a. arrival_time) vs. event_time (a.k.a. payload_time or transaction_time) In short, Hudi can efficiently update/reconcile the late-arrival records to the proper partition. https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
  11. https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html https://docs.delta.io/0.7.0/delta-batch.html
  12. Very typical Hive style. Fine-grain control.
  13. Cool stuff
  14. Similar to Aster Data Systems https://en.wikipedia.org/wiki/Aster_Data_Systems and https://github.com/sql-machine-learning/sqlflow
  15. Similar to Aster Data Systems https://en.wikipedia.org/wiki/Aster_Data_Systems and https://github.com/sql-machine-learning/sqlflow
  16. Anecdote: Huawei was donating CarbonData into open source Spark a few years ago, but maybe Delta had been the way to go already, CarbonData never made to a file format bundled in Spark. CarbonData is a more comprehensive columnar format that supports rich indexing and even DML operations at SerDe level. The latest FusionInsights MRS 8.0 is realizing the mutable Data Lake with streaming & batch combined on top of CarbonData. It will not be surprising if some of the Iceberg contributors & adopters have similar worry about Delta Lake.
  17. Huawei CarbonData anecdote:
  18. https://www.qlik.com/us/-/media/files/resource-library/global-us/register/ebooks/eb-cloud-data-warehouse-comparison-ebook-en.pdf https://www.gartner.com/doc/reprints?id=1-1ZA6E2JU&ct=200619&st=sb (Cloud Data Warehouse: Are You Shifting Your Problems to the Cloud or Solving Them?)
  19. We need to speculate where Databricks is forging forward next? (Data Lake + ETL + ML + OLAP + DL + SaaS/Serverless + Data Management + …) What shall we learn from Snowflake’s architecture and success? (Data Lake should be fast and intuitive to use, Metadata is so important to optimize the query performance) Anecdote: Snowflake’s IPO market cap is about 10x bigger than Cloudera, that should tell something about how useful it is.