SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
FROM FLAT FILES TO DECONSTRUCTED
DATABASE
The evolution and future of the Big Data ecosystem.
Julien Le Dem
@J_
julien.ledem@wework.com
April 2018
Julien Le Dem
@J_
Principal Data Engineer
• Author of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez
• Used Hadoop first at Yahoo in 2007
• Formerly Twitter Data platform and Dremio
Julien
Agenda
❖ At the beginning there was Hadoop (2005)
❖ Actually, SQL was invented in the 70s
“MapReduce: A major step backwards”
❖ The deconstructed database
❖ What next?
At the beginning there was Hadoop
Hadoop
Storage:
A distributed file system
Execution:
Map Reduce
Based on Google’s GFS and MapReduce papers
Great at looking for a needle in a haystack
… with snowplows
Great at looking for a needle in a haystack …
Original Hadoop abstractions
Execution
Map/Reduce
Storage
File System
Just flat files
● Any binary
● No schema
● No standard
● The job must know how to
split the data
M
M
M
R
R
R
Shuffle
Read
locally
write
locally
Simple
● Flexible/Composable
● Logic and optimizations
tightly coupled
● Ties execution with
persistence
“MapReduce: A major step backwards”
(2008)
SQL
Databases have been around for a long time
• Originally SEQUEL developed in the early 70s at IBM
• 1986: first SQL standard
• Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016
Relational model
Global Standard
• Universal Data access language
• Widely understood
• First described in 1969 by Edgar F. Codd
Underlying principles of relational databases
Standard
SQL is understood by many
Separation of logic and optimization
Separation of Schema and Application
High level language focusing on logic (SQL)
Indexing
Optimizer
Evolution
Views
Schemas
Integrity
Transactions
Integrity constraints
Referential integrity
Storage
Tables
Abstracted notion of data
● Defines a Schema
● Format/layout decoupled
from queries
● Has statistics/indexing
● Can evolve over time
Execution
SQL
SQL
● Decouples logic of query
from:
○ Optimizations
○ Data representation
○ Indexing
Relational Database abstractions
SELECT a, AVG(b)
FROM FOO GROUP BY a
FOO
Query evaluation
Syntax Semantic Optimization Execution
A well integrated system
Storage
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Syntax Semantic
Optimization
Execution
Table
Metadata
(Schema,
stats,
layout,…)
Columnar
data
Push
downs
Columnar
data
Columnar
data
Push
downs
Push
downs
user
So why? Why Hadoop?
Why Snowplows?
The relational model was constrained
We need the right
Constraints
Need the right abstractions
Traditional SQL implementations:
• Flat schema
• Inflexible schema evolution
• History rewrite required
• No lower level abstractions
• Not scalable
Constraints are good
They allow optimizations
• Statistics
• Pick the best join algorithm
• Change the data layout
• Reusable optimization logic
It’s just code
Hadoop is flexible and scalable
• Room to scale algorithms that are not part of the standard
• Machine learning
• Your imagination is the limit
No Data shape constraint
Open source
• You can improve it
• You can expand it
• You can reinvent it
• Nested data structures
• Unstructured text with semantic annotations
• Graphs
• Non-uniform schemas
You can actually implement SQL with this
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Parser
FOO
Execution
GROUP BY
JOIN
FILTER
BAR
And they did…
(open-sourced 2009)
10 years later
The deconstructed database
Author: gamene https://www.flickr.com/photos/gamene/4007091102
The deconstructed database
The deconstructed database
Query
model
Machine
learning
Storage
Batch
execution
Data
Exchange
Stream
Processing
We can mix and match individual components
*not exhaustive!
Specialized
Components
Stream processing
Storage
Execution SQL
Stream persistance
Streams
Resource management
Iceberg
Machine learning
We can mix and match individual components
Storage
Row oriented or columnar
Immutable or mutable
Stream storage vs analytics optimized
Query model
SQL
Functional
…
Machine Learning
Training models
Data Exchange
Row oriented
Columnar
Batch Execution
Optimized for high throughput and
historical analysis
Streaming Execution
Optimized for High Throughput and Low
latency processing
Emergence of standard components
Emergence of standard components
Columnar Storage
Apache Parquet as columnar
representation at rest.
SQL parsing and
optimization
Apache Calcite as a versatile query
optimizer framework
Schema model
Apache Avro as pipeline friendly schema
for the analytics world.
Columnar Exchange
Apache Arrow as the next generation in-
memory representation and no-overhead
data exchange
Table abstraction
Netflix’s Iceberg has a great potential to
provide Snapshot isolation and layout
abstraction on top of distributed file
systems.
The deconstructed database’s optimizer: Calcite
Storage
Execution engine
Schema
plugins
Optimizer
rules
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Syntax Semantic
Optimization
Execution
…
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Apache Calcite is used in:
Streaming SQL
• Apache Apex
• Apache Flink
• Apache SamzaSQL
• Apache StormSQL
…
Batch SQL
• Apache Hive
• Apache Drill
• Apache Phoenix
• Apache Kilin
…
Columnar Row oriented
Mutable
The deconstructed database’s storage
Optimized for analytics Optimized for serving
Immutable
Query integration
To be performant a query engine requires deep integration with the storage layer.
Implementing push down and a vectorized reader producing data in an efficient
representation (for example Apache Arrow).
Storage: Push downs
PROJECTION
Read only what you need
PREDICATE
Filter
AGGREGATION
Avoid materializing intermediary data
To reduce IO, aggregation can
also be implemented during the
scan to:
• minimize materialization of
intermediary data
Evaluate filters during scan to:
• Leverage storage properties
(min/max stats, partitioning,
sort, etc)
• Avoid decoding skipped data.
• Reduce IO.
Read only the columns that are
needed:
• Columnar storage makes this
efficient.
The deconstructed database interchange: Apache Arrow
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Result
SELECT SUM(a)
FROM t
WHERE c = 5
GROUP BY b
Projection and
predicate push
downs
Incumbent Interesting
Storage: Stream persistence
Open source projects
Features
• State of consumer: how do we recover from failure
• Snapshot
• Decoupling reads from writes
• Parallel reads
• Replication
• Data isolation
Big Data infrastructure blueprint
Big Data infrastructure blueprint
Big Data infrastructure blueprint
Data Infra
Stream persistance
Stream processing
Batch processing
Real time
dashboards
Interactive
analysis
Periodic
dashboards
Analyst
Real time
publishing
Batch
publishing
DataAPI
Data API
Schema
registry
Datasets
metadata
Scheduling/
Job deps
Persistence
Legend:
Processing
Metadata
Monitoring /
Observability
Data Storage
(S3, GCS, HDFS)
(Parquet, Avro)
Data API
UI
Eng
The Future
Still Improving
Better interoperability
• More efficient interoperability:
Continued Apache Arrow adoption
A better data
abstraction
• A better metadata repository
• A better table abstraction:
Netflix/Iceberg
• Common Push down
implementations (filter,
projection, aggregation)
Better data governance
• Global Data Lineage
• Access control
• Protect privacy
• Record User consent
Some predictions
A common access layer
Distributed access
service
Centralizes:
• Schema evolution
• Access control/anonymization
• Efficient push downs
• Efficient interoperability
Table abstraction layer
(Schema, push downs, access control, anonymization)
SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …)
File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …)
A multi tiered streaming batch storage system
Batch-Stream storage
integration
Convergence of Kafka and Kudu
• Schema evolution
• Access control/anonymization
• Efficient push downs
• Efficient interoperability
Time based
Ingestion
API
1) In memory
row oriented
store
2) in memory
column
oriented
store
3) On Disk Column oriented
store
Stream consumer API
Projection, Predicate, Aggregation push downs
Batch consumer API
Projection, Predicate, Aggregation push downs
Mainly reads from here
Mainly reads from here
THANK YOU!
julien.ledem@wework.com
Julien Le Dem @J_
April 2018
We’re hiring!
Contact: julien.ledem@wework.com
We’re growing
We’re hiring!
• WeWork doubled in size last year and the year before
• We expect to double in size this year as well
• Broadened scope: WeWork Labs, Powered by We, Flatiron
School, Meetup, WeLive, …
WeWork in numbers
Technology jobs Locations
• San Francisco: Platform teams
• New York
• Tel Aviv
• Montevideo
• Singapore
• WeWork has 242 locations, in 72 cities internationally
• Over 210,000 members worldwide
• More than 20,000 member companies
Contact: julien.ledem@wework.com
Questions?
Julien Le Dem @J_ julien.ledem@wework.com
April 2018

Contenu connexe

Tendances

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 

Tendances (20)

Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 

Similaire à From flat files to deconstructed database

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 

Similaire à From flat files to deconstructed database (20)

Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 

Plus de Julien Le Dem

Plus de Julien Le Dem (20)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 

Dernier

Dernier (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

From flat files to deconstructed database

  • 1. FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem. Julien Le Dem @J_ julien.ledem@wework.com April 2018
  • 2. Julien Le Dem @J_ Principal Data Engineer • Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Julien
  • 3. Agenda ❖ At the beginning there was Hadoop (2005) ❖ Actually, SQL was invented in the 70s “MapReduce: A major step backwards” ❖ The deconstructed database ❖ What next?
  • 4. At the beginning there was Hadoop
  • 5. Hadoop Storage: A distributed file system Execution: Map Reduce Based on Google’s GFS and MapReduce papers
  • 6. Great at looking for a needle in a haystack
  • 7. … with snowplows Great at looking for a needle in a haystack …
  • 8. Original Hadoop abstractions Execution Map/Reduce Storage File System Just flat files ● Any binary ● No schema ● No standard ● The job must know how to split the data M M M R R R Shuffle Read locally write locally Simple ● Flexible/Composable ● Logic and optimizations tightly coupled ● Ties execution with persistence
  • 9. “MapReduce: A major step backwards” (2008)
  • 10. SQL Databases have been around for a long time • Originally SEQUEL developed in the early 70s at IBM • 1986: first SQL standard • Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016 Relational model Global Standard • Universal Data access language • Widely understood • First described in 1969 by Edgar F. Codd
  • 11. Underlying principles of relational databases Standard SQL is understood by many Separation of logic and optimization Separation of Schema and Application High level language focusing on logic (SQL) Indexing Optimizer Evolution Views Schemas Integrity Transactions Integrity constraints Referential integrity
  • 12. Storage Tables Abstracted notion of data ● Defines a Schema ● Format/layout decoupled from queries ● Has statistics/indexing ● Can evolve over time Execution SQL SQL ● Decouples logic of query from: ○ Optimizations ○ Data representation ○ Indexing Relational Database abstractions SELECT a, AVG(b) FROM FOO GROUP BY a FOO
  • 13. Query evaluation Syntax Semantic Optimization Execution
  • 14. A well integrated system Storage SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Select Scan FOO Scan BAR JOIN GROUP BY FILTER Syntax Semantic Optimization Execution Table Metadata (Schema, stats, layout,…) Columnar data Push downs Columnar data Columnar data Push downs Push downs user
  • 15. So why? Why Hadoop? Why Snowplows?
  • 16. The relational model was constrained We need the right Constraints Need the right abstractions Traditional SQL implementations: • Flat schema • Inflexible schema evolution • History rewrite required • No lower level abstractions • Not scalable Constraints are good They allow optimizations • Statistics • Pick the best join algorithm • Change the data layout • Reusable optimization logic
  • 17. It’s just code Hadoop is flexible and scalable • Room to scale algorithms that are not part of the standard • Machine learning • Your imagination is the limit No Data shape constraint Open source • You can improve it • You can expand it • You can reinvent it • Nested data structures • Unstructured text with semantic annotations • Graphs • Non-uniform schemas
  • 18. You can actually implement SQL with this
  • 19. SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Parser FOO Execution GROUP BY JOIN FILTER BAR And they did… (open-sourced 2009)
  • 21. The deconstructed database Author: gamene https://www.flickr.com/photos/gamene/4007091102
  • 24. We can mix and match individual components *not exhaustive! Specialized Components Stream processing Storage Execution SQL Stream persistance Streams Resource management Iceberg Machine learning
  • 25. We can mix and match individual components Storage Row oriented or columnar Immutable or mutable Stream storage vs analytics optimized Query model SQL Functional … Machine Learning Training models Data Exchange Row oriented Columnar Batch Execution Optimized for high throughput and historical analysis Streaming Execution Optimized for High Throughput and Low latency processing
  • 26. Emergence of standard components
  • 27. Emergence of standard components Columnar Storage Apache Parquet as columnar representation at rest. SQL parsing and optimization Apache Calcite as a versatile query optimizer framework Schema model Apache Avro as pipeline friendly schema for the analytics world. Columnar Exchange Apache Arrow as the next generation in- memory representation and no-overhead data exchange Table abstraction Netflix’s Iceberg has a great potential to provide Snapshot isolation and layout abstraction on top of distributed file systems.
  • 28. The deconstructed database’s optimizer: Calcite Storage Execution engine Schema plugins Optimizer rules SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Syntax Semantic Optimization Execution … Select Scan FOO Scan BAR JOIN GROUP BY FILTER
  • 29. Apache Calcite is used in: Streaming SQL • Apache Apex • Apache Flink • Apache SamzaSQL • Apache StormSQL … Batch SQL • Apache Hive • Apache Drill • Apache Phoenix • Apache Kilin …
  • 30. Columnar Row oriented Mutable The deconstructed database’s storage Optimized for analytics Optimized for serving Immutable Query integration To be performant a query engine requires deep integration with the storage layer. Implementing push down and a vectorized reader producing data in an efficient representation (for example Apache Arrow).
  • 31. Storage: Push downs PROJECTION Read only what you need PREDICATE Filter AGGREGATION Avoid materializing intermediary data To reduce IO, aggregation can also be implemented during the scan to: • minimize materialization of intermediary data Evaluate filters during scan to: • Leverage storage properties (min/max stats, partitioning, sort, etc) • Avoid decoding skipped data. • Reduce IO. Read only the columns that are needed: • Columnar storage makes this efficient.
  • 32. The deconstructed database interchange: Apache Arrow Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result SELECT SUM(a) FROM t WHERE c = 5 GROUP BY b Projection and predicate push downs
  • 33. Incumbent Interesting Storage: Stream persistence Open source projects Features • State of consumer: how do we recover from failure • Snapshot • Decoupling reads from writes • Parallel reads • Replication • Data isolation
  • 36. Big Data infrastructure blueprint Data Infra Stream persistance Stream processing Batch processing Real time dashboards Interactive analysis Periodic dashboards Analyst Real time publishing Batch publishing DataAPI Data API Schema registry Datasets metadata Scheduling/ Job deps Persistence Legend: Processing Metadata Monitoring / Observability Data Storage (S3, GCS, HDFS) (Parquet, Avro) Data API UI Eng
  • 38. Still Improving Better interoperability • More efficient interoperability: Continued Apache Arrow adoption A better data abstraction • A better metadata repository • A better table abstraction: Netflix/Iceberg • Common Push down implementations (filter, projection, aggregation) Better data governance • Global Data Lineage • Access control • Protect privacy • Record User consent
  • 40. A common access layer Distributed access service Centralizes: • Schema evolution • Access control/anonymization • Efficient push downs • Efficient interoperability Table abstraction layer (Schema, push downs, access control, anonymization) SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …) File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …)
  • 41. A multi tiered streaming batch storage system Batch-Stream storage integration Convergence of Kafka and Kudu • Schema evolution • Access control/anonymization • Efficient push downs • Efficient interoperability Time based Ingestion API 1) In memory row oriented store 2) in memory column oriented store 3) On Disk Column oriented store Stream consumer API Projection, Predicate, Aggregation push downs Batch consumer API Projection, Predicate, Aggregation push downs Mainly reads from here Mainly reads from here
  • 44. We’re growing We’re hiring! • WeWork doubled in size last year and the year before • We expect to double in size this year as well • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, … WeWork in numbers Technology jobs Locations • San Francisco: Platform teams • New York • Tel Aviv • Montevideo • Singapore • WeWork has 242 locations, in 72 cities internationally • Over 210,000 members worldwide • More than 20,000 member companies Contact: julien.ledem@wework.com
  • 45. Questions? Julien Le Dem @J_ julien.ledem@wework.com April 2018