2. Me ?
Alexis Gendronneau
OVH, worldwide cloud provider
Data convergence Tech Lead
• Design customer Data Solutions
@bru_gere
https://www.linkedin.com/in/alexis-gendronneau-36066174/
3. Apache Dremio
Apache project since July,17
Founded by :
Jacques Nadeau, Drill MapR
Tomer Shiran, MapR Microsoft IBM
Team (part of) :
Ajay Singh, Hortonworks.
Collin Weitzman, Mesosphere and MapR, Oracle.
Kelly Stirman, MongoDB
Slogan :
“The missing link in modern data”
4. How to use data fast and easily ?
SQL
?
@vincentterrasi
?
?
5. Data is a massive engineering project today
Data Staging
• Custom ETL
• Fragile transforms
• Slow moving
SQL
@vincentterrasi
6. Data is a massive engineering project today
Data Staging
Data Warehouse
• High overhead
• DBA experts
SQL
@vincentterrasi
7. Data is a massive engineering project today
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables
• Data sprawl
• Governance issues
• Slow to update
SQL
+
+
+
+
+
+
+
+
+
@vincentterrasi
8. A New Tier In Data Analytics: Data Fabric
SQL
Data Virtualization
RDBMS, MongoDB, Elasticsearch, Hadoop,, NAS,
Excel, JSON
Data Acceleration
OLAP and AdHoc queries at interactive speed,
without cubes or BI-extracts
Data Curation
Wrangle, prepare, enrich any source without
making copies of your data.
Data Catalog
Interactive Data Discovery, Enterprise and
Personal Data Assets
@vincentterrasi
9. A production ready architecture
Native Push-Downs
Optimized query semantics for each data source:
relational, NoSQL HDFS and more.
Universal Relational Algebra
Query Planner automatically substitutes plans to make
optimal use of cache fragments.
Scalable
From 1 to 1000+ nodes, run on dedicated infrastructure
or in your Hadoop cluster, via YARN.
Dremio ReflectionsTM
Optimized physical data structures for row and
aggregation operations,.
Dremio
optimizer
Accelerator cache
(local disks, HDFS, S3, …)
Query plan
Dremio
optimizer
Accelerator cache
(local disks, HDFS, S3, …)
Query plan
@vincentterrasi
10. Relying on standards open source projects
Apache Drill (forked)
Distributed data exploration service
Apache calcite
SQL parser & optimizer
Apache Arrow
In-memory columnar data processing lib
Apache Parquet
columnar data storage format
11. Dremio approach
Reflection
design ui
Source Storage layer
Cache
Persistance
Refresh
System
Change
detection
Relationnal
Pattern
End user
Queries
Query
planner
Data
Processing
12. Impersonation | Trusted Context* | Passthr*
Data Source Access Control
Dremio security architecture
LDA
P
LDAP
Kerberos*
Virtual Dataset Access Control
ODBC | JDBC | REST
SSL / TLS*
SQL
@vincentterrasi
• Keep data where it is even with
your usual tools
13. Discover
Curate
Accelerate
Share
Discover
● Self-service access to all sources
● First class SQL support
● Extends your LDAP and Kerberos
Share
● Collaborate with your team
● Extends your permissions
● Google Docs for your data
Curate
● Rename columns, filter results
● Extract and transform values
● Join with other data sets
Accelerate
● Make queries 1000x faster
● Works with any data source
● Automatically adapts to you
Dremio powers analyst collaboration
@vincentterrasi
17. Dataset Creation
You need a Data source
• Elasticsearch
• MongoDB
• HDFS
• RDBSM (PGSQL, MySQL, MariaDB)
• File (csv, json, …)
Or a dataset
• Use search to find the right one
18. Data curation/preparation
On a dataset you can apply several changes
• Modify a column (split, delete, …)
• Modify Rows (filter by columns value, ...)
• Join with other datasets (Type does not
matter)
If needed, revert to a previous step
19. Data queries enhancement
Define reflections on data to make it faster
• Raw reflection for low/loaded backends
• Aggregation reflection for computed data
/! Be sure to know what you do with reflections
20. Management
how much it is used
Where it comes from
How it is built (Enterprise)
Manage Reflection request creation
(Enterprise)
Resources creation
21. Apache Dremio
next ?
Open API for queries (Data serving)
New datasource integration
Your requests ! (community)