SlideShare une entreprise Scribd logo
1  sur  29
Hadoop Summit, June 2013
SQL on Hadoop
Defining the New Generation of
Analytic Databases
Speaker Bio: Carl Steinbach
1
Currently:
Engineer @ Citus Data
PMC Chair, Committer -- Apache Hive Project
Formerly:
Oracle, NetApp, Informatica, Cloudera
Twitter: @cwsteinbach
LinkedIn: carlsteinbach
This talk is about:
2
A New Type of
Distributed
Analytic Database
What Is an Analytic Database?
3
OLAP: Online Analytical Processing
Consolidation (Roll-up)
Drill-down
Slicing and Dicing
No Transactions
Large Sequential Scans
I/O Bound
Motivation:
The Problem with Enterprise Storage
4
Storage Tier (NAS/SAN)
Server/Worker Tier
Server Server Server
Server Server Server
Server Server Server
Server Server Server
Really Big Pipe
Google File System (’03)
A Possible Solution?
5
Design Priorities
• Commodity Hardware
• Fault Tolerance
• Big Files / Big Blocks
• Big Sequential Reads/Writes
Design Tradeoffs
• No random writes (write once/read many)
• Slow random reads
• Not POSIX compliant
So GFS Solved the problem?
6
- Yes, but not because of anything described in
the original paper
- Client/Server approach won’t scale
- Full scope of GFS revealed one year later with
publication of MapReduce (‘04) paper.
GFS + MapReduce Key Idea: Eliminate I/O
Bottleneck by Colocating Compute and Storage
Resources on the Same Node
What’s Good About Hadoop?
7
Commodity Storage
Scale-out
Fault Tolerance
Flexibility
MapReduce
Multi-structured Data
What’s Bad About Hadoop?
8
MapReduce!
No Schemas!
Missing Features
Optimizer, Indexes, Views
Incompatibility with Existing Tools
BI, ETL, IDEs
Apache Hive Solved Many of these
Problems
9
SQL to MapReduce
Compiler + Execution Engine
Pluggable Storage Layer
(SerDes)
Schema-on-Read
But Other Problems Remained
10
Many Missing Features:
• ANSI SQL
• Cost Based Optimizer
• UDFs
• Data Types
• Security
• …
Biggest Problem:
• MapReduce Latency Overhead
Work in Progress: Hive Improvements
11
Stinger Initiative:
• Columnar Query Engine
• ORCFile File Format
• Replace MR with Tez (Apache Incubator)
One Solution:
MPP Database + Hadoop Connector
12
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
MPP Database Cluster
Hadoop Cluster
13
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Pull
Data
One Solution:
MPP Database + Hadoop Connector
14
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Pull
Data
IO Bottleneck
One Solution:
MPP Database + Hadoop Connector
A Better Solution:
New Architecture for SQL on Hadoop
15
MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node
Global Query
Executor
MPP Master Node
HDFS
datanode
HDFS
datanode
HDFS
datanode
HDFS
datanode
Local Query
Executor
Local Query
Executor
Local Query
Executor
Local Query
Executor
Maintain
Data
Locality
Push Work
To Data
New Architecture for SQL on Hadoop
16
Data Locality
• Block-Aware Query Planner Pushes Work to Data
Real-Time Query Performance
• Replace MapReduce
Schema-on-Read
• Pluggable Storage Format Handlers
Tight Integration with SQL Ecosystem Tools
Examples of the New Architecture
17
Google Dremel
• Interactive ad hoc query system for read-only
nested data. Powers BigQuery.
Apache Drill
• Open source version of Dremel. Implemented in
Java. Work in progress.
Cloudera Impala
• Heavily Influenced by MonetDB/X100. Runtime
codegen. CPU cache aware. Implemented in C++.
Citus Data
• Built on PostgreSQL. Powerful cost based optimizer
for disk I/O. Handles failures.
The New Architecture in Detail:
CitusDB
18
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
Hadoop
Metadata
HDFS
NameNode
PostgreSQL
Tools
ODBC/JDBC
Clients
CitusDB: Metadata Synchronization
19
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
Hadoop
Metadata
HDFS
NameNode
Metadata Sync
CREATE FOREIGN TABLE emp_{block_id} …
PostgreSQL
Tools
ODBC/JDBC
Clients
CREATE TABLE emp
CitusDB: Query Execution
20
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
Hadoop
Metadata
HDFS
NameNode
PostgreSQL
Tools
ODBC/JDBC
Clients
SELECT AVG(sal)
FROM emp
WHERE job = “manager”;
CitusDB: Query Execution
21
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
Local Queries
SELECT SUM(sal), COUNT(sal)
FROM emp_{block_id}
WHERE job = “manager”;
CitusDB: Query Execution
22
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
Local Results
{842176.53, 8}
{1234283.00, 12}
{0.00, 0}
{125500.00, 1}
{523100.00, 3}
{785300.32, 5}
CitusDB: Query Execution
23
CitusDB Master Node
Metadata
Distributed Query
Planner
Distributed Query
Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL
Tools
ODBC/JDBC
Clients
Hadoop
Metadata
HDFS
NameNode
{121046.58}
Why We Chose PostgreSQL
24
- Powerful Cost-Based Optimizer
- Designed to minimize disk I/O
- Extensible, Rich Type System
- Pluggable Storage Format Handlers
- Lots of Extensions:
- Geospatial, Full Text Search, JSON, etc…
- Enterprise Features:
- ODBC/JDBC
- Security
- Internationalization
Defining the New Generation of
Distributed Analytic Databases
25
SQL  Ease of Use, Increased Productivity
Real-time responsiveness  Faster
Data Locality  Proven Scalability
Schema-on-Read  Flexibility, Lower Cost
Where Are We At?
26
CitusDB SQL on Hadoop is in Open Beta
Download our Binary Packages
Or Use Our EC2 AMI
http://citusdata.com/docs/sql-on-hadoop
We’re Hiring!
27
http://citusdata.com/job
28
Questions?

Contenu connexe

Tendances

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataYahoo Developer Network
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldLester Martin
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 

Tendances (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 

Similaire à SQL on Hadoop: Defining the New Generation of Analytics Databases

Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQpivotalny
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Ferran Galí Reniu
 
It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?Srihari Srinivasan
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Technical Overview on Cloudera Impala
Technical Overview on Cloudera ImpalaTechnical Overview on Cloudera Impala
Technical Overview on Cloudera ImpalaPraneeth Krishna
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 

Similaire à SQL on Hadoop: Defining the New Generation of Analytics Databases (20)

Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 
It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Technical Overview on Cloudera Impala
Technical Overview on Cloudera ImpalaTechnical Overview on Cloudera Impala
Technical Overview on Cloudera Impala
 
Apache drill
Apache drillApache drill
Apache drill
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

SQL on Hadoop: Defining the New Generation of Analytics Databases

  • 1. Hadoop Summit, June 2013 SQL on Hadoop Defining the New Generation of Analytic Databases
  • 2. Speaker Bio: Carl Steinbach 1 Currently: Engineer @ Citus Data PMC Chair, Committer -- Apache Hive Project Formerly: Oracle, NetApp, Informatica, Cloudera Twitter: @cwsteinbach LinkedIn: carlsteinbach
  • 3. This talk is about: 2 A New Type of Distributed Analytic Database
  • 4. What Is an Analytic Database? 3 OLAP: Online Analytical Processing Consolidation (Roll-up) Drill-down Slicing and Dicing No Transactions Large Sequential Scans I/O Bound
  • 5. Motivation: The Problem with Enterprise Storage 4 Storage Tier (NAS/SAN) Server/Worker Tier Server Server Server Server Server Server Server Server Server Server Server Server Really Big Pipe
  • 6. Google File System (’03) A Possible Solution? 5 Design Priorities • Commodity Hardware • Fault Tolerance • Big Files / Big Blocks • Big Sequential Reads/Writes Design Tradeoffs • No random writes (write once/read many) • Slow random reads • Not POSIX compliant
  • 7. So GFS Solved the problem? 6 - Yes, but not because of anything described in the original paper - Client/Server approach won’t scale - Full scope of GFS revealed one year later with publication of MapReduce (‘04) paper. GFS + MapReduce Key Idea: Eliminate I/O Bottleneck by Colocating Compute and Storage Resources on the Same Node
  • 8. What’s Good About Hadoop? 7 Commodity Storage Scale-out Fault Tolerance Flexibility MapReduce Multi-structured Data
  • 9. What’s Bad About Hadoop? 8 MapReduce! No Schemas! Missing Features Optimizer, Indexes, Views Incompatibility with Existing Tools BI, ETL, IDEs
  • 10. Apache Hive Solved Many of these Problems 9 SQL to MapReduce Compiler + Execution Engine Pluggable Storage Layer (SerDes) Schema-on-Read
  • 11. But Other Problems Remained 10 Many Missing Features: • ANSI SQL • Cost Based Optimizer • UDFs • Data Types • Security • … Biggest Problem: • MapReduce Latency Overhead
  • 12. Work in Progress: Hive Improvements 11 Stinger Initiative: • Columnar Query Engine • ORCFile File Format • Replace MR with Tez (Apache Incubator)
  • 13. One Solution: MPP Database + Hadoop Connector 12 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor MPP Database Cluster Hadoop Cluster
  • 14. 13 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Pull Data One Solution: MPP Database + Hadoop Connector
  • 15. 14 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Pull Data IO Bottleneck One Solution: MPP Database + Hadoop Connector
  • 16. A Better Solution: New Architecture for SQL on Hadoop 15 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Maintain Data Locality Push Work To Data
  • 17. New Architecture for SQL on Hadoop 16 Data Locality • Block-Aware Query Planner Pushes Work to Data Real-Time Query Performance • Replace MapReduce Schema-on-Read • Pluggable Storage Format Handlers Tight Integration with SQL Ecosystem Tools
  • 18. Examples of the New Architecture 17 Google Dremel • Interactive ad hoc query system for read-only nested data. Powers BigQuery. Apache Drill • Open source version of Dremel. Implemented in Java. Work in progress. Cloudera Impala • Heavily Influenced by MonetDB/X100. Runtime codegen. CPU cache aware. Implemented in C++. Citus Data • Built on PostgreSQL. Powerful cost based optimizer for disk I/O. Handles failures.
  • 19. The New Architecture in Detail: CitusDB 18 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode PostgreSQL Tools ODBC/JDBC Clients
  • 20. CitusDB: Metadata Synchronization 19 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode Metadata Sync CREATE FOREIGN TABLE emp_{block_id} … PostgreSQL Tools ODBC/JDBC Clients CREATE TABLE emp
  • 21. CitusDB: Query Execution 20 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode PostgreSQL Tools ODBC/JDBC Clients SELECT AVG(sal) FROM emp WHERE job = “manager”;
  • 22. CitusDB: Query Execution 21 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode Local Queries SELECT SUM(sal), COUNT(sal) FROM emp_{block_id} WHERE job = “manager”;
  • 23. CitusDB: Query Execution 22 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode Local Results {842176.53, 8} {1234283.00, 12} {0.00, 0} {125500.00, 1} {523100.00, 3} {785300.32, 5}
  • 24. CitusDB: Query Execution 23 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode {121046.58}
  • 25. Why We Chose PostgreSQL 24 - Powerful Cost-Based Optimizer - Designed to minimize disk I/O - Extensible, Rich Type System - Pluggable Storage Format Handlers - Lots of Extensions: - Geospatial, Full Text Search, JSON, etc… - Enterprise Features: - ODBC/JDBC - Security - Internationalization
  • 26. Defining the New Generation of Distributed Analytic Databases 25 SQL  Ease of Use, Increased Productivity Real-time responsiveness  Faster Data Locality  Proven Scalability Schema-on-Read  Flexibility, Lower Cost
  • 27. Where Are We At? 26 CitusDB SQL on Hadoop is in Open Beta Download our Binary Packages Or Use Our EC2 AMI http://citusdata.com/docs/sql-on-hadoop

Notes de l'éditeur

  1. Databases are tools that let you ask questions about data.The architecture of a database depends heavily on the design of the system that stores the data.Hadoop, and HDFS in particular, represent a radical change to the underlying storage infrastructure.In order to capitalize on these changes we need to redesign the database from the ground up. That’s the goal of these new systems.
  2. Make sure we’re on the same page.Next: Enterprise Storage Model
  3. Availability - Fault tolerance through RAIDAccessibility - Shared files - POSIX file APIProblems:- Cost- ScalabilityOutro:Folks at Google were aware of these problems when they were building their search engine.-Fibre channel,
  4. Distributed Block StoreACM interview Sean Quinlan and Kirk McKusick: http://queue.acm.org/detail.cfm?id=1594206
  5. Did this solve the problem?Commodity: yesFault tolerance: yesScalability: NoMR is the missing pieceOutro:2005: Mike Cafarella, Doug CuttingNutchDoug Cutting and Mike Cafarella launched the Hadoop project a year later. HDFS + MapReduce