SlideShare une entreprise Scribd logo
1  sur  55
EMC Corporation All rights reserved
SQL ON HADOOP
EMC Corporation All rights reserved
• Introduction
• Hive
• HAWQ
• Impala
• SparkSQL
• HBase + Phoenix
• Drill
• Networking & Pizza
AGENDA
EMC Corporation All rights reserved
• How many developers?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• How many BI/SQL Developer?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• How many Business analyst/Sales?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• How many have used Hadoop?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• How many have used SQL on Hadoop?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• Hadoop is an open source framework for large-
scale data storing & processing.
WHAT IS HADOOP
EMC Corporation All rights reserved
• Application Workgroup in EMC
– Focused on
•Big data development/infrastructure
•Application modernization
•DevOps
ABOUT THE HOSTS
EMC Corporation All rights reserved
• Fahim Kundi
– 10+ years experience in EDW and big data
• Haden Pareira
– Data engineer with 5+ years of Hadoop experience
• Muhammad Ali
– Data engineer 2+ years with Hadoop
ABOUT THE HOSTS
APPLICATION WORKGROUP IN EMC
EMC Corporation All rights reserved
WHAT IS HADOOP
EMC Corporation All rights reserved
• HDFS is a file system – it’s all files
• MapReduce requires strong programming skills
• It’s so difficult
WHAT IS HADOOP
EMC Corporation All rights reserved
• SQL is well known in analytics community
• Faster and easier data insights
• Allows SQL/BI developer to retain their expertise
and create value out of big data
SQL ON HADOOP
EMC Corporation All rights reserved
• Cloudera – Impala
• Hortonworks – Hive/Tez
• Pivotal – HAWQ … now HDB
• MapR – Drill
• IBM – Big SQL
SQL ON HADOOP
EMC Corporation All rights reserved
HIVE
EMC Corporation All rights reserved
Hive and HAWQ
By Fahim Kundi
EMC Corporation All rights reserved
CONTENTS
• Hive Introduction
• How Hive Works
• Apache Tez
• Hive with Tez Vs Mapreduce
• ORC and Parquet Format
• HAWQ Introduction
• Query Optimizer
• PxF
EMC Corporation All rights reserved
HIVE INTRODUCTION (1)
• Apache Hive is high level query language
and data warehouse features built on top of
Hadoop.
• It is initially developed by yahoo and made
open source in 2008.
• SQL Like Query Language called HQL.
• Partitioning and Bucketing for faster Query
processing.
• Integration with Visualization tool like
Tableau.
EMC Corporation All rights reserved
HIVE INTRODUCTION (2)
• Hive supports all the common primitive data
formats such as INT, BINARY, BOOLEAN,
CHAR, DECIMAL, FLOAT, STRING, TIMESTAMP
etc.
• In addition, analysts can combine primitive
data types to form complex data types, such
as structs, maps and arrays.
EMC Corporation All rights reserved
HOW HIVE WORKS (1)
• The tables in Hive are similar to tables in a relational
database.
• Databases are comprised of tables, which are made up
of partitions.
• Data can be accessed via a simple query language and
Hive supports overwriting or appending data.
• Hive queries internally will be converted to map reduce
programs or Tez.
EMC Corporation All rights reserved
HOW HIVE WORKS (2)
• Within a particular database, data in the tables is
serialized and each table has a corresponding Hadoop
Distributed File System (HDFS) directory.
• Each table can be sub-divided into partitions that
determine how data is distributed within sub-
directories of the table directory.
• Data within partitions can be further broken down into
buckets.
EMC Corporation All rights reserved
APACHE TEZ (1)
• Apache Tez, a new distributed execution framework
that is targeted towards data-processing applications
on Hadoop.
• Tez is developed by Hortonwork and built on top of
YARN (Resource Management Framework for Hadoop)
• Tez generalizes Mapreduce to more powerful
framework as it creates Dataflow Graph for job
executed by User. (Example)
EMC Corporation All rights reserved
APACHE TEZ (2)
• The Tez API has the following components –
– DAG (Directed Acyclic Graph) – defines the overall job.
One DAG object corresponds to one job
– Vertex – defines the user logic along with the resources
and the environment needed to execute the user logic.
One Vertex corresponds to one step in the job
– Edge – defines the connection between producer and
consumer vertices.
• Tez is not meant directly for end-users – in fact it
enables developers to build end-user applications with
much better performance and flexibility.
EMC Corporation All rights reserved
EXAMPLE OF HIVE WITH TEZ VS MAPREDUCE
EMC Corporation All rights reserved
ORC FILE
• ORC(Optimal Row Columnar) is columnar file format designed
for Hadoop workloads.
• ORC files developed to massively speed up Apache Hive and
improve the storage efficiency of data stored in Apache Hadoop.
It is optimized for large streaming reads.
• ORC Features:
– Columnar format for complex data types
– Built into Hive from 0.11
– Support for Pig and Mapreduce via Hcat.
– Two level of compression
• Light weight type specific
• General
– Built in Indexes
EMC Corporation All rights reserved
ORC FILE LAYOUT
EMC Corporation All rights reserved
PARQUET
• Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem, regardless of
the choice of data processing framework, data model
or programming language.
• Parquet Feature:
– Columnar File Format
– Support Nested Data Structures
– Accessible by Hive, Spark, Pig, Drill, MR
– R/W in HDFS or local file system
EMC Corporation All rights reserved
PARQUET FILE LAYOUT
EMC Corporation All rights reserved
ORC VS PARQUET
• Two major consideration for considering ORC over Parquet
– Many of the performance improvements provided in the Stinger
initiative are dependent on features of the ORC format including
block level index for each column. This leads to potentially more
efficient I/O allowing Hive to skip reading entire blocks of data if it
determines predicate values are not present there.
– Also the Cost Based Optimizer has the ability to consider column
level metadata present in ORC files in order to generate the most
efficient graph.
– ACID transactions are only possible when using ORC as the file
format.
EMC Corporation All rights reserved
FILE SIZE COMPARISION
EMC Corporation All rights reserved
HAWQ INTRODUCTION
• HAWQ is MPP(Parallel) SQL-query engine that uses HDFS for
its storage layer.
• HAWQ evolves from the Greenplum Database query planner
to handle query processing and does not rely on MapReduce
under the hood to do processing.
• HAWQ reads data from and writes data to HDFS natively.
• It also has extensions(PxF) to allow it to interact with data
contained in other services (HBase, Hive, Avro, etc) that also
reside in HDFS.
EMC Corporation All rights reserved
HAWQ FEATURES
• HAWQ provides all major features found in Greenplum
database
– SQL Completeness: 2003 Extensions
– JDBC Compliant
– Robust Query Optimizer
– Row or Column-Oriented Table Storage
– Parallel Loading and Unloading
– Distributions
– Multi-level Partitioning
– High speed data redistribution
– Views
– External Tables
– Compression
– Resource Management
– Security
– Authentication
– Management and Monitoring
EMC Corporation All rights reserved
HAWQ ARCHITECTURE
Interconnect
Local Storage
HAWQ Master
Parser Query Optimizer
PXF
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
[Segment …]
DataNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
[Segment …]
HAWQ Standby
Master
NameNode
HDFS
Secondary NameNode
HDFS
EMC Corporation All rights reserved
HAWQ PARALLEL QUERY OPTIMIZER
Gather Motion
Sort
HashAggregate
HashJoin
Redistribute Motion
HashJoin
Seq Scan on
lineitem
Hash
Seq Scan on
orders
Hash
HashJoin
Seq Scan on
customer
Hash
Broadcast Motion
Seq Scan on
nation
• Turn SQL Query into execution Plan
• Cost based Optimizer
EMC Corporation All rights reserved
PIVOTAL EXTENSION FRAMEWORK (PXF)
• PXF is a fast, extensible framework connecting HAWQ to a
HDFS data store of choice that exposes a parallel API
 An advanced version of external
tables
 Enables combining HAWQ data
and Hadoop data in a single query
 Supports connectors for HDFS,
HBase and Hive
 Provides extensible framework API
to enable custom connector
development for any data sources
HDFS HBase Hive
Xtension Framework
EMC Corporation All rights reserved
Muhammad Ali
Image courtesy cloudera
EMC Corporation. All rights reserved.
• Interactive Query on top of Hadoop
• ANSI-92 SQL Standard
• Native MPP query engine
• Written in C++
IMPALA
OVERVIEW
EMC Corporation. All rights reserved.
• Native to Hadoop
– Blends with the eco system
– Security
– Hive MetaStore / HCatalog
– Query existing HDFS data
• Not as fault-tolerant as MapReduce
– (or Hive or SparkSQL or …)
– Single node fails during query the whole query fails
– But if it’s 20x faster, you can rerun and still finish faster ;)
IMPALA
OVERVIEW
EMC Corporation. All rights reserved.
IMPALAARCHITECTURE
Image courtesy cloudera
EMC Corporation. All rights reserved.
• Query execution times (small to medium size)
• Parquet Format
– Compression
• High Concurrency – kills the competitors
• Partitioning
• Query Optimizer (Compute Statistics!)
IMPALA
WHERE IT SHINES
EMC Corporation. All rights reserved.
IMPALA DEMO
EMC Corporation. All rights reserved.
• Distributed columnar storage manager
• Performance of Parquet
– Great for analytical queries
• Mutability of HBase
– Supports UPDATE/DELETE unlike Parquet
• One common storage to rule them all!
– (not exactly!)
WHAT THE HELL IS KUDU!
EMC Corporation. All rights reserved.
WHERE DO YOU POSITION KUDU?
EMC Corporation. All rights reserved.
• IoT use cases
– High velocity data
– Same data read for analytical queries near real time
• Predictive Modeling
– Large datasets updated frequently
– Retraining models
• Time-series applications
– Kudu offers compound keys/hash based partitioning
– Avoids hot spotting
KUDU USE CASES
EMC Corporation. All rights reserved.
IMPALA DEMO
EMC Corporation. All rights reserved.
SPARK
EMC Corporation. All rights reserved.
2 MIN INTRO TO SPARK
• General Purpose Distributed Computing System
– Multiple language support (Java, Scala, Python, and R)
– Fault tolerant, data distribution, in-memory caching etc.
• RDD
– Resilient distributed datasets
• Operations
– Transformations (define new RDDs)
– Actions (return value)
• No nonsense
– 100x faster than MapReduce
– Disk used only when can’t be avoided
EMC Corporation. All rights reserved.
2 MIN INTRO TO SPARK
Image Courtesy: Sachin Parmar
http://www.slideshare.net/sachinparmarss/deep-dive-spark-data-frames-sql-and-catalyst-optimizer?
EMC Corporation. All rights reserved.
SPARKSQL
EMC Corporation. All rights reserved.
SPARKSQL
• Structured Data Processing
– Commonly known to us as tables
• Integrated into Spark programming model
• Unified Data Access
• Scalability
• Support for HiveQL
• Cache it!
EMC Corporation. All rights reserved.
SPARKSQL
• Two APIs
– DataFrames
• Data organized into named columns
• Similar to Tables
• Can be constructed from structured data files, Hive, external DBs
– DataSets
• Experimental interface
• Strongly typed & SQL execution engine
• Can be constructed from regular JVM objects
EMC Corporation. All rights reserved.
SPARKSQL ARCHITECTURE
EMC Corporation. All rights reserved.
DEMO
SPARKSQL ON HADOOP
EMC Corporation. All rights reserved.
SQL On Hadoop

Contenu connexe

Tendances

Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdIBM Analytics
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudDataWorks Summit
 
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...Charlie Berger
 
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DiveTravis Wright
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopDataWorks Summit
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on HadoopDataWorks Summit
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data avanttic Consultoría Tecnológica
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021Sandesh Rao
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 Database
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 DatabaseeProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 Database
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 DatabaseMarco Gralike
 

Tendances (20)

Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
Oracle’s Advanced Analytics & Machine Learning 12.2c New Features & Road Map;...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep Dive
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 Database
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 DatabaseeProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 Database
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 Database
 

En vedette

Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hivezahid-mian
 
gsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Manigsoc_mentor for Shivram Mani
gsoc_mentor for Shivram ManiShivram Mani
 
Managing Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARIManaging Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARIMithun (Matt) Mathew
 
PXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged DataPXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged DataShivram Mani
 
Hawq Hcatalog Integration
Hawq Hcatalog IntegrationHawq Hcatalog Integration
Hawq Hcatalog IntegrationShivram Mani
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 PivotalOpenSourceHub
 
Apache HAWQ : An Introduction
Apache HAWQ : An IntroductionApache HAWQ : An Introduction
Apache HAWQ : An IntroductionSandeep Kunkunuru
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configurationGerrit van Vuuren
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
 
Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)saravana krishnamurthy
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchVMware Tanzu
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApachePivotalOpenSourceHub
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQInMobi Technology
 

En vedette (20)

Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
 
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
PXF BDAM 2016
PXF BDAM 2016PXF BDAM 2016
PXF BDAM 2016
 
gsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Manigsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Mani
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Managing Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARIManaging Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARI
 
PXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged DataPXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged Data
 
Hawq Hcatalog Integration
Hawq Hcatalog IntegrationHawq Hcatalog Integration
Hawq Hcatalog Integration
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
 
Apache HAWQ : An Introduction
Apache HAWQ : An IntroductionApache HAWQ : An Introduction
Apache HAWQ : An Introduction
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
 
Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
 
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
 

Similaire à SQL On Hadoop

Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
EMC HADOOP Storage Strategy
EMC HADOOP Storage StrategyEMC HADOOP Storage Strategy
EMC HADOOP Storage Strategywalshe1
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...VMware Tanzu
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 

Similaire à SQL On Hadoop (20)

Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
EMC HADOOP Storage Strategy
EMC HADOOP Storage StrategyEMC HADOOP Storage Strategy
EMC HADOOP Storage Strategy
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 

Dernier

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 

Dernier (20)

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 

SQL On Hadoop

  • 1. EMC Corporation All rights reserved SQL ON HADOOP
  • 2. EMC Corporation All rights reserved • Introduction • Hive • HAWQ • Impala • SparkSQL • HBase + Phoenix • Drill • Networking & Pizza AGENDA
  • 3. EMC Corporation All rights reserved • How many developers? INTRODUCTION A SURVEY
  • 4. EMC Corporation All rights reserved • How many BI/SQL Developer? INTRODUCTION A SURVEY
  • 5. EMC Corporation All rights reserved • How many Business analyst/Sales? INTRODUCTION A SURVEY
  • 6. EMC Corporation All rights reserved • How many have used Hadoop? INTRODUCTION A SURVEY
  • 7. EMC Corporation All rights reserved • How many have used SQL on Hadoop? INTRODUCTION A SURVEY
  • 8. EMC Corporation All rights reserved • Hadoop is an open source framework for large- scale data storing & processing. WHAT IS HADOOP
  • 9. EMC Corporation All rights reserved • Application Workgroup in EMC – Focused on •Big data development/infrastructure •Application modernization •DevOps ABOUT THE HOSTS
  • 10. EMC Corporation All rights reserved • Fahim Kundi – 10+ years experience in EDW and big data • Haden Pareira – Data engineer with 5+ years of Hadoop experience • Muhammad Ali – Data engineer 2+ years with Hadoop ABOUT THE HOSTS APPLICATION WORKGROUP IN EMC
  • 11. EMC Corporation All rights reserved WHAT IS HADOOP
  • 12. EMC Corporation All rights reserved • HDFS is a file system – it’s all files • MapReduce requires strong programming skills • It’s so difficult WHAT IS HADOOP
  • 13. EMC Corporation All rights reserved • SQL is well known in analytics community • Faster and easier data insights • Allows SQL/BI developer to retain their expertise and create value out of big data SQL ON HADOOP
  • 14. EMC Corporation All rights reserved • Cloudera – Impala • Hortonworks – Hive/Tez • Pivotal – HAWQ … now HDB • MapR – Drill • IBM – Big SQL SQL ON HADOOP
  • 15. EMC Corporation All rights reserved HIVE
  • 16. EMC Corporation All rights reserved Hive and HAWQ By Fahim Kundi
  • 17. EMC Corporation All rights reserved CONTENTS • Hive Introduction • How Hive Works • Apache Tez • Hive with Tez Vs Mapreduce • ORC and Parquet Format • HAWQ Introduction • Query Optimizer • PxF
  • 18. EMC Corporation All rights reserved HIVE INTRODUCTION (1) • Apache Hive is high level query language and data warehouse features built on top of Hadoop. • It is initially developed by yahoo and made open source in 2008. • SQL Like Query Language called HQL. • Partitioning and Bucketing for faster Query processing. • Integration with Visualization tool like Tableau.
  • 19. EMC Corporation All rights reserved HIVE INTRODUCTION (2) • Hive supports all the common primitive data formats such as INT, BINARY, BOOLEAN, CHAR, DECIMAL, FLOAT, STRING, TIMESTAMP etc. • In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.
  • 20. EMC Corporation All rights reserved HOW HIVE WORKS (1) • The tables in Hive are similar to tables in a relational database. • Databases are comprised of tables, which are made up of partitions. • Data can be accessed via a simple query language and Hive supports overwriting or appending data. • Hive queries internally will be converted to map reduce programs or Tez.
  • 21. EMC Corporation All rights reserved HOW HIVE WORKS (2) • Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. • Each table can be sub-divided into partitions that determine how data is distributed within sub- directories of the table directory. • Data within partitions can be further broken down into buckets.
  • 22. EMC Corporation All rights reserved APACHE TEZ (1) • Apache Tez, a new distributed execution framework that is targeted towards data-processing applications on Hadoop. • Tez is developed by Hortonwork and built on top of YARN (Resource Management Framework for Hadoop) • Tez generalizes Mapreduce to more powerful framework as it creates Dataflow Graph for job executed by User. (Example)
  • 23. EMC Corporation All rights reserved APACHE TEZ (2) • The Tez API has the following components – – DAG (Directed Acyclic Graph) – defines the overall job. One DAG object corresponds to one job – Vertex – defines the user logic along with the resources and the environment needed to execute the user logic. One Vertex corresponds to one step in the job – Edge – defines the connection between producer and consumer vertices. • Tez is not meant directly for end-users – in fact it enables developers to build end-user applications with much better performance and flexibility.
  • 24. EMC Corporation All rights reserved EXAMPLE OF HIVE WITH TEZ VS MAPREDUCE
  • 25. EMC Corporation All rights reserved ORC FILE • ORC(Optimal Row Columnar) is columnar file format designed for Hadoop workloads. • ORC files developed to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop. It is optimized for large streaming reads. • ORC Features: – Columnar format for complex data types – Built into Hive from 0.11 – Support for Pig and Mapreduce via Hcat. – Two level of compression • Light weight type specific • General – Built in Indexes
  • 26. EMC Corporation All rights reserved ORC FILE LAYOUT
  • 27. EMC Corporation All rights reserved PARQUET • Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. • Parquet Feature: – Columnar File Format – Support Nested Data Structures – Accessible by Hive, Spark, Pig, Drill, MR – R/W in HDFS or local file system
  • 28. EMC Corporation All rights reserved PARQUET FILE LAYOUT
  • 29. EMC Corporation All rights reserved ORC VS PARQUET • Two major consideration for considering ORC over Parquet – Many of the performance improvements provided in the Stinger initiative are dependent on features of the ORC format including block level index for each column. This leads to potentially more efficient I/O allowing Hive to skip reading entire blocks of data if it determines predicate values are not present there. – Also the Cost Based Optimizer has the ability to consider column level metadata present in ORC files in order to generate the most efficient graph. – ACID transactions are only possible when using ORC as the file format.
  • 30. EMC Corporation All rights reserved FILE SIZE COMPARISION
  • 31. EMC Corporation All rights reserved HAWQ INTRODUCTION • HAWQ is MPP(Parallel) SQL-query engine that uses HDFS for its storage layer. • HAWQ evolves from the Greenplum Database query planner to handle query processing and does not rely on MapReduce under the hood to do processing. • HAWQ reads data from and writes data to HDFS natively. • It also has extensions(PxF) to allow it to interact with data contained in other services (HBase, Hive, Avro, etc) that also reside in HDFS.
  • 32. EMC Corporation All rights reserved HAWQ FEATURES • HAWQ provides all major features found in Greenplum database – SQL Completeness: 2003 Extensions – JDBC Compliant – Robust Query Optimizer – Row or Column-Oriented Table Storage – Parallel Loading and Unloading – Distributions – Multi-level Partitioning – High speed data redistribution – Views – External Tables – Compression – Resource Management – Security – Authentication – Management and Monitoring
  • 33. EMC Corporation All rights reserved HAWQ ARCHITECTURE Interconnect Local Storage HAWQ Master Parser Query Optimizer PXF Local Temp Storage Segment Host Query Executor HDFS PXF Segment [Segment …] DataNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment [Segment …] HAWQ Standby Master NameNode HDFS Secondary NameNode HDFS
  • 34. EMC Corporation All rights reserved HAWQ PARALLEL QUERY OPTIMIZER Gather Motion Sort HashAggregate HashJoin Redistribute Motion HashJoin Seq Scan on lineitem Hash Seq Scan on orders Hash HashJoin Seq Scan on customer Hash Broadcast Motion Seq Scan on nation • Turn SQL Query into execution Plan • Cost based Optimizer
  • 35. EMC Corporation All rights reserved PIVOTAL EXTENSION FRAMEWORK (PXF) • PXF is a fast, extensible framework connecting HAWQ to a HDFS data store of choice that exposes a parallel API  An advanced version of external tables  Enables combining HAWQ data and Hadoop data in a single query  Supports connectors for HDFS, HBase and Hive  Provides extensible framework API to enable custom connector development for any data sources HDFS HBase Hive Xtension Framework
  • 36. EMC Corporation All rights reserved Muhammad Ali Image courtesy cloudera
  • 37. EMC Corporation. All rights reserved. • Interactive Query on top of Hadoop • ANSI-92 SQL Standard • Native MPP query engine • Written in C++ IMPALA OVERVIEW
  • 38. EMC Corporation. All rights reserved. • Native to Hadoop – Blends with the eco system – Security – Hive MetaStore / HCatalog – Query existing HDFS data • Not as fault-tolerant as MapReduce – (or Hive or SparkSQL or …) – Single node fails during query the whole query fails – But if it’s 20x faster, you can rerun and still finish faster ;) IMPALA OVERVIEW
  • 39. EMC Corporation. All rights reserved. IMPALAARCHITECTURE Image courtesy cloudera
  • 40. EMC Corporation. All rights reserved. • Query execution times (small to medium size) • Parquet Format – Compression • High Concurrency – kills the competitors • Partitioning • Query Optimizer (Compute Statistics!) IMPALA WHERE IT SHINES
  • 41. EMC Corporation. All rights reserved. IMPALA DEMO
  • 42. EMC Corporation. All rights reserved. • Distributed columnar storage manager • Performance of Parquet – Great for analytical queries • Mutability of HBase – Supports UPDATE/DELETE unlike Parquet • One common storage to rule them all! – (not exactly!) WHAT THE HELL IS KUDU!
  • 43. EMC Corporation. All rights reserved. WHERE DO YOU POSITION KUDU?
  • 44. EMC Corporation. All rights reserved. • IoT use cases – High velocity data – Same data read for analytical queries near real time • Predictive Modeling – Large datasets updated frequently – Retraining models • Time-series applications – Kudu offers compound keys/hash based partitioning – Avoids hot spotting KUDU USE CASES
  • 45. EMC Corporation. All rights reserved. IMPALA DEMO
  • 46. EMC Corporation. All rights reserved. SPARK
  • 47. EMC Corporation. All rights reserved. 2 MIN INTRO TO SPARK • General Purpose Distributed Computing System – Multiple language support (Java, Scala, Python, and R) – Fault tolerant, data distribution, in-memory caching etc. • RDD – Resilient distributed datasets • Operations – Transformations (define new RDDs) – Actions (return value) • No nonsense – 100x faster than MapReduce – Disk used only when can’t be avoided
  • 48. EMC Corporation. All rights reserved. 2 MIN INTRO TO SPARK Image Courtesy: Sachin Parmar http://www.slideshare.net/sachinparmarss/deep-dive-spark-data-frames-sql-and-catalyst-optimizer?
  • 49. EMC Corporation. All rights reserved. SPARKSQL
  • 50. EMC Corporation. All rights reserved. SPARKSQL • Structured Data Processing – Commonly known to us as tables • Integrated into Spark programming model • Unified Data Access • Scalability • Support for HiveQL • Cache it!
  • 51. EMC Corporation. All rights reserved. SPARKSQL • Two APIs – DataFrames • Data organized into named columns • Similar to Tables • Can be constructed from structured data files, Hive, external DBs – DataSets • Experimental interface • Strongly typed & SQL execution engine • Can be constructed from regular JVM objects
  • 52. EMC Corporation. All rights reserved. SPARKSQL ARCHITECTURE
  • 53. EMC Corporation. All rights reserved. DEMO SPARKSQL ON HADOOP
  • 54. EMC Corporation. All rights reserved.

Notes de l'éditeur

  1. Hadoop has traditionally been a batch-processing platform for large amounts of data. However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as Machine Learning, which do not fit will into the MapReduce paradigm. Tez helps Hadoop address these use cases.
  2. Compared with RCFile format, for example, ORC file format has many advantages such as: a single file as the output of each task, which reduces the NameNode's load Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) light-weight indexes stored within the file that skip row groups that don't pass predicate filtering block-mode compression based on data type run-length encoding for integer columns and dictionary encoding for string columns concurrent reads of the same file using separate RecordReaders Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query.
  3. Advantages of Columnar Storage: Limits IO by loading the columns that is needed. Save space as columnar layout compress better
  4. Converts SQL into a physical execution plan Cost-based optimization looks for the most efficient plan Physical plan contains scans, joins, sorts, aggregations, etc. Global planning avoids sub-optimal ‘SQL pushing’ to segments Directly inserts motion nodes for inter-segment communication Directly inserts motion nodes for efficient non-local join processing
  5. Thank you