SlideShare une entreprise Scribd logo
1  sur  65
Impala:
Modern, Open-Source SQL Engine
For Hadoop
Shravan (Sean) Pabba
@skpabba
Agenda
• Why Hadoop?
• Data Processing in Hadoop
• User’s view of Impala
• Impala Use Cases
• Impala Architecture
• Performance highlights
In the beginning….
was the database
For a while,
the database was all we needed.
Data is not what it used to beDataGrowth
STRUCTURED DATA – 20%
1980 TODAY
UNSTRUCTUREDDATA–80%
Hadoop was Invented to Solve:
• Large volumes of data
• Data that is only valuable in bulk
• High ingestion rates
• Data that requires more processing
• Differently structured data
• Evolving data
• High license costs
What is Apache Hadoop?
Has the Flexibility to Store and
Mine Any Type of Data
 Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
 Not bound by a single schema
Excels at
Processing Complex Data
 Scale-out architecture divides workloads
across multiple nodes
 Flexible file system eliminates ETL
bottlenecks
Scales
Economically
 Can be deployed on commodity
hardware
 Open source platform guards against
vendor lock
Hadoop Distributed
File System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open source
platform for data storage and processing
that is…
 Distributed
 Fault tolerant
 Scalable
CORE HADOOP SYSTEM COMPONENTS
Processing Data in Hadoop
Map Reduce
• Versatile
• Flexible
• Scalable
• High latency
• Batch oriented
• Java
• Challenging paradigm
Hive & Pig
• Hive – Turn SQL into MapReduce
• Pig – Turn execution plans into MapReduce
• Makes MapReduce easier
• But not any faster
Towards a Better Map Reduce
• Spark – Next generation MapReduce
With in-memory caching
Lazy Evaluation
Fast recovery times from node failures
• Tez – Next generation MapReduce.
Reduced overhead, more flexibility.
Currently Alpha
And now to something completely different!
What is Impala?
Impala Overview
Interactive SQL for Hadoop
 Responses in seconds
 Nearly ANSI-92 standard SQL with Hive SQL
Native MPP Query Engine
 Purpose-built for low-latency queries
 Separate runtime from MapReduce
 Designed as part of the Hadoop ecosystem
Open Source
 Apache-licensed
Impala Overview
Runs directly within Hadoop
 reads widely used Hadoop file formats
 talks to widely used Hadoop storage managers
 runs on same nodes that run Hadoop processes
High performance
 C++ instead of Java
 runtime code generation
 completely new execution engine – No MapReduce
 Beta version released since October 2012
 General availability (v1.0) release out since April 2013
 Latest release (v1.3.0) released on April 2014
Impala is Production Ready
User View of Impala: Overview
• Distributed service in cluster:
one Impala daemon on each node with data
• Highly available: no single point of failure
• Submit query to any daemon:
• ODBC/JDBC
• Impala CLI
• Hue
• Query is distributed to all nodes with relevant data
• Impala uses Hive’s metadata
User View of Impala: File Formats
• There is no ‘Impala format’.
• Impala supports:
• Uncompressed/lzo-compressed text files
• Sequence files and RCFile with snappy/gzip
compression
• Avro data files
• Parquet columnar format (more on that later)
• HBase
User View of Impala: SQL Support
• Most of SQL-92
• INSERT INTO … SELECT …
• Only equi-joins; no non-equi joins, no cross products
• Order By requires Limit (for now)
• DDL support
• SQL-style authorization via Apache Sentry
• UDFs and UDAFs are supported
Comparing Alternatives
Not All SQL On Hadoop Is Created Equal
Batch MapReduce
Make MapReduce faster
Slow, still batch
Remote Query
Pull data from HDFS over
the network to the DW
compute layer
Slow, expensive
Siloed DBMS
Load data into a
proprietary database file
Rigid, siloed data,
slow ETL
Impala
Native MPP query engine
that’s integrated into
Hadoop
Fast, flexible,
cost-effective
$
DMBSHadoop
More Detail On Alternative Approaches
Storage
Integration
Resource Management
Metadata
Batch
Processing
…
Interactive
SQL
Machine
Learning
HDFS HBase
Batch MapReduce
 Batch-oriented
 High latency
Remote Query Siloed DBMS
Hadoop DMBS
HDFS Storage
Compute Compute
 Network bottleneck
 2x the hardware
 Duplicate metadata,
security, SQL, etc.
Storage (HDFS)
Integration
Resource Management
HadoopMetadata
DBMS
Hadoop
Engines
MAPREDUCE, HIVE, PIG, IMPALA, ETC.
DBMSMetadata
PROPRIETARY STANDARD & SHARED
 RDBMS rigidity
 Query subset of data
 Duplicate storage,
metadata, security,
SQL, etc.
Impala Vs Dremel
• Impala
• Open source
• Multi Table joins
• Many standard file formats supported (Text, Avro, RCFile,
Seq, Parquet)
• No Nested structures (on Roadmap)
• Dremel
• Google Only
• Single Table
• Columnar Format only
• Supports nested structures
Use Cases
Impala Use Cases
Interactive BI/analytics on more data
Asking new questions – exploration, ML
Data processing with tight SLAs
Query-able archive w/full fidelity
Cost-effective, ad hoc query environment that
offloads the data warehouse for:
Global Financial Services Company
Saved 90% on incremental EDW spend &
improved performance by 5x
Offload data warehouse for query-able archive
Store decades of data cost-effectively
Process & analyze on the same system
Improved capabilities through interactive query
on more data
Digital Media Company
20x performance improvement for exploration
& data discovery
Easily identify new data sets for modeling
Interact with raw data directly to test
hypotheses
Avoid expensive DW schema changes
Accelerate ‘time to answer’
Impala Architecture
Impala Architecture
• Impala daemon (impalad) – N instances
• Query execution
• State store daemon (statestored) – 1 instance
• Provides name service and metadata distribution
• Catalog daemon (catalogd) – 1 instance
• Relays metadata changes to all impalad’s
Impala Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/HUE/Shell
Impala Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collections of plan fragments
3) Coordinator initiates execution on impalad(s) local to data
Impala Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s)
5) Query results are streamed back to client
Query results
Query Planner
2-phase planning
 Left deep tree
 Partition plan to maximize data locality
Join order
 Before 1.2.3: Order of tables in query.
 1.2.3 and above: Cost based if statistics exist
Plan Operators
 Scan, HashJoin, HashAggregation, Union, TopN, Exchange
 All operators are fully distributed
Query Execution Example
Simple Example
SELECT state, SUM(revenue)
FROM HdfsTbl h
JOIN HbaseTbl b ON (id)
GROUP BY state
ORDER BY 2 desc LIMIT 10
How does a database execute a query?
• Left Deep Tree
• Data flows from bottom
to top
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Wait – Why is this a left-deep tree?
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
Agg
HashJoin
Scan: t0
How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data.
• So, the RHS table (Hbase
scan) is scanned first.
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Scan
Hbase
first
How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data. TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data. TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data. TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Start scanning LHS (Hdfs)
table
• For each row from LHS,
probe the hash table for
matching rows
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Probe hash
table and a
matching
row is
found.
How does a database execute a query?
• Matched rows are
bubbled up the
execution tree TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from LHS,
probe the hash table for
matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
No
matching
row
How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from LHS,
probe the hash table for
matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from LHS,
probe the hash table for
matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Probe hash
table and a
matching
row is
found.
How does a database execute a query?
• Matched rows are
bubbled up the
execution tree TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from LHS,
probe the hash table for
matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
No
matching
row
How does a database execute a query?
• All rows have been
returned from the hash
join node. Agg node can
start returning rows
• Rows are bubbled up the
execution tree
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Rows from the
aggregation node
bubbles up to the top-n
node
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Rows from the
aggregation node
bubbles up to the top-n
node
• When all rows are
returned by the agg
node, top-n node can
restart return rows to
the end-user
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Key takeaways
 Data flows from bottom to top in the execution tree
and finally goes to the end user
 Larger tables go on the left
 Collect statistics
 Filter early
Simpler Example
SELECT state, SUM(revenue)
FROM HdfsTbl h
JOIN HbaseTbl b ON (id)
GROUP BY state
How does an MPP database execute a query?
Tbl b
Scan
Hash
Join
Tbl a
Scan
Exch
Agg
Exch
Agg
Agg
Hash
Join
Tbl a
Scan
Tbl b
Scan
Broadcast
Re-distribute by
“state”
How does a MPP database execute a query
A join B
A join B
A join B
Local
Agg
Local
Agg
Local
Agg
Scan and
Broadcast
Tbl B
Final
Agg
Final
Agg
Final
Agg
Re-distribute by
“state”
Local read
Tbl A
Performance
Impala Performance Results
• Impala’s Performance:
• Comparable commercial MPP DBMS speed
• Natively on Hadoop
• Three Result Sets:
• Impala vs Hive 0.12 (Impala 6-70x faster)
• Impala vs “DBMS-Y” (Impala average of 2x faster)
• Impala scalability (Impala achieves linear scale)
• Background
• 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported
language)
• Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB)
• Realistic nodes (e.g. 8-core CPU, 96GB RAM, 12x2TB disks)
• Methodical testing (multiple runs, reviewed fairness for competition, etc)
• Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
• Tests: https://github.com/cloudera/impala-tpcds-kit
Impala vs Hive 0.12 (Lower bars are better)
Impala vs “DBMS-Y” (Lower bars are better)
Impala Scalability: 2x the Hardware
(Expectation: Cut Response Times in Half)
Impala Scalability: 2x the Hardware and 2x Users/Data
(Expectation: Constant Response Times)
2x the Users, 2x the Hardware
2x the Data, 2x the Hardware
Demo
Roadmap
Roadmap
• SQL 2003-compliant analytic window functions
• Additional authentication mechanisms
• UDTFs (user-defined table functions)
• Intra-node parallelized aggregations and joins
• Nested data
• Enhanced, production-ready, YARN-integrated resource
manager
• Parquet enhancements – continued performance gains
including index pages
• Additional data types – including Date and Decimal types
• ORDER BY without LIMIT clauses
Impala for PhillyDB Meetup

Contenu connexe

Tendances

An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies
 

Tendances (20)

Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Apache drill
Apache drillApache drill
Apache drill
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 

En vedette

Pharmaceutical Clusters
Pharmaceutical ClustersPharmaceutical Clusters
Pharmaceutical Clusters
tarahwhite
 
Pharmaceutical Clusters
Pharmaceutical  ClustersPharmaceutical  Clusters
Pharmaceutical Clusters
tarahwhite
 
C:\fakepath\vocabulary wales
C:\fakepath\vocabulary walesC:\fakepath\vocabulary wales
C:\fakepath\vocabulary wales
Roksana Novruzova
 
Pharmaceutical Clusters
Pharmaceutical  ClustersPharmaceutical  Clusters
Pharmaceutical Clusters
tarahwhite
 
Pharmaceutical Clusters
Pharmaceutical ClustersPharmaceutical Clusters
Pharmaceutical Clusters
tarahwhite
 

En vedette (20)

Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
C:\fakepath\the usa
C:\fakepath\the usaC:\fakepath\the usa
C:\fakepath\the usa
 
slect riteTime slideshow
slect riteTime slideshowslect riteTime slideshow
slect riteTime slideshow
 
Phamacutical Clusters
Phamacutical ClustersPhamacutical Clusters
Phamacutical Clusters
 
Pharmaceutical Clusters
Pharmaceutical  ClustersPharmaceutical  Clusters
Pharmaceutical Clusters
 
Pharmaceutical Clusters
Pharmaceutical ClustersPharmaceutical Clusters
Pharmaceutical Clusters
 
Kebaikan Lecithin dan Ostematrix Shaklee
Kebaikan Lecithin dan Ostematrix ShakleeKebaikan Lecithin dan Ostematrix Shaklee
Kebaikan Lecithin dan Ostematrix Shaklee
 
Pharmaceutical Clusters
Pharmaceutical ClustersPharmaceutical Clusters
Pharmaceutical Clusters
 
P3.3 Scientific Models
P3.3 Scientific ModelsP3.3 Scientific Models
P3.3 Scientific Models
 
Pharmaceutical Clusters
Pharmaceutical  ClustersPharmaceutical  Clusters
Pharmaceutical Clusters
 
Pharmaceutical Clusters
Pharmaceutical ClustersPharmaceutical Clusters
Pharmaceutical Clusters
 
C:\fakepath\vocabulary wales
C:\fakepath\vocabulary walesC:\fakepath\vocabulary wales
C:\fakepath\vocabulary wales
 
Pharmaceutical Clusters
Pharmaceutical  ClustersPharmaceutical  Clusters
Pharmaceutical Clusters
 
World religions
World religionsWorld religions
World religions
 
Preactor Sage1000 Integration
Preactor Sage1000 IntegrationPreactor Sage1000 Integration
Preactor Sage1000 Integration
 
Pharmaceutical Clusters
Pharmaceutical ClustersPharmaceutical Clusters
Pharmaceutical Clusters
 
Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 

Similaire à Impala for PhillyDB Meetup

HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 

Similaire à Impala for PhillyDB Meetup (20)

Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Hadoop
HadoopHadoop
Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopArchitectural Evolution Starting from Hadoop
Architectural Evolution Starting from Hadoop
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 

Dernier

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 

Impala for PhillyDB Meetup

  • 1. Impala: Modern, Open-Source SQL Engine For Hadoop Shravan (Sean) Pabba @skpabba
  • 2. Agenda • Why Hadoop? • Data Processing in Hadoop • User’s view of Impala • Impala Use Cases • Impala Architecture • Performance highlights
  • 4. For a while, the database was all we needed.
  • 5. Data is not what it used to beDataGrowth STRUCTURED DATA – 20% 1980 TODAY UNSTRUCTUREDDATA–80%
  • 6. Hadoop was Invented to Solve: • Large volumes of data • Data that is only valuable in bulk • High ingestion rates • Data that requires more processing • Differently structured data • Evolving data • High license costs
  • 7. What is Apache Hadoop? Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is…  Distributed  Fault tolerant  Scalable CORE HADOOP SYSTEM COMPONENTS
  • 9. Map Reduce • Versatile • Flexible • Scalable • High latency • Batch oriented • Java • Challenging paradigm
  • 10. Hive & Pig • Hive – Turn SQL into MapReduce • Pig – Turn execution plans into MapReduce • Makes MapReduce easier • But not any faster
  • 11. Towards a Better Map Reduce • Spark – Next generation MapReduce With in-memory caching Lazy Evaluation Fast recovery times from node failures • Tez – Next generation MapReduce. Reduced overhead, more flexibility. Currently Alpha
  • 12. And now to something completely different!
  • 14. Impala Overview Interactive SQL for Hadoop  Responses in seconds  Nearly ANSI-92 standard SQL with Hive SQL Native MPP Query Engine  Purpose-built for low-latency queries  Separate runtime from MapReduce  Designed as part of the Hadoop ecosystem Open Source  Apache-licensed
  • 15. Impala Overview Runs directly within Hadoop  reads widely used Hadoop file formats  talks to widely used Hadoop storage managers  runs on same nodes that run Hadoop processes High performance  C++ instead of Java  runtime code generation  completely new execution engine – No MapReduce
  • 16.  Beta version released since October 2012  General availability (v1.0) release out since April 2013  Latest release (v1.3.0) released on April 2014 Impala is Production Ready
  • 17. User View of Impala: Overview • Distributed service in cluster: one Impala daemon on each node with data • Highly available: no single point of failure • Submit query to any daemon: • ODBC/JDBC • Impala CLI • Hue • Query is distributed to all nodes with relevant data • Impala uses Hive’s metadata
  • 18. User View of Impala: File Formats • There is no ‘Impala format’. • Impala supports: • Uncompressed/lzo-compressed text files • Sequence files and RCFile with snappy/gzip compression • Avro data files • Parquet columnar format (more on that later) • HBase
  • 19. User View of Impala: SQL Support • Most of SQL-92 • INSERT INTO … SELECT … • Only equi-joins; no non-equi joins, no cross products • Order By requires Limit (for now) • DDL support • SQL-style authorization via Apache Sentry • UDFs and UDAFs are supported
  • 21. Not All SQL On Hadoop Is Created Equal Batch MapReduce Make MapReduce faster Slow, still batch Remote Query Pull data from HDFS over the network to the DW compute layer Slow, expensive Siloed DBMS Load data into a proprietary database file Rigid, siloed data, slow ETL Impala Native MPP query engine that’s integrated into Hadoop Fast, flexible, cost-effective $
  • 22. DMBSHadoop More Detail On Alternative Approaches Storage Integration Resource Management Metadata Batch Processing … Interactive SQL Machine Learning HDFS HBase Batch MapReduce  Batch-oriented  High latency Remote Query Siloed DBMS Hadoop DMBS HDFS Storage Compute Compute  Network bottleneck  2x the hardware  Duplicate metadata, security, SQL, etc. Storage (HDFS) Integration Resource Management HadoopMetadata DBMS Hadoop Engines MAPREDUCE, HIVE, PIG, IMPALA, ETC. DBMSMetadata PROPRIETARY STANDARD & SHARED  RDBMS rigidity  Query subset of data  Duplicate storage, metadata, security, SQL, etc.
  • 23. Impala Vs Dremel • Impala • Open source • Multi Table joins • Many standard file formats supported (Text, Avro, RCFile, Seq, Parquet) • No Nested structures (on Roadmap) • Dremel • Google Only • Single Table • Columnar Format only • Supports nested structures
  • 25. Impala Use Cases Interactive BI/analytics on more data Asking new questions – exploration, ML Data processing with tight SLAs Query-able archive w/full fidelity Cost-effective, ad hoc query environment that offloads the data warehouse for:
  • 26. Global Financial Services Company Saved 90% on incremental EDW spend & improved performance by 5x Offload data warehouse for query-able archive Store decades of data cost-effectively Process & analyze on the same system Improved capabilities through interactive query on more data
  • 27. Digital Media Company 20x performance improvement for exploration & data discovery Easily identify new data sets for modeling Interact with raw data directly to test hypotheses Avoid expensive DW schema changes Accelerate ‘time to answer’
  • 29. Impala Architecture • Impala daemon (impalad) – N instances • Query execution • State store daemon (statestored) – 1 instance • Provides name service and metadata distribution • Catalog daemon (catalogd) – 1 instance • Relays metadata changes to all impalad’s
  • 30. Impala Query Execution Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request 1) Request arrives via ODBC/JDBC/HUE/Shell
  • 31. Impala Query Execution Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data
  • 32. Impala Query Execution Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query results
  • 33. Query Planner 2-phase planning  Left deep tree  Partition plan to maximize data locality Join order  Before 1.2.3: Order of tables in query.  1.2.3 and above: Cost based if statistics exist Plan Operators  Scan, HashJoin, HashAggregation, Union, TopN, Exchange  All operators are fully distributed
  • 35. Simple Example SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (id) GROUP BY state ORDER BY 2 desc LIMIT 10
  • 36. How does a database execute a query? • Left Deep Tree • Data flows from bottom to top TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 37. Wait – Why is this a left-deep tree? HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin Agg HashJoin Scan: t0
  • 38. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. • So, the RHS table (Hbase scan) is scanned first. TopN Agg Hash Join Hdfs Scan Hbase Scan Scan Hbase first
  • 39. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 40. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 41. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 42. How does a database execute a query? • Start scanning LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows TopN Agg Hash Join Hdfs Scan Hbase Scan Probe hash table and a matching row is found.
  • 43. How does a database execute a query? • Matched rows are bubbled up the execution tree TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 44. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan No matching row
  • 45. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 46. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan Probe hash table and a matching row is found.
  • 47. How does a database execute a query? • Matched rows are bubbled up the execution tree TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 48. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan No matching row
  • 49. How does a database execute a query? • All rows have been returned from the hash join node. Agg node can start returning rows • Rows are bubbled up the execution tree TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 50. How does a database execute a query? • Rows from the aggregation node bubbles up to the top-n node TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 51. How does a database execute a query? • Rows from the aggregation node bubbles up to the top-n node • When all rows are returned by the agg node, top-n node can restart return rows to the end-user TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 52. Key takeaways  Data flows from bottom to top in the execution tree and finally goes to the end user  Larger tables go on the left  Collect statistics  Filter early
  • 53. Simpler Example SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (id) GROUP BY state
  • 54. How does an MPP database execute a query? Tbl b Scan Hash Join Tbl a Scan Exch Agg Exch Agg Agg Hash Join Tbl a Scan Tbl b Scan Broadcast Re-distribute by “state”
  • 55. How does a MPP database execute a query A join B A join B A join B Local Agg Local Agg Local Agg Scan and Broadcast Tbl B Final Agg Final Agg Final Agg Re-distribute by “state” Local read Tbl A
  • 57. Impala Performance Results • Impala’s Performance: • Comparable commercial MPP DBMS speed • Natively on Hadoop • Three Result Sets: • Impala vs Hive 0.12 (Impala 6-70x faster) • Impala vs “DBMS-Y” (Impala average of 2x faster) • Impala scalability (Impala achieves linear scale) • Background • 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported language) • Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB) • Realistic nodes (e.g. 8-core CPU, 96GB RAM, 12x2TB disks) • Methodical testing (multiple runs, reviewed fairness for competition, etc) • Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/ • Tests: https://github.com/cloudera/impala-tpcds-kit
  • 58. Impala vs Hive 0.12 (Lower bars are better)
  • 59. Impala vs “DBMS-Y” (Lower bars are better)
  • 60. Impala Scalability: 2x the Hardware (Expectation: Cut Response Times in Half)
  • 61. Impala Scalability: 2x the Hardware and 2x Users/Data (Expectation: Constant Response Times) 2x the Users, 2x the Hardware 2x the Data, 2x the Hardware
  • 62. Demo
  • 64. Roadmap • SQL 2003-compliant analytic window functions • Additional authentication mechanisms • UDTFs (user-defined table functions) • Intra-node parallelized aggregations and joins • Nested data • Enhanced, production-ready, YARN-integrated resource manager • Parquet enhancements – continued performance gains including index pages • Additional data types – including Date and Decimal types • ORDER BY without LIMIT clauses

Notes de l'éditeur

  1. Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  2. Take away – MapReduce is going away in favor of multi-framework Hadoop. Most of the replacements are improved MR. Impala is different.
  3. Interactive SQL for HadoopResponses in seconds vs. minutes or hours4-65x faster than Hive; up to 100x seenNearly ANSI-92 standard SQL with HiveQLCREATE, ALTER, SELECT, INSERT, JOIN, subqueries, etc.ODBC/JDBC drivers Compatible SQL interface for existing Hadoop/CDH applicationsNative MPP Query EnginePurpose-built for low latency queries – another application being brought to HadoopSeparate runtime from MapReduce which is designed for batch processingTightly integrated with Hadoop ecosystem – major design imperative and differentiator for ClouderaSingle system (no integration)Native, open file formats that are compatible across the ecosystem (no copying)Single metadata model (no synchronization)Single set of hardware and system resources (better performance, lower cost)Integrated, end-to-end security (no vulnerabilities)Open SourceKeeps with our strategy of an open platform – i.e. if it stores or processes data, it’s open sourceApache-licensedCode available on Github
  4. (Faster) MapReduceUse MapReduce as the engine to execute SQL queriesBatch orientation provides poor BI/analytics experience“Faster MapReduce” is a band-aid, not a solution – that’s why Google abandoned this approachExample: Hortonworks’ StingerRemote QueryNetwork bottleneck – must migrate data over the network to the compute layer of the DMBS. Impacts performance and eliminates the “data locality” value proposition of Hadoop in the first place2x the hardware – once again, storage is separate from compute. You use one set of nodes for storage, another for queryDuplicate metadata – must maintain and synchronize two different metadata modelsDuplicate security – Hadoop and DBMS security are distinct and must be managed separately. Often using this approach means you must turn off certain portions of Hadoop’s securityDuplicate SQL – the SQL syntax of the DBMS vs. HiveQLExample: Teradata SQL-HSiloed DBMSBasically a traditional RDBMS that uses HDFS as a data storeRDBMS rigidity – to achieve performance claims you must load (or ETL) data into a proprietary file format with a predefined schema. This format is not compatible with the rest of the Hadoop systemQuery subset of data – only what has been migrated into the proprietary file format can be queried against (or must load into memory during runtime which drastically degrades performance)Duplicate storage – data exists in standard file formats as well as proprietary file formatDuplicate metadata – must maintain and synchronize two different metadata modelsDuplicate security – Hadoop and DBMS security are distinct and must be managed separately. Often using this approach means you must turn off certain portions of Hadoop’s securityDuplicate SQL – SQL syntax of DMBS vs. HiveQL
  5. Interactive BI/Analytics on more dataRaw, full fidelity data – nothing lost through aggregation or ETL/LTNew sources & types – structured/unstructuredHistorical dataAsking new questionsExploration and data discovery for analytics and machine learning – need to find a data set for a model, which requires lots of simple queries to summarize, count, and validate.Hypothesis testing – avoid having to subset and fit the data to a warehouse just to ask a single questionData processing with tight SLAsCost-effective platformMinimize data movementReduce strain on data warehouseQuery-able storageReplace production data warehouse for DR/active archiveStore decades of data cost effectively (for better modeling or data retention mandates) without sacrificing the capability to analyze
  6. Nows, we’ve finished scanning the RHS table and have finished building the hash table. We can now start scanning the LHS table to do the join.
  7. If there’s a match, the joined row will bubble up the execution tree to the aggregation node.
  8. This row doesn’t match. So, it won’t bubble up.
  9. Now that all the rows have been returned from the hash join node, the aggregation node can start returning rows.
  10. B are scanned in parallel, and broadcast to all impalad. Each Impalad reads its local data block for A and do the join. This is broadcast join. After the join is done, we do the aggregation. But before we can produce the final result, we need to redistribute the result of “local agg” according to the group by expression “state” and do the final aggregate.
  11. We added a redundant condition in the WHERE clause to the query that doesn't change the query semantics or results returned. This is transparently mentioned in both our public blog post and published queries as the "explicit partition filter/predicate." Like window functions, this is done as a workaround for a feature limitation in both Impala and Hive to match what a user should to do optimize for these systems. Please also note that this change was done for all compared systems (Impala, Hive, and "DBMS-Y") to ensure an apples-to-apples comparison.
  12. SQL 2003-compliant analytic window functions (aggregation OVER PARTITION) – to provide more advanced SQL analytic capabilities