Technical Overview on Cloudera Impala

•Télécharger en tant que PPTX, PDF•

0 j'aime•530 vues

This document provides an overview of Cloudera Impala, which is an interactive SQL query engine for processing large datasets stored in HDFS. It discusses how Impala addresses the challenges of running analytical queries over petabytes of data in near real-time without much effort. The architecture of Impala is explained, which involves Impala daemons, a state store, and distributed query planning and execution. Key features and benefits of Impala like its scalability, low latency, and ease of use are also highlighted.

Technologie

Technical Overview of Cloudera
Impala & Demo
Praneeth Krishna Bellamkonda

Big Questions ?
 How to run analytical queries over Peta Bytes of data in
near real-time?
 Example: A Seller want to know which city in Texas bought
most from them?
 How to achieve the low-latency response with minimal
effort?
 Is there any cost-effective solution available to run the
analytical queries?

Question ?
 If I have 10TB of data in my HDFS what are the options I have to process the data?
 Map-reduce
 Hive
 PIG
Any major performance gain?

Impala – Architecture
 Impala Daemon
 runs on every node
 handles client requests
 handles query planning & execution
 State Store Daemon
 provides name service
 metadata distribution
 used for finding data

Impala – Architecture
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Impalad continually talks to statestore to
update their state and to receive metadata to
use for query planning

Why Impala?
 Interactive SQL
 In-memory Distributed SQL Query Engine.
 Built for low-latency (real-time) analytics query.
 Highly Scalable
 Built on top of Hadoop
 Simply scales by just adding nodes.
 Direct access to data in HDFS/Hbase (no map-reduce)
 Easy to use
 Minimal data transformation effort required.
 Re-uses hive metastore.
 Easy to integrate. Supports JDBC client

Impala Query Execution
1) Request arrives via ODBC/JDBC/HUE/Shell
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request

Impala Query Execution
2) Planner turns request into collections of plan fragments
3) Coordinator initiates execution on impalad(s) local to data
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase

Impala Query Execution
4) Intermediate results are streamed between impalad(s)
5) Query results are streamed back to client
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase

Features from relational
databases or Hive are not
available in Impala?
 Querying streaming data.
 Deleting individual rows. You delete data in bulk by
overwriting an entire table or partition, or by dropping
a table.
 Indexing (not currently).
 Custom Hive Serializer/Deserializer classes (SerDes)
 Check pointing within a query. That is, Impala does not
save intermediate results to disk during long-running
queries.

Features from relational
databases or Hive are not
available in Impala?
 Data is immutable, no updating
 High memory usage
 Response time is seconds not microseconds
 Non-scalar data types such as maps, arrays, structs
 XML and JSON functions

References
 Cloudera Impala official documentation and slides
http://www.cloudera.com/content/cloudera/en/document
ation/core/latest/topics/impala.html
 Stack
Overflow: http://stackoverflow.com/search?q=impala
 Quora: http://www.quora.com/Cloudera-Impala
 http://impala.io/index.html
 https://www.youtube.com/watch?v=G05CJbdMFaA

Recommandé

Democratizing data science Using spark, hive and druidDataWorks Summit

Data Discovery at Databricks with AmundsenDatabricks

Spark Summit EU talk by Pat PattersonSpark Summit

HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at ExplorysCloudera, Inc.

Building Robust Production Data Pipelines with Databricks DeltaDatabricks

A Practical Enterprise Feature Store on Delta LakeDatabricks

Building Data Pipelines with Spark and StreamSetsPat Patterson

Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark Summit

Recommandé

Democratizing data science Using spark, hive and druidDataWorks Summit

Data Discovery at Databricks with AmundsenDatabricks

Spark Summit EU talk by Pat PattersonSpark Summit

HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at ExplorysCloudera, Inc.

Building Robust Production Data Pipelines with Databricks DeltaDatabricks

A Practical Enterprise Feature Store on Delta LakeDatabricks

Building Data Pipelines with Spark and StreamSetsPat Patterson

Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark Summit

Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks

Modularized ETL Writing with Apache SparkDatabricks

Building the Artificially Intelligent EnterpriseDatabricks

FlorenceAI: Reinventing Data Science at HumanaDatabricks

A High Performance Mutable Engagement Activity Delta LakeDatabricks

Translating Models to Medicine an Example of Managing Visual CommunicationsDatabricks

#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask Cask Data

Graph representation learning to prevent payment collusion fraudDataWorks Summit

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit

"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data

DoneDeal - AWS Data Analytics Platformmartinbpeters

Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.

Databricks: A Tool That Empowers You To Do More With DataDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks

Solving Performance Problems on HadoopTyler Mitchell

Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks

Spark - Migration Story Roman Chukh

Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services

Sherlock: an anomaly detection service on top of Druid DataWorks Summit

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Impetus Technologies

Contenu connexe

Tendances

Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks

Modularized ETL Writing with Apache SparkDatabricks

Building the Artificially Intelligent EnterpriseDatabricks

FlorenceAI: Reinventing Data Science at HumanaDatabricks

A High Performance Mutable Engagement Activity Delta LakeDatabricks

Translating Models to Medicine an Example of Managing Visual CommunicationsDatabricks

#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask Cask Data

Graph representation learning to prevent payment collusion fraudDataWorks Summit

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit

"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data

DoneDeal - AWS Data Analytics Platformmartinbpeters

Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.

Databricks: A Tool That Empowers You To Do More With DataDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks

Solving Performance Problems on HadoopTyler Mitchell

Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks

Spark - Migration Story Roman Chukh

Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services

Sherlock: an anomaly detection service on top of Druid DataWorks Summit

Tendances (20)

Informational Referential Integrity Constraints Support in Apache Spark with ...

Modularized ETL Writing with Apache Spark

Building the Artificially Intelligent Enterprise

FlorenceAI: Reinventing Data Science at Humana

A High Performance Mutable Engagement Activity Delta Lake

Translating Models to Medicine an Example of Managing Visual Communications

#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask

Graph representation learning to prevent payment collusion fraud

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...

"Who Moved my Data? - Why tracking changes and sources of data is critical to...

DoneDeal - AWS Data Analytics Platform

Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...

Databricks: A Tool That Empowers You To Do More With Data

Massive Data Processing in Adobe Using Delta Lake

Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...

Solving Performance Problems on Hadoop

Designing the Next Generation of Data Pipelines at Zillow with Apache Spark

Spark - Migration Story

Loading Data into Redshift: Data Analytics Week at the SF Loft

Sherlock: an anomaly detection service on top of Druid

Similaire à Technical Overview on Cloudera Impala

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Impetus Technologies

SQL on Hadoop: Defining the New Generation of Analytics Databases DataWorks Summit

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit

SQL on Hadoop in TaiwanTreasure Data, Inc.

Impala for PhillyDB MeetupShravan (Sean) Pabba

Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.

Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.

Data Infrastructure for a World of MusicLars Albertsson

Impala presentationtrihug

SQL Engines for Hadoop - The case for Impalamarkgrover

Big Data , Big Problem?Mohammadhasan Farazmand

HadoopDB in ActionTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA

It takes two to tango! : Is SQL-on-Hadoop the next big step?Srihari Srinivasan

Skilwise Big dataSkillwise Group

HANA SITSP 2011Henrique Pinto

Hawq meets Hive - DataWorks San Jose 2017Alex Diachenko

Trafodion overviewRohit Jain

Benchmarking Hadoop and Big DataNicolas Poggi

Similaire à Technical Overview on Cloudera Impala (20)

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...

SQL on Hadoop: Defining the New Generation of Analytics Databases

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

SQL on Hadoop in Taiwan

Impala for PhillyDB Meetup

Cloudera Impala: A Modern SQL Engine for Hadoop

Cloudera Impala: A modern SQL Query Engine for Hadoop

Data Infrastructure for a World of Music

Impala presentation

SQL Engines for Hadoop - The case for Impala

Big Data , Big Problem?

HadoopDB in Action

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

It takes two to tango! : Is SQL-on-Hadoop the next big step?

Skilwise Big data

HANA SITSP 2011

Hawq meets Hive - DataWorks San Jose 2017

Trafodion overview

Benchmarking Hadoop and Big Data

Dernier

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer

Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Architecting Cloud Native ApplicationsWSO2

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Corporate and higher education May webinar.pptxRustici Software

Manulife - Insurer Transformation Award 2024The Digital Insurer

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Dernier (20)

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

AXA XL - Insurer Innovation Award Americas 2024

Cyberprint. Dark Pink Apt Group [EN].pdf

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Architecting Cloud Native Applications

Ransomware_Q4_2023. The report. [EN].pdf

Corporate and higher education May webinar.pptx

Manulife - Insurer Transformation Award 2024

AWS Community Day CPH - Three problems of Terraform

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

FWD Group - Insurer Innovation Award 2024

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Axa Assurance Maroc - Insurer Innovation Award 2024

Exploring the Future Potential of AI-Enabled Smartphone Processors

Strategies for Landing an Oracle DBA Job as a Fresher

Technical Overview on Cloudera Impala

1. Technical Overview of Cloudera Impala & Demo Praneeth Krishna Bellamkonda

2. Scale at eBay

3. Big Questions ?  How to run analytical queries over Peta Bytes of data in near real-time?  Example: A Seller want to know which city in Texas bought most from them?  How to achieve the low-latency response with minimal effort?  Is there any cost-effective solution available to run the analytical queries?

4. Question ?  If I have 10TB of data in my HDFS what are the options I have to process the data?  Map-reduce  Hive  PIG Any major performance gain?

5. Impala – Architecture

6. Impala – Architecture  Impala Daemon  runs on every node  handles client requests  handles query planning & execution  State Store Daemon  provides name service  metadata distribution  used for finding data

7. Impala – Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Impalad continually talks to statestore to update their state and to receive metadata to use for query planning

8. Why Impala?  Interactive SQL  In-memory Distributed SQL Query Engine.  Built for low-latency (real-time) analytics query.  Highly Scalable  Built on top of Hadoop  Simply scales by just adding nodes.  Direct access to data in HDFS/Hbase (no map-reduce)  Easy to use  Minimal data transformation effort required.  Re-uses hive metastore.  Easy to integrate. Supports JDBC client

9. Impala Query Execution 1) Request arrives via ODBC/JDBC/HUE/Shell Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request

10. Impala Query Execution 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase

11. Impala Query Execution 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase

12. Features from relational databases or Hive are not available in Impala?  Querying streaming data.  Deleting individual rows. You delete data in bulk by overwriting an entire table or partition, or by dropping a table.  Indexing (not currently).  Custom Hive Serializer/Deserializer classes (SerDes)  Check pointing within a query. That is, Impala does not save intermediate results to disk during long-running queries.

13. Features from relational databases or Hive are not available in Impala?  Data is immutable, no updating  High memory usage  Response time is seconds not microseconds  Non-scalar data types such as maps, arrays, structs  XML and JSON functions

14. DEMO

15. References  Cloudera Impala official documentation and slides http://www.cloudera.com/content/cloudera/en/document ation/core/latest/topics/impala.html  Stack Overflow: http://stackoverflow.com/search?q=impala  Quora: http://www.quora.com/Cloudera-Impala  http://impala.io/index.html  https://www.youtube.com/watch?v=G05CJbdMFaA