SlideShare a Scribd company logo
1 of 18
Near-realtime
Big Data Analytics
using Impala

David Lauzon
Big Data Montreal #8
January 10th 2013
                       1 / 18
Plan

•   What is Impala?
•   Why Google built Dremel?
•   Use cases for Impala
•   Use cases for Map-Reduce
•   Cloudera Customer Survey
•   Impala Features
•   Impala Performance Expectations and Benchmarks
•   Impala Components
•   Impala Architecture
•   Impala Development Roadmap
•   Where to learn more and get started


                                                     2 / 18
Disclaimer


• In order to preserve the best accuracy in
  the description of Dremel and Impala, most of
  the contents in this presentation have been
  gathered from the authors of the respective
  technologies. References are found at the end
  of the presentation.
• I am not affiliated or sponsored by Cloudera or
  Google.

                                               3 / 18
What is Impala?

  “An Impala is an athletic, gracious,
  african antilope, famous for its
  velocity and its agility to jump”
  - Wikipedia




                                         4 / 18
Seriously, what is Impala?


• “Impala enables real-time, interactive,
  analytical queries of the data stored in
  HBase or HDFS” – Cloudera

• Inspired by Google Dremel Paper (2010)
  – BigQuery is a Dremel implementation service,
     • It’s proprietary, not free, and requires to upload your
       data to Google servers



                                                                 5 / 18
Why Google built Dremel?

• Problems with Data Warehouse Solutions for OLAP/BI:
   – Relational OLAP (ROLAP) :
       • Need to build indices for every possible query (for performance
         concerns)
            Indices size could take up the whole RAM
   – Multi-dimensional OLAP (MOLAP):
       • Require extensive time and money to design and build the data cubes
   – Ad-hoc query (specific non-optimised query) :
       • When you don’t know what you’ll need / or need to work in
         iterations. e.g. quite often !!


• Solution:
   – Increase full-scan speed without requiring indexing or pre-
     aggregated values


                                                                           6 / 18
Come on, give me some use cases!

• Finding particular records with specified
  conditions.
   – “Find all the locations where account “ABC” was
     accessed from”.
• Quick aggregation of statistics with dynamically-
  changing conditions:
   – “Can you give me yesterday’s number of impressions
     for Google AdWords display ads – but only in the
     Tokyo region?”
• Trial-and-error data analysis:
   – “And between 11am to 1pm?”


                                                          7 / 18
Use cases for which you should stick
with Map-Reduce based applications

• Very long running, batch-oriented tasks
  such as ETL:
  – e.g. exporting large amount of data after processing
• Complex event processing:
  – e.g. stream-processing
• “Complex data mining on Big Data which requires
  multiple iterations and paths of data processing
  with programmed algorithms” - Google


                                                           8 / 18
Integration with Hadoop


• Cloudera Customer Survey (Aug. 2012)
  – 80% needs faster queries on Hadoop data
  – 65% query Hadoop using Hive
  – 70% move data from Hadoop to RDBMS for
    interactive SQL
  – 60% see value today in consolidating to a single
    platform




                                                       9 / 18
Impala Features


• Shared with Hive:
   –   Hive MetaStore
   –   Hive SQL (most common SQL-92 features)
   –   ODBC Driver
   –   User Interface (Hue Beeswax)
• Specific to Impala:
   –   No Map Reduce, but in memory transfers
   –   Host and Disk Awareness (data locality)
   –   Table data caching in RAM
   –   No virtual columns, or locking

                                                 10 / 18
Impala Performance Expectations


• Performance improvements over Hive
  – 3 - 4X for purely I/O bound queries
  – 7 - 45X for queries with at least one join
  – 20 - 90X when data available in the cache




                                                 11 / 18
External Benchmarks


• Searching log files at 37 signals
  (creators of Ruby on Rails web framework)
Workload                                                               Impala       Hive        MySQL
                                                                       Query        Query       Query
                                                                       Time         Time        Time
5.2 Gb HAproxy log – top IPs by request count                                3.1s      65.4s       146.0s
5.2 Gb HAproxy log – top IPs by total request time                           3.3s      65.2s       164.0s
800 Mb parsed rails log – slowest accounts                                   1.0s      33.2s        48.1s
800 Mb parsed rails log – highest database time paths                        1.1s      33.7s        49.6s
8 Gb pageview table – daily pageviews and unique                           22.4s       92.2s       180.0s
visitors

http://37signals.com/svn/posts/3315-how-i-came-to-love-big-data-or-at-least-acknowledge-its-existence



                                                                                                        12 / 18
Impala Components


• Impala State Store : 1 per cluster
   – Coordinates information (location and status) about
     all the running impalad instances
• Impala Daemon : 1 per DataNode
   – Coordinates and executes queries
   – Distributes query fragments to other Impala Daemon
• Impala Shell : 1 per node
   – Provides Command Line Interface allowing
     interactions with Impala

                                                           13 / 18
Impala Architecture




                      14 / 18
Roadmap : 0.3 Beta Version

• Operation System:
    – Only RHEL/CentOS 6.2 is supported
• File formats:
    – Text files, SequenceFiles, HBase table
• Compression:
    – Snappy, Gzip, BZip
•   No UDFs or user extensibility *
•   Largest table in joins must be specified first *
•   Right-side of join must fit in RAM *
•   No support for complex nested structures * :
    – e.g. maps, structs and arrays.
               * Post 1.0 G.A. Version Top Asks
                                                       15 / 18
Roadmap : 1.0 General Availability
(Q1 2013)

• File Formats:
   – RCFile, Avro, LZO
   – Trevni : new columnar file-format by Doug Cutting
• More OS Support:
   – Same as those supported by CDH4
• Performance:
   – Faster, bigger, and more memory efficient joins and
     aggregations
   – Straggler handling :
       • more work to faster machines, and less to slower machines
• DDL : enables users to create tables from Impala
• JDBC Driver (shared with Hive)

                                                                     16 / 18
Where to learn more and get started

• Impala Documentation
   https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation
• Clouder’s Impala Demo VM
   https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM
• Cloudera Blog
   http://blog.cloudera.com/blog/category/impala/
• Impala-user Google Group
   https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user
• (Unofficial) presentation at Apache Asia Road Show
   http://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf
• Official announcement of Impala at Strata Conference NY 2012
   http://www.slideshare.net/cloudera/2012-1025-hadoop-world-impala-16x9
• Dremel: Interactive Analysis of Web-Scale Datasets
   http://research.google.com/pubs/pub36632.html
• BigQuery Technical White Paper
   https://cloud.google.com/files/BigQueryTechnicalWP.pdf


                                                                                             17 / 18
Conclusion


• Uses Impala when you need to
  find / compute quickly little data from a large
  data source
• Impala does not replace batch-oriented jobs
• Impala beta and documentation is quite good
  for a beta
  – If you can’t wait for Impala v1.0, try BigQuery



                                                      18 / 18

More Related Content

What's hot

ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
Eric Sun
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 

What's hot (20)

OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBase
 
Hadoop Solutions
Hadoop SolutionsHadoop Solutions
Hadoop Solutions
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Hadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_whichHadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_which
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 Steps
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systems
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 

Viewers also liked

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 

Viewers also liked (9)

BDM29: AdamCloud Project - Part I
BDM29: AdamCloud Project - Part IBDM29: AdamCloud Project - Part I
BDM29: AdamCloud Project - Part I
 
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetupBDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
 
BDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part IIBDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part II
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 Debriefing
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 

Similar to BDM8 - Near-realtime Big Data Analytics using Impala

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recap
UserReport
 

Similar to BDM8 - Near-realtime Big Data Analytics using Impala (20)

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalability
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
 
Pldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysqlPldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysql
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
Apache drill
Apache drillApache drill
Apache drill
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recap
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

BDM8 - Near-realtime Big Data Analytics using Impala

  • 1. Near-realtime Big Data Analytics using Impala David Lauzon Big Data Montreal #8 January 10th 2013 1 / 18
  • 2. Plan • What is Impala? • Why Google built Dremel? • Use cases for Impala • Use cases for Map-Reduce • Cloudera Customer Survey • Impala Features • Impala Performance Expectations and Benchmarks • Impala Components • Impala Architecture • Impala Development Roadmap • Where to learn more and get started 2 / 18
  • 3. Disclaimer • In order to preserve the best accuracy in the description of Dremel and Impala, most of the contents in this presentation have been gathered from the authors of the respective technologies. References are found at the end of the presentation. • I am not affiliated or sponsored by Cloudera or Google. 3 / 18
  • 4. What is Impala? “An Impala is an athletic, gracious, african antilope, famous for its velocity and its agility to jump” - Wikipedia 4 / 18
  • 5. Seriously, what is Impala? • “Impala enables real-time, interactive, analytical queries of the data stored in HBase or HDFS” – Cloudera • Inspired by Google Dremel Paper (2010) – BigQuery is a Dremel implementation service, • It’s proprietary, not free, and requires to upload your data to Google servers 5 / 18
  • 6. Why Google built Dremel? • Problems with Data Warehouse Solutions for OLAP/BI: – Relational OLAP (ROLAP) : • Need to build indices for every possible query (for performance concerns)  Indices size could take up the whole RAM – Multi-dimensional OLAP (MOLAP): • Require extensive time and money to design and build the data cubes – Ad-hoc query (specific non-optimised query) : • When you don’t know what you’ll need / or need to work in iterations. e.g. quite often !! • Solution: – Increase full-scan speed without requiring indexing or pre- aggregated values 6 / 18
  • 7. Come on, give me some use cases! • Finding particular records with specified conditions. – “Find all the locations where account “ABC” was accessed from”. • Quick aggregation of statistics with dynamically- changing conditions: – “Can you give me yesterday’s number of impressions for Google AdWords display ads – but only in the Tokyo region?” • Trial-and-error data analysis: – “And between 11am to 1pm?” 7 / 18
  • 8. Use cases for which you should stick with Map-Reduce based applications • Very long running, batch-oriented tasks such as ETL: – e.g. exporting large amount of data after processing • Complex event processing: – e.g. stream-processing • “Complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms” - Google 8 / 18
  • 9. Integration with Hadoop • Cloudera Customer Survey (Aug. 2012) – 80% needs faster queries on Hadoop data – 65% query Hadoop using Hive – 70% move data from Hadoop to RDBMS for interactive SQL – 60% see value today in consolidating to a single platform 9 / 18
  • 10. Impala Features • Shared with Hive: – Hive MetaStore – Hive SQL (most common SQL-92 features) – ODBC Driver – User Interface (Hue Beeswax) • Specific to Impala: – No Map Reduce, but in memory transfers – Host and Disk Awareness (data locality) – Table data caching in RAM – No virtual columns, or locking 10 / 18
  • 11. Impala Performance Expectations • Performance improvements over Hive – 3 - 4X for purely I/O bound queries – 7 - 45X for queries with at least one join – 20 - 90X when data available in the cache 11 / 18
  • 12. External Benchmarks • Searching log files at 37 signals (creators of Ruby on Rails web framework) Workload Impala Hive MySQL Query Query Query Time Time Time 5.2 Gb HAproxy log – top IPs by request count 3.1s 65.4s 146.0s 5.2 Gb HAproxy log – top IPs by total request time 3.3s 65.2s 164.0s 800 Mb parsed rails log – slowest accounts 1.0s 33.2s 48.1s 800 Mb parsed rails log – highest database time paths 1.1s 33.7s 49.6s 8 Gb pageview table – daily pageviews and unique 22.4s 92.2s 180.0s visitors http://37signals.com/svn/posts/3315-how-i-came-to-love-big-data-or-at-least-acknowledge-its-existence 12 / 18
  • 13. Impala Components • Impala State Store : 1 per cluster – Coordinates information (location and status) about all the running impalad instances • Impala Daemon : 1 per DataNode – Coordinates and executes queries – Distributes query fragments to other Impala Daemon • Impala Shell : 1 per node – Provides Command Line Interface allowing interactions with Impala 13 / 18
  • 15. Roadmap : 0.3 Beta Version • Operation System: – Only RHEL/CentOS 6.2 is supported • File formats: – Text files, SequenceFiles, HBase table • Compression: – Snappy, Gzip, BZip • No UDFs or user extensibility * • Largest table in joins must be specified first * • Right-side of join must fit in RAM * • No support for complex nested structures * : – e.g. maps, structs and arrays. * Post 1.0 G.A. Version Top Asks 15 / 18
  • 16. Roadmap : 1.0 General Availability (Q1 2013) • File Formats: – RCFile, Avro, LZO – Trevni : new columnar file-format by Doug Cutting • More OS Support: – Same as those supported by CDH4 • Performance: – Faster, bigger, and more memory efficient joins and aggregations – Straggler handling : • more work to faster machines, and less to slower machines • DDL : enables users to create tables from Impala • JDBC Driver (shared with Hive) 16 / 18
  • 17. Where to learn more and get started • Impala Documentation https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation • Clouder’s Impala Demo VM https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM • Cloudera Blog http://blog.cloudera.com/blog/category/impala/ • Impala-user Google Group https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user • (Unofficial) presentation at Apache Asia Road Show http://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf • Official announcement of Impala at Strata Conference NY 2012 http://www.slideshare.net/cloudera/2012-1025-hadoop-world-impala-16x9 • Dremel: Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html • BigQuery Technical White Paper https://cloud.google.com/files/BigQueryTechnicalWP.pdf 17 / 18
  • 18. Conclusion • Uses Impala when you need to find / compute quickly little data from a large data source • Impala does not replace batch-oriented jobs • Impala beta and documentation is quite good for a beta – If you can’t wait for Impala v1.0, try BigQuery 18 / 18

Editor's Notes

  1. Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
  2. Marcel Kornacker is the architect of Impala. Prior to joining Cloudera, he was the lead developer for the query engine of Google’s F1 project
  3. You may well have an OLAP cube, but not for this specific use case…
  4. Impala uses SSE4.2 for checksumming (2X faster than without SSE4.2)e.g. Intel Nehalem+, AMD Bulldozer+
  5. 37 signals – Web Application Company (where Ruby on Rails originated)
  6. Rappeler objectif de l’exposé:Points importants:Message significatif facile à retenir: