BDM8 - Near-realtime Big Data Analytics using Impala

Near-realtime
Big Data Analytics
using Impala

David Lauzon
Big Data Montreal #8
January 10th 2013
1 / 18

Plan

• What is Impala?
• Why Google built Dremel?
• Use cases for Impala
• Use cases for Map-Reduce
• Cloudera Customer Survey
• Impala Features
• Impala Performance Expectations and Benchmarks
• Impala Components
• Impala Architecture
• Impala Development Roadmap
• Where to learn more and get started

2 / 18

Disclaimer

• In order to preserve the best accuracy in
the description of Dremel and Impala, most of
the contents in this presentation have been
gathered from the authors of the respective
technologies. References are found at the end
of the presentation.
• I am not affiliated or sponsored by Cloudera or
Google.

3 / 18

What is Impala?

“An Impala is an athletic, gracious,
african antilope, famous for its
velocity and its agility to jump”
- Wikipedia

4 / 18

Seriously, what is Impala?

• “Impala enables real-time, interactive,
analytical queries of the data stored in
HBase or HDFS” – Cloudera

• Inspired by Google Dremel Paper (2010)
– BigQuery is a Dremel implementation service,
• It’s proprietary, not free, and requires to upload your
data to Google servers

5 / 18

Why Google built Dremel?

• Problems with Data Warehouse Solutions for OLAP/BI:
– Relational OLAP (ROLAP) :
• Need to build indices for every possible query (for performance
concerns)
 Indices size could take up the whole RAM
– Multi-dimensional OLAP (MOLAP):
• Require extensive time and money to design and build the data cubes
– Ad-hoc query (specific non-optimised query) :
• When you don’t know what you’ll need / or need to work in
iterations. e.g. quite often !!

• Solution:
– Increase full-scan speed without requiring indexing or pre-
aggregated values

6 / 18

Come on, give me some use cases!

• Finding particular records with specified
conditions.
– “Find all the locations where account “ABC” was
accessed from”.
• Quick aggregation of statistics with dynamically-
changing conditions:
– “Can you give me yesterday’s number of impressions
for Google AdWords display ads – but only in the
Tokyo region?”
• Trial-and-error data analysis:
– “And between 11am to 1pm?”

7 / 18

Use cases for which you should stick
with Map-Reduce based applications

• Very long running, batch-oriented tasks
such as ETL:
– e.g. exporting large amount of data after processing
• Complex event processing:
– e.g. stream-processing
• “Complex data mining on Big Data which requires
multiple iterations and paths of data processing
with programmed algorithms” - Google

8 / 18

Integration with Hadoop

• Cloudera Customer Survey (Aug. 2012)
– 80% needs faster queries on Hadoop data
– 65% query Hadoop using Hive
– 70% move data from Hadoop to RDBMS for
interactive SQL
– 60% see value today in consolidating to a single
platform

9 / 18

Impala Features

• Shared with Hive:
– Hive MetaStore
– Hive SQL (most common SQL-92 features)
– ODBC Driver
– User Interface (Hue Beeswax)
• Specific to Impala:
– No Map Reduce, but in memory transfers
– Host and Disk Awareness (data locality)
– Table data caching in RAM
– No virtual columns, or locking

10 / 18

Impala Performance Expectations

• Performance improvements over Hive
– 3 - 4X for purely I/O bound queries
– 7 - 45X for queries with at least one join
– 20 - 90X when data available in the cache

11 / 18

External Benchmarks

• Searching log files at 37 signals
(creators of Ruby on Rails web framework)
Workload Impala Hive MySQL
Query Query Query
Time Time Time
5.2 Gb HAproxy log – top IPs by request count 3.1s 65.4s 146.0s
5.2 Gb HAproxy log – top IPs by total request time 3.3s 65.2s 164.0s
800 Mb parsed rails log – slowest accounts 1.0s 33.2s 48.1s
800 Mb parsed rails log – highest database time paths 1.1s 33.7s 49.6s
8 Gb pageview table – daily pageviews and unique 22.4s 92.2s 180.0s
visitors

http://37signals.com/svn/posts/3315-how-i-came-to-love-big-data-or-at-least-acknowledge-its-existence

12 / 18

Impala Components

• Impala State Store : 1 per cluster
– Coordinates information (location and status) about
all the running impalad instances
• Impala Daemon : 1 per DataNode
– Coordinates and executes queries
– Distributes query fragments to other Impala Daemon
• Impala Shell : 1 per node
– Provides Command Line Interface allowing
interactions with Impala

13 / 18

Impala Architecture

14 / 18

Roadmap : 0.3 Beta Version

• Operation System:
– Only RHEL/CentOS 6.2 is supported
• File formats:
– Text files, SequenceFiles, HBase table
• Compression:
– Snappy, Gzip, BZip
• No UDFs or user extensibility *
• Largest table in joins must be specified first *
• Right-side of join must fit in RAM *
• No support for complex nested structures * :
– e.g. maps, structs and arrays.
* Post 1.0 G.A. Version Top Asks
15 / 18

Roadmap : 1.0 General Availability
(Q1 2013)

• File Formats:
– RCFile, Avro, LZO
– Trevni : new columnar file-format by Doug Cutting
• More OS Support:
– Same as those supported by CDH4
• Performance:
– Faster, bigger, and more memory efficient joins and
aggregations
– Straggler handling :
• more work to faster machines, and less to slower machines
• DDL : enables users to create tables from Impala
• JDBC Driver (shared with Hive)

16 / 18

Where to learn more and get started

• Impala Documentation
https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation
• Clouder’s Impala Demo VM
https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM
• Cloudera Blog
http://blog.cloudera.com/blog/category/impala/
• Impala-user Google Group
https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user
• (Unofficial) presentation at Apache Asia Road Show
http://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf
• Official announcement of Impala at Strata Conference NY 2012
http://www.slideshare.net/cloudera/2012-1025-hadoop-world-impala-16x9
• Dremel: Interactive Analysis of Web-Scale Datasets
http://research.google.com/pubs/pub36632.html
• BigQuery Technical White Paper
https://cloud.google.com/files/BigQueryTechnicalWP.pdf

17 / 18

Conclusion

• Uses Impala when you need to
find / compute quickly little data from a large
data source
• Impala does not replace batch-oriented jobs
• Impala beta and documentation is quite good
for a beta
– If you can’t wait for Impala v1.0, try BigQuery

18 / 18

BDM8 - Near-realtime Big Data Analytics using Impala

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to BDM8 - Near-realtime Big Data Analytics using Impala

Similar to BDM8 - Near-realtime Big Data Analytics using Impala (20)

Recently uploaded

Recently uploaded (20)

BDM8 - Near-realtime Big Data Analytics using Impala

Editor's Notes