Contenu connexe Similaire à Impala 2.0 - The Best Analytic Database for Hadoop (20) Plus de Cloudera, Inc. (20) Impala 2.0 - The Best Analytic Database for Hadoop1. Impala 2.0
The Leading Analytic Database for Hadoop
Justin Erickson | Director, Product Management
2. © 2014 Cloudera, Inc. All rights reserved. 2
Notification
• The information in this document is proprietary to Cloudera. No part of this document may be
reproduced, copied or transmitted in any form for any purpose without the express prior written
permission of Cloudera.
• This document is a preliminary version and not subject to your license agreement or any other
agreement with Cloudera. This document contains only intended strategies, developments and
functionalities of Cloudera products and is not intended to be binding upon Cloudera to any
particular course of business, product strategy and/or development. Please note that this
document is subject to change and may be changed by Cloudera at any time without notice.
• Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not
warrant the accuracy or completeness of the information, text, graphics, links or other items
contained within this material. This document is provided without a warranty of any kind, either
express or implied, including but not limited to the implied warranties of merchantability, fitness
for a particular purpose or non-infringement.
• Cloudera shall have no liability for damages of any kind including without limitation direct,
special, indirect or consequential damages that may result from the use of these materials. The
limitation shall not apply in cases of gross negligence.
3. © 2014 Cloudera, Inc. All rights reserved. 3
Agenda
• Impala Overview
• Milestones and 2.0 Features
• SQL-on-Hadoop Performance Update
• What’s Next
4. The Right SQL Engine for the Use Case
SQL
©2014 Cloudera, Inc. All rights reserved. © 2014 Cloudera, Inc. All rights reserved. 4
BI and SQL
Analytics
Batch
Processing
Spark
Developers
5. Analytic Database for Hadoop Requires
© 2014 Cloudera, Inc. All rights reserved. 5
Multi-User Interactive
Performance
Interaction at the speed of thought
Compatibility Familiar BI tools/SQL interfaces
Usability Accessible to broad range of applications
Flexibility Use SQL along with other Hadoop frameworks
across all data
Native in Hadoop Unified resource management, metadata, security,
and management across frameworks
6. © 2014 Cloudera, Inc. All rights reserved. 6
Impala’s Benefits
Multi-User Interactive
Performance ✔ • 10x vs alternatives with latest benchmarks
• Performance advantage increases with multi-user
Compatibility ✔ • Provides both ANSI SQL and vendor-specific extensions
• Compatibility with the leading BI partners
Usability ✔ • Cost-based optimization allows for more users and tools to run a
broader range of queries
Flexibility ✔ • Supports the common native Hadoop file formats
• Parquet provides best-of-breed columnar performance across
Hadoop frameworks
Native in Hadoop ✔ • Unified with Hadoop’s resource management, metadata, security,
and management
7. Engines
Resource Management
© 2014 Cloudera, Inc. All rights reserved. 7
Most Common Scenarios
Single Platform for Data Processing and Analytics
• Interactive BI/analytics on “big data”
• Data discovery
• Exploratory analytics
• Queryable operational data store
Storage
Integration
Metadata
Batch
Processing
MAPREDUCE,
HIVE & PIG
…
Interactive
SQL
IMPALA
Interactive
Search
Solr
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO… RECORDS
Management | Support
Interactive
Analytics
SAS, R, …
8. © 2014 Cloudera, Inc. All rights reserved. 8
Most Common Use Cases
Operational Dashboards
Example: Healthcare Insurance Company
Goal:
• Visualizations of current hospital spending
and comparison to peers and historical data
• Integrate 1000s of client hospital purchasing
systems
Key benefits of Impala:
• Simplification via unification
• Saved license $ over traditional DBMS
• Enabled finer-grain details in source data
vs. planned summarized extracts
• 3 nodes of Impala outperformed a rack of
the traditional RDBMS on their workload
Data Discovery
Example: Major Financial Institution
Goal:
• Fraud group looking at internal / external fraud
• Captured internal systems and external
application/website logs
Key benefits of Impala:
• Flexibility to have more data readily
available without upfront modeling
• Ability to use existing BI visualization tools
• Better TCO
9. Previous Key Milestones and Features
© 2014 Cloudera, Inc. All rights reserved. 9
• Impala 1.0
• ~SQL-92 (minus correlated sub-queries)
• Native Hadoop file formats (Parquet, Avro, text, Sequence, …)
• Enterprise-readiness (authentication, ODBC/JDBC drivers, etc)
• Service-level resource isolation with other Hadoop frameworks
• Impala 1.1
• Fine-grained, role-based authorization via Apache Sentry
• Auditing (Impala 1.1.1 and CM 4.7+)
• Impala 1.2
• Custom language extensibility (UDFs, UDAFs)
• Cost-based join-order optimization
• On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility
• Impala 1.3 / CDH 5.0 (also has version for CDH 4.x)
• Resource management
• Impala 1.4 / CDH 5.1 (also has version for CDH 4.x)
• More SQL compatibility (DECIMAL, vendor-specific extensions, ORDER BY without LIMIT, etc)
• HDFS caching
• Faster performance (selective queries and compute stats in particular
10. © 2014 Cloudera, Inc. All rights reserved. 10
Impala 2.0 Key Updates
• Same great multi-user interactive performance
• Removed limits on SQL compatibility
• SQL:2003 analytic/window functions
• Subqueries in WHERE clause, EXISTS, and IN
• Additional data types (CHAR and VARCHAR)
• GRANT/REVOKE functions via Sentry
• Additional vendor-specific SQL extensions
• Removed limits on query size
• Disk-based query processing
11. September SQL-on-Hadoop Benchmark:
Impala, Presto, Stinger, Spark SQL
© 2014 Cloudera, Inc. All rights reserved. 11
• Benchmarks on:
• Impala (1.4.0)
• Presto (0.74)
• Stinger (final) phase 3 => aka Hive 0.13.0
• Spark SQL (1.1)
• As always, our public benchmarks are:
• Based on industry standards (TPC)
• Repeatable (https://github.com/cloudera/impala-tpcds-kit)
• Methodical testing with multiple runs on same hardware
• Help competing software put its best foot forward
• SQL-92 join style for engines without CBO
• JVM tuning for Presto
• Run on optimal file formats for each
• Full details on our blog: http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-
impala-1-4-widens-the-performance-gap/
12. © 2014 Cloudera, Inc. All rights reserved. 12
Impala’s Multi-User Over 10x Faster:
Gap widening compared to May’s update
13. © 2014 Cloudera, Inc. All rights reserved. 13
Faster = More Work in Less Time:
Impala enables over 8.7x throughput
14. © 2014 Cloudera, Inc. All rights reserved. 14
Performance Takeaways
• Impala’s advantage expands from 5x single-user to >10x with just 10 user
• Performance gap is widening since May
• Single user Presto went from 5x before to 7.5x now
• Single user Hive/Tez went from 5x before to 9x now
• Mid-term trends will further favor Impala’s design approach
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)
• CPU efficiency will increase in importance
• Native code enables easy optimizations for CPU instruction sets (e.g. floating point operations, math
operations, encrypt/decrypt)
• The Intel joint roadmap helps support these opportunities
15. © 2014 Cloudera, Inc. All rights reserved. 15
IBM Research Validation
• New VLDB academic paper comparing Impala and Hive-based (both MR and Tez) for SQL-on-Hadoop
• http://www.vldb.org/pvldb/vol7/p1295-floratou.pdf
• Impala’s significantly more efficient than Hive/Tez or Hive/MR
• “Impala’s database-like architecture provides significant performance gains, compared to Hive’s MapReduce or Tez based
runtime”
• Correctly attributes Impala’s lead to it’s CPU efficiency, IO manager, and overall architecture that resembles a shared-nothing
parallel database
• Parquet more efficient than ORC
• “The Parquet format skips data more efficiently than ORC which tends to pre-fetch unnecessary data especially when a table
contains a large number of columns”
• Note: Paper is single-user only. Multi-user would make the gap even wider
• Our published results show ~5x single-user Impala lead goes to ~10x with just 10 users in our blog:
http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/
• Same CPU efficiency, IO manager, and overall architectural reasons
• Additional Notes:
• Impala 2.0 will have disk-based joins and aggregations
• Impala 1.4 is significantly faster on selective joins than Impala 1.2.2 used in the paper
16. Impala’s Analytic Database Leadership
1 > 1 MM downloads since GA
2 Majority adoption across Cloudera EDH customers
3 Certification across key application partners:
© 2014 Cloudera, Inc. All rights reserved. 16
4 De facto standard with multi-vendor support:
5 Full Apache Open Source License
and others
17. © 2014 Cloudera, Inc. All rights reserved. 17
What’s Next?
• Usability
• Nested data structures for greater data flexibility and expressiveness than
traditional RDBMS systems
• Ability to run on data natively stored in Amazon S3
• Advanced security with lineage tracking and query redaction from logs
• Built-in abilities for data maintenance and updates
• Compatibility
• Continued additions of commonly used vendor-specific built-ins
• Continued joint-development with BI partners
• More advanced SQL:2010 set features
• Multi-User Performance
• Focus on even better multi-user concurrency
• Continued performance increases and leadership
18. Engines
Resource Management
© 2014 Cloudera, Inc. All rights reserved. 18
It’s Not Just About SQL-on-Hadoop
The Platform for Big Data
• Single platform for processing &
analytics
• Scales to ‘000s of servers
• No upfront schema
• 10% the cost per TB
• Open source platform
Storage
Integration
Metadata
Batch
Processing
MAPREDUCE,
HIVE & PIG
…
Interactive
SQL
IMPALA
Interactive
Search
Solr
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO… RECORDS
Management | Support
Interactive
Analytics
SAS, R, …
19. © 2014 Cloudera, Inc. All rights reserved. 19
Try Impala Out!
• 100% Apache-licensed open source
• Downloads on http://impala.io/:
• Live online
• VM
• Installation
• Questions/comments?
• Community: http://impala.io/community
• Email: impala-user@cloudera.org
19
Notes de l'éditeur Our goal is to provide the best tools for a particular job
* Hive is the best for batch, and of course we want to make that experience better.
* Impala is purpose built for interactive BI on Hadoop. Latency, concurrency, vendor ecosystem, and partner certification.
* Spark SQL is built for supporting an advanced analyst’s direct interactions with data, where you’re mixing Spark and SQL Multi-user performance – enables BI users and analysts to interact with Hadoop data at the speed of thought
Compatibility - provides familiar BI tools/applications and SQL interfaces
Usability - Accessible to the broad range of business users, analysts, and partner applications
Flexibility – Enables users access to more data and the ability to use SQL along with the rest of the Hadoop frameworks across all their data
Native in Hadoop - Easier and integrated administration with unified resource management, metadata, security, and management across frameworks
Multi-user interactive performance
10x vs alternatives with latest benchmarks
Broad SQL compatibility
Provides both ANSI SQL and vendor-specific extensions
Compatibility with the leading BI partners
Usability
Cost-based optimization allows for more users and tools to run a broader range of queries
Flexibility
Supports the common native Hadoop file formats
Parquet provides best-of-breed columnar performance across Hadoop frameworks
Native in Hadoop
Unified with Hadoop’s resource management, metadata, security, and management