Contenu connexe
Similaire à sql on hadoop
Similaire à sql on hadoop (20)
sql on hadoop
- 16. 16© Cloudera, Inc. All rights reserved.
Why Spark SQL
• Ease of embedding SQL into Java, Scala, or Python applications
• Easy language for common operations (eg. aggregations, filters,
samples)
• Seamlessly mix SQL and Spark code within a single application
• Improved performance with automatic optimizations (Intelligent
Query Engine)
- 18. 18© Cloudera, Inc. All rights reserved.
What’s Impala?
• Interactive SQL
• Typically 5-70x faster than the latest Hive
• Responses in seconds instead of minutes (sometimes sub-second)
• ANSI-92 standard SQL queries with HiveQL
• Compatible SQL interface for existing Hadoop/CDH applications
• Based on industry standard SQL
• Natively on Hadoop/HBase storage and metadata
• Flexibility, scale, and cost advantages of Hadoop
• No duplication/synchronization of data and metadata
• Local processing to avoid network bottlenecks
• Separate runtime from batch processing
• Hive, Pig, MapReduce are designed and great for batch
• Impala is purpose-built for low-latency SQL queries on Hadoop
18
- 29. 29© Cloudera, Inc. All rights reserved.
Parquet Overview
• State-of-the-art, open source columnar file format that’s available for (most)
Hadoop processing frameworks:
• Impala, Hive, Pig, MapReduce, Spark, Cascading, Crunch, Drill, Tajo, …
• Offers both high compression and high scan efficiency
• Co-developed by Twitter and Cloudera
• Contributions from Criteo, Stripe, Berkeley AMPlab, LinkedIn
• Top-Level Apache Project
- 30. 30© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
- 31. 31© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
Only read 1 column
1GB 2GB 1GB 200GB
- 33. 33© Cloudera, Inc. All rights reserved.
Impala Performance
• Impala’s latest milestone:
• Comparable commercial MPP DBMS speed
• Natively on Hadoop
• Three result sets:
• Impala vs Hive (Impala 6-70x faster)
• Impala vs “DBMS-Y” (Impala average of 2x faster)
• Impala scalability (Impala achieves linear scale)
• Background:
• 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported language)
• Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB)
• Methodical testing (multiple runs, reviewed fairness for competition, etc)
• Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
33
- 38. 38© Cloudera, Inc. All rights reserved.
Impala Roadmap
2H 2015 1H 2016 2016
• SQL Support & Usability
• Nested structures
• Kudu updates (beta)
• Management & Security
• Record reader service
(beta)
• Finer-grained security
(Sentry)
• Integration
• Isilon support
• Python interface (Ibis)
• Performance & Scale
• Improved predictability
under concurrency
• Performance & Scale
• Continued scalability and
concurrency
• Initial perf/scale
improvements
• Management & Security
• Improved admission
control
• Resource utilization and
showback
• SQL Support & Usability
• Dynamic partitioning
• Improved timestamp
compatibility
• Performance & Scale
• >20x performance
• Multi-threaded
joins/aggregations
• Continued scale work
• Management & Security
• Improved YARN
integration
• Automated metadata
• Integration
• S3 support
• SQL Support & Usability
• Nested types with Avro
• Date type
• Added SQL extensions
- 46. 46© Cloudera, Inc. All rights reserved.
Performance Benchmark Takeaways
• Impala unlocks BI usage directly on Hadoop
• Meets BI low-latency and multi-user requirements
• Advantage expands for single-user vs just 10 users
• Hive is designed (and still great) for batch processing
• Most Impala customers use Hive for data preparation
• Hive is the most commonly used ETL framework
• Spark SQL enables easier Spark application development
• Enables mixed procedural Spark (Java/Scala) and SQL job development
• Mid-term trends will further favor Impala’s design approach for latency and concurrency
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)
• CPU efficiency will increase in importance
• Native code enables easy optimizations for CPU instruction sets
• Intel joint roadmap support these opportunities