BigBench is the brand new standard (TPCx-BB) for benchmarking and testing Big Data systems. The BigBench specification describes several application use cases combining the need for SQL queries, Map/Reduce, user code (UDF), Machine Learning, and even streaming. From the available implementation, we can test the different framework combinations such as Hadoop+Hive (with Mahout) and Spark (SparkSQL+MLlib) in their different versions and configurations, helping us to spot problems and possible optimizations of our data stacks.
This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with their respective 1 and 2 versions under distinct configurations including Tez, Mahout, MLlib. Experiments are run on Cloud and On-Prem clusters of different numbers of nodes and data scales, taking into account interactive and batch usage. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and logfile analysis. The talk concludes with the main findings, the scalability, and limits of each framework.
Originally presented at: https://dataworkssummit.com/munich-2017/sessions/using-bigbench-to-compare-hive-and-spark-versions-and-features/
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
Using BigBench to compare Hive and Spark (Long version)
1. Using BigBench to compare
Hive and Spark
Nicolas Poggi, Alejandro Montero
April 2017
2. Outline
1. Intro to BSC and ALOJA
2. BigBench
3. Sequential tests
1. Data scales
4. Concurrency tests
5. Summary
2
3. Barcelona Supercomputing Center (BSC)
• Spanish national supercomputing center 22 years history in:
• Computer Architecture, networking and distributed systems
research
• Based at BarcelonaTech University (UPC)
• Large ongoing life science computational projects
• Prominent body of research activity around Hadoop
• 2008-2013: SLA Adaptive Scheduler, Accelerators, Locality
Awareness, Performance Management. 7 publications
• 2013-Present: Cost-efficient upcoming Big Data architectures
(ALOJA) 8+ publications
4. ALOJA: towards cost-effective Big Data
• Research project for automating characterization and
optimization of Big Data deployments
• Open source Benchmarking-to-Insights platform and tools
• Largest Big Data public repository (70,000+ jobs)
• Community collaboration with industry and academia
http://aloja.bsc.es
Big Data
Benchmarking
Online
Repository
Web / ML
Analytics
6. The need for a new benchmark standard
• A benchmark captures the solution to a problem and guide decision making
• Database related benchmarks standards
• Transactional (OLTP): TPC C and E
• Decision Support (DSS/OLAP): TPC H and DS
• And for Big Data analytics properties?
• 3 Vs, ML, M/R
• Benchmark uses:
• System tuning and debugging
• Spread and broad Big Data ecosystem
• Set common rules
• Vendor comparison
• Transparency across the industry
6
7. What is BigBench (TPCx-BB1)?
• End-to-end application level benchmark
• result of many years of collaboration
• industry and academia
• Covers most Big Data Analytical properties (3Vs)
• Based on a retailer company (extension of TPC-DS)
7
[1]: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf
2012
•Launched at WBDB
2013
•Published at SIGMOD
2014
•First implementation
2016
•Standardized by TPC (Feb)
2016
•TCPx-BB Version 1.2 (Nov)
BigBench history
8. BigBench use cases and process overview
• 30 business uses cases covering:
• Merchandising,
• Pricing Optimization
• Product Return
• Customers...
• Implementation resulted in:
• 14 Declarative queries (SQL)
• 7 Queries with Natural Language Processing
• 4 Queries with data preprocessing with
MapReduce jobs.
• 5 Queries with Machine Learning post
processing.
8
Data generation
Data loading
Power test
Throughput test 1
Data refresh
Throughput test 2
Result
• BB queries / hour
32. Conclusions (I)
• Hive on Tez improves SQL performance over Hive on MapReduce.
• It is also faster than Hive on spark 1.
• Hive on spark 2 is slightly faster.
• The Spark implementation is based on hive…
• Could be faster on a native Spark implementation
• Spark MLlib has improved performance over Mahout.
• Best production combination on HDI: Apache Tez for SQL + Spark MLlib for
Machine Learning.
32
33. Conclusions (II)
• In concurrency scenarios Spark gave better results than Hive on Tez
• Spark shows a constant container execution and leverages in-mem caching
• Container execution follows an almost linear progression.
• Hive on Tez shows important spikes in container execution.
• Spark uses more containers than Hive on tez when scaling up.
• Queueing times don’t seem to be an issue in either engine.
• BigBench is new, but it can already help us guide our decision making
• It needs more BigData engines to be added
• We hope more vendors and providers start releasing audited results
33
34. Resources and references
BigBench and ALOJA
• Original BigBench Implementation repository
• https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-
Bench
• ALOJA benchmarking platform
• https://github.com/Aloja/aloja
• http://aloja.bsc.es/publications
• ALOJA fork of BigBench (adds support for HDI and fixes spark)
• https://github.com/Aloja/Big-Data-Benchmark-for-Big-Bench
• TPCx-BB reference:
• http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-
bb_v1.2.0.pdf
• Evaluating Hive and Spark SQL with BigBench – T. Ivanov et. Al.
• http://arxiv.org/ftp/arxiv/papers/1512/1512.08417.pdf
• BigBench SIGMOD 2013 paper
• http://dl.acm.org/citation.cfm?id=2463712&CFID=878332641&
CFTOKEN=16933633
• The State of SQL-on-Hadoop in the Cloud – N. Poggi et. al.
• https://doi.org/10.1109/BigData.2016.7840751
Big Data Benchmarking
• Big Data Benchmarking Community (BDBC) mailing
list
• (~200 members from ~80organizations)
• http://clds.sdsc.edu/bdbc/community
• Workshop Big Data Benchmarking (WBDB)
• http://clds.sdsc.edu/bdbc/workshops
• SPEC Research Big Data working group
• http://research.spec.org/working-groups/big-data-
working-group.html
• Benchmarking slides and video:
• Benchmarking Hadoop:
• https://www.slideshare.net/ni_po/benchmarking-hadoop
• Michael Frank on Big Data benchmarking
• http://www.tele-task.de/archive/podcast/20430/
• Tilmann Rabl Big Data Benchmarking Tutorial
• http://www.slideshare.net/tilmann_rabl/ieee2014-
tutorialbarurabl
34