My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
4. Actian at a Glance
4
10,000+
8 Countries; 7 US Cities
HQ Palo Alto
400+
Employees Customers
3
Businesses
Banking, Insurance
Telecom and Media
Data Management
Data Integration
Big Data Analytics
6. Accidental Hadoop Tourist – Brief History
6
DataBusiness
Data Capture
Data Management
& Integration
Analytics
Query & Analyze
Solutions
Problem
Solved
7. Accidental Hadoop Tourist – Brief History
7
DataBusiness
Data Capture
Data Management
& Integration
Analytics Solutions
??????
8. Accidental Hadoop Tourist – Brief History
8
DataBusiness
Data Capture
Data Management
& Integration
Analytics
???
Solutions
???
9. Modern, best-in-class analytic database technology provides:
9
Measureable business impact: monetize Big Data to grow revenue,
reduce cost, mitigate risk, enable new business
The ability to make data driven business decisions using a massively
scalable platform
Decisive reduction in the cost of high performance analytics at scale
Performance that can meet all SLAs
Full leverage of existing SQL skills while deploying a modern analytic
infrastructure
Grow
Revenue
Reduce
Cost
Mitigate
Risk
Create
New
Business
Business Solution Architecture Challenges
10. Wide Ranges of Use Cases
10
Financial Services
Advanced Credit
Risk Analytics
across billions of
data points
Internet Scale
Application
Predictive
Analytics across
hundreds of
millions of
customers
Media
Data Science and
Discovery across
trillions of IoT
events
Dept of Defense
Cyber-Security:
Network
intrusion models
every second
Credit Card
Processing
Fraud
detection
every milli-
second
12. 3 Essential Big Data Concepts
12
0. Take nothing for granted
1. Partitioning vs Data skew
2. Data types matter
3. Maximize memory / minimize bottlenecks
4. Take nothing for granted
17. Customer 360: Understanding Experience, Driving Revenue
17
Telecom Challenge
Vast and growing repository of proprietary click data, customer records, service
call records, smart phone and device data GPS location, webserver, telephone,
network usage.
Queries took minutes or hours, and sometimes never returned at all.
Critical business analysis on a consolidated customer 360 data lake was
grinding to a halt.
The ability to gain deeper market insights, visualization and desired data
management and operational optimization was at risk
18. Customer 360: Initial Architecture
18
Development System
• 300+ node cluster
• HIVE access
• SQL based BI / Data Science
• Pre-processed as performance was unacceptable
• Views taking days to return snapshot views
19. Customer 360: Technical Improvements
19
Production Prototype
• 30 node cluster (10% of Hive)
• Actian Vector on Hadoop solution
• SQL based BI / Data Science
• No materialized view building required
• Join on demand faster than aggregate tables in Hive
• Reduced storage requirements
• 91TB – two years data, 1100 columns when joined
20. Customer 360: Understanding Experience, Driving Revenue
20
Results
Customer 360 across prior data silos
Leveraged for customer retention strategies
Predict and take proactive, tailored
responses
Enables next gen data-driven
troubleshooting, impact analysis and root
cause analysis
• Accelerated operations intelligence
• Improved customer experience
• Reduced customer churn
Impact
21. Financial Risk: Upgrading Legacy to Meet SLA
21
Challenge
Legacy single-purpose risk application took 3 hours to generate end-of-day risk report,
and failed to meet changing SLA’s for reporting risk.
In deciding to replace risk application, bank opted to build a multi-purpose risk
application, addressing multiple business requirements
22. Financial Risk: Upgrading Legacy to Meet SLA
22
Legacy System
• Single server architecture, MS SSAS, Oracle - ~30 applications
• Pre-processing of desired measures exploding data volumes
• Cube and Analysis engines being maxed out as they exceed 1.5TB range
• Unable to scale to the desired range of > 200GB/day new data
• Impala attempt failed
• Highly invested in apps built on Analysis service
23. Financial Risk: Upgrading Legacy to Meet SLA
23
New Possibilities
• Clustered solution – Hadoop 5 and 10 node
• No pre-processing cubes, SSAS partly kept
• Tested solutions 1TB -> 20TB at a time
• Produced interactive queries across large datasets
• Focused query results in 2s or less
• Processing all data in the database 6s – 80s
• 2x nodes ~ 200% speed improvement
24. Financial Risk: Upgrading Legacy to Meet SLA
24
Results
Increased data analyzed by 100X
2–200B rows / 1-20TB
Report run in 28 seconds vs. 3 hours
Use of application for:
• Intra-day reporting (surveillance)
• End of day reporting (compliance)
• Overnight float investment
options
• Annual CCAR Analysis
ActualGoal
29. Technical Benchmarks: VectorH - SQL on Hadoop
29
TPC-H SF1000 *
VectorH vs other platforms, faster by how much?
Tuned platforms
Identical hardware **
* Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM,
24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0
30. Actian VectorH Delivers More Efficient File Format
30
Better compression & functionality
Vector advantages:
• skip blocks via MinMax indexes
• sophisticated query processing
• efficient block format, esp. 64-bit int
31. Summary
Conscientious data handling & next gen engineering takes SQL
in Hadoop to new levels.
All Hadoop users can move from development into production
while delivering compelling business results.
31
32. Delivering the Results With Better Engineering
32
VectorH v5 – Spark integration, external table support, and more
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.