Solving Performance Problems on Hadoop

Solving performance problems on Hadoop
Moving analytic workloads into production
1
Tyler Mitchell
Sr. Software Engineer
Actian Center of Excellence

Topics
How we got (stuck) here
Performance best practises
Sample business cases
Benchmarking results
2

Actian’s Lineage
Ingres – 1970’s Versant – 1988 ParAccel – 2006
Pervasive – 1982 Vectorwise – 2003
3
Actian

Actian at a Glance
4
10,000+
8 Countries; 7 US Cities
HQ Palo Alto
400+
Employees Customers
3
Businesses
Banking, Insurance
Telecom and Media
Data Management
Data Integration
Big Data Analytics

Accidental Hadoop Tourist – Brief History
6
DataBusiness
Data Capture
Data Management
& Integration
Analytics
Query & Analyze
Solutions
Problem
Solved

7
DataBusiness
Data Capture
Data Management
& Integration
Analytics Solutions
??????

8
DataBusiness
Data Capture
Data Management
& Integration
Analytics
???
Solutions
???

Modern, best-in-class analytic database technology provides:
9
Measureable business impact: monetize Big Data to grow revenue,
reduce cost, mitigate risk, enable new business
The ability to make data driven business decisions using a massively
scalable platform
Decisive reduction in the cost of high performance analytics at scale
Performance that can meet all SLAs
Full leverage of existing SQL skills while deploying a modern analytic
infrastructure
Grow
Revenue
Reduce
Cost
Mitigate
Risk
Create
New
Business
Business Solution Architecture Challenges

Wide Ranges of Use Cases
10
Financial Services
Advanced Credit
Risk Analytics
across billions of
data points
Internet Scale
Application
Predictive
Analytics across
hundreds of
millions of
customers
Media
Data Science and
Discovery across
trillions of IoT
events
Dept of Defense
Cyber-Security:
Network
intrusion models
every second
Credit Card
Processing
Fraud
detection
every milli-
second

3 Essential Big Data Concepts
12
0. Take nothing for granted
1. Partitioning vs Data skew
2. Data types matter
3. Maximize memory / minimize bottlenecks
4. Take nothing for granted

6 Game Changing Database Innovations
13

6 Game Changing Database Innovations
14
1. Use the CPU! – Vector Processing
2. Minimize bottlenecks – Exploiting Chip Cache
3. Got columnar?
4. Smarter compression
5. Smarter indexing
6. Multi-core matters

Big Data Business Use Cases
16

Customer 360: Understanding Experience, Driving Revenue
17
Telecom Challenge
Vast and growing repository of proprietary click data, customer records, service
call records, smart phone and device data GPS location, webserver, telephone,
network usage.
Queries took minutes or hours, and sometimes never returned at all.
Critical business analysis on a consolidated customer 360 data lake was
grinding to a halt.
The ability to gain deeper market insights, visualization and desired data
management and operational optimization was at risk

Customer 360: Initial Architecture
18
Development System
• 300+ node cluster
• HIVE access
• SQL based BI / Data Science
• Pre-processed as performance was unacceptable
• Views taking days to return snapshot views

Customer 360: Technical Improvements
19
Production Prototype
• 30 node cluster (10% of Hive)
• Actian Vector on Hadoop solution
• SQL based BI / Data Science
• No materialized view building required
• Join on demand faster than aggregate tables in Hive
• Reduced storage requirements
• 91TB – two years data, 1100 columns when joined

Customer 360: Understanding Experience, Driving Revenue
20
Results
Customer 360 across prior data silos
Leveraged for customer retention strategies
Predict and take proactive, tailored
responses
Enables next gen data-driven
troubleshooting, impact analysis and root
cause analysis
• Accelerated operations intelligence
• Improved customer experience
• Reduced customer churn
Impact

Financial Risk: Upgrading Legacy to Meet SLA
21
Challenge
Legacy single-purpose risk application took 3 hours to generate end-of-day risk report,
and failed to meet changing SLA’s for reporting risk.
In deciding to replace risk application, bank opted to build a multi-purpose risk
application, addressing multiple business requirements

22
Legacy System
• Single server architecture, MS SSAS, Oracle - ~30 applications
• Pre-processing of desired measures exploding data volumes
• Cube and Analysis engines being maxed out as they exceed 1.5TB range
• Unable to scale to the desired range of > 200GB/day new data
• Impala attempt failed
• Highly invested in apps built on Analysis service

23
New Possibilities
• Clustered solution – Hadoop 5 and 10 node
• No pre-processing cubes, SSAS partly kept
• Tested solutions 1TB -> 20TB at a time
• Produced interactive queries across large datasets
• Focused query results in 2s or less
• Processing all data in the database 6s – 80s
• 2x nodes ~ 200% speed improvement

24
Results
Increased data analyzed by 100X
2–200B rows / 1-20TB
Report run in 28 seconds vs. 3 hours
Use of application for:
• Intra-day reporting (surveillance)
• End of day reporting (compliance)
• Overnight float investment
options
• Annual CCAR Analysis
ActualGoal

Delivering the Results With Better Engineering
25

Technical Benchmarks – Single Machine
27

Technical Benchmarks – Single Machine
28

Technical Benchmarks: VectorH - SQL on Hadoop
29
TPC-H SF1000 *
VectorH vs other platforms, faster by how much?
Tuned platforms
Identical hardware **
* Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM,
24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0

Actian VectorH Delivers More Efficient File Format
30
Better compression & functionality
Vector advantages:
• skip blocks via MinMax indexes
• sophisticated query processing
• efficient block format, esp. 64-bit int

Summary
Conscientious data handling & next gen engineering takes SQL
in Hadoop to new levels.
All Hadoop users can move from development into production
while delivering compelling business results.
31

Delivering the Results With Better Engineering
32
VectorH v5 – Spark integration, external table support, and more

Thank you!
tyler.mitchell@actian.com - @1tylermitchell
Blogs at Actian.com - MakeDataUseful.com
Visit us in booth 503
34

Solving Performance Problems on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Solving Performance Problems on Hadoop

Similar to Solving Performance Problems on Hadoop (20)

Recently uploaded

Recently uploaded (20)

Solving Performance Problems on Hadoop

Editor's Notes