Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP Databases

© 2019Walmart Inc.All Rights Reserved© 2019Walmart Inc.All Rights Reserved
Hive LLAP : A High Performance, Cost-Effective Alternative to
Traditional MPP Databases
Any reference in this presentation to any specific commercial product, process, or service, or the use of any trade, firm, or corporation
name is for information and convenience only and is not an endorsement, favor, or recommendation byWalmart Inc.
Naveen Peddamail
Sr. Manager, Global Data
Abhishek Gupta
Data Engineer, Global Data

• Introduction toWalmart
• Data Lake Initiative – Building a Single Source ofTruth
• Challenges Around Low Latency Querying on Hadoop – Hive LLAP as a Solution
• Performance & Cost Effectiveness of Hive LLAP vs. MPP Databases
• Conclusion & Next Steps
• Q & A
Agenda

• Largest retailer in the world and Fortune 1 company
• Serves over 275M customers weekly
• Employs over 2.2M associates worldwide
• 11,300 stores under 58 banners in 27 countries
• eCommerce websites in 10 countries & brands include:
• Walmart.com
• Jet.com
• Hayneedle.com (home furnishings)
• Shoes.com (footwear)
• Moosejaw (outdoor apparel and gear)
• ModCloth (women’s apparel)
• Bonobos (men’s apparel)
To find out more, visit us at https://corporate.walmart.com
About Walmart Labs
• Employs over 4,000 associates worldwide
• Development centers in the US, India, and Ireland
• Open source projects include:
• Hapi (server framework for Node.js)
• OneOps (cloud management platform)
• Electrode (universal React/Node.js platform)
• TestArmada (suite of testing tools)
• Includes Global Data and Analytics Platform team
To find out more, visit us at:
https://www.walmartlabs.com
https://www.facebook.com/WalmartLabs
https://twitter.com/WalmartLabs
https://github.com/walmartlabs
About Walmart

Data Landscape atWalmart
• Transactional systems from various domains generate huge volume of data every second
• Sales & Orders
• Merchandizing
• Logistics & Supply Chain
• Real Estate
• HR Systems
• Compliance
• Analytical & Reporting databases spread across various platforms and teams
• Challenges in correctly identifying Source ofTruth
• Data Quality, Governance, Metadata management & Lineage was difficult to manage
• Need to build a single source of truth – Data Lake
Data Lake Initiative – Building a Single Source ofTruth

Criteria for the Data Lake
Governed, Secured & Certified Data
Single Source ofTruth
LowerTotal Cost of Ownership
Robust and Fast Data Access & Reporting

Data Lake @ Walmart
01 02 03 04
Central true source of analytical data across
Walmart
.
Central Analytical Data Source
• Common services for metadata
• ETL pipeline
• Data quality framework
Data Service Layer
• Roles to manage access control
• Encryption for sensitive data elements
• Providing end to end lineage
Governed and Secure
• Enable ad-hoc analysis
• Improve speed to market for analysis
• Providing a self served storage and compute
platform
Self Service Platform

Governed, Secured & Certified Data
Single Source ofTruth
LowerTotal Cost of Ownership
Robust and Fast Data Access & Reporting
Are the Business Users Happy Now?

Low Latency Querying on Hadoop; HIVE LLAP as a Solution
Challenges
• Ad hoc query performance was not so great on
Hadoop/Hive
• Users benchmarked against Massively Parallel
Processing - Enterprise data warehouses (MPP
EDWs)
• Migrating some teams off of Enterprise Data
Warehouses was not possible until you could
guarantee better query response times.
• Queries migrated from other data-warehouses were
not optimal for querying on Hive
Potential
Solutions
Tune Queries for
optimal Hive
performance
Recommend Tez as
default execution
engine
Hive LLAP as a
Performance
Booster

JIT Optimization & in- Memory
Cashing
Data Sharing,
Asynchronous IO
Leverages long
lived Daemons
Bridges inefficiencies
of execution engines
Hive LLAPLOW LATENCYANALYTICAL PROCESSING
(Also known as Long Live and Process)

Hive LLAP Architecture
Source: https://hortonworks.com/blog/top-5-performance-boosters-with-apache-hive-llap/

Hive LLAP – ReviewingTPC-DS Benchmarks on HDP 2.6
Source: https://hortonworks.com/blog/3x-faster-interactive-query-hive-llap/
• 10TB Scale & the Data model for the underlying tables
were similar to our use case
• Hive LLAP Benchmarks looked promising forTPC-DS data
• Wider Tables
• Complex Dimension tables
SimilaritiesTo Walmart’s Data Model Differences From Walmart’s Data Model

© 2019Walmart Inc.All Rights Reserved
POC GOALS
 Benchmark Hive LLAP query performance
on 3NF Tables involving Joins
 Compare Hive LLAP query performance vs.
MPP-EDWs on same set of queries
Hive LLAP – POC
DATA MODEL

• Hadoop Distribution – HDP 2.6.3
• YARN Scheduler – Capacity Scheduler with pre-emption enabled
• Number of LLAP Nodes –Two Configs 10 Nodes & 15 Nodes.
• Hardware – 256GB RAM, 32 Cores, and 14*6TB disks. Incremental Spend : ~ $ 150K
• Overall Hadoop Cluster Nodes – 90 Nodes
Hive LLAP – Environment Setup

Hive LLAP – Environment Setup
YARN Config
Nodemanager Max Container Size (MB) 230400
Number of LLAP nodes 10 & 15 (TwoVariations)
LLAP Configs
hive.llap.execution.mode all
hive.llap.io.memory.mode cache
hive.llap.io.enabled TRUE
Slider Memory 2048
tez.am.resource.memory.mb 2048
LLAP Daemon Container Max Headroom 8192
Number of concurrent queries 10
Memory per Daemon 226304
Number of executors per LLAP Daemon 44
hive.llap.io.threadpool.size 44
LLAP Daemon Heap Size (MB) 171213
In-Memory Cache per Daemon (MB) 46899

Hive LLAP – Query Patterns & Stats
Query Characteristics
• Queries fall mainly into reporting & ad-hoc workloads
with a focus on business applications
• Aggregations of key metrics across various location,
item & timeframe dimensions
• Scans involving large tables & Joins on multiple tables
• Sorting across various dimensions & facts
• 48 Queries over 4 Time Frames
Table Stats
• Fact Table (1 year data): ~70 Billion rows, 12 TB
• Dimensions(1 key table): ~25 Million rows, 110 GB
SELECT l.column1, l.column2, i.column3, i.column4,
d.column5, sum(s.column6), sum(s. column7),
avg(s.column8), avg(s.column9)
….
….
….
FROM sales as s
JOIN item_dim as i on s.item_id=i.item_id
JOIN location_dim as l on s.location_id=l.location_id
JOIN date_dim as d on s.visit_dt=d.cal_dt
WHERE s.column10 BETWEEN <val1> and <val2>
AND l.column11 = <val3>
…
…
GROUP BY
l.column1, l.column2, i.column3, i.column4, d.column5
ORDER BY
l.column1, l.column2, i.column3, i.column4, d.column5;
Sample Query

Hive LLAP – Results
0
50
100
150
200
250
300
350
400
450
ExecutionTime(seconds)
Hive LLAP Performance Benchmark
1 Week 4 Weeks 12 Weeks 52 Weeks
75% of the queries ran in < 100 secs

30% - 50% Performance Improvement between 10 node vs. 15 node configuration
0
100
200
300
400
500
600
Queries
Hive LLAP Query Performance for 10 vs. 15 Nodes - Linear Scalability
LLAP -15 Nodes LLAP-10 Nodes
1 Week 4 Weeks 12 Weeks 52 Weeks
Hive LLAP – Results

Comparing Query Performance of Hive LLAP vs. MPP-EDWs
• For our Comparative analysis, we used two MPP-EDW Clusters
• Queries in the MPP-EDW Clusters were optimized for best performance
Hadoop Cluster
~ 4 TB Memory
480 VCores
MPP EDW B
~ 16 TB Memory
840 VCores
MPP EDW A
~ 4 TB Memory
512 VCores

• LLAP performed better than MPP EDW-A system having similar infrastructure
• Comparable difference between LLAP and MPP EDW-B; Provided 4x Infrastructure for MPP
Comparing Query Performance of Hive LLAP vs. MPP-EDWs
0
100
200
300
400
500
600
700
800
Hive LLAP vs. MPP-A vs. MPP-B
LLAP (Secs) MPP - Enterprise Data Warehouse A (Secs) MPP - Enterprise Data Warehouse B (Secs)
4 Weeks1 Week 13 Weeks 52 Weeks

Hive LLAP: Conclusion & Next Steps
• Promising product for low latency SQLAccess on top of Hadoop
• Significant Cost Savings vs.Traditional MPP databases
• Not a one size fits all solution
Next Steps:
• Evaluate Hive LLAP on HDP 3.x (Better Enterprise Support)
• Resource Plans & Workload Manager
• SSD Caching
• HS2I : Hive Server2 Interactive - High Availability

Thank You !
Abhishek Gupta
Data Engineer, Walmart
Abhishek.gupta2@Walmart.com
https://www.linkedin.com/in/gupta-abhishek/
Naveen Peddamail
Sr. Manager, Walmart
Naveen.Peddamail@walmart.com
https://www.linkedin.com/in/naveenpeddamail/

Questions?

Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP Databases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP Databases

Similar to Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP Databases (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP Databases

Editor's Notes