The performance of modern Big Data frameworks, e.g. Spark, depends greatly on high-speed storage and shuffling, which impose a significant memory burden on production data centers. In many production situations, the persistence and shuffling intensive applications can lead to a major performance loss due to lack of memory. Thus, the common practice is usually to over-allocate the memory assigned to the data workers for production applications, which in turn reduces overall resource utilization. One efficient way to address the dilemma between the performance and cost efficiency of Big Data applications is through data center computing resource disaggregation. This paper proposes and implements a system that incorporates the Apache Spark Big Data framework with a novel in-memory distributed file system to achieve memory disaggregation for data persistence and shuffling. We address the challenge of optimizing performance at a low cost by co-designing the proposed in-memory distributed file system with large-volume DIMMbased persistent memory (PMEM) and RDMA technology. The disaggregation design allows each part of the system to be scaled independently, which is particularly suitable for cloud deployments. The proposed system is evaluated in a productionlevel cluster using real enterprise-level Spark production applications. The results of an empirical evaluation show that the system can achieve up to a 3.5-fold performance improvement for shuffle-intensive applications with the same amount of memory, compared to the default Spark setup. Moreover, by leveraging PMEM, we demonstrate that our system can effectively increase by 66.5% the memory capacity of the computing cluster with affordable cost, with a reasonable execution time overhead with respect to using local DRAM only.
2. 2#UnifiedAnalytics #SparkAISummit
Agenda
Motivation
• Production Environment — tens of thousands of servers
• Uneven computing resource utilization in production data center
• Independent scaling of compute and storage
Extending Spark with external storage
• Current: JD’s remote shuffle service (RSS)
• The next generation: disaggregated persistent memory
Performance evaluation
Conclusion
3. 3#UnifiedAnalytics #SparkAISummit
Computing Resource Utilization of a Production Cluster
100
20100
40100
60100
80100
100100
120100
140100
160100
8:30
8:53
9:15
9:38
10:00
10:23
10:45
11:08
11:30
11:53
12:15
CPU(Cores)
Time
Allocated VCores Total VCores
100.00
200.00
300.00
400.00
500.00
600.00
700.00
8:30
8:53
9:15
9:38
10:00
10:23
10:45
11:08
11:30
11:53
12:15
Memory(TB)
Time
Allocated Memory Total Memory
• 3,700 servers from 8:30 AM to 12:30 PM
• Memory is at high level all time
4. 4#UnifiedAnalytics #SparkAISummit
Uneven Computing Resource Utilization
-10.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
MemoryOverhead(%)
Day
Memory overhead of a production computing cluster with 3,700 servers
- MemoryOverhead = MemoryUtilization(%) – CPUUtilization(%)
• Uneven computing resource utilization between CPU and memory
• Thousands of servers – impossible to change hardware
• Spark applications centric data center – need more memory
5. 5#UnifiedAnalytics #SparkAISummit
High Memory Demand from Spark Tasks
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000 1200 1400
ExecutionTime(s)
Memory Utilization (GB)
• Spark performance highly depends on the capacity of available memory
• Machine learning and iterative jobs
• Cache and execution memory
6. 6#UnifiedAnalytics #SparkAISummit
Too Much Shuffle Data to Store
• Shuffle-heavy applications are very common in production
• Local HDD
− slow under high pressure
− Intolerable random I/O when shuffle data heavy
− easily broken in production and no replicas
8. Current Solution
• Related discussion
– Spark JIRA-25299
• Separate storage from compute
– Shuffle to external storage
• The JD remote shuffle service (RSS)
− More effective and more stable for shuffle-heavy applications
8#UnifiedAnalytics #SparkAISummit
10. 10#UnifiedAnalytics #SparkAISummit
Writer & RSS Implementation
Shuffle Writer:
• Add metadata header and use protobuf to encode the sending blocks
• Partition group for effective merge in RSS
• Ack to guarantee blocks saved in HDFS
• The data path can introduce dirty data — deduplication at reducer side
RSS:
• Merge blocks of the same partition group in memory
• trade-off between merge buffer size and the lingered time
• Flush the merged buffer in synchronization
• Many other details about controlling the data flow
11. 11#UnifiedAnalytics #SparkAISummit
Reducer Side Implementation
• Reduce task fetches related HDFS file(s)
- Files: if one crashes, other RSS make a new file for the this partition group
• Extract the key info from driver mapStatuses
- For deduplication
• Skip the partitions that is not relevant
• Deduplication
- with metadata in blocks and mapStatus
• Throttle the input streams …
14. To Improve Performance …
• Storage
– HDFS might not be performant enough?
• Network
– Netty might introduce bottleneck?
• We need better tools that help us explore
– Different storage backend
– Different network transports
14#UnifiedAnalytics #SparkAISummit
15. 15#UnifiedAnalytics #SparkAISummit
The Splash Shuffle Manager
• A flexible shuffle manager
• Supports user-defined storage backend
and network transport for shuffle
• Works with vanilla Spark
• JD-RSS uses HDFS-based backend
• Open source
• https://github.com/MemVerge/splash
• Sending shuffle states to external storage
• Shuffle index
• Data partition
• Spill
16. The Next Generation Architecture
• JD remote shuffle service (RSS)
– Only supports shuffle workload
– Uses external storage that is relatively slow (HDFS)
• A more general framework that separates compute and storage
– Shuffle
– RDD caching
– Goal: faster, reliable and elastic
• Our solution
• Memory extension via external memory pool
16#UnifiedAnalytics #SparkAISummit
17. 17#UnifiedAnalytics #SparkAISummit
Extending Spark Memory via Remote Memory
RDD caching
Shuffle
Spark Memory Management Model
Image source: https://0x0fff.com/spark-memory-management/
External Memory Pool
• High capacity
• High performance
• High endurance
• Affordable
18. 18#UnifiedAnalytics #SparkAISummit
Intel Optane DC Persistent Memory: An Excellent Candidate
• Jointly developed by Intel and Micron
• High density (up to 6TB per 2-socket server)
• High endurance
• Low latency: ns level
• Byte-addressable, can be used as main memory
• Non-volatile, can be used as primary storage
• Half the price of DRAM
19. 19#UnifiedAnalytics #SparkAISummit
Spark with Disaggregated Persistent Memory
External Extended
Memory Node
DDR-4
pmem
Node #1
DDR-4
Persistence,
Shuffling & Spill
Data
Persistence,
Shuffling &Spill
Data
RDMA
Spark Executor
Spark Executor
Node #N
DDR-4
Persistence,
Shuffling & Spill
Data
Spark Executor
TOR
RDMA
Spark Compute Node Spark Compute Node
Fast network
20. 20#UnifiedAnalytics #SparkAISummit
Summary
Spark with disaggregated persistent memory
• A dedicated remote PMEM pool
• Easier to manage
• Affordable
• Minimal changes
• No change for existing computing nodes
• No change for user application
• Minimal at Spark level
• Highly elastic
• Computing nodes become stateless
• Highly performant
21. 21#UnifiedAnalytics #SparkAISummit
Workloads
TeraSort
• A synthetic shuffle intensive benchmark well suited to evaluate the I/O performance
Core service of data warehouse application
• Spark-SQL
• A core data warehouse task at JD.com, supporting business decision
• I/O-intensive, based on select, insert, and full outer join operations
Anti-Price-Crawling Service
• A core analytic task of JD.com that defends against coordinated price crawling
• It is both I/O intensive and computing intensive
24. 24#UnifiedAnalytics #SparkAISummit
Performance – TeraSort
Low executor memory scenario:
• Execution time ⇓300-400s (31%)
Medium executor memory scenario:
• Execution time ⇓300-500s (38%)
Large executor memory scenario:
• Execution time ⇓450s (35%) 700
850
1000
1150
1300
1450
1600
1750
400 600 800 1000 1200 1400 1600
ExecutionTime(s)
Memory Utilization (GB)
Spark
Optimal Spark
Spark w remote pmem
Low Medium Large
25. 25#UnifiedAnalytics #SparkAISummit
Performance – Data Warehouse
Low executor memory scenario:
• Execution time ⇓700-800s (58%)
Medium executor memory scenario:
• Execution time ⇓550-650s (54%)
Large executor memory scenario:
• Execution time ⇓350-400s (42%)
400
600
800
1000
1200
1400
1600
100 300 500 700 900 1100 1300 1500 1700
ExecutionTime(s)
Memory Utilization (GB)
Spark
Optimal Spark
Spark w remote pmem
Low Medium Large
26. 26#UnifiedAnalytics #SparkAISummit
Performance – Anti-Price-Crawling
Low executor memory scenario:
• Execution time ⇓500s (17%)
Medium executor memory scenario:
• Execution time ⇓1200s (40%)
Large executor memory scenario:
• Execution time ⇑0-60s (4%)
• Baseline → process local caching data
• Spark w remote PMEM→ remote PMEM
caching data
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000 1200 1400 1600 1800
ExecutionTime(s)
Memory Utilization (GB)
Spark
Optimal Spark
Spark w remote pmem
Low Medium Large
27. Conclusion
• The separation of compute and storage
– JD-RSS
• Shuffle to external storage pool based on HDFS
– Disaggregated persistent memory pool
• Storage memory extension for shuffle and RDD caching
– Unified under MemVerge Splash Shuffle Manager
• Better elasticity and reliability
• High capacity, high performance and affordable
• Persistent memory will bring fundamental changes to Big Data
infrastructure
27#UnifiedAnalytics #SparkAISummit
28. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT