Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
Talk given by Reynold Xin at Scala Days SF 2015
In this talk, Reynold talks about the underlying techniques used to achieve high performance sorting using Spark and Scala, among which are sun.misc.Unsafe, exploiting cache locality, high-level resource pipelining.
2. Who am I?
Reynold Xin (@rxin)
Co-founder, Databricks
Apache Spark committer & maintainer on core
API, shuffle, network, block manager, SQL, GraphX
On leave from PhD @ UC Berkeley AMPLab
3. Spark is In-Memory and Fast
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop MR
Spark
Time per Iteration (s)
4. “Spark is in-memory. It doesn’t work with BIG
DATA.”
“It is too expensive to buy a cluster with enough
memory to fit our data.”
5. Common Misconception About Spark
“Spark is in-memory. It doesn’t work with BIG
DATA.”
“It is too expensive to buy a cluster with enough
memory to fit our data.”
6. Spark Project Goals
Works well with GBs, TBs, or PBs or data
Works well with different storage media (RAM,
HDDs, SSDs)
Works well from low-latency streaming jobs to
long batch jobs
7. Spark Project Goals
Works well with KBs, MBs, … or PBs or data
Works well with different storage media (RAM,
HDDs, SSDs)
Works well from low-latency streaming jobs to
long batch jobs
8. Sort Benchmark
Originally sponsored by Jim Gray to measure
advancements in software and hardware in 1987
Participants often used purpose-built hardware/
software to compete
– Large companies: IBM, Microsoft, Yahoo, …
– Academia: UC Berkeley, UCSD, MIT, …
10. Sorting 100 TB and 1 PB
Participated in the most challenging category
– Daytona GraySort
– Sort 100TB in a fault-tolerant manner
– 1 trillion records generated by a 3rd-party harness (10B
key + 90B payload)
Set a new world record (tied with UCSD)
– Saturated throughput of 8 SSDs and 10Gbps network
– 1st time public cloud + open source won
Sorted 1PB on our own to push scalability
11. 11
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
12. Why Sorting?
Sorting stresses “shuffle”, which underpins
everything from SQL to machine learning (group
by, join, sort, ALS, etc)
Sorting is challenging because there is no
reduction in data.
Sort 100TB = 500TB disk I/O + 200TB network
13. What made this possible?
Power of the Cloud
Engineering Investment in Spark
– Sort-based shuffle (SPARK-2045)
– Netty native network transport (SPARK-2468)
– External shuffle service (SPARK-3796)
Clever Application Level Techniques
– Avoid Scala/JVM gotchas
– Cache-friendly & GC-friendly sorting
– Pipelining
14. What made this possible?
Power of the Cloud
Engineering Investment in Spark
– Sort-based shuffle (SPARK-2045)
– Netty native network transport (SPARK-2468)
– External shuffle service (SPARK-3796)
Clever Application Level Techniques
– Avoid Scala/JVM gotchas
– Cache-friendly & GC-friendly sorting
– Pipelining
15. Sort-based Shuffle
Old hash-based shuffle: require R
(#reduce tasks) concurrent streams
with buffers, i.e. limits R.
New sort-based shuffle: sort
records by partition first, and then
write them out. One active stream
at a time.
Went up to 250,000 tasks in PB sort
16. Network Transport
High performance event-driven architecture using
Netty (SEDA-like)
Explicitly manages pools of memory outside JVM
garbage collection
Zero-copy (avoid user space buffer through
FileChannel.transferTo)
18. What made this possible?
Power of the Cloud
Engineering Investment in Spark
– Sort-based shuffle (SPARK-2045)
– Netty native network transport (SPARK-2468)
– External shuffle service (SPARK-3796)
Clever Application Level Techniques
– Avoid Scala/JVM gotchas
– Cache-friendly & GC-friendly sorting
– Pipelining
19. Warning
The guidelines here apply only to hyper
performance optimizations
– Majority of the code should not be perf sensitive
Measure before you optimize
– It is ridiculously hard to write good benchmarks
– Use jmh if you want to do that
http://github.com/databricks/style-style-guide
21. Avoid Collection Library
Avoid Scala collection library
– Generally slower than java.util.*
Avoid Java collection library
– Not specialized (too many objects)
– Bad cache locality
22. Prefer private[this]
A.scala:
class A {
private val field = 0
def accessor = field
}
scalac –print A.scala
Class A extends Object {
private[this] val field: Int = _;
<stable> <accessor> private def field(): Int = A.this.field;
def accessor(): Int = A.this.field();
def <init>(): A = {
A.super.<init>();
A.this.field = 0;
()
}
};
JIT might not be able to inline this
23. Prefer private[this]
B.scala:
class B {
private[this] val field = 0
def accessor = field
}
scalac –print B.scala
Class B extends Object {
private[this] val field: Int = _;
def accessor(): Int = B.this.field;
def <init>(): B = {
B.super.<init>();
B.this.field = 0;
()
}
};
24. What made this possible?
Power of the Cloud
Engineering Investment in Spark
– Sort-based shuffle (SPARK-2045)
– Netty native network transport (SPARK-2468)
– External shuffle service (SPARK-3796)
Clever Application Level Techniques
– Avoid Scala/JVM gotchas
– Cache-friendly & GC-friendly sorting
– Pipelining
25. Garbage Collection
Hard to tune GC when memory utilization > 60%
Do away entirely with GC: off-heap allocation
26. GC Take 1: DirectByteBuffer
java.nio.ByteBuffer.allocateDirect(long size);
Limited size: 2GB only
Hard to recycle
– only claimed by GC
– possible to control with reflection, but there is a
static synchronize!
27. GC Take 2: sun.misc.Unsafe
Available on most JVM implementations
(frequently exploited by trading platforms)
Provide C-like explicit memory allocation
class Unsafe {
public native long allocateMemory(long bytes);
public native void freeMemory(long address);
…
}
28. GC Take 2: sun.misc.Unsafe
class Unsafe {
…
public native copyMemory(
long srcAddress, long destAddress,
long bytes);
public native long getLong(long address);
public native int getInt(long address);
...
}
29. Caveats when using Unsafe
Maintain a thread local pool of memory blocks to
avoid allocation overhead and fragmentation
(Netty uses jemalloc)
-ea during debugging, -da in benchmark mode
30. Cache Friendly Sorting
Naïve Scala/Java collection sort is extremely slow
– object dereferences (random memory access)
– polymorphic comparators (cannot be inlined)
Spark’s TimSort allows operating on custom data
memory layout
– Be careful with inlining again
31. Data Layout Take 1
100B …
Consecutive Memory Block (size N * 100B)
Comparison: uses the 10B key to compare.
Swap: requires swapping 100B.
Too much memcpy!
Record format: 10B key + 90B payload
32. Data Layout Take 2
Swap: cheap (8B or 4B)
Comparison: requires
dereference 2 pointers
(random access)
Bad cache locality!
…
100bytearray
100bytearray
100bytearray
100bytearray
100bytearray
Array[Array[Byte]]
33. Data Layout Take 3
14 bytes …
Consecutive Memory Block (size N * 14B)
10B key prefix + 4B position
Sort comparison: using the 10B prefix inline
Swap: only swap 14B
Good cache locality & low memcpy overhead!
34.
35. What made this possible?
Power of the Cloud
Engineering Investment in Spark
– Sort-based shuffle (SPARK-2045)
– Netty native network transport (SPARK-2468)
– External shuffle service (SPARK-3796)
Clever Application Level Techniques
– Avoid Scala/JVM gotchas
– Cache-friendly & GC-friendly sorting
– Pipelining
38. What did we do exactly?
Improve single-thread sorting speed
– Better sort algorithm
– Better data layout for cache locality
Reduce garbage collection overhead
– Explicitly manage memory using Unsafe
Reduce resource contention via pipelining
Increase network transfer throughput
– Netty (zero-copy, jemalloc)
39.
40. Extra Readings
Databricks Scala Guide (Performance Section)
https://github.com/databricks/scala-style-guide
Blog post about this benchmark entry
http://tinyurl.com/spark-sort