Apache Spark performance on SQL and DataFrame/DataSet workloads has made impressive progress, thanks to Catalyst and Tungsten, but there is still a significant gap towards what is achievable by best-of-breed query engines or hand-written low-level C code on modern server-class hardware. This session presents Flare, a new experimental back-end for Spark SQL that yields significant speed-ups by compiling Catalyst query plans to native code.
Flare’s low-level implementation takes full advantage of native execution, using techniques such as NUMA-aware scheduling and data layouts to leverage ‘mechanical sympathy’ and bring execution closer to the metal than current JVM-based techniques on big memory machines. Thus, with available memory increasingly in the TB range, Flare makes scale-up on server-class hardware an interesting alternative to scaling out across a cluster, especially in terms of data center costs. This session will describe the design of Flare, and will demonstrate experiments on standard SQL benchmarks that exhibit order of magnitude speedups over Spark 2.1.
5. ScalaMartin Odersky + the Scala team
Rompf, Iulian Dragos, Adriaan Moors, Gilles Dubochet, Philipp Haller, Lukas Rytz, Ingo Maier, Antonio Cunei, Donna Malayeri, Miguel Garcia, Hubert Plociniczak, Aleksandar Pro
st: Geoffrey Washburn, Stéphane Micheloud, Lex Spoon, Sean Mc Dirmid, Burak Emir, Nikolay Mihaylov, Philippe Altherr, Vincent Cremet, Michel Schinz, Erik Stenman, Matthias
external/visiting contributors: Paul Phillips, Miles Sabin, Stepan Koltsov and others
19. NUMA Optimization
5300
5400
5500
Q1
1
0
100
200
300
400
500
600
1 18 36 72
RunningTime(ms)
# Cores
12 12
23
12
24
46
3500
3600
3700
Q6
one socket
two sockets
four sockets
1
0
100
200
300
1 18 36 72
# Cores
14
22
29
23
44
58
Scaling-up Flare for SF100 with NUMA optimizations on different configurations: threads pinned to one, two and four sockets
• Q6 performs better when threads are dispatched on different sockets.
• Q1 is computation-bound, little effect
• On scaling-up Q1 and Q6 up to 72 cores (the capacity of the
machine), the maximum speedup is 46x and 58x.
Hardware: Single NUMA machine with 4 sockets, 12 Xeon E5-4657L cores per socket, and
256GB RAM per socket (1 TB total).
21. Example: k-Means Clustering
untilconverged(mu, tol) { mu =>
// calculate distances to current centroids
val c = (0::m) {i =>
val allDistances = mu mapRows { centroid =>
dist(x(i), centroid)
}
allDistances.minIndex
}
// move each cluster centroid to the
// mean of the points assigned to it
val newMu = (0::k,*) { i =>
val (weightedpoints, points) = sum(0,m) { j =>
if (c(i) == j) (x(i),1)
}
if (points == 0) Vector.zeros(n)
else weightedpoints / points
}
newMu
25. Relational + ML
/* TensorFlow inference as UDF */
val q = spark.sql("select ... from data
where class = findNearestCluster(...)
group by class")
flare(q).show