Data analytics applications are often 10x off peak hardware performance since they combine multiple functions from different libraries and frameworks to build increasingly complex workflows. Even if each individual function is optimized in isolation, the cost of data movement across these functions can cause order of magnitude slowdowns. For example, even though the TensorFlow machine-learning library uses highly tuned linear algebra functions for each of its operators, workflows that combine these operators can be 16x slower than hand-tuned code. Similarly, workflows that perform relational processing in Spark SQL or pandas, numerical processing in NumPy, or a combination of these tasks spend most of their time in data movement across processing functions and could run between 2x and 30× faster if optimized end to end.
This talk offers an overview of Weld, an optimizing runtime for data-intensive applications that works across disjoint libraries and functions. Weld uses a common representation to capture the structure of diverse data-parallel workloads such as SQL, machine learning, and graph analytics and then optimizes across them using a cost-based optimizer that takes into account hardware characteristics. Weld can be integrated into a variety of widely used analytics frameworks, such as Spark SQL for relational processing, TensorFlow for machine learning, and Pandas and NumPy for general data science workloads. Integrating Weld with these frameworks requires no changes to user application code. Weld speeds up existing workloads in these frameworks by up to 16x and can also enable speed-ups of two orders of magnitude in applications that combine them.
1. Weld: A Common Runtime for
Data Analytics
Shoumik Palkar, James Thomas, Deepak Narayanan, Anil Shanbhag*, Rahul
Palamuttam, Holger Pirk*, Malte Schwarzkopf*, Saman Amarasinghe*, Sam
Madden*, Matei Zaharia
Stanford InfoLab, *MIT CSAIL
2. Motivation
Modern data apps combine many disjoint processing
libraries & functions
» SQL, statistics, machine learning, …
» E.g. PyData stack
3. Motivation
Modern data apps combine many disjoint processing
libraries & functions
» SQL, statistics, machine learning, …
» E.g. PyData stack
+ Great results leveraging work of 1000s of authors
– No optimization across these functions
4. How Bad is This Problem?
Growing gap between memory/processing makes
traditional way of combining functions worse
data = pandas.parse_csv(string)
filtered = pandas.dropna(data)
avg = numpy.mean(filtered)
parse_csv
dropna
mean
Up to 30x slowdowns in NumPy, Pandas, TensorFlow, etc.
compared to optimized C implementation
10. Integration Effort
Small up front cost to enable Weld integration
» 500 LoC
Easy to port over each operator
» 30 LoC each
Incrementally Deployable
» Weld-enabled ops work with native ops
11. Implementation
Around 12K lines of Rust
» Fast and safe! Perfect for embedding
Parallel CPU backend using LLVM
GPU support coming soon
13. Rest of This Talk
Runtime API: Enabling cross-library optimization
Weld IR: Enabling speedups and automatic
parallelization
Grizzly: Using Weld to build a Faster Pandas
14. Runtime API
Uses lazy evaluation to collect work across libraries
data = lib1.f1()
lib2.map(data,
item => lib3.f2(item)
)
User Application Weld Runtime
Combined IR
program
Optimized
machine code
1101110
0111010
1101111
IR fragments
for each function
Runtime
API
f1
map
f2
Data in
application
15. Weld IR
Designed to meet three goals:
1. Generality: support diverse workloads and
nested calls
2. Ability to express optimizations: e.g., loop
fusion, vectorization, loop tiling
3. Explicit parallelism and targeting parallel
hardware
16. Weld IR: Internals
Small IR with only two main constructs.
Parallel loops: iterate over a dataset
Builders: declarative objects for producing results
» E.g. append items to a list, compute a sum
» Can be implemented differently on different hardware
Captures relational algebra, functional APIs like
Spark, linear algebra, and composition thereof
17. Examples
Implement functional operators using builders
def map(data, f):
builder = new vecbuilder[int]
for x in data:
merge(builder, f(x))
result(builder)
def reduce(data, zero, func):
builder = new merger[zero, func]
for x in data:
merge(builder, x)
result(builder)
18. Example Optimization: Fusion
squares = map(data, x => x * x)
sum = reduce(data, 0, +)
bld1 = new vecbuilder[int]
bld2 = new merger[0, +]
for x in data:
merge(bld1, x * x)
merge(bld2, x)
Loops can be merged into one pass over data
19. “Typically lumber along at 1.2-1.8 MPH”“Active, agile, hard to out-run, top speed of
35 MPH”
20. Grizzly
A subset of Pandas integrated with Weld
» Operators ported over including unique, filter, mask
» Easy to port more!
Same API as native Pandas so zero changes to applications
Transparent single-core and multi-core speedups
21. Grizzly in action
import pandas as pd
# Read dataframe from file
requests = pd.read_csv(‘filename.csv’)
# Fix requests with extra digits
requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5)
# Fix requests with 00000 zipcodes
zero_zips = requests['Incident Zip'] == '00000’
requests['Incident Zip'][zero_zips] = np.nan
# Display unique incident zips
print requests['Incident Zip'].unique()
Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)
22. Grizzly in action
import pandas as pd, grizzly as gr
# Read dataframe from file
raw_requests = pd.read_csv(‘filename.txt’)
requests = gr.DataFrame(raw_requests)
# Fix requests with extra digits
requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5)
# Fix requests with 00000 zipcodes
zero_zips = requests['Incident Zip'] == '00000’
requests['Incident Zip'][zero_zips] = np.nan
# Display unique incident zips
print requests['Incident Zip'].unique().evaluate()
Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)
24. Conclusion
Changing the interface between libraries can speed up
data analytics applications by 10-100x on modern
hardware
Try out Weld for yourself: weld.stanford.edu
Notes de l'éditeur
Modern This is great because anyone can leverage the work of 1000s of authors who have never met each other and are implementing the latest algorithms in each domain.
What we’ve lost for this modularity, however, is optimization across these functions. Instead, developers have only focused on making each individual function really fast, but there has been almost no effort in looking for opportunities in optimizing the end-to-end application.
This is great because anyone can leverage the work of 1000s of authors who have never met each other and are implementing the latest algorithms in each domain.
What we’ve lost for this modularity, however, is optimization across these functions. Instead, developers have only focused on making each individual function really fast, but there has been almost no effort in looking for opportunities in optimizing the end-to-end application.
Is end-to-end optimization in modern data applications something we really need?
Growing gap between memory and processing speeds is making the traditional way of combining functions worse
In particular, the usual way to compose these libraries is to make function calls into them. To combine different libraries, you take the output of one function call and pass it as the input to another function call in a different library. This means that you keep reading results from memory to do some work and writing them back over and over again.
Consider this toy example which reads some CSV and does some computation with NumPy and Pandas. Each of these individual functions is highly optimized and written in C. So even though each individual function is really fast and carefully optimized, a lot of time goes into just reading and writing from memory as opposed to doing any processing.
We found up to 30x slowdowns in applications such as NumPy and Pandas and TensorFlow, compared to writing these applications as a single giant C program with one loop that processes the data and that a standard C compiler could optimize. The results were even worse when we combined libraries (for example, using NumPy and TensorFlow in the same application).
So, how do we solve this problem? Because each function is written as a “black box”, we can’t do any optimizations on existing programs. In addition, data scientists want to continue to mix and match different libraries and frameworks to build their applications, so building one giant “mega system” which can do it all isn’t a practical.
The approach we take to solving this is a common runtime which can capture the computation in all the domains people use in their applications in some common format, and then optimize that common format.
This common format is similar to how people use libraries like CUDA or OpenCL today to implement accelerated versions of their libraries.
The runtime can look at the entire workload and produce one “end-to-end optimized” version of the entire workload by running optimizations like vectorization or loop fusion.
We can also target different heterogeneous parallel hardware
Weld is our implementation of such a runtime. It has four main components
First is a runtime API which frameworks and libraries use to register computations with the runtime.
Second is an IR to capture the actual work each framework does. An IR is a small programming language.
Third is an optimizer which operates over the IR to fuse loops, vectorize code, and so forth
Last is a set of backends to generate code for different parallel hardware.
First is an IR which to capture parallel work – this is what a single function expresses its computation in. We focus on data parallel workloads because they are predominant in analytics applications - for example, all of Spark's operators are data parallel, and most operators in libraries like TensorFlow and NumPy also fundamentally apply data parallel operations over collections of data.
Our prototype of Weld has integrations with some real frameworks: Pandas, TensorFlow, Numpy, and Spark.
Of these pandas is most complete, open source today
Weld integration gives us some pretty nice performance boosts
TPCH: 400GB total, distributed across 20 nodes, 20GB per node
NumPy: Simple sum of 10 million integers in a loop
TensorFlow: MNIST again (zero or non-zero), 55000 images.
Spark 2.0
Cross library optimizations can give even crazier speedups. We combined Pandas and NumPy in a simple data cleaning workload.
Note Log Scale
Dataset size Pandas: 5GB dataset
Workload:
Filter big cities evaluate a simple linear model of features using features prseent to compute a crime index, and then aggregate into a total crime index.
Spark SQL: 10GB TPCH (SF=10) 60 million rows
Workload:
Read columns, UDF doubles one column, compute sum of pointwise product.
Weld’s first component is an API to collect work across libraries. The main goal of the API is composability - allowing library writers who have never met and who work in different domains to enable optimizations across their functions.
The API uses *lazy evaluation* to build up a large computation, and then compiles the computation into fast machine code and evaluates it using the data in the application.
In the example in this picture, a user is calling three Weld-enabled functions, which are stitched together using the runtime API. When the user needs to materialize the result of a computation, the runtime compiles the IR into optimized machine code and produces a result.
Weld’s runtime produces an IR we have developed, which has three goals.
(read goals off slides)
Apart from some basic operators to do arithmetic, branching, and so forth, Weld's IR has two parallel constructs: a loop which iterates over a dataset in parallel, and an object type called builders for producing results.
Builders are declarative types, so they just specify what kind of result they return when you "merge" values into them in parallel.
For example, we can perform a reduction by using a loop to pass over some data and merge each element into a builder which takes a commutative merge function (e.g., add if we want to compute a sum).
Builders don't specify a specific implementation strategy, so backends can choose how to implement them most efficiently for a given target platform. They also have some limitations that let them be implemented efficiently on parallel hardware. For example, the final result of a builder can only be read out once, after all writes have completed. This allows backends to keep arbitrary, distributed state until an actual result needs to be materialized.
The IR can capture workloads from a number of different domains, and more importantly, we've found that it can also capture the compositions of these libraries
The IR enables optimizations of composed functions. We can express things like common subexpression elimination and vectorization, and because the loops and builders are parallel structures, every loop in Weld can automatically be parallelized as well.
In this slide we see a simple optimization which can improve the performance of two composed library calls. One library does a map, and another library does a reduce over the same data.
In today's world, each of these functions would be a black box, and you'd have to scan the data twice.
However, Weld's runtime API enables composing these into a single block of Weld code, and IR can express fusing these two calls into a single pass over the data.