2. Motivation
1
● Big Data computations require lots of resources
○ CPU
○ RAM
● Sharing the results is difficult in most current setups
○ Precomputed datasets
○ Trained models
○ Insights
3. Solution
Created for the Seahorse 1.0 release
● Single Spark application as the backend
○ Results of other team members easily accessible in-memory
○ No unnecessary duplication of data
● Multiple IPython Notebooks as clients
2
4. ● How to use the SparkContext and SqlContext of an
application running on a cluster?
● How to execute Python code on cluster?
Challenges
3
5. A library for Python - Java communication
● “Wraps” JVM-based objects
● Exposes their API in Python
● Internally, uses a custom TCP
client/server communication
● In JVM: a Gateway Server
● On the Python side:
a client called Java Gateway
Py4J
4
6. ● Spark application exposes its SparkContext
and SqlContext
○ It’s actually quite easy, once you know what you’re doing
● Notebook connects to the Spark application
via Py4J on startup
○ sc and sqlContext variables are added to user’s environment
○ This setup is completely transparent to the user
Using an Existing SparkContext
5
7. Notebook Architecture Overview
6
● User’s code is executed by kernels - processes spawned
by the Notebook Server
● Kernels execute user’s code on Notebook Server host
8. Requirements
7
● User’s code is executed on the Spark driver
● No assumptions about the driver being visible
from the Notebook Server
10. ● Storage object accessible via Py4J
○ Each client connected to the Spark application can reuse any entity
from the storage
■ DataFrames
■ Models
■ Even code snippets
○ Access control
■ Sharing with only selected colleagues
■ Private storage
○ Notifications: “Hey, look, Susan published a new result!”
The Interaction Between Users
9
11. ● John defines a DataFrame: “Something Interesting”
● Alex explores it
● Susan bases her models on it
● John uses a model shared by Susan
Cooperative Data Exploration
10