Big analytics meetup - Extended Jupyter Kernel Gateway
1. IBM SparkTechnology Center
Big Analytics Meetup
Building Enterprise/Cloud Analytics Platform
with Jupyter Notebooks and Apache Spark
Luciano Resende
IBM | Spark Technology Center
2. IBM SparkTechnology Center
About Me
Luciano Resende (lresende AT apache DOT org)
• Architect and community liaison at IBM – Spark Technology Center
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Spark, Apache Toree among other projects related to Apache Spark ecosystem
2
@lresende1975 http://lresende.blogspot.com/ https://www.linkedin.com/in/lresendehttp://slideshare.net/luckbr1975lresende
4. Open Source Community Leadership
Spark Technology
Center
Founding Partner 188+ Project Committers 77+ Projects
Key Open source
steering committee
OSS Advisory Board
Open Source
6. IBM SparkTechnology Center
IBM Spark Technology Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com
Key statistics:
About 40 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now a top level Apache project !
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
6
7. IBM SparkTechnology Center
Focus on meaningful code contributions across
all major Spark projects
863 code contributions (JIRAs) and counting –
Check out http://jiras.spark.tc
Over 422 commits in Spark 2.0 , and
continuing major contributions in 2.x
Contributions by the Spark Technology Center
across almost all components of Spark
— Spark Core, SparkR, SQL, MLlib,
Streaming, PySpark, build and infrastructure,
etc
STC impact on community
46,385 Spark LOC
863 Spark JIRAs
457 SystemML JIRAs
67 Speakers at Events
Apache Spark Contributions
8. IBM SparkTechnology Center
Project Focus Areas
SQL
TPC-DS and Performance
Query Pushdown/Federation
Machine Learning
Spark MLLib
R4ML
Online Retraining
Apache Arrow
SystemML
Deep Learning
Consumability
Reference architectures
Spark Notebook stack
Spark Resource optimization
Spark Web UI
Apache Bahir
RedRock
Immersive Insights
10. IBM SparkTechnology Center
Jupyter Notebook Platform Architecture Overview
• Notebook UI runs on the browser
• The Notebook Server serves the ’Notebooks’
• Kernels interpret/execute cell contents
• Are responsible for code execution
• Abstracts different languages
10
11. IBM SparkTechnology Center
Enterprise/Cloud Analytics Platform Characteristics
Large pool of shared computing resources
• Enterprise Cloud, Public Cloud or Hybrid
• Data in the cloud (Data Lakes/Object Store)
Distributed Consumers
• Notebooks running local (users laptop) or as a service
Different Resource Utilization Patterns
• High number of idle resources
11
12. IBM SparkTechnology Center
Analytics Platform – Current state of the art
Open Source Jupyter based Notebook Platform
• Single User sharing the same distributed filesystem and privileges
• Jupyter Kernels running as local process
• Resources are limited by what is available on the one single node that runs all Kernels and associated Spark drivers.
• No security, users can see and control each others process using Jupyter’s administration
utilities.
12
13. IBM SparkTechnology Center
Analytics Platform Today – Shared Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels inside the cluster
sharing it’s resources.
• All Jupyter kernels run under a shared, “service” user ID.
• Users can see and control each others’ kernels using
Jupyter’s administration utilities.
• All kernels and their associated Spark drivers run on a
single (configurable) node of the cluster.
13
Spark Cluster
Bob’s Desktop
Multiple Notebooks
Alice’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter
Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
YARN
Workers
Security
Layer
Jupyter
Notebook
Server
(with NB2KG)
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
14. IBM SparkTechnology Center
Analytics Platform Today – Single User Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels in a cluster created
specially to the user.
• Expensive as clusters are created for every individual
user
14
Spark Cluster
Bob’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter
Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
YARN
Workers
Spark Cluster
Alice’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter
Kernel
Gateway
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
YARN
Workers
Jupyter
Notebook
Server
(with NB2KG)
15. IBM SparkTechnology Center
Extended Jupyter Kernel Gateway
Notebook Platform based on Jupyter stack aiming on Enterprise/Cloud
requirements and use cases
15
16. IBM SparkTechnology Center
Extended Jupyter Kernel Gateway – Goals
Optimized Resource Allocation
•Run Spark in YARN Cluster Mode to better utilize cluster resources.
•Pluggable architecture for additional Resource Managers
Enhanced Security
•Enable TLS for all socket communications
•Any HTTP communication should be encrypted (SSL)
Multiuser support with user impersonation
•Enhance security and sandboxing by enabling user impersonation when running kernels.
•Individual HDFS home folder for each notebook user.
•Use the same user ID for notebook and batch jobs.
16
17. IBM SparkTechnology Center
Extended Jupyter Kernel Gateway
Extending Jupyter Kernel Gateway
• Enable running kernels remotely in a cluster
• Pluggable kernel lifecycle management
• Enhanced security
• Multiuser leveraging user impersonation
17
Extended Jupyter Kernel Gateway
Jupyter Kernel Gateway
Jupyter Notebook Server
19. IBM SparkTechnology Center
Extended Jupyter Kernel Gateway
Stay tuned, we are becoming open source very soon!!!
Are you considering being an early adopter, please contact me at
lresende AT us DOT ibm DOT com !!!
19
20. IBM SparkTechnology Center
Other Resources
IBM developerWorks Code IBM developerWorks Journeys
https://developer.ibm.com/code/ https://developer.ibm.com/code/journey/
20