Cooperative Data Exploration with iPython Notebook

•

0 j'aime•599 vues

DataWorks Summit/Hadoop Summit

Technologie

Motivation
1
● Big Data computations require lots of resources
○ CPU
○ RAM
● Sharing the results is difficult in most current setups
○ Precomputed datasets
○ Trained models
○ Insights

Solution
Created for the Seahorse 1.0 release
● Single Spark application as the backend
○ Results of other team members easily accessible in-memory
○ No unnecessary duplication of data
● Multiple IPython Notebooks as clients
2

● How to use the SparkContext and SqlContext of an
application running on a cluster?
● How to execute Python code on cluster?
Challenges
3

A library for Python - Java communication
● “Wraps” JVM-based objects
● Exposes their API in Python
● Internally, uses a custom TCP
client/server communication
● In JVM: a Gateway Server
● On the Python side:
a client called Java Gateway
Py4J
4

● Spark application exposes its SparkContext
and SqlContext
○ It’s actually quite easy, once you know what you’re doing
● Notebook connects to the Spark application
via Py4J on startup
○ sc and sqlContext variables are added to user’s environment
○ This setup is completely transparent to the user
Using an Existing SparkContext
5

Notebook Architecture Overview
6
● User’s code is executed by kernels - processes spawned
by the Notebook Server
● Kernels execute user’s code on Notebook Server host

Requirements
7
● User’s code is executed on the Spark driver
● No assumptions about the driver being visible
from the Notebook Server

● Forwarding Kernel
● Executing Kernel
● Message Queue
Custom Kernel
8

● Storage object accessible via Py4J
○ Each client connected to the Spark application can reuse any entity
from the storage
■ DataFrames
■ Models
■ Even code snippets
○ Access control
■ Sharing with only selected colleagues
■ Private storage
○ Notifications: “Hey, look, Susan published a new result!”
The Interaction Between Users
9

● John defines a DataFrame: “Something Interesting”
● Alex explores it
● Susan bases her models on it
● John uses a model shared by Susan
Cooperative Data Exploration
10

Thank you!
Piotr Lusakowski
Senior Software Engineer
piotr.lusakowski@deepsense.io

Contenu connexe

Tendances

Securing Spark ApplicationsDataWorks Summit/Hadoop Summit

Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi

HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks

Introduction to Spark Streamingdatamantra

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward

Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi

Embeddable data transformation for real time streamsJoey Echeverria

Visualizing Kafka SecurityDataWorks Summit

Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling

Machine Learning in the IoT with Apache NiFiDataWorks Summit/Hadoop Summit

Flink vs. SparkSlim Baltagi

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful ServingDatabricks

The Future of Apache StormDataWorks Summit/Hadoop Summit

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Kafka connect-london-meetup-2016Gwen (Chen) Shapira

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates

Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov

Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit

Extending the Yahoo Streaming Benchmark + MapR BenchmarksJamie Grier

Tendances (20)

Securing Spark Applications

Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

Introduction to Spark Streaming

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Apache Fink 1.0: A New Era for Real-World Streaming Analytics

Embeddable data transformation for real time streams

Visualizing Kafka Security

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Machine Learning in the IoT with Apache NiFi

Flink vs. Spark

03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving

The Future of Apache Storm

Developing Real-Time Data Pipelines with Apache Kafka

Kafka connect-london-meetup-2016

Why Kubernetes as a container orchestrator is a right choice for running spar...

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Large-Scale Stream Processing in the Hadoop Ecosystem

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

En vedette

IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit

IPythonKyunghoon Kim

Building custom kernels for IPythonNarahari (Hari) Allamraju

JupyterHub for Interactive Data Science CollaborationCarol Willing

Data Process Systems, connecting everythingDataWorks Summit/Hadoop Summit

The Future of Apache StormDataWorks Summit/Hadoop Summit

The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit

Log I am your fatherDataWorks Summit/Hadoop Summit

Powering a Virtual Power Station with Big DataDataWorks Summit/Hadoop Summit

Protecting Enterprise Data in Apache HadoopDataWorks Summit/Hadoop Summit

Using IPython to Find CorrelationORAMI THAILAND

The Heterogeneous Data lakeDataWorks Summit/Hadoop Summit

A Continuously Deployed Hadoop Analytics Platform?DataWorks Summit/Hadoop Summit

Hadoop EverywhereDataWorks Summit/Hadoop Summit

Practical advice to build a data driven companyDataWorks Summit/Hadoop Summit

NLP Structured Data Investigation on Non-TextDataWorks Summit/Hadoop Summit

Using a Data Lake at the core of a Life Assurance businessDataWorks Summit/Hadoop Summit

Hadoop in the Cloud: Real World Lessons from Enterprise CustomersDataWorks Summit/Hadoop Summit

Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...DataWorks Summit/Hadoop Summit

Architecting a multi-tenanted platform DataWorks Summit/Hadoop Summit

En vedette (20)

IPython Notebook as a Unified Data Science Interface for Hadoop

IPython

Building custom kernels for IPython

JupyterHub for Interactive Data Science Collaboration

Data Process Systems, connecting everything

The Future of Apache Storm

The key to unlocking the Value in the IoT? Managing the Data!

Log I am your father

Powering a Virtual Power Station with Big Data

Protecting Enterprise Data in Apache Hadoop

Using IPython to Find Correlation

The Heterogeneous Data lake

A Continuously Deployed Hadoop Analytics Platform?

Hadoop Everywhere

Practical advice to build a data driven company

NLP Structured Data Investigation on Non-Text

Using a Data Lake at the core of a Life Assurance business

Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...

Architecting a multi-tenanted platform

Similaire à Cooperative Data Exploration with iPython Notebook

LCU14 310- Cisco ODP v2Linaro

Blackray @ SAPO CodeBits 2009fschupp

Using protocol analyzer on mikrotikAchmad Mardiansyah

MySQL X protocol - Talking to MySQL Directly over the WireSimon J Mudd

OpenStack Neutron: What's New In Kilo and a Look Toward Libertymestery

Netty trainingJackson dos Santos Olveira

Socket Programming with PythonGLC Networks

Netty trainingMarcelo Serpa

OpenStack Neutron Tutorialmestery

Using Docker Platform to Provide ServicesGLC Networks

Streamsets and spark in RetailHari Shreedharan

Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks

Python lecture 11Tanwir Zaman

End to-end example: consumer loan acceptance scoring using kubeflowRadovan Parrak

EOIP Deep DiveGLC Networks

Pyongyang FortressMayank Dhiman

The benefits of running Spark on your own DockerItai Yaffe

WebCamp Ukraine 2016: Instant messenger with Python. Back-end developmentViach Kakovskyi

Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil

JDG 7 & Spark IntegrationTed Won

Similaire à Cooperative Data Exploration with iPython Notebook (20)

LCU14 310- Cisco ODP v2

Blackray @ SAPO CodeBits 2009

Using protocol analyzer on mikrotik

MySQL X protocol - Talking to MySQL Directly over the Wire

OpenStack Neutron: What's New In Kilo and a Look Toward Liberty

Netty training

Socket Programming with Python

Netty training

OpenStack Neutron Tutorial

Using Docker Platform to Provide Services

Streamsets and spark in Retail

Analytic Insights in Retail Using Apache Spark with Hari Shreedharan

Python lecture 11

End to-end example: consumer loan acceptance scoring using kubeflow

EOIP Deep Dive

Pyongyang Fortress

The benefits of running Spark on your own Docker

WebCamp Ukraine 2016: Instant messenger with Python. Back-end development

Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp

JDG 7 & Spark Integration

Plus de DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in ProductionDataWorks Summit/Hadoop Summit

State of Security: Apache Spark & Apache ZeppelinDataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit

Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit

Revolutionize Text Mining with Spark and ZeppelinDataWorks Summit/Hadoop Summit

Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit

Hadoop Crash CourseDataWorks Summit/Hadoop Summit

Data Science Crash CourseDataWorks Summit/Hadoop Summit

Apache Spark Crash CourseDataWorks Summit/Hadoop Summit

Dataflow with Apache NiFiDataWorks Summit/Hadoop Summit

Schema Registry - Set you Data FreeDataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit

Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient DataWorks Summit/Hadoop Summit

HBase in Practice DataWorks Summit/Hadoop Summit

The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit

Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopDataWorks Summit/Hadoop Summit

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit

Backup and Disaster Recovery in Hadoop DataWorks Summit/Hadoop Summit

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production

State of Security: Apache Spark & Apache Zeppelin

Unleashing the Power of Apache Atlas with Apache Ranger

Enabling Digital Diagnostics with a Data Science Platform

Revolutionize Text Mining with Spark and Zeppelin

Double Your Hadoop Performance with Hortonworks SmartSense

Hadoop Crash Course

Data Science Crash Course

Apache Spark Crash Course

Dataflow with Apache NiFi

Schema Registry - Set you Data Free

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

Mool - Automated Log Analysis using Data Science and ML

How Hadoop Makes the Natixis Pack More Efficient

HBase in Practice

The Challenge of Driving Business Value from the Analytics of Things (AOT)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

Backup and Disaster Recovery in Hadoop

Dernier

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

FWD Group - Insurer Innovation Award 2024The Digital Insurer

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Architecting Cloud Native ApplicationsWSO2

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Boost Fertility New Invention Ups Success Rates.pdf

Axa Assurance Maroc - Insurer Innovation Award 2024

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

MS Copilot expands with MS Graph connectors

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

[BuildWithAI] Introduction to Gemini.pdf

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

FWD Group - Insurer Innovation Award 2024

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Architecting Cloud Native Applications

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Cooperative Data Exploration with iPython Notebook

1. Piotr Lusakowski Cooperative Data Exploration with IPython Notebook

2. Motivation 1 ● Big Data computations require lots of resources ○ CPU ○ RAM ● Sharing the results is difficult in most current setups ○ Precomputed datasets ○ Trained models ○ Insights

3. Solution Created for the Seahorse 1.0 release ● Single Spark application as the backend ○ Results of other team members easily accessible in-memory ○ No unnecessary duplication of data ● Multiple IPython Notebooks as clients 2

4. ● How to use the SparkContext and SqlContext of an application running on a cluster? ● How to execute Python code on cluster? Challenges 3

5. A library for Python - Java communication ● “Wraps” JVM-based objects ● Exposes their API in Python ● Internally, uses a custom TCP client/server communication ● In JVM: a Gateway Server ● On the Python side: a client called Java Gateway Py4J 4

6. ● Spark application exposes its SparkContext and SqlContext ○ It’s actually quite easy, once you know what you’re doing ● Notebook connects to the Spark application via Py4J on startup ○ sc and sqlContext variables are added to user’s environment ○ This setup is completely transparent to the user Using an Existing SparkContext 5

7. Notebook Architecture Overview 6 ● User’s code is executed by kernels - processes spawned by the Notebook Server ● Kernels execute user’s code on Notebook Server host

8. Requirements 7 ● User’s code is executed on the Spark driver ● No assumptions about the driver being visible from the Notebook Server

9. ● Forwarding Kernel ● Executing Kernel ● Message Queue Custom Kernel 8

10. ● Storage object accessible via Py4J ○ Each client connected to the Spark application can reuse any entity from the storage ■ DataFrames ■ Models ■ Even code snippets ○ Access control ■ Sharing with only selected colleagues ■ Private storage ○ Notifications: “Hey, look, Susan published a new result!” The Interaction Between Users 9

11. ● John defines a DataFrame: “Something Interesting” ● Alex explores it ● Susan bases her models on it ● John uses a model shared by Susan Cooperative Data Exploration 10

12. Thank you! Piotr Lusakowski Senior Software Engineer piotr.lusakowski@deepsense.io

Cooperative Data Exploration with iPython Notebook

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Cooperative Data Exploration with iPython Notebook

Similaire à Cooperative Data Exploration with iPython Notebook (20)

Plus de DataWorks Summit/Hadoop Summit

Plus de DataWorks Summit/Hadoop Summit (20)

Dernier

Dernier (20)

Cooperative Data Exploration with iPython Notebook