Spark Summit EU 2015: Reynold Xin Keynote

•

26 j'aime•7,633 vues

A Look Ahead at Spark's Development

A look ahead at Spark’s development
Reynold Xin @rxin
Spark Summit EU, Amsterdam
Oct 29th,2015

SQL Streaming MLlib
Spark Core (RDD)
GraphX
Spark stack diagram

Frontend
(user facing APIs)
Backend
(execution)
Spark stack diagram
(a different take)

Frontend
(RDD, DataFrame, ML pipelines, …)
Backend
(scheduler, shuffle, operators, …)
Spark stack diagram
(a different take)

Last 12 months of Spark evolution
Frontend
DataFrames
Data sources
R
Machine learning pipelines
…
Backend
Project Tungsten
Sort-based shuffle
Netty-based network
…

Last 12 months of Spark evolution
Frontend
DataFrames
Data sources
R
Machine learning pipelines
…
Backend
Project Tungsten
Sort-based shuffle
Netty-based network
…

DataFrame:
A Frontend Perspective

Spark DataFrame
> head(filter(df, df$waiting < 50)) # an example in R
## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
Scalabledata frame for Java, Python, R, Scala
Similar APIs as single-nodetools (Pandas, dplyr), i.e. easy to learn

Spark RDD Execution
Java/Scala
frontend
JVM
backend
Python
frontend
Python
backend
opaque closures
(user-defined functions)

Spark DataFrame Execution
DataFrame
frontend
Logical Plan
Physical
execution
Catalyst
optimizer
Intermediate representationfor computation

Spark DataFrame Execution
Python
DF
Logical Plan
Physical
execution
Catalyst
optimizer
Java/Scala
DF
R
DF
Intermediate representationfor computation
Simple wrappers to create logical plan

Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend)
R : ~1000 line of code
i.e. much easier to add newlanguagebindings (Julia, Clojure, …)

Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregationworkload
RDD

Benefit of Logical Plan:
Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregationworkload (secs)
DataFrame
RDD

Tungsten:
A Backend Perspective

Hardware Trends
Storage
Network
CPU

Hardware Trends
2010
Storage
50+MB/s
(HDD)
Network 1Gbps
CPU ~3GHz

Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz

Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L

Project Tungsten
Substantially speed up execution by optimizing CPU efficiency, via:
(1) Runtime code generation
(2) Exploiting cachelocality
(3) Off-heap memory management

From DataFrame to Tungsten
Python
DF
Logical Plan
Java/Scala
DF
R
DF
Tungsten
Execution
Initial phasein Spark 1.5
More work coming in 2016

3 Things to Look Forward To

Dataset API in Spark 1.6
Typed interface over DataFrames / Tungsten
case class Person(name: String, age: Int)
val dataframe = read.json(“people.json”)
val ds: Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.name.startsWith(“M”))
.groupBy(“name”)
.avg(“age”)

Dataset
“Encoder”to specify type information
so Spark can translate it into DataFrame
and generateoptimized memory layouts
CheckoutSPARK-9999
Dataset[T]
DataFrame
encoder

Streaming DataFrames
Easier-to-use APIs (batch, streaming, and interactive)
And optimizations:
- Tungstenbackends
- native support for out-of-order data
- data sourcesand sinks
val stream = read.kafka("...")
stream.window(5 mins, 10 secs)
.agg(sum("sales"))
.write.jdbc("mysql://...")

3D XPoint
- DRAM latency
- SSD capacity
- Byte addressible

Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM SIMD 3D XPoint
Unified API, One Engine, Automatically Optimized
Tungsten
backend
language
frontend
…

Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
Advanced
Analytics

Office Hours Today @ Databricks booth
Topic Area
10:30–11:30 Spark general(Reynold)
13:00–14:00 R and datascience (Hossein)
13:30–14:30 machine learning(Joseph)
14:00–15:00 Spark, YARN, etc (Andrew)

Recommandé

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Spark Under the Hood - Meetup @ Data Science London

Spark Under the Hood - Meetup @ Data Science London

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

New directions for Apache Spark in 2015

New directions for Apache Spark in 2015

New directions for Apache Spark in 2015Databricks

Spark Summit EU 2015: Lessons from 300+ production users

Spark Summit EU 2015: Lessons from 300+ production users

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

Strata NYC 2015 - Supercharging R with Apache Spark

Strata NYC 2015 - Supercharging R with Apache Spark

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks

Recommandé

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Spark Under the Hood - Meetup @ Data Science London

Spark Under the Hood - Meetup @ Data Science London

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

New directions for Apache Spark in 2015

New directions for Apache Spark in 2015

New directions for Apache Spark in 2015Databricks

Spark Summit EU 2015: Lessons from 300+ production users

Spark Summit EU 2015: Lessons from 300+ production users

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

Strata NYC 2015 - Supercharging R with Apache Spark

Strata NYC 2015 - Supercharging R with Apache Spark

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

New Directions for Spark in 2015 - Spark Summit East

New Directions for Spark in 2015 - Spark Summit East

New Directions for Spark in 2015 - Spark Summit EastDatabricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

Spark Application Carousel: Highlights of Several Applications Built with Spark

Spark Application Carousel: Highlights of Several Applications Built with Spark

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

What to Expect for Big Data and Apache Spark in 2017

What to Expect for Big Data and Apache Spark in 2017

What to Expect for Big Data and Apache Spark in 2017 Databricks

New Developments in Spark

New Developments in Spark

New Developments in SparkDatabricks

RISELab:Enabling Intelligent Real-Time Decisions

RISELab:Enabling Intelligent Real-Time Decisions

RISELab:Enabling Intelligent Real-Time DecisionsJen Aman

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks

Lessons from Running Large Scale Spark Workloads

Lessons from Running Large Scale Spark Workloads

Lessons from Running Large Scale Spark WorkloadsDatabricks

Enabling exploratory data science with Spark and R

Enabling exploratory data science with Spark and R

Enabling exploratory data science with Spark and RDatabricks

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015Databricks

Visualizing big data in the browser using spark

Visualizing big data in the browser using spark

Visualizing big data in the browser using sparkDatabricks

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit

GraphFrames: DataFrame-based graphs for Apache® Spark™

GraphFrames: DataFrame-based graphs for Apache® Spark™

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

Spark DataFrames and ML Pipelines

Spark DataFrames and ML Pipelines

Spark DataFrames and ML PipelinesDatabricks

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit

Contenu connexe

Tendances

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

New Directions for Spark in 2015 - Spark Summit East

New Directions for Spark in 2015 - Spark Summit East

New Directions for Spark in 2015 - Spark Summit EastDatabricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

Spark Application Carousel: Highlights of Several Applications Built with Spark

Spark Application Carousel: Highlights of Several Applications Built with Spark

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

What to Expect for Big Data and Apache Spark in 2017

What to Expect for Big Data and Apache Spark in 2017

What to Expect for Big Data and Apache Spark in 2017 Databricks

New Developments in Spark

New Developments in Spark

New Developments in SparkDatabricks

RISELab:Enabling Intelligent Real-Time Decisions

RISELab:Enabling Intelligent Real-Time Decisions

RISELab:Enabling Intelligent Real-Time DecisionsJen Aman

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks

Lessons from Running Large Scale Spark Workloads

Lessons from Running Large Scale Spark Workloads

Lessons from Running Large Scale Spark WorkloadsDatabricks

Enabling exploratory data science with Spark and R

Enabling exploratory data science with Spark and R

Enabling exploratory data science with Spark and RDatabricks

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015Databricks

Visualizing big data in the browser using spark

Visualizing big data in the browser using spark

Visualizing big data in the browser using sparkDatabricks

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit

GraphFrames: DataFrame-based graphs for Apache® Spark™

GraphFrames: DataFrame-based graphs for Apache® Spark™

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

Spark DataFrames and ML Pipelines

Spark DataFrames and ML Pipelines

Spark DataFrames and ML PipelinesDatabricks

Tendances (20)

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

New Directions for Spark in 2015 - Spark Summit East

New Directions for Spark in 2015 - Spark Summit East

New Directions for Spark in 2015 - Spark Summit East

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Spark Application Carousel: Highlights of Several Applications Built with Spark

Spark Application Carousel: Highlights of Several Applications Built with Spark

Spark Application Carousel: Highlights of Several Applications Built with Spark

What to Expect for Big Data and Apache Spark in 2017

What to Expect for Big Data and Apache Spark in 2017

What to Expect for Big Data and Apache Spark in 2017

New Developments in Spark

New Developments in Spark

New Developments in Spark

RISELab:Enabling Intelligent Real-Time Decisions

RISELab:Enabling Intelligent Real-Time Decisions

RISELab:Enabling Intelligent Real-Time Decisions

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Lessons from Running Large Scale Spark Workloads

Lessons from Running Large Scale Spark Workloads

Lessons from Running Large Scale Spark Workloads

Enabling exploratory data science with Spark and R

Enabling exploratory data science with Spark and R

Enabling exploratory data science with Spark and R

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015

Visualizing big data in the browser using spark

Visualizing big data in the browser using spark

Visualizing big data in the browser using spark

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

GraphFrames: DataFrame-based graphs for Apache® Spark™

GraphFrames: DataFrame-based graphs for Apache® Spark™

GraphFrames: DataFrame-based graphs for Apache® Spark™

Spark DataFrames and ML Pipelines

Spark DataFrames and ML Pipelines

Spark DataFrames and ML Pipelines

Similaire à Spark Summit EU 2015: Reynold Xin Keynote

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit

Unified Big Data Processing with Apache Spark (QCON 2014)

Unified Big Data Processing with Apache Spark (QCON 2014)

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Unified Big Data Processing with Apache Spark

Unified Big Data Processing with Apache Spark

Unified Big Data Processing with Apache SparkC4Media

A look under the hood at Apache Spark's API and engine evolutions

A look under the hood at Apache Spark's API and engine evolutions

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Jump Start with Apache Spark 2.0 on Databricks

Jump Start with Apache Spark 2.0 on Databricks

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit

Apache spark-melbourne-april-2015-meetup

Apache spark-melbourne-april-2015-meetup

Apache spark-melbourne-april-2015-meetupNed Shawa

Building a modern Application with DataFrames

Building a modern Application with DataFrames

Building a modern Application with DataFramesDatabricks

Building a modern Application with DataFrames

Building a modern Application with DataFrames

Building a modern Application with DataFramesSpark Summit

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020Taro L. Saito

Data Analytics and Machine Learning: From Node to Cluster on ARM64

Data Analytics and Machine Learning: From Node to Cluster on ARM64

Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju

BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro

BKK16-408B Data Analytics and Machine Learning From Node to Cluster

BKK16-408B Data Analytics and Machine Learning From Node to Cluster

BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro

Jump Start into Apache® Spark™ and Databricks

Jump Start into Apache® Spark™ and Databricks

Jump Start into Apache® Spark™ and DatabricksDatabricks

Parallelizing Existing R Packages

Parallelizing Existing R Packages

Parallelizing Existing R PackagesCraig Warman

Apache spark - Architecture , Overview & libraries

Apache spark - Architecture , Overview & libraries

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020

Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020

Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Taro L. Saito

Similaire à Spark Summit EU 2015: Reynold Xin Keynote (20)

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

Unified Big Data Processing with Apache Spark (QCON 2014)

Unified Big Data Processing with Apache Spark (QCON 2014)

Unified Big Data Processing with Apache Spark (QCON 2014)

Unified Big Data Processing with Apache Spark

Unified Big Data Processing with Apache Spark

Unified Big Data Processing with Apache Spark

A look under the hood at Apache Spark's API and engine evolutions

A look under the hood at Apache Spark's API and engine evolutions

A look under the hood at Apache Spark's API and engine evolutions

Jump Start with Apache Spark 2.0 on Databricks

Jump Start with Apache Spark 2.0 on Databricks

Jump Start with Apache Spark 2.0 on Databricks

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...

Apache spark-melbourne-april-2015-meetup

Apache spark-melbourne-april-2015-meetup

Apache spark-melbourne-april-2015-meetup

Building a modern Application with DataFrames

Building a modern Application with DataFrames

Building a modern Application with DataFrames

Building a modern Application with DataFrames

Building a modern Application with DataFrames

Building a modern Application with DataFrames

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

Data Analytics and Machine Learning: From Node to Cluster on ARM64

Data Analytics and Machine Learning: From Node to Cluster on ARM64

Data Analytics and Machine Learning: From Node to Cluster on ARM64

BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

BKK16-408B Data Analytics and Machine Learning From Node to Cluster

BKK16-408B Data Analytics and Machine Learning From Node to Cluster

BKK16-408B Data Analytics and Machine Learning From Node to Cluster

Jump Start into Apache® Spark™ and Databricks

Jump Start into Apache® Spark™ and Databricks

Jump Start into Apache® Spark™ and Databricks

Parallelizing Existing R Packages

Parallelizing Existing R Packages

Parallelizing Existing R Packages

Apache spark - Architecture , Overview & libraries

Apache spark - Architecture , Overview & libraries

Apache spark - Architecture , Overview & libraries

Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020

Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020

Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020

Plus de Databricks

DW Migration Webinar-March 2022.pptx

DW Migration Webinar-March 2022.pptx

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4

Data Lakehouse Symposium | Day 4

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized Platform

Democratizing Data Quality Through a Centralized Platform

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data Science

Learn to Use Databricks for Data Science

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML Monitoring

Why APM Is Not the Same As ML Monitoring

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI Integration

Stage Level Scheduling Improving Big Data and AI Integration

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature Aggregations

Sawtooth Windows for Feature Aggregations

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and Spark

Re-imagine Data Monitoring with whylogs and Spark

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction Queries

Raven: End-to-end Optimization of ML Prediction Queries

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache Spark

Processing Large Datasets for ADAS Applications using Apache Spark

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta Lake

Massive Data Processing in Adobe Using Delta Lake

Massive Data Processing in Adobe Using Delta LakeDatabricks

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx

DW Migration Webinar-March 2022.pptx

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

Data Lakehouse Symposium | Day 4

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Democratizing Data Quality Through a Centralized Platform

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Learn to Use Databricks for Data Science

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

Why APM Is Not the Same As ML Monitoring

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Stage Level Scheduling Improving Big Data and AI Integration

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Sawtooth Windows for Feature Aggregations

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Re-imagine Data Monitoring with whylogs and Spark

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Raven: End-to-end Optimization of ML Prediction Queries

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Processing Large Datasets for ADAS Applications using Apache Spark

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Massive Data Processing in Adobe Using Delta Lake

Massive Data Processing in Adobe Using Delta Lake

Dernier

Unlocking the Future of AI Agents with Large Language Models

Unlocking the Future of AI Agents with Large Language Models

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan

Pharm-D Biostatistics and Research methodology

Pharm-D Biostatistics and Research methodology

Pharm-D Biostatistics and Research methodologyAnusha Are

The Top App Development Trends Shaping the Industry in 2024-25 .pdf

The Top App Development Trends Shaping the Industry in 2024-25 .pdf

The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

Microsoft AI Transformation Partner Playbook.pdf

Microsoft AI Transformation Partner Playbook.pdf

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

A Secure and Reliable Document Management System is Essential.docx

A Secure and Reliable Document Management System is Essential.docx

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

8257 interfacing 2 in microprocessor for btech students

8257 interfacing 2 in microprocessor for btech students

8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Dernier (20)

Unlocking the Future of AI Agents with Large Language Models

Unlocking the Future of AI Agents with Large Language Models

Unlocking the Future of AI Agents with Large Language Models

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

Pharm-D Biostatistics and Research methodology

Pharm-D Biostatistics and Research methodology

Pharm-D Biostatistics and Research methodology

The Top App Development Trends Shaping the Industry in 2024-25 .pdf

The Top App Development Trends Shaping the Industry in 2024-25 .pdf

The Top App Development Trends Shaping the Industry in 2024-25 .pdf

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

Microsoft AI Transformation Partner Playbook.pdf

Microsoft AI Transformation Partner Playbook.pdf

Microsoft AI Transformation Partner Playbook.pdf

A Secure and Reliable Document Management System is Essential.docx

A Secure and Reliable Document Management System is Essential.docx

A Secure and Reliable Document Management System is Essential.docx

8257 interfacing 2 in microprocessor for btech students

8257 interfacing 2 in microprocessor for btech students

8257 interfacing 2 in microprocessor for btech students

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Spark Summit EU 2015: Reynold Xin Keynote

1. A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th,2015

2. SQL Streaming MLlib Spark Core (RDD) GraphX Spark stack diagram

3. Frontend (user facing APIs) Backend (execution) Spark stack diagram (a different take)

4. Frontend (RDD, DataFrame, ML pipelines, …) Backend (scheduler, shuffle, operators, …) Spark stack diagram (a different take)

5. Last 12 months of Spark evolution Frontend DataFrames Data sources R Machine learning pipelines … Backend Project Tungsten Sort-based shuffle Netty-based network …

6. Last 12 months of Spark evolution Frontend DataFrames Data sources R Machine learning pipelines … Backend Project Tungsten Sort-based shuffle Netty-based network …

7. DataFrame: A Frontend Perspective

8. Spark DataFrame > head(filter(df, df$waiting < 50)) # an example in R ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48 Scalabledata frame for Java, Python, R, Scala Similar APIs as single-nodetools (Pandas, dplyr), i.e. easy to learn

9. Spark RDD Execution Java/Scala frontend JVM backend Python frontend Python backend opaque closures (user-defined functions)

10. Spark DataFrame Execution DataFrame frontend Logical Plan Physical execution Catalyst optimizer Intermediate representationfor computation

11. Spark DataFrame Execution Python DF Logical Plan Physical execution Catalyst optimizer Java/Scala DF R DF Intermediate representationfor computation Simple wrappers to create logical plan

12. Benefit of Logical Plan: Simpler Frontend Python : ~2000 line of code (built over a weekend) R : ~1000 line of code i.e. much easier to add newlanguagebindings (Julia, Clojure, …)

13. Performance 0 2 4 6 8 10 Java/Scala Python Runtime for an example aggregationworkload RDD

14. Benefit of Logical Plan: Performance Parity Across Languages 0 2 4 6 8 10 Java/Scala Python Java/Scala Python R SQL Runtime for an example aggregationworkload (secs) DataFrame RDD

15. Tungsten: A Backend Perspective

16. Hardware Trends Storage Network CPU

17. Hardware Trends 2010 Storage 50+MB/s (HDD) Network 1Gbps CPU ~3GHz

18. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz

19. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10X Network 1Gbps 10Gbps 10X CPU ~3GHz ~3GHz L

20. Project Tungsten Substantially speed up execution by optimizing CPU efficiency, via: (1) Runtime code generation (2) Exploiting cachelocality (3) Off-heap memory management

21. From DataFrame to Tungsten Python DF Logical Plan Java/Scala DF R DF Tungsten Execution Initial phasein Spark 1.5 More work coming in 2016

22. 3 Things to Look Forward To

23. Dataset API in Spark 1.6 Typed interface over DataFrames / Tungsten case class Person(name: String, age: Int) val dataframe = read.json(“people.json”) val ds: Dataset[Person] = dataframe.as[Person] ds.filter(p => p.name.startsWith(“M”)) .groupBy(“name”) .avg(“age”)

24. Dataset “Encoder”to specify type information so Spark can translate it into DataFrame and generateoptimized memory layouts CheckoutSPARK-9999 Dataset[T] DataFrame encoder

25. Streaming DataFrames Easier-to-use APIs (batch, streaming, and interactive) And optimizations: - Tungstenbackends - native support for out-of-order data - data sourcesand sinks val stream = read.kafka("...") stream.window(5 mins, 10 secs) .agg(sum("sales")) .write.jdbc("mysql://...")

26.

27. 3D XPoint - DRAM latency - SSD capacity - Byte addressible

28. Python Java/Scala RSQL … DataFrame Logical Plan LLVMJVM SIMD 3D XPoint Unified API, One Engine, Automatically Optimized Tungsten backend language frontend …

29. Tungsten Execution PythonSQL R Streaming DataFrame (& Dataset) Advanced Analytics

30. Office Hours Today @ Databricks booth Topic Area 10:30–11:30 Spark general(Reynold) 13:00–14:00 R and datascience (Hossein) 13:30–14:30 machine learning(Joseph) 14:00–15:00 Spark, YARN, etc (Andrew)