New directions for Apache Spark in 2015

•

35 j'aime•11,662 vues

This document discusses new directions for Apache Spark in 2015, including improved interfaces for data science, external data sources, and machine learning pipelines. It also summarizes Spark's growth in 2014 with over 500 contributors, 370,000 lines of code, and 500 production deployments. The author proposes that Spark will become a unified engine for all data sources, workloads, and environments.

Technologie

New Directions for Spark in 2015
Matei Zaharia
February 20, 2015

What is Apache Spark?
Fast and general engine for big data processing with
libraries for SQL, streaming, advanced analytics
Most active open source project in big data
2

Founded by the creators of Spark in 2013
Largest organization contributing to Spark
–  3/4 of the code in 2014
End-to-end hosted service, Databricks Cloud
About Databricks
3

2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500 active production deployments
4

Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
5

Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache
6

7
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes

9
New Directions in 2015
Data Science
High-level interfaces similar
to single-machine tools
Platform Interfaces
Plug in data sources
and algorithms

10
DataFrames
Similar API to data frames
in R and Pandas
Automatically optimized
via Spark SQL
Coming in Spark 1.3
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime

11
R Interface (SparkR)
Arrives in Spark 1.4 (June)
Exposes DataFrames,
RDDs, and ML library in R
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))

12
Machine Learning Pipelines
High-level API inspired by
SciKit-Learn
Featurization, evaluation,
model tuning
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
modelDataFrame

13
External Data Sources
Platform API to plug smart
data sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
Spark
{JSON}

$14 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en”$

15
Goal: one engine for all data sources,
workloads and environments

To Learn More
Two free massive online
courses on Spark:
databricks.com/moocs
16
Try
Databricks Cloud:
databricks.com

Recommandé

Building a modern Application with DataFramesSpark Summit

Enabling exploratory data science with Spark and RDatabricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks

Recommandé

Building a modern Application with DataFramesSpark Summit

Enabling exploratory data science with Spark and RDatabricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks

Jump Start into Apache® Spark™ and DatabricksDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

A look ahead at spark 2.0 Databricks

New Developments in SparkDatabricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Spark Meetup at UberDatabricks

Apache Spark Usage in the Open Source EcosystemDatabricks

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Strata NYC 2015 - What's coming for the Spark communityDatabricks

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Spark streaming State of the Union - Strata San Jose 2015Databricks

Spark what's new what's comingDatabricks

Operational Tips for Deploying SparkDatabricks

New Directions for Spark in 2015 - Spark Summit EastDatabricks

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks

Large-Scale Data Science in Apache Spark 2.0Databricks

Parallelize R Code Using Apache Spark Databricks

The BDAS Open Source Communityjeykottalam

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Tuning and Debugging in Apache SparkDatabricks

Contenu connexe

Tendances

Jump Start into Apache® Spark™ and DatabricksDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

A look ahead at spark 2.0 Databricks

New Developments in SparkDatabricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Spark Meetup at UberDatabricks

Apache Spark Usage in the Open Source EcosystemDatabricks

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Strata NYC 2015 - What's coming for the Spark communityDatabricks

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Spark streaming State of the Union - Strata San Jose 2015Databricks

Spark what's new what's comingDatabricks

Operational Tips for Deploying SparkDatabricks

New Directions for Spark in 2015 - Spark Summit EastDatabricks

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks

Large-Scale Data Science in Apache Spark 2.0Databricks

Parallelize R Code Using Apache Spark Databricks

The BDAS Open Source Communityjeykottalam

Tendances (20)

Jump Start into Apache® Spark™ and Databricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Spark Summit EU 2015: Lessons from 300+ production users

A look ahead at spark 2.0

New Developments in Spark

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Spark Meetup at Uber

Apache Spark Usage in the Open Source Ecosystem

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Strata NYC 2015 - What's coming for the Spark community

From Pipelines to Refineries: Scaling Big Data Applications

Strata NYC 2015 - Supercharging R with Apache Spark

Spark streaming State of the Union - Strata San Jose 2015

Spark what's new what's coming

Operational Tips for Deploying Spark

New Directions for Spark in 2015 - Spark Summit East

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

Large-Scale Data Science in Apache Spark 2.0

Parallelize R Code Using Apache Spark

The BDAS Open Source Community

En vedette

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Tuning and Debugging in Apache SparkDatabricks

TensorFlow User Group #1陽平山口

デブサミ2017 公募セッション募集要項Developers Summit

Tensor flow usergroup 2016 (公開版)Hiroki Nakahara

Flink vs. SparkSlim Baltagi

CultureReed Hastings

Apache Provisionr (incubating) - Bucharest JUG 10Andrei Savu

Strata + Hadoop World 2014 レポート #cwt2014Cloudera Japan

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA

Spark - The beginningsDaniel Leon

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly

Apache SparkMahdi Esmailoghli

New Directions in Information Organization: A Linked Data Model with BIBFRAMESharonYang

Introduction to Apache SparkAnastasios Skarlatidis

Is spark streaming based on reactive streams?chibochibo

Hadoopビッグデータ基盤の歴史を振り返る #cwt2015Cloudera Japan

Apache spark linkedinYukti Kaura

Stream dataprocessing101Sotaro Kimura

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly

En vedette (20)

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Tuning and Debugging in Apache Spark

TensorFlow User Group #1

デブサミ2017 公募セッション募集要項

Tensor flow usergroup 2016 (公開版)

Flink vs. Spark

Culture

Apache Provisionr (incubating) - Bucharest JUG 10

Strata + Hadoop World 2014 レポート #cwt2014

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks

Spark - The beginnings

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Apache Spark

New Directions in Information Organization: A Linked Data Model with BIBFRAME

Introduction to Apache Spark

Is spark streaming based on reactive streams?

Hadoopビッグデータ基盤の歴史を振り返る #cwt2015

Apache spark linkedin

Stream dataprocessing101

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Similaire à New directions for Apache Spark in 2015

Spark Community Update - Spark Summit San Francisco 2015Databricks

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

Scalable Machine Learning with PySparkLadle Patel

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

H2O PySparkling WaterSri Ambati

ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys

Big data apache spark + scalaJuantomás García Molina

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Apache Spark: Lightning Fast Cluster ComputingAll Things Open

Composable Parallel Processing in Apache Spark and WeldDatabricks

Koalas: Unifying Spark and pandas APIsTakuya UESHIN

Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan

Big data analysis using spark r publishedDipendra Kusi

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri

Building a modern Application with DataFramesDatabricks

Dev Ops TrainingSpark Summit

Spark ML Pipeline servingStepan Pushkarev

Similaire à New directions for Apache Spark in 2015 (20)

Spark Community Update - Spark Summit San Francisco 2015

Jump Start with Apache Spark 2.0 on Databricks

Spark + AI Summit 2020 イベント概要

Scalable Machine Learning with PySpark

Big Data Processing with .NET and Spark (SQLBits 2020)

H2O PySparkling Water

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

Big data apache spark + scala

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Apache Spark: Lightning Fast Cluster Computing

Composable Parallel Processing in Apache Spark and Weld

Koalas: Unifying Spark and pandas APIs

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming

Big data analysis using spark r published

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...

Building a modern Application with DataFrames

Dev Ops Training

Spark ML Pipeline serving

Plus de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Dernier

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Histor y of HAM Radio presentation slidevu2urc

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Slack Application Development 101 Slidespraypatel2

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Developing An App To Navigate The Roads of BrazilV3cube

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

GenCyber Cyber Security Day PresentationMichael W. Hawkins

A Domino Admins Adventures (Engage 2024)Gabriella Davis

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Dernier (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Salesforce Community Group Quito, Salesforce 101

Partners Life - Insurer Innovation Award 2024

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Breaking the Kubernetes Kill Chain: Host Path Mount

Histor y of HAM Radio presentation slide

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

08448380779 Call Girls In Friends Colony Women Seeking Men

Slack Application Development 101 Slides

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Automating Google Workspace (GWS) & more with Apps Script

Developing An App To Navigate The Roads of Brazil

The Codex of Business Writing Software for Real-World Solutions 2.pptx

GenCyber Cyber Security Day Presentation

A Domino Admins Adventures (Engage 2024)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Exploring the Future Potential of AI-Enabled Smartphone Processors

New directions for Apache Spark in 2015

1. New Directions for Spark in 2015 Matei Zaharia February 20, 2015

2. What is Apache Spark? Fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics Most active open source project in big data 2

3. Founded by the creators of Spark in 2013 Largest organization contributing to Spark –  3/4 of the code in 2014 End-to-end hosted service, Databricks Cloud About Databricks 3

4. 2014: an Amazing Year for Spark Total contributors: 150 => 500 Lines of code: 190K => 370K 500 active production deployments 4

5. Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 5

6. Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 Most active project at Apache 6

7. 7 On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes

8. Distributors Applications 8

9. 9 New Directions in 2015 Data Science High-level interfaces similar to single-machine tools Platform Interfaces Plug in data sources and algorithms

10. 10 DataFrames Similar API to data frames in R and Pandas Automatically optimized via Spark SQL Coming in Spark 1.3 df = jsonFile(“tweets.json”) df[df[“user”] == “matei”] .groupBy(“date”) .sum(“retweets”) 0 5 10 Python Scala DataFrame RunningTime

11. 11 R Interface (SparkR) Arrives in Spark 1.4 (June) Exposes DataFrames, RDDs, and ML library in R df = jsonFile(“tweets.json”) summarize( group_by( df[df$user == “matei”,], “date”), sum(“retweets”))

12. 12 Machine Learning Pipelines High-level API inspired by SciKit-Learn Featurization, evaluation, model tuning tokenizer = Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline([tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR modelDataFrame

13. 13 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources Spark {JSON}

14. 14 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en”

15. 15 Goal: one engine for all data sources, workloads and environments

16. To Learn More Two free massive online courses on Spark: databricks.com/moocs 16 Try Databricks Cloud: databricks.com