SparkR: Enabling Interactive Data Science at Scale on Hadoop

•

15 j'aime•3,271 vues

DataWorks Summit

Technologie Formation

SparkR: Enabling Interactive
Data Science at Scale on Hadoop
Shivaram Venkataraman
Zongheng Yang

Fast !
Scalable
Flexible
Statistics !
Plots
Packages

HDFS / HBase / Cassandra
Spark
SparkR
Mesos / YARN
Storage
Cluster
Manager
Data
Processing

Outline
SparkR API
Live Demo
Design Details

RDD
Parallel Collection
Transformations
map
filter
groupBy
…
Actions
count
collect
saveAsTextFile
…

Q: How can I use a loop
to [...insert task here...] ?!
A: Don’t. Use one of the
apply functions.!
From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
R

R + RDD =
RRDD
lapply
lapplyPartition
groupByKey
reduceByKey
sampleRDD
collect
cache
…
broadcast
includePackage
textFile
parallelize

$Example: Word Count lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words, function(word) { list(word, 1L) })$

A
b
|| Ax − b ||2Minimize
x = (AT
A)−1
AT
b

Dataﬂow
Local
R
Spark
Context
Java
Spark
Context
JNI
Worker
Worker

Dataﬂow
Local Worker
Worker
R
Spark
Context
Java
Spark
Context
JNI
Spark
Executor
exec
R
Spark
Executor
exec
R

From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

Pipelined RDD
words
<-‐
flatMap(lines,…)

wordCount
<-‐
lapply(words,…)

Spark
Executor
exec
R
Spark
Executor Rexec

Pipelined RDD
Spark
Executor
exec
R
Spark
Executor Rexec
Spark
Executor exec R R
Spark
Executor

Alpha developer release
One line install !

install_github("amplab-‐extras/SparkR-‐pkg",

subdir="pkg")

SparkR Implementation
Very similar to PySpark
Spark is easy to extend
292 lines of Scala code
1694 lines of R code
549 lines of test code in R

EC2 setup scripts
All Spark examples
MNIST demo
Hadoop2, Maven build
Also on github

In the Roadmap
High level DataFrame API
Integrating Spark’s MLLib from R
Reading from SequenceFiles, HBase

SparkR
Combine scalability & utility
RDD à distributed lists
Serialize closure
Re-use R packages

SparkR
https://github.com/amplab-extras/SparkR-pkg
Shivaram Venkataraman shivaram@cs.berkeley.edu
Spark User mailing list user@spark.apache.org

Example: Logistic Regression
pointsRDD
<-‐
textFile(sc,
"hdfs://myfile")

weights
<-‐
runif(n=D,
min
=
-‐1,
max
=
1)

#
Logistic
gradient

gradient
<-‐
function(partition)
{

X
<-‐
partition[,1];
Y
<-‐
partition[,-‐1]

t(X)
%*%
(1/(1
+
exp(-‐Y
*
(X
%*%
weights)))
-‐
1)
*
Y

}

How does it work ?
R Shell
rJava
Spark Context
Spark Executor Spark Executor
RScript RScript
Data: RDD[Array[Byte]]
Func: Array[Byte]

Recommandé

SparkR: Enabling Interactive Data Science at Scalejeykottalam

Introduction to Spark R with R studio - Mr. Pragith Sigmoid

Introduction to SparkRKien Dang

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

First impressions of SparkR: our own machine learning algorithmInfoFarm

A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)Spark Summit

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Recommandé

SparkR: Enabling Interactive Data Science at Scalejeykottalam

Introduction to Spark R with R studio - Mr. Pragith Sigmoid

Introduction to SparkRKien Dang

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

First impressions of SparkR: our own machine learning algorithmInfoFarm

A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)Spark Summit

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

End-to-end Data Pipeline with Apache SparkDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit

Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa

Enabling exploratory data science with Spark and RDatabricks

Introduction to Spark (Intern Event Presentation)Databricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Scalable Data Science in Python and R on Apache Sparkfelixcss

New directions for Apache Spark in 2015Databricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Use r tutorial part1, introduction to sparkrDatabricks

The BDAS Open Source Communityjeykottalam

Spark tutorialSahan Bulathwela

Adding Complex Data to Spark Stack by Tug GrallSpark Summit

Spark what's new what's comingDatabricks

SparkR - Play Spark Using R (20160909 HadoopCon)wqchen

Jump Start into Apache® Spark™ and DatabricksDatabricks

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

Apache Spark Model Deployment Databricks

Hadoop and R Go to the MoviesDataWorks Summit

Contenu connexe

Tendances

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

End-to-end Data Pipeline with Apache SparkDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit

Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa

Enabling exploratory data science with Spark and RDatabricks

Introduction to Spark (Intern Event Presentation)Databricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Scalable Data Science in Python and R on Apache Sparkfelixcss

New directions for Apache Spark in 2015Databricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Use r tutorial part1, introduction to sparkrDatabricks

The BDAS Open Source Communityjeykottalam

Spark tutorialSahan Bulathwela

Adding Complex Data to Spark Stack by Tug GrallSpark Summit

Spark what's new what's comingDatabricks

SparkR - Play Spark Using R (20160909 HadoopCon)wqchen

Jump Start into Apache® Spark™ and DatabricksDatabricks

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

Tendances (20)

Spark Summit EU 2015: Lessons from 300+ production users

End-to-end Data Pipeline with Apache Spark

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

Performant data processing with PySpark, SparkR and DataFrame API

Enabling exploratory data science with Spark and R

Introduction to Spark (Intern Event Presentation)

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Scalable Data Science in Python and R on Apache Spark

New directions for Apache Spark in 2015

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Spark Application Carousel: Highlights of Several Applications Built with Spark

Use r tutorial part1, introduction to sparkr

The BDAS Open Source Community

Spark tutorial

Adding Complex Data to Spark Stack by Tug Grall

Spark what's new what's coming

SparkR - Play Spark Using R (20160909 HadoopCon)

Jump Start into Apache® Spark™ and Databricks

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

En vedette

Apache Spark Model Deployment Databricks

Hadoop and R Go to the MoviesDataWorks Summit

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit

Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra

Cassandra/Hadoop IntegrationJeremy Hanna

Evaluating Apache Cassandra as a Cloud DatabaseDataStax

Big Data Analytics with HadoopPhilippe Julio

En vedette (7)

Apache Spark Model Deployment

Hadoop and R Go to the Movies

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

Cassandra/Hadoop Integration

Evaluating Apache Cassandra as a Cloud Database

Big Data Analytics with Hadoop

Similaire à SparkR: Enabling Interactive Data Science at Scale on Hadoop

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Parallelizing Existing R PackagesCraig Warman

An Overview of Apache SparkYasoda Jayaweera

Apache spark-melbourne-april-2015-meetupNed Shawa

Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau

Introduction to Apache SparkMohamed hedi Abidi

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa

A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

Apache Spark WorkshopMichael Spector

20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference

Introduction to Spark - DataFactZDataFactZ

Intro to Spark and Spark SQLjeykottalam

Apache Spark IntroductionRich Lee

20170126 big data processingVienna Data Science Group

Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos

Apache Spark An OverviewMohit Jain

Fossasia 2018-chetan-khatriChetan Khatri

Spark devoxx2014Andy Petrella

Similaire à SparkR: Enabling Interactive Data Science at Scale on Hadoop (20)

Apache spark - Architecture , Overview & libraries

Parallelizing Existing R Packages

An Overview of Apache Spark

Apache spark-melbourne-april-2015-meetup

Alpine academy apache spark series #1 introduction to cluster computing wit...

Introduction to Apache Spark

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

Apache spark sneha challa- google pittsburgh-aug 25th

A really really fast introduction to PySpark - lightning fast cluster computi...

Big Data Processing with .NET and Spark (SQLBits 2020)

Apache Spark Workshop

20130912 YTC_Reynold Xin_Spark and Shark

Introduction to Spark - DataFactZ

Intro to Spark and Spark SQL

Apache Spark Introduction

20170126 big data processing

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Apache Spark An Overview

Fossasia 2018-chetan-khatri

Spark devoxx2014

Plus de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Plus de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Dernier

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

A Year of the Servo Reboot: Where Are We Now?Igalia

AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Manulife - Insurer Transformation Award 2024The Digital Insurer

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Why Teams call analytics are critical to your entire businesspanagenda

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

FWD Group - Insurer Innovation Award 2024The Digital Insurer

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Dernier (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

A Year of the Servo Reboot: Where Are We Now?

AXA XL - Insurer Innovation Award Americas 2024

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Axa Assurance Maroc - Insurer Innovation Award 2024

Manulife - Insurer Transformation Award 2024

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Artificial Intelligence Chap.5 : Uncertainty

Why Teams call analytics are critical to your entire business

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

FWD Group - Insurer Innovation Award 2024

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Data Cloud, More than a CDP by Matt Robison

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

SparkR: Enabling Interactive Data Science at Scale on Hadoop

1. SparkR: Enabling Interactive Data Science at Scale on Hadoop Shivaram Venkataraman Zongheng Yang

2. Fast ! Scalable Flexible

3. Statistics ! Packages Plots

4. Fast ! Scalable Flexible Statistics ! Plots Packages

5. HDFS / HBase / Cassandra Spark SparkR Mesos / YARN Storage Cluster Manager Data Processing

6. Outline SparkR API Live Demo Design Details

7. RDD Parallel Collection Transformations map filter groupBy … Actions count collect saveAsTextFile …

8. Q: How can I use a loop to [...insert task here...] ?! A: Don’t. Use one of the apply functions.! From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/ R

9. R + RDD = R2D2

10. R + RDD = RRDD lapply lapplyPartition groupByKey reduceByKey sampleRDD collect cache … broadcast includePackage textFile parallelize

11. Example: Word Count lines <-‐ textFile(sc, “hdfs://my_text_file”)

12. Example: Word Count lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words, function(word) { list(word, 1L) })

13. Example: Word Count lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words, function(word) { list(word, 1L) }) counts <-‐ reduceByKey(wordCount, "+", 2L) output <-‐ collect(counts)

14. Demo

15. MNIST

16. A b || Ax − b ||2Minimize x = (AT A)−1 AT b

17. How does this work ?

18. Dataﬂow Local Worker Worker

19. Dataﬂow Local R Worker Worker

20. Dataﬂow Local R Spark Context Java Spark Context JNI Worker Worker

21. Dataﬂow Local Worker Worker R Spark Context Java Spark Context JNI Spark Executor exec R Spark Executor exec R

22.

23. From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

24.

25. Dataﬂow Local Worker Worker R Spark Context Java Spark Context JNI Spark Executor exec R Spark Executor exec R

26. Pipelined RDD words <-‐ flatMap(lines,…) wordCount <-‐ lapply(words,…) Spark Executor exec R Spark Executor Rexec

27. Pipelined RDD Spark Executor exec R Spark Executor Rexec Spark Executor exec R R Spark Executor

28. Alpha developer release One line install ! install_github("amplab-‐extras/SparkR-‐pkg", subdir="pkg")

29. SparkR Implementation Very similar to PySpark Spark is easy to extend 292 lines of Scala code 1694 lines of R code 549 lines of test code in R

30. EC2 setup scripts All Spark examples MNIST demo Hadoop2, Maven build Also on github

31. In the Roadmap High level DataFrame API Integrating Spark’s MLLib from R Reading from SequenceFiles, HBase

32. SparkR Combine scalability & utility RDD à distributed lists Serialize closure Re-use R packages

33. SparkR https://github.com/amplab-extras/SparkR-pkg Shivaram Venkataraman shivaram@cs.berkeley.edu Spark User mailing list user@spark.apache.org

34.

35.

36. Example: Logistic Regression pointsRDD <-‐ textFile(sc, "hdfs://myfile") weights <-‐ runif(n=D, min = -‐1, max = 1) # Logistic gradient gradient <-‐ function(partition) { X <-‐ partition[,1]; Y <-‐ partition[,-‐1] t(X) %*% (1/(1 + exp(-‐Y * (X %*% weights))) -‐ 1) * Y }

37. Example: Logistic Regression pointsRDD <-‐ textFile(sc, "hdfs://myfile") weights <-‐ runif(n=D, min = -‐1, max = 1) # Logistic gradient gradient <-‐ function(partition) { X <-‐ partition[,1]; Y <-‐ partition[,-‐1] t(X) %*% (1/(1 + exp(-‐Y * (X %*% weights))) -‐ 1) * Y } #Iterate weights <-‐ weights -‐ reduce( lapplyPartition(pointsRDD, gradient), "+")

38. How does it work ? R Shell rJava Spark Context Spark Executor Spark Executor RScript RScript Data: RDD[Array[Byte]] Func: Array[Byte]