Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Introduction to SparkR
Shivaram Venkataraman, Hossein Falaki
Big Data & R
DataFrames
Visualization
Libraries
Data+
Big Data & R: Challenges
Data access
HDFS, Hive
Capacity
Single machine
memory Parallelism
Single Thread
Apache Spark
Engine for large-scale data processing
Fast, Easy to Use
Runs Everywhere
EC2, clusters, laptop etc.
Speed
Scalable
Flexible
Statistics
Visualization
DataFrames
SparkR
Big Data & R: Patterns
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning
1. Big Data, Small Learning
Data
Cleaning
Filtering
Aggregation
Collect
Subset
DataFrames
Visualizatio
n
Libraries
1. Big Data, Small Learning
songs <- read.df(
“songs.json”,
“json”)
newSongs <- filter(
songs,
songs$year > 2000)
ggplot(c...
2. Partition Aggregate
Data Best
Model
Params
Parameter Tuning
params<-c(1e-3,1e-1,1e2)
data <- read.csv(“t.csv”)
train <- function(prm) {
lm.ridge(“y ~ x+z”,
data, prm)
}
lapply(params...
3. Large Scale Machine Learning
Data Featurize Learning Model
3. Large Scale Machine Learning
Data Featurize Learning Model
training <- read.csv(
“t.csv”)
model <- glm(
delay~Distance+...
Big Data & R
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning
SparkR:
Unified approach
SparkR DataFrames
people <- read.df(
“people.json”,
“json”)
avgAge <- select(
df,
avg(df$age))
head(avgAge)
Number of data...
Large Scale Machine Learning
Integration with MLLib
Key Features
R-like formulas
Model statistics
model <- glm(
a ~ b + c,...
Partition Aggregate
spark.lapply: Simple, parallel API
Ex: Parameter tuning, Model Averaging
Include existing R packages
SparkR Status
Open source -- Part of Apache Spark
> 60 committers from UC Berkeley, Databricks,
IBM, Intel, Alteryx etc.
C...
Tutorial Outline
Part 1: Data Exploration
• ETL: Data loading, schema
• Exploration: Filter, clean, aggregate etc.
• Visua...
Tutorial Setup
Each user gets a dedicated micro cluster
• Cluster is terminated after 1 hour of inactivity
• Multiple user...
Tutorial Setup
Databricks Notebooks
• Interactive workspace
• Markdown + R, Python, Scala, SQL
Sign up at http://databrick...
Tutorial Setup
Fill out our survey at
tiny.cc/sparkr-user-survey
SparkR
Big data processing from R
DataFrames for ETL, data exploration
Support for advanced analytics
Tutorial Next Steps
Sign up at http://databricks.com/ce
Part 1: tiny.cc/sparkr-tutorial-part1
Fill out our survey at tiny....
Prochain SlideShare
Chargement dans…5
×

Use r tutorial part1, introduction to sparkr

6 293 vues

Publié le

Presentation given at useR 2016 at http://user2016.org/tutorials/11.html

Publié dans : Technologie
  • Soyez le premier à commenter

Use r tutorial part1, introduction to sparkr

  1. 1. Introduction to SparkR Shivaram Venkataraman, Hossein Falaki
  2. 2. Big Data & R DataFrames Visualization Libraries Data+
  3. 3. Big Data & R: Challenges Data access HDFS, Hive Capacity Single machine memory Parallelism Single Thread
  4. 4. Apache Spark Engine for large-scale data processing Fast, Easy to Use Runs Everywhere EC2, clusters, laptop etc.
  5. 5. Speed Scalable Flexible Statistics Visualization DataFrames SparkR
  6. 6. Big Data & R: Patterns Big Data Small Learning Partition Aggregate Large Scale Machine Learning
  7. 7. 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset DataFrames Visualizatio n Libraries
  8. 8. 1. Big Data, Small Learning songs <- read.df( “songs.json”, “json”) newSongs <- filter( songs, songs$year > 2000) ggplot(collect(newSongs)) Data Cleaning Filtering Aggregation Collect Subset
  9. 9. 2. Partition Aggregate Data Best Model Params Parameter Tuning
  10. 10. params<-c(1e-3,1e-1,1e2) data <- read.csv(“t.csv”) train <- function(prm) { lm.ridge(“y ~ x+z”, data, prm) } lapply(params, train) 2. Partition Aggregate Data Best Model Params
  11. 11. 3. Large Scale Machine Learning Data Featurize Learning Model
  12. 12. 3. Large Scale Machine Learning Data Featurize Learning Model training <- read.csv( “t.csv”) model <- glm( delay~Distance+Dest, family = “gaussian”, data=data) summary(model)
  13. 13. Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning SparkR: Unified approach
  14. 14. SparkR DataFrames people <- read.df( “people.json”, “json”) avgAge <- select( df, avg(df$age)) head(avgAge) Number of data sources Column Functions, SQL Support for R UDFs
  15. 15. Large Scale Machine Learning Integration with MLLib Key Features R-like formulas Model statistics model <- glm( a ~ b + c, data = df) summary(model)
  16. 16. Partition Aggregate spark.lapply: Simple, parallel API Ex: Parameter tuning, Model Averaging Include existing R packages
  17. 17. SparkR Status Open source -- Part of Apache Spark > 60 committers from UC Berkeley, Databricks, IBM, Intel, Alteryx etc. Contributions welcome !
  18. 18. Tutorial Outline Part 1: Data Exploration • ETL: Data loading, schema • Exploration: Filter, clean, aggregate etc. • Visualization: Integration with ggplot Part 2: Advanced Analytics (After the break)
  19. 19. Tutorial Setup Each user gets a dedicated micro cluster • Cluster is terminated after 1 hour of inactivity • Multiple users can collaborate on a notebook Notebooks can be exported/imported Examples and tutorials in R/Python/Scala Free online service for learning Apache Spark
  20. 20. Tutorial Setup Databricks Notebooks • Interactive workspace • Markdown + R, Python, Scala, SQL Sign up at http://databricks.com/ce
  21. 21. Tutorial Setup Fill out our survey at tiny.cc/sparkr-user-survey
  22. 22. SparkR Big data processing from R DataFrames for ETL, data exploration Support for advanced analytics
  23. 23. Tutorial Next Steps Sign up at http://databricks.com/ce Part 1: tiny.cc/sparkr-tutorial-part1 Fill out our survey at tiny.cc/sparkr-user-survey

×