SlideShare a Scribd company logo
1 of 31
Download to read offline
Ali Zaidi
Data Scientist @Microsoft
akzaidi
Extending R’s API with
Microsoft R Server and
Spark
@alikzaidi
Incredible R Speakers and Talks
• Felix Cheung -- SSR: Structured Streaming on R for Machine
Learning
– Tuesday, 11:00 AM – 11:30 AM
• Javier Luraschi -- Sparklyr: Recap, Updates and Use Cases
– Wednesday, 2:00 PM - 3:10 PM
• Hossein Falaki -- Apache SparkR Under the Hood: How to
Debug your SparkR Applications
– Wednesday, 4:20 PM – 4:50 PM
• Navdeep Gill -- From R Script to Production Using rsparkling
– Wednesday, 5:00 PM – 5:30 PM
Language Popularity
IEEE Spectrum Top Programming Languages
R’s popularity is growing rapidly
R Usage Growth
Rexer Data Miner Survey, 2007-2013
Rexer Data Miner Survey
What is
• A statistics programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most universities
• Thriving user groups worldwide
• 10,000+ free algorithms in CRAN
• Scalable to big data
• New and recent grad’s use it
Language
Platform
Community
Ecosystem
• Rich application & platform integration
R as an Interface
[Y]ou should understand R as a user interface,
or a language that’s capable of providing very
good user interfaces
– JJ Allaire, RStudio
Distributions of R
R APIs for Spark
PR#78
• SparkR
• ASF, Apache-
licensed
• Ships with Apache-
Spark since 1.4x
• SparkSQL and
SparkML support
through RPC
• UDF support
through gapply,
dapply,
spark.lapply
• On a workstation, that means:
– All available cores will be used for math operations and parallel
processes
– Hard drive capacity sets limit for data size, not RAM
• On a cluster:
– Parallel utilization of all available nodes
– Distributed file systems like HDFS greatly expand possible data sizes
MRS in Different Contexts
Code written on a workstation will run on a cluster by
tweaking a single function call:
# Use your local computer:
rxSetComputeContext( RxLocalParallel() )
# Switch to your cluster:
rxSetComputeContext( RxSpark(...) )
MRS in Different Contexts
Parallel External Memory Algorithms (PEMAs)
1. A chunk/subset of data is extracted from the main dataset
2. An intermediate result is calculated from that chunk of data
3. The intermediate results are combined into a final dataset
How MRS Works
How	Does	Remote	Compute	Context	Work?
Algorithm
Master
Predictive
Algorithm
Big
Data
Analyze
Blocks In
Parallel
Load Block
At A Time
Distribute Work,
Compile Results
“Pack and Ship”
Requests to
Remote
Environments
Results
Microsoft R Server functions
• A compute context defines where to process.
• E.g. remote context like Hadoop Map Reduce
• Microsoft R functions prefixed with rx
• Current set compute context determines processing
location
Copyright Microsoft Corporation. All rights reserved.
Microsoft R Client Microsoft R Server
Console
R IDE or
command-
line REMOTE
CONTEXT
Variable Selection
Stepwise Regression
Simulation
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Cluster Analysis
K-Means
Classification
Decision Trees
Decision Forests
Gradient Boosted Decision Trees
Naïve Bayes
Combination
rxDataStep
rxExec, rxExecBy
PEMA-R API Custom Algorithms
Parallelized, Remote Execution
Algorithms
Data Step
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Descriptive Statistics
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long form)
Marginal Summaries of Cross Tabulations
Statistical Tests
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Sampling
Subsample (observations & variables)
Random Sampling
Stratified Sampling
Predictive Models
Sum of Squares (cross product matrix for set variables)
Quantiles (approx.)
Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian, Poisson,
Tweedie. Standard link functions: cauchy,, identity, log, logit,
probit. User defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression, SDCA
Classification & Regression Trees, Random Forest,
Neural Networks, Convolutional Neural Networks
Ensemble Algorithms
Sharing a Spark Session
• In Microsoft R Server 9.1 (MRS), you can create a single Spark Session
from MRS that connects with sparklyr to that session
• Allows you to share data across MRS and sparklyr, and use functions
from either package in the same pipeline
• can also share the session with any other sparklyr extension
packages, like rsparkling
• MRS has readers for Hive, using the RxHiveData source reader, so
anything you persist to the Hive metastore from SparkSQL/sparklyr can
be consumed by MRS
Sparklyr’s Role in the Tidyverse
Spark Abstractions for R
RDD;
DataFrame;
Transformers;
a function that takes a
RDD/DataFrame in and pushes
another RDD/DataFrame out
Actions;
a function that takes a
RDD/DataFrame in, and puts
anything else out
Distributed immutable list;
Distributed data.frame:
tbl_spark, tbl_sql
Lazy Computations;
dplyr, sql,
sdf_transforms,
ml_transformer
Eager Queries;
dplyr::collect, head
The Power of Interfaces and
Abstractions
• Interfaces and abstractions make it easier to write reusable code
• For dplyr, Spark SQL method dispatch occurs at the tbl_sql class
• Laziness occurs at the tbl_spark class
• We can develop scripts with dplyr that manipulate
other tbl objects: tbl_df, tbl_mysql, etc.
• When we're ready to run our code in Spark, simply change the data
source
Scaling Analytics from Single
Machines to Clusters
Suppose we had a local, single machine running an dplyr method:
Changing the Source
• Not sparklyr
functions,
regular dplyr
methods for
the right tbl
Can You Explain That?
• If we try the explain method, we can see the code-gen by dplyr
Two-Tables Joins and Lazy
Operations
Distributed Computing with MRS
• MRS has two packages for distributed machine learning and
predictive modeling:
• RevoScaleR
• Primarily written in C++
• Spark extensions written in Scala
• MicrosoftML
• Primarily written in C#
• CLR bridge to MRS process – (BxlServer)
• Both packages are executed as parallel external memory
algorithms, and can consume a variety a data sources
• Hive tables
• Parquet
• Text
• ODBC, and many more
From sparklyr to MRS
Spark
DataFrame,
tbl_spark
Parquet,
spark_write
_parquet
Hive
sdf_register
MRS
rxDataStep(
outFile =
xdfd)
Training ML Models with MRS
• Same intuitive syntax we are used to from all CRAN-R modeling functions
Training Ensembles
• MRS provides a convenient function to parallelize training across your
cluster and combine the models into an ensemble
• Creates an ensemble by sampling over the training and features, and
aggregating over the predictions
Distributed Hyperparameter
Optimization
• Can conduct cross-validation and hyperparameter tuning using foreach or
rxExec
Pretrained Algorithms for Transfer
Learning
• MicrosoftML includes pre-trained ImageNet models that you can use to
directly featurize your images, and fine-tune it for your data
Streaming Extensions
• sparklyr::invoke
• sparkstreaming | structlystreams
• sparkstreaming and structlystreams interface to Spark
Streaming (RDD) and Structured Streams respectively
• sparkstreaming:
Structured Streams
• structlystreams:
GraphFrames Extensions
• GraphFrames
• kevinykuo/sparklygraphs: R interface for GraphFrames
Thanks!

More Related Content

What's hot

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 

Similar to Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi

Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
GapData Institute
 

Similar to Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi (20)

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi

  • 1. Ali Zaidi Data Scientist @Microsoft akzaidi Extending R’s API with Microsoft R Server and Spark @alikzaidi
  • 2. Incredible R Speakers and Talks • Felix Cheung -- SSR: Structured Streaming on R for Machine Learning – Tuesday, 11:00 AM – 11:30 AM • Javier Luraschi -- Sparklyr: Recap, Updates and Use Cases – Wednesday, 2:00 PM - 3:10 PM • Hossein Falaki -- Apache SparkR Under the Hood: How to Debug your SparkR Applications – Wednesday, 4:20 PM – 4:50 PM • Navdeep Gill -- From R Script to Production Using rsparkling – Wednesday, 5:00 PM – 5:30 PM
  • 3. Language Popularity IEEE Spectrum Top Programming Languages R’s popularity is growing rapidly R Usage Growth Rexer Data Miner Survey, 2007-2013 Rexer Data Miner Survey
  • 4. What is • A statistics programming language • A data visualization tool • Open source • 2.5+M users • Taught in most universities • Thriving user groups worldwide • 10,000+ free algorithms in CRAN • Scalable to big data • New and recent grad’s use it Language Platform Community Ecosystem • Rich application & platform integration
  • 5. R as an Interface [Y]ou should understand R as a user interface, or a language that’s capable of providing very good user interfaces – JJ Allaire, RStudio
  • 6.
  • 8. R APIs for Spark PR#78 • SparkR • ASF, Apache- licensed • Ships with Apache- Spark since 1.4x • SparkSQL and SparkML support through RPC • UDF support through gapply, dapply, spark.lapply
  • 9. • On a workstation, that means: – All available cores will be used for math operations and parallel processes – Hard drive capacity sets limit for data size, not RAM • On a cluster: – Parallel utilization of all available nodes – Distributed file systems like HDFS greatly expand possible data sizes MRS in Different Contexts
  • 10. Code written on a workstation will run on a cluster by tweaking a single function call: # Use your local computer: rxSetComputeContext( RxLocalParallel() ) # Switch to your cluster: rxSetComputeContext( RxSpark(...) ) MRS in Different Contexts
  • 11. Parallel External Memory Algorithms (PEMAs) 1. A chunk/subset of data is extracted from the main dataset 2. An intermediate result is calculated from that chunk of data 3. The intermediate results are combined into a final dataset How MRS Works
  • 12. How Does Remote Compute Context Work? Algorithm Master Predictive Algorithm Big Data Analyze Blocks In Parallel Load Block At A Time Distribute Work, Compile Results “Pack and Ship” Requests to Remote Environments Results Microsoft R Server functions • A compute context defines where to process. • E.g. remote context like Hadoop Map Reduce • Microsoft R functions prefixed with rx • Current set compute context determines processing location Copyright Microsoft Corporation. All rights reserved. Microsoft R Client Microsoft R Server Console R IDE or command- line REMOTE CONTEXT
  • 13. Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes Combination rxDataStep rxExec, rxExecBy PEMA-R API Custom Algorithms Parallelized, Remote Execution Algorithms Data Step Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test Sampling Subsample (observations & variables) Random Sampling Stratified Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Quantiles (approx.) Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchy,, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression, SDCA Classification & Regression Trees, Random Forest, Neural Networks, Convolutional Neural Networks Ensemble Algorithms
  • 14. Sharing a Spark Session • In Microsoft R Server 9.1 (MRS), you can create a single Spark Session from MRS that connects with sparklyr to that session • Allows you to share data across MRS and sparklyr, and use functions from either package in the same pipeline • can also share the session with any other sparklyr extension packages, like rsparkling • MRS has readers for Hive, using the RxHiveData source reader, so anything you persist to the Hive metastore from SparkSQL/sparklyr can be consumed by MRS
  • 15. Sparklyr’s Role in the Tidyverse
  • 16. Spark Abstractions for R RDD; DataFrame; Transformers; a function that takes a RDD/DataFrame in and pushes another RDD/DataFrame out Actions; a function that takes a RDD/DataFrame in, and puts anything else out Distributed immutable list; Distributed data.frame: tbl_spark, tbl_sql Lazy Computations; dplyr, sql, sdf_transforms, ml_transformer Eager Queries; dplyr::collect, head
  • 17. The Power of Interfaces and Abstractions • Interfaces and abstractions make it easier to write reusable code • For dplyr, Spark SQL method dispatch occurs at the tbl_sql class • Laziness occurs at the tbl_spark class • We can develop scripts with dplyr that manipulate other tbl objects: tbl_df, tbl_mysql, etc. • When we're ready to run our code in Spark, simply change the data source
  • 18. Scaling Analytics from Single Machines to Clusters Suppose we had a local, single machine running an dplyr method:
  • 19. Changing the Source • Not sparklyr functions, regular dplyr methods for the right tbl
  • 20. Can You Explain That? • If we try the explain method, we can see the code-gen by dplyr
  • 21. Two-Tables Joins and Lazy Operations
  • 22. Distributed Computing with MRS • MRS has two packages for distributed machine learning and predictive modeling: • RevoScaleR • Primarily written in C++ • Spark extensions written in Scala • MicrosoftML • Primarily written in C# • CLR bridge to MRS process – (BxlServer) • Both packages are executed as parallel external memory algorithms, and can consume a variety a data sources • Hive tables • Parquet • Text • ODBC, and many more
  • 23. From sparklyr to MRS Spark DataFrame, tbl_spark Parquet, spark_write _parquet Hive sdf_register MRS rxDataStep( outFile = xdfd)
  • 24. Training ML Models with MRS • Same intuitive syntax we are used to from all CRAN-R modeling functions
  • 25. Training Ensembles • MRS provides a convenient function to parallelize training across your cluster and combine the models into an ensemble • Creates an ensemble by sampling over the training and features, and aggregating over the predictions
  • 26. Distributed Hyperparameter Optimization • Can conduct cross-validation and hyperparameter tuning using foreach or rxExec
  • 27. Pretrained Algorithms for Transfer Learning • MicrosoftML includes pre-trained ImageNet models that you can use to directly featurize your images, and fine-tune it for your data
  • 28. Streaming Extensions • sparklyr::invoke • sparkstreaming | structlystreams • sparkstreaming and structlystreams interface to Spark Streaming (RDD) and Structured Streams respectively • sparkstreaming:
  • 30. GraphFrames Extensions • GraphFrames • kevinykuo/sparklygraphs: R interface for GraphFrames