We provide an update on developments in the intersection of the R and the broader machine learning ecosystems. These collections of packages enable R users to leverage the latest technologies for big data analytics and deep learning in their existing workflows, and also facilitate collaboration within multidisciplinary data science teams. Topics covered include – MLflow: managing the ML lifecycle with improved dependency management and more deployment targets – TensorFlow: TF 2.0 update and probabilistic (deep) machine learning with TensorFlow Probability – Spark: latest improvements and extensions, including text processing at scale with SparkNLP
3. Intro to R
“R is a programming language and free
software environment for statistical
computing and graphics."
3#UnifiedDataAnalytics #SparkAISummit
4. Modern R
library(tidyverse)
flights %>%
group_by(month, day) %>%
summarise(count = n(), avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
filter(count > 1000)
4#UnifiedDataAnalytics #SparkAISummit
The tidyverse is an opinionated collection of R packages designed for data
science. All packages share an underlying design philosophy, grammar, and
data structures.
10. Spark with R - Timeline
10#UnifiedDataAnalytics #SparkAISummit
Oct 2019
Sep 2016
sparklyr 0.4
R interface for
Apache Spark.
sparklyr 0.6
Distributed R and
external sources.
Jul 2017
Jan 2017
sparklyr 0.5
Livy and dplyr
improvements.
Jan 2018
sparklyr 0.7
Spark
Pipelines and
Machine
Learning.
May 2018
sparklyr 0.8
Production
pipelines and
graphs.
sparklyr 0.9
Streams and
Kubernetes.
Oct 2018
Mar 2019
sparklyr 1.0
Arrow,
XGBoost,
Broom and
TFRecords
23. TensorFlow - New? - tfprobability
Combine probabilistic models and deep learning
on modern hardware
23#UnifiedDataAnalytics #SparkAISummit
# create a binomial distribution with n = 7 and p = 0.3
d <- tfd_binomial(total_count = 7, probs = 0.3)
# compute mean
d %>% tfd_mean()
# compute variance
d %>% tfd_variance()
# compute probability
d %>% tfd_prob(2.3)
github.com/rstudio/tfprobability