SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Introduction
Enter base
base:
statistical functions for large datasets
Jan Wijels  Edwin de Jonge:
jwijels@bnosac.be  edwindjonge@gmail.com
BNOSAC - Belgium Network of Open Source Analytical Consultants: www.bnosac.be
 Centraal Bureau voor Statistiek: www.cbs.nl
July 10 2013, useR! 2013
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Overview
Introduction
Who are we
Large data
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
About Jan)
Founder of www.bnosac.be and very recent father of Midas
(2013-06-19) and therefore not present at useR! 2013.
Providing consultancy services in open source analytical engineering
Poor man's BI:
Python/PostgreSQL/Pentaho/R/Hadoop/Sencha/ExtJS...
...
Expertise in predictive data mining, biostatistics, geostats, python
programming, GUI building, articial intelligence, process
automation, analytical web development
R implementations  R application maintenance
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
About Edwin
Working at Statistics Netherlands, that produces all dutch ocial
statistics. Co-author of several R packages:
editrules
tabplot
whisker
ffbase
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
Large data
Statistical data is becoming larger.
If nrow(data)  106 everything works ne in R
For larger data.frame's you run quickly into memory problems
# create an integer of length = 1 billion
integer(1e+09)
## Error: cannot allocate vector of size 3.7 Gb
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
If it ts in memory plain R is just ne!
If it doesn't t on your hard disk, you'll need some big data stu
(Hadoop, rmr, RHadoop)
For data sizes range of 106 - 109 several options provided by external
packages.
Popular one is ff1
1Daniel Adler, Christian Gläser, Oleg Nenadic, Jens Oehlschlägel, Walter Zucchini
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
Popularity based on RStudio download log les
0
100
200
300
jan feb mrt apr mei jun jul
#uniqueip
ff
bigmemory
filehash
ffbase
colbycol
mmap
pbdBASE
Rstudio logs 2013, downloads/week
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
ff package provides:
numerical, integer, boolean and factor variables of type ff stored on
disk (via memory mapping)
a sort of data.frame of type ffdf
ecient indexing, retrieval and sorting of ff vectors and ffdf.
ecient chunk-wise retrieval of data.
ff-based matrix storage.
vectors up-to length of 2 · 109.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
What's the catch?
ff is nice, but:
Handling ff vectors often results in non-standard R code
It oers no statistical functions on ff and ffdf objects
No mean, max, min, sd, etc. on ff vectors
Requires that you process the data chunkwise.
Has no support for character vectors.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
Typical chunkwise code for ff
library(ff)
x - ff(0, length = 1e+08)
# calculating the max value of x
m - -Inf
for (i in chunk(x)) {
m - max(x[i], m, na.rm = T)
}
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Who are we
Large data
Value vs reference
Note that out-of-memory objects have issues with value vs reference
semantics.
x - ff(0, length = 1e+07)
y - x
y[1] - 100
print(x[1])
## [1] 100
This is not what a normal R vector would do! Trade-of between copying
large object and side-eects.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ffbase2 attempts to
add basic statistical functions to 
make code as standard R as possible
make working with ff more pleasant.
connect ff with big* methods.
2de Jonge/Wijels/van der Laan
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Basic operations
Most methods works via S3 dispatch (but not all...)
mean,min,max,range, sum, all, cumsum, cumprod, quantile,
tabulate.ff, table.ff,
cut, c, unique, duplicated, Math.Ops
x - 1:10
x_ff - ff(x)
mean(x)
## [1] 5.5
mean(x_ff)
## [1] 5.5
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Idiosyncratic R
Use S3 dispatch (works only for generic methods...)
ffbase adds subset3, with, within and transform to ff
Makes code interchangeable with normal R code:
iris_ff - as.ffdf(iris)
iris_ff - transform( iris_ff
, Sepal.Ratio = Sepal.Width/Sepal.Length
)
str(iris_ff[,])
## 'data.frame': 150 obs. of 6 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2
## $ Species : Factor w/ 3 levels setosa,versicolor,
## $ Sepal.Ratio : num 0.686 0.612 0.681 0.674 0.72 ...
3This creates a copy of all selected data
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Under the hood
ffbase rewrites the expression into chunked expression
transform( iris_ff
, Sepal.Ratio = Sepal.Width/Sepal.Length
)
into4
iris_ff2 - iris_ff
for (.i in chunk(iris_ff2)){
iris_ff2[.i,] - transform( iris_ff2[.i,]
, Sepal.Ratio=Sepal.Width/Sepal.Length
)
}
return(iris_ff2)
4greatly simplied
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ltering: ffwhich
Often we need an index for a subselection, but even this may be to big
for memory.
ffwhich(dat, expression) returns a ff index vector
result can be used to index ffdf data.frame.
idx - ffwhich(iris_ff, Sepal.Width  2)
iris_ff[idx, ]
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Importing data
Package ff: read.table.ffdf, read.csv.ffdf etcetera
Package ETLutils: read.dbi.ffdf, read.odbc.ffdf,
read.jdbc.ffdf SQL Databases (SQLite / PostgreSQL / Oracle /
MySQL / SQL Server (read.odbc.df) / Hive (read.jdbc.df) / ...)
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ffbase
ffbase adds
laf_to_ffdf using LaF for importing large csv and fwf les
ffappend for appending vectors to an existing ff object
x - ffappend(x, 1:10)
ffdfappend for appending data.frame's to an existing ffdf
dat - ffdfappend(dat, iris)
# Note the pattern of assigning the result of the function ap
# ff object to itself
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
 storage
When data is in ff format processing is fast!
Time bottle neck can be loading the data into ff
Keeping data in ff format can save time
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ff stores all ff vectors on disk, however lenames are not user-friendly.
basename(filename(iris_ff$Sepal.Length))
## [1] ffdf1ad05b442183.ff
Furthermore each ff vector stores the absolute path.
Makes moving data around more dicult
ff provides: ffsave and ffload, which archives and unarchives 
vectors and df data.frames into a zip le.
Note that this still can be time-consuming.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
ffbase has:
save.ffdf and load.ffdf that store and load ffdf data.frames
into a directory with sensible R names.
save.ffdf(iris_ff)
basename(filename(iris_ff$Sepal.Length))
## [1] iris_ff$Sepal.Length.ff
pack.ffdf and unpack.ffdf that do the same but zip/unzip the
result.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Several methods for ff
ffbase allows statistical models directly on  data.5
Classication + Regression with bigglm.ffdf (biglm + base
packages)
Least angle regression with biglars.fit (biglars + base
packages)
Randomforest classication with bigrfc (bigrf + base packages)
Clustering with clustering features of package stream
5These work on large data but you can also sample from your df to build your
stat model and predict chunkwise.
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Example of bigglm
Using biglm (with bigglm.df from base)
class(iris_ff)
## [1] ffdf
nrow(iris_ff) # that is 1e7!
## [1] 1000000
mymodel - bigglm(Sepal.Length ~ Petal.Length, data = iris_ff)
coef(mymodel)
## (Intercept) Petal.Length
## 4.3051 0.4093
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets
Introduction
Enter base
Goal
Basic statistical functions
Normal R code
Getting data into 
 storage
Using big statistical methods on  datasets
Thank you!
Questions?
Jan Wijels  Edwin de Jonge: jwijels@bnosac.be  edwindjonge@gmail.combase: statistical functions for large datasets

Contenu connexe

Tendances

Synchronize OpenLDAP with Active Directory with LSC project
Synchronize OpenLDAP with Active Directory with LSC projectSynchronize OpenLDAP with Active Directory with LSC project
Synchronize OpenLDAP with Active Directory with LSC project
Clément OUDOT
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 

Tendances (20)

How to find what is making your Oracle database slow
How to find what is making your Oracle database slowHow to find what is making your Oracle database slow
How to find what is making your Oracle database slow
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Amazon Aurora: Deep Dive - SRV308 - Chicago AWS Summit
Amazon Aurora: Deep Dive - SRV308 - Chicago AWS SummitAmazon Aurora: Deep Dive - SRV308 - Chicago AWS Summit
Amazon Aurora: Deep Dive - SRV308 - Chicago AWS Summit
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
 
The Power of SPL
The Power of SPLThe Power of SPL
The Power of SPL
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database System
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
 
Synchronize OpenLDAP with Active Directory with LSC project
Synchronize OpenLDAP with Active Directory with LSC projectSynchronize OpenLDAP with Active Directory with LSC project
Synchronize OpenLDAP with Active Directory with LSC project
 
Storage and Alfresco
Storage and AlfrescoStorage and Alfresco
Storage and Alfresco
 
DumpsCafe Microsoft-AZ-104 Free Exam Dumps Demo.pdf
DumpsCafe Microsoft-AZ-104 Free Exam Dumps Demo.pdfDumpsCafe Microsoft-AZ-104 Free Exam Dumps Demo.pdf
DumpsCafe Microsoft-AZ-104 Free Exam Dumps Demo.pdf
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Deep Dive: Scaling Up to Your First 10 Million Users
Deep Dive: Scaling Up to Your First 10 Million UsersDeep Dive: Scaling Up to Your First 10 Million Users
Deep Dive: Scaling Up to Your First 10 Million Users
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Structured Streaming - The Internal -
Structured Streaming - The Internal -Structured Streaming - The Internal -
Structured Streaming - The Internal -
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 

En vedette

Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
Ajay Ohri
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Ryan Rosario
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Revolution Analytics
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Ryan Rosario
 

En vedette (15)

Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
 
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig WangDecember 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
 
Zurich R user group presentation May 2016
Zurich R user group presentation May 2016Zurich R user group presentation May 2016
Zurich R user group presentation May 2016
 
Heatmaps best practices Strata Hadoop
Heatmaps best practices Strata HadoopHeatmaps best practices Strata Hadoop
Heatmaps best practices Strata Hadoop
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2
 
Parallel Computing with R
Parallel Computing with RParallel Computing with R
Parallel Computing with R
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
 
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
 

Similaire à ffbase, statistical functions for large datasets

Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
Dataiku
 
Question 1 1.  A data flow element in a DFD must be conntected.docx
Question 1 1.  A data flow element in a DFD must be conntected.docxQuestion 1 1.  A data flow element in a DFD must be conntected.docx
Question 1 1.  A data flow element in a DFD must be conntected.docx
IRESH3
 
Data scientist enablement dse 400 - week 1 roadmap
Data scientist enablement   dse 400 - week 1 roadmapData scientist enablement   dse 400 - week 1 roadmap
Data scientist enablement dse 400 - week 1 roadmap
Dr. Mohan K. Bavirisetty
 

Similaire à ffbase, statistical functions for large datasets (20)

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Abstract.DOCX
Abstract.DOCXAbstract.DOCX
Abstract.DOCX
 
Big data and you
Big data and you Big data and you
Big data and you
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Tf itpptbo
Tf itpptboTf itpptbo
Tf itpptbo
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 
Suneel manne
Suneel  manneSuneel  manne
Suneel manne
 
Suneel manne
Suneel  manneSuneel  manne
Suneel manne
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Data science and Machine learning Booklet
Data science and Machine learning BookletData science and Machine learning Booklet
Data science and Machine learning Booklet
 
Question 1 1.  A data flow element in a DFD must be conntected.docx
Question 1 1.  A data flow element in a DFD must be conntected.docxQuestion 1 1.  A data flow element in a DFD must be conntected.docx
Question 1 1.  A data flow element in a DFD must be conntected.docx
 
Data scientist enablement dse 400 - week 1 roadmap
Data scientist enablement   dse 400 - week 1 roadmapData scientist enablement   dse 400 - week 1 roadmap
Data scientist enablement dse 400 - week 1 roadmap
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 

Plus de Edwin de Jonge

StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)
Edwin de Jonge
 

Plus de Edwin de Jonge (14)

sdcSpatial user!2019
sdcSpatial user!2019sdcSpatial user!2019
sdcSpatial user!2019
 
Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rules
 
Data error! But where?
Data error! But where?Data error! But where?
Data error! But where?
 
Daff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frameDaff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frame
 
Uncertainty visualisation
Uncertainty visualisationUncertainty visualisation
Uncertainty visualisation
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Big data experiments
Big data experimentsBig data experiments
Big data experiments
 
StatMine
StatMineStatMine
StatMine
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Tabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large dataTabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large data
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
 
Statmine, Visuele dataexploratie
Statmine, Visuele dataexploratieStatmine, Visuele dataexploratie
Statmine, Visuele dataexploratie
 
StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)
 
StatMine, visual exploration of output data
StatMine, visual exploration of output dataStatMine, visual exploration of output data
StatMine, visual exploration of output data
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

ffbase, statistical functions for large datasets

  • 1. Introduction Enter base base: statistical functions for large datasets Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.com BNOSAC - Belgium Network of Open Source Analytical Consultants: www.bnosac.be Centraal Bureau voor Statistiek: www.cbs.nl July 10 2013, useR! 2013 Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 2. Introduction Enter base Overview Introduction Who are we Large data Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 3. Introduction Enter base Who are we Large data About Jan) Founder of www.bnosac.be and very recent father of Midas (2013-06-19) and therefore not present at useR! 2013. Providing consultancy services in open source analytical engineering Poor man's BI: Python/PostgreSQL/Pentaho/R/Hadoop/Sencha/ExtJS... ... Expertise in predictive data mining, biostatistics, geostats, python programming, GUI building, articial intelligence, process automation, analytical web development R implementations R application maintenance Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 4. Introduction Enter base Who are we Large data About Edwin Working at Statistics Netherlands, that produces all dutch ocial statistics. Co-author of several R packages: editrules tabplot whisker ffbase Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 5. Introduction Enter base Who are we Large data Large data Statistical data is becoming larger. If nrow(data) 106 everything works ne in R For larger data.frame's you run quickly into memory problems # create an integer of length = 1 billion integer(1e+09) ## Error: cannot allocate vector of size 3.7 Gb Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 6. Introduction Enter base Who are we Large data If it ts in memory plain R is just ne! If it doesn't t on your hard disk, you'll need some big data stu (Hadoop, rmr, RHadoop) For data sizes range of 106 - 109 several options provided by external packages. Popular one is ff1 1Daniel Adler, Christian Gläser, Oleg Nenadic, Jens Oehlschlägel, Walter Zucchini Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 7. Introduction Enter base Who are we Large data Popularity based on RStudio download log les 0 100 200 300 jan feb mrt apr mei jun jul #uniqueip ff bigmemory filehash ffbase colbycol mmap pbdBASE Rstudio logs 2013, downloads/week Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 8. Introduction Enter base Who are we Large data ff package provides: numerical, integer, boolean and factor variables of type ff stored on disk (via memory mapping) a sort of data.frame of type ffdf ecient indexing, retrieval and sorting of ff vectors and ffdf. ecient chunk-wise retrieval of data. ff-based matrix storage. vectors up-to length of 2 · 109. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 9. Introduction Enter base Who are we Large data What's the catch? ff is nice, but: Handling ff vectors often results in non-standard R code It oers no statistical functions on ff and ffdf objects No mean, max, min, sd, etc. on ff vectors Requires that you process the data chunkwise. Has no support for character vectors. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 10. Introduction Enter base Who are we Large data Typical chunkwise code for ff library(ff) x - ff(0, length = 1e+08) # calculating the max value of x m - -Inf for (i in chunk(x)) { m - max(x[i], m, na.rm = T) } Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 11. Introduction Enter base Who are we Large data Value vs reference Note that out-of-memory objects have issues with value vs reference semantics. x - ff(0, length = 1e+07) y - x y[1] - 100 print(x[1]) ## [1] 100 This is not what a normal R vector would do! Trade-of between copying large object and side-eects. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 12. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ffbase2 attempts to add basic statistical functions to make code as standard R as possible make working with ff more pleasant. connect ff with big* methods. 2de Jonge/Wijels/van der Laan Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 13. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Basic operations Most methods works via S3 dispatch (but not all...) mean,min,max,range, sum, all, cumsum, cumprod, quantile, tabulate.ff, table.ff, cut, c, unique, duplicated, Math.Ops x - 1:10 x_ff - ff(x) mean(x) ## [1] 5.5 mean(x_ff) ## [1] 5.5 Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 14. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Idiosyncratic R Use S3 dispatch (works only for generic methods...) ffbase adds subset3, with, within and transform to ff Makes code interchangeable with normal R code: iris_ff - as.ffdf(iris) iris_ff - transform( iris_ff , Sepal.Ratio = Sepal.Width/Sepal.Length ) str(iris_ff[,]) ## 'data.frame': 150 obs. of 6 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3. ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 ## $ Species : Factor w/ 3 levels setosa,versicolor, ## $ Sepal.Ratio : num 0.686 0.612 0.681 0.674 0.72 ... 3This creates a copy of all selected data Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 15. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Under the hood ffbase rewrites the expression into chunked expression transform( iris_ff , Sepal.Ratio = Sepal.Width/Sepal.Length ) into4 iris_ff2 - iris_ff for (.i in chunk(iris_ff2)){ iris_ff2[.i,] - transform( iris_ff2[.i,] , Sepal.Ratio=Sepal.Width/Sepal.Length ) } return(iris_ff2) 4greatly simplied Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 16. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ltering: ffwhich Often we need an index for a subselection, but even this may be to big for memory. ffwhich(dat, expression) returns a ff index vector result can be used to index ffdf data.frame. idx - ffwhich(iris_ff, Sepal.Width 2) iris_ff[idx, ] Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 17. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Importing data Package ff: read.table.ffdf, read.csv.ffdf etcetera Package ETLutils: read.dbi.ffdf, read.odbc.ffdf, read.jdbc.ffdf SQL Databases (SQLite / PostgreSQL / Oracle / MySQL / SQL Server (read.odbc.df) / Hive (read.jdbc.df) / ...) Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 18. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ffbase ffbase adds laf_to_ffdf using LaF for importing large csv and fwf les ffappend for appending vectors to an existing ff object x - ffappend(x, 1:10) ffdfappend for appending data.frame's to an existing ffdf dat - ffdfappend(dat, iris) # Note the pattern of assigning the result of the function ap # ff object to itself Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 19. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets storage When data is in ff format processing is fast! Time bottle neck can be loading the data into ff Keeping data in ff format can save time Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 20. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ff stores all ff vectors on disk, however lenames are not user-friendly. basename(filename(iris_ff$Sepal.Length)) ## [1] ffdf1ad05b442183.ff Furthermore each ff vector stores the absolute path. Makes moving data around more dicult ff provides: ffsave and ffload, which archives and unarchives vectors and df data.frames into a zip le. Note that this still can be time-consuming. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 21. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets ffbase has: save.ffdf and load.ffdf that store and load ffdf data.frames into a directory with sensible R names. save.ffdf(iris_ff) basename(filename(iris_ff$Sepal.Length)) ## [1] iris_ff$Sepal.Length.ff pack.ffdf and unpack.ffdf that do the same but zip/unzip the result. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 22. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Several methods for ff ffbase allows statistical models directly on data.5 Classication + Regression with bigglm.ffdf (biglm + base packages) Least angle regression with biglars.fit (biglars + base packages) Randomforest classication with bigrfc (bigrf + base packages) Clustering with clustering features of package stream 5These work on large data but you can also sample from your df to build your stat model and predict chunkwise. Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 23. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Example of bigglm Using biglm (with bigglm.df from base) class(iris_ff) ## [1] ffdf nrow(iris_ff) # that is 1e7! ## [1] 1000000 mymodel - bigglm(Sepal.Length ~ Petal.Length, data = iris_ff) coef(mymodel) ## (Intercept) Petal.Length ## 4.3051 0.4093 Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets
  • 24. Introduction Enter base Goal Basic statistical functions Normal R code Getting data into storage Using big statistical methods on datasets Thank you! Questions? Jan Wijels Edwin de Jonge: jwijels@bnosac.be edwindjonge@gmail.combase: statistical functions for large datasets