SlideShare une entreprise Scribd logo
1  sur  39
Get up to Speed
QUICK GUIDE TO DATA.TABLE IN R AND PENTAHO PDI
02 09 2015
SERBAN TANASA
What You Could Gain Tonight
 2-20x speed increase in your data loading and manipulation using
data.table
 (If time allows) A free path of entry into Business Intelligence ETL
(commercial scale computing technologies for Extract/Transform/Load)
using Pentaho Data Integration.
 Free food?
2
Planned Outline
data.table
 Why use it? Benchmarks.
 How to use it? Primer on basic functions.
 Overcome R scaling limitations: Multithread, Cloud, Databases.
Pentaho Data Integration (PDI)
 (Optional time-constrained section) Very basic run-through of PDI ETL
Unstructured Time for Q&A and (potentially) hilarious live-coding
3
R Online Support and Business Use
Source: Stack Overflow, Talk Stats, and Cross Validated
0
20
40
60
80
100
120
140
R SAS SPSS Stata
Thousands
Posts per Software
SO TalkStats Cross Validated
0
20
40
60
80
100
120
R SAS SPSS Stata
Thousands
LinkedIn Groups Members
4
Benchmarks
READ DATA
ORDER DATA
TRANSFORM DATA
5
Benchmarks: Hardware Setup
 Test Machine: AWS EC2 r3.8xlarge
 # R version 3.2.2 (2014-07-10) -- “Fire Safety”
 # Platform: x86_64-pc-linux-gnu (64-bit)
 An Amazon Web Services Elastic Cloud Compute on-demand instance
with these settings costs $2.8/hr on demand, ~$1/hr reserved, or as low as
~0.3/hr on spot instances.
6
Benchmarks: Reading Data
0
200
400
600
800
1000
1200
1400
50Mb 500Mb 5Gb
Seconds to Read File
read.csv read.csv(2) read.table ff sqldf fread
7
Benchmarks: Reading Data
0%
500%
1000%
1500%
2000%
2500%
50Mb 500Mb 5Gb
Read Performance Relative to fread()
read.csv read.csv(2) read.table ff sqldf fread
8
Benchmarks: Order Data
0.1
1
10
100
1000
10000
100000
1000000
10000000
1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09
Sort Table Operations by Table Size
Log Scale
Base dplyr data.table
9
Benchmarks: Order Data
-
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
Base dplyr data.table
Sort 1 Billion Rows (milisec)
10
Benchmarks: Transform Data (Setup)
 The input data is randomly ordered. No pre-sort. No indexes. No key.
 5 simple queries are run: large groups and small groups on different
columns of different types. Similar to what a data analyst might do in
practice; i.e., various ad hoc aggregations as the data is explored and
investigated.
 Each package is tested separately in its own fresh session.
 Each query is repeated once more, immediately. This is to isolate cache
effects and confirm the first timing.
 The results are compared and checked allowing for numeric tolerance
and column name differences.
 It is a tough test that happens to be realistic and very common.
11
Benchmarks: Transform Data (Setup)
N=1e9; K=100
set.seed(1)
DF <- data.frame(stringsAsFactors=FALSE,
id1 = sample(sprintf("id%03d",1:K), N, TRUE),
id2 = sample(sprintf("id%03d",1:K), N, TRUE),
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE),
id4 = sample(K, N, TRUE),
id5 = sample(K, N, TRUE),
id6 = sample(N/K, N, TRUE),
v1 = sample(5, N, TRUE),
v2 = sample(5, N, TRUE),
v3 = sample(round(runif(100,max=100),4), N, TRUE) )
12
id1 id2 id3 id4 id5 id6 v1 v2 v3
id027 id007 id0000000022 42 60 58 4 4 50.7016
id038 id068 id0000000012 15 56 71 4 4 5.5459
id058 id074 id0000000015 46 60 34 5 1 11.5124
id091 id012 id0000000031 81 40 12 1 1 18.8075
id021 id005 id0000000016 33 27 88 2 3 34.0231
id090 id014 id0000000053 87 74 6 2 3 27.2783
id095 id089 id0000000012 25 3 35 2 5 11.5124
id067 id084 id0000000048 83 85 47 5 1 63.7503
id063 id087 id0000000031 22 86 78 2 4 23.251
id007 id004 id0000000031 58 14 82 2 5 7.1864
id021 id011 id0000000030 37 39 69 5 1 49.0202
id018 id055 id0000000066 95 86 1 1 2 4.0548
id069 id011 id0000000039 11 8 71 5 2 45.0637
id039 id073 id0000000075 54 23 50 5 4 89.157
id077 id073 id0000000069 9 77 73 4 2 22.9517
id050 id079 id0000000027 29 34 17 3 4 23.251
id072 id062 id0000000041 67 98 53 4 1 73.6784
id100 id051 id0000000051 13 15 55 1 3 54.3411
id039 id046 id0000000090 100 77 79 1 2 7.1864
id078 id004 id0000000009 68 97 10 2 2 40.3839
13
Benchmarks: Test Commands
Test data.table dplyr
1.1DT[, sum(v1), keyby=id1] DF %>% group_by(id1) %>% summarise(sum(v1))
1.2DT[, sum(v1), keyby=id1] DF %>% group_by(id1) %>% summarise(sum(v1))
2.1DT[, sum(v1), keyby="id1,id2"] DF %>% group_by(id1,id2) %>% summarise(sum(v1))
2.2DT[, sum(v1), keyby="id1,id2"] DF %>% group_by(id1,id2) %>% summarise(sum(v1))
3.1DT[, list(sum(v1),mean(v3)), keyby=id3] DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3))
3.2DT[, list(sum(v1),mean(v3)), keyby=id3] DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3))
4.1DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9)
4.2DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9)
5.1DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9)
5.2DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9)
14
Benchmarks: Results
-
50
100
150
200
250
300
350
Millions
Group by and Summarize
(Average of 5 Operations)
dplyr data.table
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
Group by and Summarize
Average of 5 Operations
Log Scale
dplyr data.table
Microseconds
15
GB <0.01 <0.01 0.03 0.075 0.516 4.939 49.15
1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09
1 (1st) 127% 133% 160% 238% 217% 186% 185%
1 (2nd) 125% 146% 215% 265% 217% 188% 188%
2 (1st) 150% 331% 508% 578% 399% 309% 294%
2 (2nd) 148% 328% 497% 581% 405% 304% 281%
3 (1st) 94% 116% 264% 316% 254% 276% 298%
3 (2nd) 95% 120% 264% 307% 256% 264% 299%
4 (1st) 226% 214% 193% 176% 188% 227% 227%
4 (2nd) 171% 172% 175% 188% 187% 224% 232%
5 (1st) 165% 166% 204% 239% 314% 586% 497%
5 (2nd) 161% 164% 203% 240% 314% 623% 498%
16
data.table
Primer
READ
CREATE
MANIPULATE
SPECIAL COMMANDS
17
Read
fread()
 Similar to read.table but faster and more convenient. All controls
such as sep, colClasses and nrows are automatically
detected. bit64::integer64 types are also detected and read
directly without needing to read as character before converting.
 sep -- The separator between columns. Defaults to the first character
in the set [,t |;:] that exists on line autostart outside quoted regions, and
separates the rows above autostart into a consistent number of fields, too.
 skip, drop, select, showProgress;
 Input can be a file name, a URL pointing to a file, or (advanced
use) a shell command fread("grep @WhiteHouse.gov filename"))
18
Create
 data.table() – much like data.frame
 setDT() – makes an existing data.frame a data.table without copying (this
is important for large data)
 setkey() and setkeyv() – supercharged rownames, indices
19
Manipulate
 := : Assignment operator (without copy)
 .N : Counts
 data.table::melt(), data.table::dcast()
 data.table::merge() and DT_1[DT_2] joins
 DT[ i, j, by ]
20
DT[i, j, by] format
Source: https://campus.datacamp.com/courses/data-table-data-manipulation-r-tutorial
21
DT[i, j, by] format
Source: https://campus.datacamp.com/courses/data-table-data-manipulation-r-tutorial
22
Special Commands
 .( )
 .eachi
 .SD and .SDcols
 c('x2', 'y2') := list(..., ...)
 `:=`(x2=...,y2= ...) equivalent group assignment
 DT[, plot(x)] will actually produce a plot
 copy() – for when you do not want to update by reference
 DT[, “colname”, with=FALSE]
 DT[…][…] -- Chaining
23
Overcome R
scaling
limitations
MULTITHREAD R
CLOUD
DATABASES
SPECIALIZE R USE
24
Multithread R: RRO 3.2.1
https://mran.revolutionanalytics.com/download/
Enhancements
include multi-core
processing…
25
http://serbantanasa.com/2015/06/12/r-vs-revolution-r-open-3-2-0/
26
Cloud, Database, BI Tools
 AWS
 On-Off Deployment of memory-optimized instances for one-off heavy
processing
 AWS with Rstudio Server + Shiny Server (Linux Only)
 R is increasingly integrated in BI tools and even Databases
 Pentaho EE has R integration (as does Microstrategy, Microsoft SSRS, IBM, and
even data discovery tools like Tableau, Qlikview & Alteryx)
 IBM DashDB has a built-in Rstudio, MS SQL Server 2016 will have in-database R,
Postgres has PL/R etc.
27
Specialize Your Use of R
 R can do anything you can
program (it is a Turing-
complete programming
language)
 R should NOT do everything.
 Push ETL to specialized software
(like PDI)
 Push computation to DB (DBI
and rstats-db packages) &
Hadoop (Rhadoop – basically
large scale lapply)
https://github.com/rstats-db
28
Pentaho PDI OVERVIEW OF CAPABILITIES
29
What PDI can do for you
 Data integration without writing 1 line of code
 Heavily parallel streams (compare to base-R 1 core), can even push to a
whole slave computing cluster.
 Java, JavaScript, SQL, R Scripting (EE?)
 Slowly changing dimensions made easy
30
Data I/O Capabilities
31
Visual Data Munging
32
33Complexity can escalate quickly…
Additional Resources: PDI
 Community Version:
http://community.pentaho.com/projects/
data-integration/
 Enterprise Edition
http://www.pentaho.com/product/data-
integration
34
Additional Resources: data.table
 data.table wiki:
https://github.com/Rdatatable/data.table/wiki
 data.table tutorial:
https://campus.datacamp.com/courses/data-table-
data-manipulation-r-tutorial/
35
Thank you for your time!
stanasa@sunstonescience.com
36
Appendix
37
Stack Overflow
38
Benchmark Test
Data Size (GB) <0.01 <0.012 0.03 0.075 0.516 4.939 49.15
Rows 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09
DF %>% group_by(id1) %>% summarise(sum(v1)) 6,729 7,074 10,737 49,100 468,973 5,076,499 51,998,307
DT[, sum(v1), keyby=id1] 5,300 5,305 6,708 20,656 216,540 2,730,619 28,076,861
DF %>% group_by(id1) %>% summarise(sum(v1)) 1,188 1,642 5,344 43,524 455,865 5,128,423 51,528,819
DT[, sum(v1), keyby=id1] 953 1,123 2,486 16,416 210,032 2,721,640 27,406,794
DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 1,894 8,480 28,033 152,965 1,444,988 14,927,984 152,535,515
DT[, sum(v1), keyby="id1,id2"] 1,263 2,559 5,516 26,446 361,812 4,827,830 51,897,539
DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 1,865 8,396 27,343 153,185 1,440,788 14,652,509 152,118,605
DT[, sum(v1), keyby="id1,id2"] 1,257 2,561 5,505 26,386 355,492 4,827,702 54,047,731
DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 1,130 1,697 10,323 129,391 1,805,955 45,748,652 693,700,832
DT[, list(sum(v1),mean(v3)), keyby=id3] 1,197 1,461 3,910 40,991 710,813 16,582,213 233,001,218
DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 1,100 1,701 10,299 125,585 1,824,419 44,038,141 627,247,199
DT[, list(sum(v1),mean(v3)), keyby=id3] 1,160 1,415 3,894 40,895 713,289 16,660,202 209,734,867
DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 2,463 2,900 6,965 51,318 591,011 8,421,787 86,594,397
DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] 1,091 1,354 3,603 29,196 314,705 3,711,641 38,151,119
DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 1,811 2,276 6,328 49,939 579,184 8,315,717 86,585,169
DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] 1,056 1,322 3,614 26,572 310,265 3,706,489 37,275,816
DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 1,729 2,234 8,268 94,044 1,661,863 41,141,396 600,233,695
DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] 1,049 1,343 4,053 39,299 529,329 7,016,574 120,664,934
DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 1,696 2,207 8,189 93,323 1,660,917 41,126,264 625,999,220
DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] 1,055 1,343 4,033 38,910 529,763 6,602,454 125,725,448
39

Contenu connexe

Tendances

Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Merge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RMerge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RYogesh Khandelwal
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data framekrishna singh
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using pythonPurna Chander
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R LanguageGaurang Dobariya
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Citus Data
 
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...Citus Data
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RRsquared Academy
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-exportFAO
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRsquared Academy
 

Tendances (20)

Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Pandas
PandasPandas
Pandas
 
Merge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RMerge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using R
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
R factors
R   factorsR   factors
R factors
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Datamining with R
Datamining with RDatamining with R
Datamining with R
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
 
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In R
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-export
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 

Similaire à Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)

Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformBob Ward
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresSteven Johnson
 
How to build tabular dashboards using proc report
How to build tabular dashboards using proc reportHow to build tabular dashboards using proc report
How to build tabular dashboards using proc reportFrank Bereznay
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2rowensCap
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with ScalaChetan Khatri
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用mysqlops
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptxPaulo Alonso
 
SQL Server 2008 Performance Enhancements
SQL Server 2008 Performance EnhancementsSQL Server 2008 Performance Enhancements
SQL Server 2008 Performance Enhancementsinfusiondev
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 CourseMarcus Davage
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageMajid Abdollahi
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 

Similaire à Get up to Speed (Quick Guide to data.table in R and Pentaho PDI) (20)

Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data Platform
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
 
How to build tabular dashboards using proc report
How to build tabular dashboards using proc reportHow to build tabular dashboards using proc report
How to build tabular dashboards using proc report
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
 
SQL Server 2008 Performance Enhancements
SQL Server 2008 Performance EnhancementsSQL Server 2008 Performance Enhancements
SQL Server 2008 Performance Enhancements
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 Course
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Module02
Module02Module02
Module02
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)

  • 1. Get up to Speed QUICK GUIDE TO DATA.TABLE IN R AND PENTAHO PDI 02 09 2015 SERBAN TANASA
  • 2. What You Could Gain Tonight  2-20x speed increase in your data loading and manipulation using data.table  (If time allows) A free path of entry into Business Intelligence ETL (commercial scale computing technologies for Extract/Transform/Load) using Pentaho Data Integration.  Free food? 2
  • 3. Planned Outline data.table  Why use it? Benchmarks.  How to use it? Primer on basic functions.  Overcome R scaling limitations: Multithread, Cloud, Databases. Pentaho Data Integration (PDI)  (Optional time-constrained section) Very basic run-through of PDI ETL Unstructured Time for Q&A and (potentially) hilarious live-coding 3
  • 4. R Online Support and Business Use Source: Stack Overflow, Talk Stats, and Cross Validated 0 20 40 60 80 100 120 140 R SAS SPSS Stata Thousands Posts per Software SO TalkStats Cross Validated 0 20 40 60 80 100 120 R SAS SPSS Stata Thousands LinkedIn Groups Members 4
  • 6. Benchmarks: Hardware Setup  Test Machine: AWS EC2 r3.8xlarge  # R version 3.2.2 (2014-07-10) -- “Fire Safety”  # Platform: x86_64-pc-linux-gnu (64-bit)  An Amazon Web Services Elastic Cloud Compute on-demand instance with these settings costs $2.8/hr on demand, ~$1/hr reserved, or as low as ~0.3/hr on spot instances. 6
  • 7. Benchmarks: Reading Data 0 200 400 600 800 1000 1200 1400 50Mb 500Mb 5Gb Seconds to Read File read.csv read.csv(2) read.table ff sqldf fread 7
  • 8. Benchmarks: Reading Data 0% 500% 1000% 1500% 2000% 2500% 50Mb 500Mb 5Gb Read Performance Relative to fread() read.csv read.csv(2) read.table ff sqldf fread 8
  • 9. Benchmarks: Order Data 0.1 1 10 100 1000 10000 100000 1000000 10000000 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 Sort Table Operations by Table Size Log Scale Base dplyr data.table 9
  • 11. Benchmarks: Transform Data (Setup)  The input data is randomly ordered. No pre-sort. No indexes. No key.  5 simple queries are run: large groups and small groups on different columns of different types. Similar to what a data analyst might do in practice; i.e., various ad hoc aggregations as the data is explored and investigated.  Each package is tested separately in its own fresh session.  Each query is repeated once more, immediately. This is to isolate cache effects and confirm the first timing.  The results are compared and checked allowing for numeric tolerance and column name differences.  It is a tough test that happens to be realistic and very common. 11
  • 12. Benchmarks: Transform Data (Setup) N=1e9; K=100 set.seed(1) DF <- data.frame(stringsAsFactors=FALSE, id1 = sample(sprintf("id%03d",1:K), N, TRUE), id2 = sample(sprintf("id%03d",1:K), N, TRUE), id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), id4 = sample(K, N, TRUE), id5 = sample(K, N, TRUE), id6 = sample(N/K, N, TRUE), v1 = sample(5, N, TRUE), v2 = sample(5, N, TRUE), v3 = sample(round(runif(100,max=100),4), N, TRUE) ) 12
  • 13. id1 id2 id3 id4 id5 id6 v1 v2 v3 id027 id007 id0000000022 42 60 58 4 4 50.7016 id038 id068 id0000000012 15 56 71 4 4 5.5459 id058 id074 id0000000015 46 60 34 5 1 11.5124 id091 id012 id0000000031 81 40 12 1 1 18.8075 id021 id005 id0000000016 33 27 88 2 3 34.0231 id090 id014 id0000000053 87 74 6 2 3 27.2783 id095 id089 id0000000012 25 3 35 2 5 11.5124 id067 id084 id0000000048 83 85 47 5 1 63.7503 id063 id087 id0000000031 22 86 78 2 4 23.251 id007 id004 id0000000031 58 14 82 2 5 7.1864 id021 id011 id0000000030 37 39 69 5 1 49.0202 id018 id055 id0000000066 95 86 1 1 2 4.0548 id069 id011 id0000000039 11 8 71 5 2 45.0637 id039 id073 id0000000075 54 23 50 5 4 89.157 id077 id073 id0000000069 9 77 73 4 2 22.9517 id050 id079 id0000000027 29 34 17 3 4 23.251 id072 id062 id0000000041 67 98 53 4 1 73.6784 id100 id051 id0000000051 13 15 55 1 3 54.3411 id039 id046 id0000000090 100 77 79 1 2 7.1864 id078 id004 id0000000009 68 97 10 2 2 40.3839 13
  • 14. Benchmarks: Test Commands Test data.table dplyr 1.1DT[, sum(v1), keyby=id1] DF %>% group_by(id1) %>% summarise(sum(v1)) 1.2DT[, sum(v1), keyby=id1] DF %>% group_by(id1) %>% summarise(sum(v1)) 2.1DT[, sum(v1), keyby="id1,id2"] DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 2.2DT[, sum(v1), keyby="id1,id2"] DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 3.1DT[, list(sum(v1),mean(v3)), keyby=id3] DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 3.2DT[, list(sum(v1),mean(v3)), keyby=id3] DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 4.1DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 4.2DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 5.1DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 5.2DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 14
  • 15. Benchmarks: Results - 50 100 150 200 250 300 350 Millions Group by and Summarize (Average of 5 Operations) dplyr data.table 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Group by and Summarize Average of 5 Operations Log Scale dplyr data.table Microseconds 15
  • 16. GB <0.01 <0.01 0.03 0.075 0.516 4.939 49.15 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1 (1st) 127% 133% 160% 238% 217% 186% 185% 1 (2nd) 125% 146% 215% 265% 217% 188% 188% 2 (1st) 150% 331% 508% 578% 399% 309% 294% 2 (2nd) 148% 328% 497% 581% 405% 304% 281% 3 (1st) 94% 116% 264% 316% 254% 276% 298% 3 (2nd) 95% 120% 264% 307% 256% 264% 299% 4 (1st) 226% 214% 193% 176% 188% 227% 227% 4 (2nd) 171% 172% 175% 188% 187% 224% 232% 5 (1st) 165% 166% 204% 239% 314% 586% 497% 5 (2nd) 161% 164% 203% 240% 314% 623% 498% 16
  • 18. Read fread()  Similar to read.table but faster and more convenient. All controls such as sep, colClasses and nrows are automatically detected. bit64::integer64 types are also detected and read directly without needing to read as character before converting.  sep -- The separator between columns. Defaults to the first character in the set [,t |;:] that exists on line autostart outside quoted regions, and separates the rows above autostart into a consistent number of fields, too.  skip, drop, select, showProgress;  Input can be a file name, a URL pointing to a file, or (advanced use) a shell command fread("grep @WhiteHouse.gov filename")) 18
  • 19. Create  data.table() – much like data.frame  setDT() – makes an existing data.frame a data.table without copying (this is important for large data)  setkey() and setkeyv() – supercharged rownames, indices 19
  • 20. Manipulate  := : Assignment operator (without copy)  .N : Counts  data.table::melt(), data.table::dcast()  data.table::merge() and DT_1[DT_2] joins  DT[ i, j, by ] 20
  • 21. DT[i, j, by] format Source: https://campus.datacamp.com/courses/data-table-data-manipulation-r-tutorial 21
  • 22. DT[i, j, by] format Source: https://campus.datacamp.com/courses/data-table-data-manipulation-r-tutorial 22
  • 23. Special Commands  .( )  .eachi  .SD and .SDcols  c('x2', 'y2') := list(..., ...)  `:=`(x2=...,y2= ...) equivalent group assignment  DT[, plot(x)] will actually produce a plot  copy() – for when you do not want to update by reference  DT[, “colname”, with=FALSE]  DT[…][…] -- Chaining 23
  • 25. Multithread R: RRO 3.2.1 https://mran.revolutionanalytics.com/download/ Enhancements include multi-core processing… 25
  • 27. Cloud, Database, BI Tools  AWS  On-Off Deployment of memory-optimized instances for one-off heavy processing  AWS with Rstudio Server + Shiny Server (Linux Only)  R is increasingly integrated in BI tools and even Databases  Pentaho EE has R integration (as does Microstrategy, Microsoft SSRS, IBM, and even data discovery tools like Tableau, Qlikview & Alteryx)  IBM DashDB has a built-in Rstudio, MS SQL Server 2016 will have in-database R, Postgres has PL/R etc. 27
  • 28. Specialize Your Use of R  R can do anything you can program (it is a Turing- complete programming language)  R should NOT do everything.  Push ETL to specialized software (like PDI)  Push computation to DB (DBI and rstats-db packages) & Hadoop (Rhadoop – basically large scale lapply) https://github.com/rstats-db 28
  • 29. Pentaho PDI OVERVIEW OF CAPABILITIES 29
  • 30. What PDI can do for you  Data integration without writing 1 line of code  Heavily parallel streams (compare to base-R 1 core), can even push to a whole slave computing cluster.  Java, JavaScript, SQL, R Scripting (EE?)  Slowly changing dimensions made easy 30
  • 34. Additional Resources: PDI  Community Version: http://community.pentaho.com/projects/ data-integration/  Enterprise Edition http://www.pentaho.com/product/data- integration 34
  • 35. Additional Resources: data.table  data.table wiki: https://github.com/Rdatatable/data.table/wiki  data.table tutorial: https://campus.datacamp.com/courses/data-table- data-manipulation-r-tutorial/ 35
  • 36. Thank you for your time! stanasa@sunstonescience.com 36
  • 39. Benchmark Test Data Size (GB) <0.01 <0.012 0.03 0.075 0.516 4.939 49.15 Rows 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 DF %>% group_by(id1) %>% summarise(sum(v1)) 6,729 7,074 10,737 49,100 468,973 5,076,499 51,998,307 DT[, sum(v1), keyby=id1] 5,300 5,305 6,708 20,656 216,540 2,730,619 28,076,861 DF %>% group_by(id1) %>% summarise(sum(v1)) 1,188 1,642 5,344 43,524 455,865 5,128,423 51,528,819 DT[, sum(v1), keyby=id1] 953 1,123 2,486 16,416 210,032 2,721,640 27,406,794 DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 1,894 8,480 28,033 152,965 1,444,988 14,927,984 152,535,515 DT[, sum(v1), keyby="id1,id2"] 1,263 2,559 5,516 26,446 361,812 4,827,830 51,897,539 DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 1,865 8,396 27,343 153,185 1,440,788 14,652,509 152,118,605 DT[, sum(v1), keyby="id1,id2"] 1,257 2,561 5,505 26,386 355,492 4,827,702 54,047,731 DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 1,130 1,697 10,323 129,391 1,805,955 45,748,652 693,700,832 DT[, list(sum(v1),mean(v3)), keyby=id3] 1,197 1,461 3,910 40,991 710,813 16,582,213 233,001,218 DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 1,100 1,701 10,299 125,585 1,824,419 44,038,141 627,247,199 DT[, list(sum(v1),mean(v3)), keyby=id3] 1,160 1,415 3,894 40,895 713,289 16,660,202 209,734,867 DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 2,463 2,900 6,965 51,318 591,011 8,421,787 86,594,397 DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] 1,091 1,354 3,603 29,196 314,705 3,711,641 38,151,119 DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 1,811 2,276 6,328 49,939 579,184 8,315,717 86,585,169 DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] 1,056 1,322 3,614 26,572 310,265 3,706,489 37,275,816 DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 1,729 2,234 8,268 94,044 1,661,863 41,141,396 600,233,695 DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] 1,049 1,343 4,053 39,299 529,329 7,016,574 120,664,934 DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 1,696 2,207 8,189 93,323 1,660,917 41,126,264 625,999,220 DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] 1,055 1,343 4,033 38,910 529,763 6,602,454 125,725,448 39