SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
R-Package DescTools
Why and where to go?
Andri Signorell, Helsana Health Sciences,
Zurich R-Group 21.01.2016
Randomized clinical trials (RCT)
do not represent the reality in health care
• Population included in RCT does not correspond to
the population finally receiving the treatment
2Andri Signorell, 21.01.2016
Only 1/3 of the ultimatlely treated
people would at all fulfill the inclusion
criteria
Elderly underrepresented in
clinical trials
Medication of one patient…
Is this evidence-based medicine?
3
Real example from
our database:
Mrs. G. H. in G.
received in 2013
drugs with
101 different agents
(ATC-Codes)
in total
533 prescriptions
Andri Signorell, 21.01.2016
Unnötige Herzkatheteruntersuchungen in der
Schweiz
ni. Mit einem Herzkatheter können beim Patienten
gefährliche Verschlüsse in den Herzkranzarterien
nachgewiesen und behoben werden. Weil die
Untersuchung aber teuer, invasiv und nicht frei von
Komplikationen ist, sollte sie nur bei begründetem
Verdacht auf Engnisse durchgeführt werden – so sehen es
die Richtlinien vor. Wird das in der Schweiz befolgt?
Dieser Frage sind Forscher in einer Studie nachgegangen.
Ihre vor kurzem in «Plos One» veröffentlichten Resultate
legen nahe, dass drei von zehn Herzkathetern unnötig
sind. (NZZ, 5.3.2015)
Zeichnung: Felix Schaad
Andri Signorell, 21.01.2016
Orders of magnitude
• Analytical DataWareHouse (TeraData),
updated daily and in a bitemporal history
• 492 tables und 7494 attributes
• 1'468'893 insured in 2014
• complete treatment information since ~ 2005
• 201'875'131 claims with all in all
949'392'044 detailed positions
• Analysed with
Andri Signorell, 21.01.2016
Where's the pain point?
Cross-Industry Standard Process
for Data-Mining
Shearer C., The CRISP-DM model: the new blueprint for
data mining, J Data Warehousing (2000); 5:13—22.
80% of the analysts ressources
are lost for data understanding
and preparation – … and no one
is doing something about it!
Andri Signorell, 21.01.2016
Users, even expert statisticians, do not always
screen the data.
B. D. Ripley, Robust statistics (2004)
Andri Signorell, 21.01.2016
Get the Right Tool for the Job!
• Datasets with 150
Variablen, 500’000 rows
not unusal
• R might not always be
optimal for this order of
magnitude (performance,
RAM)
• Programming paradigm let
grow the screening code
and make it confusing!
Andri Signorell, 21.01.2016
DescTools focus
• provide elaborated descriptive routines
– numeric, factor, logical, table, numeric ~ factor, ...
– data.frame, formula interface
• integrate descriptive plots
• easy output to MS-Word document
> Desc(d.pizza$temperature) # describe single variable
> wrd <- GetNewWrd()
> Desc(d.pizza, wrd=wrd) # describe data.frame and send
# it directly to Word
> Desc(. ~ driver, d.pizza)
> Desc(driver ~ ., d.pizza)
Andri Signorell, 21.01.2016
Describe numeric
> summary(d.pizza$temperature) # base R
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
19.30 42.22 50.00 47.94 55.30 64.80 40
> describe(d.pizza$temperature) # library(Hmisc)
d.pizza$temperature
n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95
1170 39 375 1 47.94 26.70 33.29 42.23 50.00 55.30 58.80 60.50
lowest : 19.30 19.40 20.00 20.20 20.35, highest: 63.80 64.10 64.60 64.70 64.80
> Desc(d.pizza$temperature) # library(DescTools)
--------------------------------------------------
d.pizza$temperature (numeric)
length n NAs unique 0s mean meanSE
1'210 1'170 40 375 0 47.937 0.291
.05 .10 .25 median .75 .90 .95
26.700 33.290 42.225 50 55.300 58.800 60.500
rng sd vcoef mad IQR skew kurt
45.500 9.938 0.207 9.192 13.075 -0.842 0.051
lowest : 19.3, 19.4, 20, 20.2 (2), 20.35
highest: 63.8, 64.1, 64.6, 64.7, 64.8
Screening-Fragen:
• What happens at the edges?
• Are there Missings?
• Are all elements unique?
• Has 0 been misused as NA?
Andri Signorell, 21.01.2016
• Base R
plot(d.pizza$temperature)
• DescTools
plot(Desc(d.pizza$temperature))
Visualization excellence …
… is that which gives to the viewer the greatest number of ideas in the shortest
time with the least ink in the smallest space.
… requires telling the truth about the data.
Edward Tufte The Visual Display of Quantitative Information and Envisioning Information, Graphics Press, PO Box 430, Cheshire, CT 06410.
Andri Signorell, 21.01.2016
Describe table
> tab <- table(d.pizza$driver, d.pizza$area)
> summary(tab)
Number of cases in table: 1194
Number of factors: 2
Test for independence of all factors:
Chisq = 1009.5, df = 12, p-value = 1.697e-208
> describe(tab)
tab
3 Variables 7 Observations
----------------------------------------------------
Brent
n missing unique Info Mean
7 0 7 1 67.57
6 19 29 42 72 128 177
Frequency 1 1 1 1 1 1 1
% 14 14 14 14 14 14 14
----------------------------------------------------
Camden
n missing unique Info Mean
7 0 7 1 48.71
1 4 19 41 47 87 142
Frequency 1 1 1 1 1 1 1
% 14 14 14 14 14 14 14
----------------------------------------------------
...
base R: reduced to the limits…
Hmisc:
Oups! Missinterpreted…
Andri Signorell, 21.01.2016
> tab <- as.table(apply(HairEyeColor, c(1,2), sum))[
+ , c("Brown","Hazel","Green","Blue")]
> (z <- Desc(tab, row.vars=c(3, 1), rfrq="011",
plotit=FALSE, main="Hair ~ Eye"))
Hair ~ Eye
Summary:
n: 592, rows: 4, columns: 4
Pearson's Chi-squared test:
X-squared = 138.29, df = 9, p-value < 2.2e-16
Likelihood Ratio:
X-squared = 146.44, df = 9, p-value < 2.2e-16
Mantel-Haenszel Chi-squared:
X-squared = 109.64, df = 1, p-value < 2.2e-16
Phi-Coefficient 0.483
Contingency Coeff. 0.435
Cramer's V 0.279
Eye
Brown Hazel Green Blue Sum
Hair
freq Black 68 15 5 20 108
Brown 119 54 29 84 286
Red 26 14 14 17 71
Blond 7 10 16 94 127
Sum 220 93 64 215 592
p.row Black 63% 13.9% 4.6% 18.5% .
Brown 41.6% 18.9% 10.1% 29.4% .
Red 36.6% 19.7% 19.7% 23.9% .
Blond 5.5% 7.9% 12.6% 74% .
Sum 37.2% 15.7% 10.8% 36.3% .
p.col Black 30.9% 16.1% 7.8% 9.3% 18.2%
Brown 54.1% 58.1% 45.3% 39.1% 48.3%
Red 11.8% 15.1% 21.9% 7.9% 12%
Blond 3.2% 10.8% 25% 43.7% 21.5%
Sum . . . . .
> # do the plot by hand, while setting the colours
> cols1 <- SetAlpha(c("sienna4", "burlywood",
"chartreuse3", "slategray1"), 0.6)
> cols2 <- SetAlpha(c("moccasin", "salmon1", "wheat3",
"gray32"), 0.8)
> plot(z, col1=cols1, col2=cols2, horiz=FALSE)
Andri Signorell, 21.01.2016
Describe factors in Word
Desc(d.pizza$driver, wrd=GetNewWrd())
Andri Signorell, 21.01.2016
Summary:
n pairs: 768, valid: 768 (100%), missings: 0 (0%), groups: 2
neg pos Total
mean 31.19 37.07 33.24
median 27.00 36.00 29.00
sd 11.67 10.97 11.76
IQR 14.00 16.00 17.00
n 500 268 768
np 65.1% 34.9% 100%
NAs 0 0 0
0s 0 0 0
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 73.253, df = 1, p-value < 2.2e-16
Proportions of diabetes in the quantiles of age:
Q1 Q2 Q3 Q4 Q5
neg 86.7% 76.1% 57% 54.3% 46.8%
pos 13.3% 23.9% 43% 45.7% 53.2%
> Desc(diabetes ~ age, data=d.pima,
digits=2, breaks=5, margin=TRUE, conf.level=0.90) factor ~ numeric
further:
factor ~ factor
numeric ~ factor
numeric ~ numeric
Andri Signorell, 21.01.2016
+ ~ 440 Functions
• Statistical functions and Confidence Intervals
Skew, Kurt, CramerV, SomersDelta, CohenKappa, HuberM, MeanCI,
BinomCI, …
• Additional Tests not found in base R
HotellingsT2Test, JarqueBeraTest, BreslowDayTest, DurbinWatsonTest,
LeveneTest, ScheffeTest, …
• Date functions
Today, AddMonths, Day, Month, Year, Weekday, IsWeekend, Zodiac, …
• String functions
StrAlign, StrTrim, StrDist, StrCountW, StrVal, …
• Operators and other
%()%, Untable, CollapseTable, Dummy, Large, Small, …
Andri Signorell, 21.01.2016
Pain Point «Speed»
> x <- runif(1e8)
> system.time(e1071::kurtosis(x))
user system elapsed
5.67 0.55 6.21
> system.time(DescTools::Kurt(x))
user system elapsed
0.47 0.00 0.47
http://www.noamross.net/blog/2013/4/25/faster-talk.html
-> Get a Bigger Computer
Andri Signorell, 21.01.2016
Andri Signorell, 21.01.2016
Pain point «Import»
R Data Import/Export
This is a guide to importing and exporting data
to and from R.
This manual is for R, version 3.1.2 (2014-10-31).
Copyright © 2000–2014 R Core Team
Andri Signorell, 21.01.2016
DescTools::XLGetRange()
• Import directly from XL
Andri Signorell, 21.01.2016
Can one be a good data analyst without being a half-good programmer?
The short answer to that is, 'No.' The long answer to that is, 'No.'
-- Frank Harrell 1999 S-PLUS User Conference, New Orleans (October 1999)
Could you spontaneously produce the R-code needed to present todays’ date?
“Donnerstag, 21. Januar 2016”
• Solution Base R*):
> format(Sys.Date(), "%A, %d. %B %Y")
[1] "Donnerstag, 21. Januar 2016"
• Solution DescTools:
> Format(Today(), fmt="dddd, dd. mmmm yyyy")
[1] "Donnerstag, 21. Januar 2016"
Pain Point «User Interface»
Andri Signorell, 21.01.2016
The reasonable man adapts himself to the world; the
unreasonable one persists in trying to adapt the world
to himself.
Therefore, all progress depends on the unreasonable
man.
George Bernard Shaw
Be unreasonable and contact me
with feedback or feature ideas!
andri@signorell.net
Andri Signorell, 21.01.2016
Thanks to
• All the R-Core members and R–contributors
• Frank E Harrell Jr, with contributions from Charles Dupont and many others. (2014). Hmisc:
Harrell Miscellaneous. R package version 3.14-6. http://CRAN.R-project.org/package=Hmisc
• Revelle, W. (2015) psych: Procedures for Personality and Psychological Research,
Northwestern University, Evanston, Illinois, USA, http://CRAN.R-project.org/package=psych
Version = 1.5.1.
• Lemon, J. (2006) Plotrix: a package in the red light district of R. R-News, 6(4): 8-12.
• Hans Peter Wolf and Uni Bielefeld (2014). aplpack: Another Plot PACKage: stem.leaf, bagplot,
faces, spin3R, plotsummary, plothulls, and some slider functions. R package version 1.3.0.
http://CRAN.R-project.org/package=aplpack
• Martin Maechler et al. (2015). sfsmisc: Utilities from Seminar fuer Statistik ETH Zurich. R
package version 1.0-27. http://CRAN.R-project.org/package=sfsmisc
• Christian W. Hoffmann <http://www.echoffmann.ch> (2014). cwhmisc: Miscellaneous
Functions for math, plotting, printing, statistics, strings, and tools. R package version 5.0.
http://CRAN.R-project.org/package=cwhmisc
• And many more! See DescTools’ authors list!
Andri Signorell, 21.01.2016

Contenu connexe

Similaire à Zurich R User group: Desc tools

Fall 1998 review questions for comprehensive final
Fall 1998 review questions for comprehensive finalFall 1998 review questions for comprehensive final
Fall 1998 review questions for comprehensive final
arbi
 
Scientific Notation
Scientific NotationScientific Notation
Scientific Notation
Awais Khan
 
Statistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data LanguageStatistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data Language
maggiexyz
 

Similaire à Zurich R User group: Desc tools (20)

Healthcare deserts: How accessible is US healthcare?
Healthcare deserts: How accessible is US healthcare?Healthcare deserts: How accessible is US healthcare?
Healthcare deserts: How accessible is US healthcare?
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin
 
Mnh csv python
Mnh csv pythonMnh csv python
Mnh csv python
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
 
Mnh csv python
Mnh csv pythonMnh csv python
Mnh csv python
 
Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Javier Garcia - Verdugo Sanchez - Six Sigma Training - W1 Statistical Methods
Javier Garcia - Verdugo Sanchez - Six Sigma Training - W1 Statistical MethodsJavier Garcia - Verdugo Sanchez - Six Sigma Training - W1 Statistical Methods
Javier Garcia - Verdugo Sanchez - Six Sigma Training - W1 Statistical Methods
 
Fall 1998 review questions for comprehensive final
Fall 1998 review questions for comprehensive finalFall 1998 review questions for comprehensive final
Fall 1998 review questions for comprehensive final
 
Scientific Notation
Scientific NotationScientific Notation
Scientific Notation
 
The R of War
The R of WarThe R of War
The R of War
 
Data Envelopment Analysis
Data Envelopment AnalysisData Envelopment Analysis
Data Envelopment Analysis
 
Statistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data LanguageStatistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data Language
 
Business Statistics Chapter 6
Business Statistics Chapter 6Business Statistics Chapter 6
Business Statistics Chapter 6
 
Engineering Data Analysis-ProfCharlton
Engineering Data  Analysis-ProfCharltonEngineering Data  Analysis-ProfCharlton
Engineering Data Analysis-ProfCharlton
 
Piano rubyslava final
Piano rubyslava finalPiano rubyslava final
Piano rubyslava final
 
Low cost data acquisition from digital caliper to pc
Low cost data acquisition from digital caliper to pcLow cost data acquisition from digital caliper to pc
Low cost data acquisition from digital caliper to pc
 
1
11
1
 
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 

Plus de Zurich_R_User_Group

Plus de Zurich_R_User_Group (10)

Anomaly detection - database integrated
Anomaly detection - database integratedAnomaly detection - database integrated
Anomaly detection - database integrated
 
R at Sanitas - Workflow, Problems and Solutions
R at Sanitas - Workflow, Problems and SolutionsR at Sanitas - Workflow, Problems and Solutions
R at Sanitas - Workflow, Problems and Solutions
 
Modeling Bus Bunching
Modeling Bus BunchingModeling Bus Bunching
Modeling Bus Bunching
 
Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...
Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...
Visualizing the frequency of transit delays using QGIS and the Leaflet javasc...
 
Introduction to Renjin, the alternative engine for R
Introduction to Renjin, the alternative engine for R Introduction to Renjin, the alternative engine for R
Introduction to Renjin, the alternative engine for R
 
How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...
How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...
How to use R in different professions: R In Finance (Speaker: Gabriel Foix, M...
 
Where South America is Swinging to the Right: An R-Driven Data Journalism Pr...
Where South America is Swinging to the Right:  An R-Driven Data Journalism Pr...Where South America is Swinging to the Right:  An R-Driven Data Journalism Pr...
Where South America is Swinging to the Right: An R-Driven Data Journalism Pr...
 
Visualization Challenge: Mapping Health During Travel
Visualization Challenge: Mapping Health During TravelVisualization Challenge: Mapping Health During Travel
Visualization Challenge: Mapping Health During Travel
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
 
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig WangDecember 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
December 2015 Meetup - Shiny: Make Your R Code Interactive - Craig Wang
 

Dernier

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Dernier (20)

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 

Zurich R User group: Desc tools

  • 1. R-Package DescTools Why and where to go? Andri Signorell, Helsana Health Sciences, Zurich R-Group 21.01.2016
  • 2. Randomized clinical trials (RCT) do not represent the reality in health care • Population included in RCT does not correspond to the population finally receiving the treatment 2Andri Signorell, 21.01.2016 Only 1/3 of the ultimatlely treated people would at all fulfill the inclusion criteria Elderly underrepresented in clinical trials
  • 3. Medication of one patient… Is this evidence-based medicine? 3 Real example from our database: Mrs. G. H. in G. received in 2013 drugs with 101 different agents (ATC-Codes) in total 533 prescriptions Andri Signorell, 21.01.2016
  • 4. Unnötige Herzkatheteruntersuchungen in der Schweiz ni. Mit einem Herzkatheter können beim Patienten gefährliche Verschlüsse in den Herzkranzarterien nachgewiesen und behoben werden. Weil die Untersuchung aber teuer, invasiv und nicht frei von Komplikationen ist, sollte sie nur bei begründetem Verdacht auf Engnisse durchgeführt werden – so sehen es die Richtlinien vor. Wird das in der Schweiz befolgt? Dieser Frage sind Forscher in einer Studie nachgegangen. Ihre vor kurzem in «Plos One» veröffentlichten Resultate legen nahe, dass drei von zehn Herzkathetern unnötig sind. (NZZ, 5.3.2015) Zeichnung: Felix Schaad Andri Signorell, 21.01.2016
  • 5. Orders of magnitude • Analytical DataWareHouse (TeraData), updated daily and in a bitemporal history • 492 tables und 7494 attributes • 1'468'893 insured in 2014 • complete treatment information since ~ 2005 • 201'875'131 claims with all in all 949'392'044 detailed positions • Analysed with Andri Signorell, 21.01.2016
  • 6. Where's the pain point? Cross-Industry Standard Process for Data-Mining Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22. 80% of the analysts ressources are lost for data understanding and preparation – … and no one is doing something about it! Andri Signorell, 21.01.2016
  • 7. Users, even expert statisticians, do not always screen the data. B. D. Ripley, Robust statistics (2004) Andri Signorell, 21.01.2016
  • 8. Get the Right Tool for the Job! • Datasets with 150 Variablen, 500’000 rows not unusal • R might not always be optimal for this order of magnitude (performance, RAM) • Programming paradigm let grow the screening code and make it confusing! Andri Signorell, 21.01.2016
  • 9. DescTools focus • provide elaborated descriptive routines – numeric, factor, logical, table, numeric ~ factor, ... – data.frame, formula interface • integrate descriptive plots • easy output to MS-Word document > Desc(d.pizza$temperature) # describe single variable > wrd <- GetNewWrd() > Desc(d.pizza, wrd=wrd) # describe data.frame and send # it directly to Word > Desc(. ~ driver, d.pizza) > Desc(driver ~ ., d.pizza) Andri Signorell, 21.01.2016
  • 10. Describe numeric > summary(d.pizza$temperature) # base R Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 19.30 42.22 50.00 47.94 55.30 64.80 40 > describe(d.pizza$temperature) # library(Hmisc) d.pizza$temperature n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95 1170 39 375 1 47.94 26.70 33.29 42.23 50.00 55.30 58.80 60.50 lowest : 19.30 19.40 20.00 20.20 20.35, highest: 63.80 64.10 64.60 64.70 64.80 > Desc(d.pizza$temperature) # library(DescTools) -------------------------------------------------- d.pizza$temperature (numeric) length n NAs unique 0s mean meanSE 1'210 1'170 40 375 0 47.937 0.291 .05 .10 .25 median .75 .90 .95 26.700 33.290 42.225 50 55.300 58.800 60.500 rng sd vcoef mad IQR skew kurt 45.500 9.938 0.207 9.192 13.075 -0.842 0.051 lowest : 19.3, 19.4, 20, 20.2 (2), 20.35 highest: 63.8, 64.1, 64.6, 64.7, 64.8 Screening-Fragen: • What happens at the edges? • Are there Missings? • Are all elements unique? • Has 0 been misused as NA? Andri Signorell, 21.01.2016
  • 11. • Base R plot(d.pizza$temperature) • DescTools plot(Desc(d.pizza$temperature)) Visualization excellence … … is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space. … requires telling the truth about the data. Edward Tufte The Visual Display of Quantitative Information and Envisioning Information, Graphics Press, PO Box 430, Cheshire, CT 06410. Andri Signorell, 21.01.2016
  • 12. Describe table > tab <- table(d.pizza$driver, d.pizza$area) > summary(tab) Number of cases in table: 1194 Number of factors: 2 Test for independence of all factors: Chisq = 1009.5, df = 12, p-value = 1.697e-208 > describe(tab) tab 3 Variables 7 Observations ---------------------------------------------------- Brent n missing unique Info Mean 7 0 7 1 67.57 6 19 29 42 72 128 177 Frequency 1 1 1 1 1 1 1 % 14 14 14 14 14 14 14 ---------------------------------------------------- Camden n missing unique Info Mean 7 0 7 1 48.71 1 4 19 41 47 87 142 Frequency 1 1 1 1 1 1 1 % 14 14 14 14 14 14 14 ---------------------------------------------------- ... base R: reduced to the limits… Hmisc: Oups! Missinterpreted… Andri Signorell, 21.01.2016
  • 13. > tab <- as.table(apply(HairEyeColor, c(1,2), sum))[ + , c("Brown","Hazel","Green","Blue")] > (z <- Desc(tab, row.vars=c(3, 1), rfrq="011", plotit=FALSE, main="Hair ~ Eye")) Hair ~ Eye Summary: n: 592, rows: 4, columns: 4 Pearson's Chi-squared test: X-squared = 138.29, df = 9, p-value < 2.2e-16 Likelihood Ratio: X-squared = 146.44, df = 9, p-value < 2.2e-16 Mantel-Haenszel Chi-squared: X-squared = 109.64, df = 1, p-value < 2.2e-16 Phi-Coefficient 0.483 Contingency Coeff. 0.435 Cramer's V 0.279 Eye Brown Hazel Green Blue Sum Hair freq Black 68 15 5 20 108 Brown 119 54 29 84 286 Red 26 14 14 17 71 Blond 7 10 16 94 127 Sum 220 93 64 215 592 p.row Black 63% 13.9% 4.6% 18.5% . Brown 41.6% 18.9% 10.1% 29.4% . Red 36.6% 19.7% 19.7% 23.9% . Blond 5.5% 7.9% 12.6% 74% . Sum 37.2% 15.7% 10.8% 36.3% . p.col Black 30.9% 16.1% 7.8% 9.3% 18.2% Brown 54.1% 58.1% 45.3% 39.1% 48.3% Red 11.8% 15.1% 21.9% 7.9% 12% Blond 3.2% 10.8% 25% 43.7% 21.5% Sum . . . . . > # do the plot by hand, while setting the colours > cols1 <- SetAlpha(c("sienna4", "burlywood", "chartreuse3", "slategray1"), 0.6) > cols2 <- SetAlpha(c("moccasin", "salmon1", "wheat3", "gray32"), 0.8) > plot(z, col1=cols1, col2=cols2, horiz=FALSE) Andri Signorell, 21.01.2016
  • 14. Describe factors in Word Desc(d.pizza$driver, wrd=GetNewWrd()) Andri Signorell, 21.01.2016
  • 15. Summary: n pairs: 768, valid: 768 (100%), missings: 0 (0%), groups: 2 neg pos Total mean 31.19 37.07 33.24 median 27.00 36.00 29.00 sd 11.67 10.97 11.76 IQR 14.00 16.00 17.00 n 500 268 768 np 65.1% 34.9% 100% NAs 0 0 0 0s 0 0 0 Kruskal-Wallis rank sum test: Kruskal-Wallis chi-squared = 73.253, df = 1, p-value < 2.2e-16 Proportions of diabetes in the quantiles of age: Q1 Q2 Q3 Q4 Q5 neg 86.7% 76.1% 57% 54.3% 46.8% pos 13.3% 23.9% 43% 45.7% 53.2% > Desc(diabetes ~ age, data=d.pima, digits=2, breaks=5, margin=TRUE, conf.level=0.90) factor ~ numeric further: factor ~ factor numeric ~ factor numeric ~ numeric Andri Signorell, 21.01.2016
  • 16. + ~ 440 Functions • Statistical functions and Confidence Intervals Skew, Kurt, CramerV, SomersDelta, CohenKappa, HuberM, MeanCI, BinomCI, … • Additional Tests not found in base R HotellingsT2Test, JarqueBeraTest, BreslowDayTest, DurbinWatsonTest, LeveneTest, ScheffeTest, … • Date functions Today, AddMonths, Day, Month, Year, Weekday, IsWeekend, Zodiac, … • String functions StrAlign, StrTrim, StrDist, StrCountW, StrVal, … • Operators and other %()%, Untable, CollapseTable, Dummy, Large, Small, … Andri Signorell, 21.01.2016
  • 17. Pain Point «Speed» > x <- runif(1e8) > system.time(e1071::kurtosis(x)) user system elapsed 5.67 0.55 6.21 > system.time(DescTools::Kurt(x)) user system elapsed 0.47 0.00 0.47 http://www.noamross.net/blog/2013/4/25/faster-talk.html -> Get a Bigger Computer Andri Signorell, 21.01.2016
  • 19. Pain point «Import» R Data Import/Export This is a guide to importing and exporting data to and from R. This manual is for R, version 3.1.2 (2014-10-31). Copyright © 2000–2014 R Core Team Andri Signorell, 21.01.2016
  • 20. DescTools::XLGetRange() • Import directly from XL Andri Signorell, 21.01.2016
  • 21. Can one be a good data analyst without being a half-good programmer? The short answer to that is, 'No.' The long answer to that is, 'No.' -- Frank Harrell 1999 S-PLUS User Conference, New Orleans (October 1999) Could you spontaneously produce the R-code needed to present todays’ date? “Donnerstag, 21. Januar 2016” • Solution Base R*): > format(Sys.Date(), "%A, %d. %B %Y") [1] "Donnerstag, 21. Januar 2016" • Solution DescTools: > Format(Today(), fmt="dddd, dd. mmmm yyyy") [1] "Donnerstag, 21. Januar 2016" Pain Point «User Interface» Andri Signorell, 21.01.2016
  • 22. The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man. George Bernard Shaw Be unreasonable and contact me with feedback or feature ideas! andri@signorell.net Andri Signorell, 21.01.2016
  • 23. Thanks to • All the R-Core members and R–contributors • Frank E Harrell Jr, with contributions from Charles Dupont and many others. (2014). Hmisc: Harrell Miscellaneous. R package version 3.14-6. http://CRAN.R-project.org/package=Hmisc • Revelle, W. (2015) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, http://CRAN.R-project.org/package=psych Version = 1.5.1. • Lemon, J. (2006) Plotrix: a package in the red light district of R. R-News, 6(4): 8-12. • Hans Peter Wolf and Uni Bielefeld (2014). aplpack: Another Plot PACKage: stem.leaf, bagplot, faces, spin3R, plotsummary, plothulls, and some slider functions. R package version 1.3.0. http://CRAN.R-project.org/package=aplpack • Martin Maechler et al. (2015). sfsmisc: Utilities from Seminar fuer Statistik ETH Zurich. R package version 1.0-27. http://CRAN.R-project.org/package=sfsmisc • Christian W. Hoffmann <http://www.echoffmann.ch> (2014). cwhmisc: Miscellaneous Functions for math, plotting, printing, statistics, strings, and tools. R package version 5.0. http://CRAN.R-project.org/package=cwhmisc • And many more! See DescTools’ authors list! Andri Signorell, 21.01.2016