SlideShare a Scribd company logo
1 of 60
Download to read offline
Understanding R for Epidemiologists 
TomĀ“as J. AragĀ“on, MD, DrPH 
Faculty, Division of Epidemiology 
UC Berkeley School of Public Health 
Health Officer, City & County of San Francisco 
Director, Population Health Division (PHD) 
San Francisco Department of Public Health 
Blog: http://www.medepi.com 
Email: aragon@berkeley.edu 
September 8, 2014 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 1 / 60
Outline 
1 Background 
Cost 
Quality 
Community 
2 Getting started with R 
Full-function calculator/spreadsheet 
Extensible statistical packages 
High quality graphics tool 
Multi-use programming language 
3 Working with R data objects 
Atomic vs. recursive data objects 
Working with vectors, matrices, & arrays 
Working with lists, data frames, and functions 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 2 / 60
Background 
Background: Major issues 
Cost 
Quality 
Community 
Functionality 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 3 / 60
Background Cost 
Cost: Open Source vs. Proprietary Software 
Costs of software 
Costs of multi-platforms 
Costs of education and training 
Costs of adding solutions (e.g., packages) 
Costs of solving problems and sharing solutions 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 4 / 60
Background Quality 
Quality: Open Source vs. Proprietary Software 
Core Development Team 
Large pool of users/testers 
Quality control process for packages 
Bug fixes based on need/demand, not profits 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 5 / 60
Background Community 
Community: Open Source vs. Proprietary Software 
Large community of users 
Transparent development process 
Growing number of books and trainings 
Growing number of free tutorials and manuals 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 6 / 60
Background Community 
Current R contributors 
Douglas Bates 
John Chambers 
Peter Dalgaard 
Seth Falcon 
Robert Gentleman 
Kurt Hornik 
Stefano Iacus 
Ross Ihaka 
Friedrich Leisch 
Uwe Ligges 
Thomas Lumley 
Martin Maechler 
Duncan Murdoch 
Paul Murrell 
Martyn Plummer 
Brian Ripley 
Deepayan Sarkar 
Duncan Temple Lang 
Luke Tierney 
Simon Urbanek 
Source: http://www.r-project.org/contributors.html 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 7 / 60
Getting started with R 
What is R? 
Full-function calculator/spreadsheet 
Extensible statistical packages 
High-quality graphics tool 
Multi-use programming language 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 8 / 60
Getting started with R Full-function calculator/spreadsheet 
Full-function calculator: Selected math operators 
Operator Description Try these examples 
+ addition 5+4 
āˆ’ subtraction 5-4 
 multiplication 5*4 
/ division 5/4 
Ė† exponentiation 5^4 
āˆ’ unary minus (change current 
sign) 
-5 
abs absolute value abs(-23) 
exp exponentiation (e to a power) exp(8) 
log logarithm (default is natural log) log(exp(8)) 
sqrt square root sqrt(64) 
%/% integer divide 10%/%3 
%% modulus 10%%3 
%*% matrix multiplication xx - matrix(1:4, 2, 2) 
xx%*%c(1, 1) 
c(1, 1)%*%xx 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 9 / 60
Getting started with R Extensible statistical packages 
Extensible statistical packages 
Generalized Linear Models (Base) 
Linear regression 
Logistic regression 
Poisson regression 
Cox Proportional Hazard models (Survival) 
Cox PH regression 
Conditional logistic regression (matched case-control studies) 
Meta-analysis (meta) 
Complex survey analysis (survey) 
Epidemiology packages 
epitools 
epicalc 
epibasix 
epiR 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 10 / 60
Getting started with R High quality graphics tool 
Graphics display of sample size curves 
Alternative distribution 
H1 
Power 
(1 - b) 
Null distribution 
H0 
b a 2 
-Z1-a 2 m0 Z1-a 2 m1 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 11 / 60
Getting started with R High quality graphics tool 
Graphics display of P value function 
0.2 0.5 1.0 2.0 2.9 5.0 10.0 20.0 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.05 
0 
0 
10 
20 
30 
40 
50 
60 
70 
80 
90 
95 
100 
Confidence level (%) 
Rate Ratio 
Pāˆ’value 
Null hypothesis 
Median unbiased estimate 
95% Lower Confidence Limit = 0.74 
95% Upper Confidence Limit = 21.0 
95% Confidence Interval 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 12 / 60
Getting started with R High quality graphics tool 
Graphical display of multiple linear regression 
0 10 20 30 40 50 
10 20 30 40 50 60 70 80 90 
0 
10 
20 
30 
40 
50 
x1 
x2 
y 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 13 / 60
Getting started with R High quality graphics tool 
Epidemic curve using Color Brewer colors 
Unknown 
WNF 
WNND 
0 20 40 60 80 
West Nile Virus Human Cases Reported in California 
by Disease Week as of December 14, 2004 
Cases 
+ Bird 
2/24 
+ Horse 
6/20 
+ Chicken 
5/17 
+ Mosquito 
4/14 
52 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48 51 
Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
Disease Week  Calendar Month, 2004 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 14 / 60
Getting started with R Multi-use programming language 
Multi-use programming language 
Vectorized computations 
Functional programming language 
Object-oriented programming 
Text processing (e.g., using regular expressions) 
Links to C, Fortran, etc. 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 15 / 60
Working with R data objects Atomic vs. recursive data objects 
Data objects in R 
Object types 
Vector 
Matrix 
Array 
List 
Data frame 
Function 
Operations 
Create 
Name 
Index 
Replace 
Manipulate 
Do computations 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 16 / 60
Working with R data objects Atomic vs. recursive data objects 
Summary of types of data objects in R 
Data object Possible modea Default class 
Atomic 
vector character, numeric, logical NULL 
matrix character, numeric, logical NULL 
array character, numeric, logical NULL 
Recursive 
list list NULL 
data frame list data frame 
function function NULL 
a We are ignoring complex numbers 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 17 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding vectors 
A vector is a collection of like elements without dimensions1. The vector 
elements are all of the same mode (either character, numeric, or logical). 
 y - c(Pedro, Paulo, Maria) 
 y 
[1] Pedro Paulo Maria 
 x - c(1, 2, 3, 4, 5) 
 x 
[1] 1 2 3 4 5 
 x  3 
[1] TRUE TRUE FALSE FALSE FALSE 
1In other programming languages, vectors are either row vectors or column vectors. 
R does not make this distinction until it is necessary. 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 18 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding vectors: Indexing 
Indexing by Try these examples 
Position x - c(chol=234, sbp=148, dbp=78, age=54) 
x[2] #positions to include 
x[c(2, 3)] 
x[-c(1, 3, 4)] #positions to exclude 
x[-c(1, 4)] 
Name x[sbp] 
x[c(sbp, dbp)] 
Logical x  100 
x[x  100] 
(x  150)  (x  70) 
bp - (x  150)  (x  70) 
x[bp] 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 19 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding vectors: Replacement 
Replacing by Try these examples 
Position x - c(chol=234, sbp=148, dbp=78, age=54) 
x[1] 
x[1] - 250 
x 
Name x[sbp] 
x[sbp] - 150 
x 
Logical x[x100] 
x[x100] - NA 
x 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 20 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding vectors: Replacement 
 x - c(chol = 234, sbp = 148, dbp = 78, age = 54) 
 x[1] - 250 #by position 
 x 
chol sbp dbp age 
250 148 78 54 
 x[sbp] - 150 #by name 
 x 
chol sbp dbp age 
250 150 78 54 
 x[x100] 
dbp age 
78 54 
 x[x100] - NA #by logical 
 x 
chol sbp dbp age 
250 150 NA NA 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 21 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding matrices 
A matrix is a collection of like elements organized into a 2-dimensional 
(tabular) data object. Matrix elements can be either numeric, character, 
or logical. We can think of a matrix as a vector with a 2-dimensional 
structure. Contingency tables in epidemiology are represented in R as 
numeric matrices or arrays. An array is the generalization of matrices to 3 
or more dimensions (commonly known as stratified tables). We cover 
arrays later, for now we will focus on 2-dimensional tables. 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 22 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding matrices 
When R returns a matrix the [n,] indicates the nth row and [,m] 
indicates the mth column. 
 x - c(a, b, c, d) 
 y - matrix(x, 2, 2) 
 y 
[,1] [,2] 
[1,] a c 
[2,] b d 
 y[1,] 
[1] a c 
 y[,2] 
[1] c d 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 23 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding matrices 
 x - c(30, 21, 170, 180) # creating 
 y - matrix(x, 2, 2, byrow = TRUE) # creating 
 y 
[,1] [,2] 
[1,] 30 21 
[2,] 170 180 
 rownames(y) - c(Deaths, Survivors) # naming 
 colnames(y) - c(Tolbutamide, Placebo) # naming 
 y[2, 1] - 174 # replace by position 
 y[Survivors, Placebo] - 184 # replace by name 
 y 
Tolbutamide Placebo 
Deaths 30 21 
Survivors 174 184 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 24 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding matrices 
Consider the 2 Ɨ 2 table of crude data in Table. In this randomized clinical 
trial (RCT), diabetic subjects were randomly assigned to receive either 
tolbutamide, an oral hypoglycemic drug, or placebo. Because this was a 
prospective study we can calculate risks, odds, a risk ratio, and an odds 
ratio. We will do this using R as a calculator. 
Table : Deaths among subjects who received tolbutamide and placebo in the 
Unversity Group Diabetes Program (1970) 
Tolbutamide Placebo 
Deaths 30 21 
Survivors 174 184 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 25 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding matrices 
 dat - matrix(c(30, 174, 21, 184), 2, 2) 
 rownames(dat) - c(Deaths, Survivors) 
 colnames(dat) - c(Tolbutamide, Placebo) 
 coltot - apply(dat, 2, sum) #column totals 
 risks - dat[Deaths,]/coltot 
 risk.ratio - risks/risks[2] #risk ratio 
 odds - risks/(1-risks) 
 odds.ratio - odds/odds[2] #odds ratio 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 26 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding matrices 
 # display results 
 dat 
Tolbutamide Placebo 
Deaths 30 21 
Survivors 174 184 
 rbind(risks, risk.ratio, odds, odds.ratio) 
Tolbutamide Placebo 
risks 0.1470588 0.1024390 
risk.ratio 1.4355742 1.0000000 
odds 0.1724138 0.1141304 
odds.ratio 1.5106732 1.0000000 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 27 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding arrays 
An array is a collection of like elements organized into a n-dimensional 
data object. When R returns an array the [n,,] indicates the nth row 
and [,m,] indicates the mth column, and so on. 
 x - 1:8 
 y - array(x, dim=c(2, 2, 2)) 
 y 
, , 1 
[,1] [,2] 
[1,] 1 3 
[2,] 2 4 
, , 2 
[,1] [,2] 
[1,] 5 7 
[2,] 6 8 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 28 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding arrays 
While a matrix is a 2-dimensional table of like elements, an array is the 
generalization of matrices to n-dimensions. Stratified contingency tables in 
epidemiology are represented as array data objects in R. For example, the 
RCT previously shown comparing the number deaths among diabetic 
subjects that received tolbutamide vs. placebo is now also stratified by age 
group: 
Table : Deaths among subjects who received tolbutamide and placebo in the 
Unversity Group Diabetes Program (1970), stratifying by age 
Age55 Age55 Combined 
Tolb Plac Tolb Plac Tolb Plac 
Deaths 8 5 22 16 30 21 
Survivors 98 115 76 69 174 184 
Total 106 120 98 85 204 205 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 29 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding arrays 
 tdat - c(8, 98, 5, 115, 22, 76, 16, 69) 
 tdat - array(tdat, c(2, 2, 2)) 
 dimnames(tdat) - list(Outcome=c(Deaths, Survivors), 
+ Treatment=c(Tolbutamide, Placebo), 
+ Age group=c(Age55, Age=55)) 
 tdat 
, , Age group = Age55 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 8 5 
Survivors 98 115 
, , Age group = Age=55 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 22 16 
Survivors 76 69 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 30 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Table : Example of 4-dimensional array: Year 2000 population estimates by age, 
ethnicity, sex, and county 
Ethnicity 
County/Sex Age White AfrAmer AsianPI Latino Multirace AmerInd 
Alameda 
Female =19 58,160 31,765 40,653 49,738 10,120 839 
20ā€“44 112,326 44,437 72,923 58,553 7,658 1,401 
45ā€“64 82,205 24,948 33,236 18,534 2,922 822 
65+ 49,762 12,834 16,004 7,548 1,014 246 
Male =19 61,446 32,277 42,922 53,097 10,102 828 
20ā€“44 115,745 36,976 69,053 69,233 6,795 1,263 
45ā€“64 81,332 20,737 29,841 17,402 2,506 687 
65+ 33,994 8,087 11,855 5,416 711 156 
San Francisco 
Female =19 14,355 6,986 23,265 13,251 2,940 173 
20ā€“44 85,766 10,284 52,479 23,458 3,656 526 
45ā€“64 35,617 6,890 31,478 9,184 1,144 282 
65+ 27,215 5,172 23,044 5,773 554 121 
Male =19 14,881 6,959 24,541 14,480 2,851 165 
20ā€“44 105,798 11,111 48,379 31,605 3,766 782 
45ā€“64 43,694 7,352 26,404 8,674 1,220 354 
65+ 20,072 3,329 17,190 3,428 450 76 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 31 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding arrays 
Figure : Schematic representation of a 4-dimensional array: Year 2000 population 
estimates by age (1), race (2), sex (3), and county (4) 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 32 / 60
Working with R data objects Working with vectors, matrices,  arrays 
Understanding arrays 
Figure : Schematic of a theoretical 5-D array (e.g., data by age (1), race (2), sex 
(3), party affiliation (4), and state (5)). We can see that the field ā€œstateā€ has 3 
levels, and the field ā€œparty affiliationā€ has 2 levels; however, it is not apparent the 
number of age, race, and sex levels. Although not displayed, age levels would be 
represented by row names (along 1st dimension), race levels by column names 
(along 2nd dimension), and sex levels by depth names (along 3rd dimension). 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 33 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding lists 
Up to now, we have been working with atomic data objects (vector, matrix, 
array). In contrast, lists, data frames, and functions are recursive data 
objects. Recursive data objects have more flexibility in combining diverse 
data objects into one object. A list provides the most flexibility. Think of a 
list object as a collection of ā€œbinsā€ that can contain any R object. Lists 
are very useful for collecting results of an analysis or a function into one 
data object where all its contents are readily accessible by indexing. 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 34 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding lists 
A list is a collection of data objects without any restrictions: 
 x - c(11, 22, 34) 
 y - c(Male, Female, Male) 
 z - matrix(c(67, 34, 56,22), 2, 2) 
 mylist - list(x, y, z) 
 mylist 
[[1]] 
[1] 11 22 34 
[[2]] 
[1] Male Female Male 
[[3]] 
[,1] [,2] 
[1,] 67 56 
[2,] 34 22 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 35 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding lists 
Names can be assigned to each bin of a list. 
 names(mylist) - c(Age, Sex, Data) 
 mylist 
$Age 
[1] 11 22 34 
$Sex 
[1] Male Female Male 
$Data 
[,1] [,2] 
[1,] 67 56 
[2,] 34 22 
 mylist$Sex 
[1] Male Female Male 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 36 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding lists 
Figure : Schematic representation of a list of length four. The first bin [1] 
contains a smiling face [[1]], the second bin [2] contains a flower [[2]], the 
third bin [3] contains a lightning bolt [[3]], and the fourth bin [[4]] contains 
a heart [[4]]. When indexing a list object, single brackets [Ā·] indexes the bin, 
and double brackets [[Ā·]] indexes the bin contents. If the bin has a name, then 
$name also indexes the contents. 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 37 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding lists 
For example, using the UGDP clinical trial data, suppose we perform 
Fisherā€™s exact test for testing the null hypothesis of independence of rows 
and columns in a contingency table with fixed marginals. 
 udat - read.csv(http://www.medepi.net/data/ugdp.txt) 
 tab - xtabs(~ Status + Treatment, data = udat)[,2:1] 
 tab 
Treatment 
Status Tolbutamide Placebo 
Death 30 21 
Survivor 174 184 
 ftab - fisher.test(tab) 
 ftab 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 38 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding lists 
 ftab 
Fisherā€™s Exact Test for Count Data 
data: tab 
p-value = 0.1813 
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval: 
0.8013768 2.8872863 
sample estimates: 
odds ratio 
1.509142 
The default display only shows partial results. The total results are stored 
in the object ftab. Letā€™s evaluate the structure of ftab and extract some 
results: 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 39 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding lists 
 str(ftab) 
List of 7 
$ p.value : num 0.181 
$ conf.int : atomic [1:2] 0.801 2.887 
..- attr(*, conf.level)= num 0.95 
$ estimate : Named num 1.51 
..- attr(*, names)= chr odds ratio 
$ null.value : Named num 1 
..- attr(*, names)= chr odds ratio 
$ alternative: chr two.sided 
$ method : chr Fisherā€™s Exact Test for Count Data 
$ data.name : chr tab 
- attr(*, class)= chr htest 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 40 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding lists 
Letā€™s index some of the bins from ftab. 
 ftab$estimate 
odds ratio 
1.5091 
 ftab$conf.int 
[1] 0.80138 2.88729 
 ftab$conf.int[2] 
[1] 2.887286 
attr(,conf.level) 
[1] 0.95 
 ftab$p.value 
[1] 0.18126 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 41 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames 
A data frame is a list with a 2-dimensional (tabular) structure. 
Epidemiologists are very experienced working with data frames where each 
row usually represents data collected on individual subjects (also called 
records or observations) and columns represent fields for each type of data 
collected (also called variables). 
 subjno - c(1, 2, 3, 4) 
 age - c(34, 56, 45, 23) 
 sex - c(Male, Male, Female, Male) 
 case - c(Yes, No, No, Yes) 
 mydat - data.frame(subjno, age, sex, case) 
 mydat 
subjno age sex case 
1 1 34 Male Yes 
2 2 56 Male No 
3 3 45 Female No 
4 4 23 Male Yes 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 42 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames 
Epidemiologists are familiar with tabular data sets where each row is a 
record and each column is a field. A record can be data collected on 
individuals or groups. We usually refer to the field name as a variable 
(e.g., age, gender, ethnicity). Fields can contain numeric or character 
data. In R, these types of data sets are handled by data frames. Each 
column of a data frame is usually either a factor or numeric vector, 
although it can have complex, character, or logical vectors. Data frames 
have the functionality of matrices and lists. For example, here is the first 
10 rows of the infert data set, a matched case-control study published in 
1976 that evaluated whether infertility was associated with prior 
spontaneous or induced abortions. 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 43 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames 
 data(infert) 
 str(infert) 
ā€˜data.frameā€™: 248 obs. of 8 variables: 
$ education : Factor w/ 3 levels 0-5yrs,..: 1 1 ... 
$ age : num NA 45 NA 23 35 36 23 32 21 28 ... 
$ parity : num 6 1 6 4 3 4 1 2 1 2 ... 
$ induced : num 1 1 2 2 1 2 0 0 0 0 ... 
$ case : num 1 1 1 1 1 1 1 1 1 1 ... 
$ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ... 
$ stratum : int 1 2 3 4 5 6 7 8 9 10 ... 
$ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ... 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 44 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames 
 infert[1:10, 1:6] 
education age parity induced case spontaneous 
1 0-5yrs NA 6 1 1 2 
2 0-5yrs 45 1 1 1 0 
3 0-5yrs NA 6 2 1 0 
4 0-5yrs 23 4 2 1 0 
5 6-11yrs 35 3 1 1 1 
6 6-11yrs 36 4 2 1 1 
7 6-11yrs 23 1 0 1 0 
8 6-11yrs 32 2 0 1 0 
9 6-11yrs 21 1 0 1 1 
10 6-11yrs 28 2 0 1 0 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 45 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames 
The fields are obviously vectors. Letā€™s explore a few of these vectors to see 
what we can learn about their structure in R. 
 #age variable 
 infert$age 
[1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31 27 30 26 
... 
[235] 25 32 25 31 38 26 31 31 25 31 34 35 29 23 
 mode(infert$age) 
[1] numeric 
 class(infert$age) 
[1] numeric 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 46 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames 
 # education variable 
 infert$education 
[1] 0-5yrs 0-5yrs 0-5yrs 0-5yrs 6-11yrs 6-11yrs 
... 
[247] 12+ yrs 12+ yrs 
Levels: 0-5yrs 6-11yrs 12+ yrs 
 mode(infert$education) 
[1] numeric 
 class(infert$education) 
[1] factor 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 47 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames and factors 
A factor is Rā€™s representation of categorical fields and keeps track of all 
possible category levels. 
 sex - sample(c(Male, Female), 100, replace = TRUE) 
 mode(sex); class(sex) 
[1] character 
[1] character 
 table(sex) 
sex 
Female Male 
51 49 
 sexf - factor(sex, levels = c(Male, Female, Transgender)) 
 table(sexf) 
sexf 
Male Female Transgender 
49 51 0 
 mode(sexf); class(sexf) 
[1] numeric 
[1] factor 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 48 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames and lists 
Infert data is a matched case-control study evaluating the association of 
history of abortions and infertility. Use conditional logistic regression. 
 mod3 - clogit(case ~ spontaneous + induced + 
+ strata(stratum), data = infert) 
 mod3 
Call: 
clogit(case ~ spontaneous + induced + strata(stratum), data = 
coef exp(coef) se(coef) z p 
spontaneous 1.99 7.29 0.352 5.63 1.8e-08 
induced 1.41 4.09 0.361 3.91 9.4e-05 
 summod3 - summary(mod3) 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 49 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames and lists 
 summod3 
n= 248 
coef exp(coef) se(coef) z Pr(|z|) 
spontaneous 1.9859 7.2854 0.3524 5.635 1.75e-08 *** 
induced 1.4090 4.0919 0.3607 3.906 9.38e-05 *** 
--- 
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 
exp(coef) exp(-coef) lower .95 upper .95 
spontaneous 7.285 0.1373 3.651 14.536 
induced 4.092 0.2444 2.018 8.298 
Rsquare= 0.193 (max possible= 0.519 ) 
Likelihood ratio test= 53.15 on 2 df, p=2.869e-12 
Wald test = 31.84 on 2 df, p=1.221e-07 
Score (logrank) test = 48.44 on 2 df, p=3.032e-11 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 50 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frames and lists 
 str(summod3) 
List of 12 
$ call : language coxph(formula = Surv(rep(1, 248L), case) ~ spontaneous 
$ fail : NULL 
$ na.action : NULL 
$ n : int 248 
$ loglik : num [1:2] -90.8 -64.2 
$ coefficients: num [1:2, 1:5] 1.986 1.409 7.285 4.092 0.352 ... 
..- attr(*, dimnames)=List of 2 
.. ..$ : chr [1:2] spontaneous induced 
.. ..$ : chr [1:5] coef exp(coef) se(coef) z ... 
$ conf.int : num [1:2, 1:4] 7.285 4.092 0.137 0.244 3.651 ... 
..- attr(*, dimnames)=List of 2 
.. ..$ : chr [1:2] spontaneous induced 
.. ..$ : chr [1:4] exp(coef) exp(-coef) lower .95 upper .95 
$ logtest : Named num [1:3] 5.32e+01 2.00 2.87e-12 
... [output truncated] 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 51 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding data frame and lists 
 summod3$coef 
coef exp(coef) se(coef) z Pr(|z|) 
spontaneous 1.985876 7.285423 0.3524435 5.634592 1.754734e-08 
induced 1.409012 4.091909 0.3607124 3.906191 9.376245e-05 
 summod3$coef[1, ] 
coef exp(coef) se(coef) z Pr(|z|) 
1.985876e+00 7.285423e+00 3.524435e-01 5.634592e+00 1.754734e-08 
 summod3$coef[ ,2] 
spontaneous induced 
7.285423 4.091909 
 summod3$coef[1,2] 
[1] 7.285423 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 52 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding functions 
Risk Ratio confidence interval from baby Rothman, p. 135 
rr.wald - function(x, conf.level = 0.95){ 
## prepare input 
x1 - x[1,1]; n1 - sum(x[1,]) 
x0 - x[2,1]; n0 - sum(x[2,]) 
## do calculations 
p1 - x1/n1 ##risk among exposed 
p0 - x0/n0 ##risk among unexposed 
RR - p1/p0; 
logRR - log(RR) 
SElogRR - sqrt(1/x1 - 1/n1 + 1/x0 - 1/n0) 
Z - qnorm(0.5*(1 + conf.level)) 
LCL - exp(logRR - Z*SElogRR) 
UCL - exp(logRR + Z*SElogRR) 
##collect output 
list(x = x, risks = c(p1 = p1, p0 = p0), risk.ratio = RR, 
conf.int = c(LCL, UCL), conf.level = conf.level) 
} 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 53 / 60
Working with R data objects Working with lists, data frames, and functions 
Understanding functions 
Run rr.wald function on UGDP RCT data (results displayed in 2 
columns). 
 tab 
Treatment 
Status Tolbutamide Placebo 
Death 30 21 
Survivor 174 184 
 rr.wald(tab) 
$x 
Treatment 
Status Tolbutamide Placebo 
Death 30 21 
Survivor 174 184 
$risks 
p1 p0 
0.5882353 0.4860335 
$risk.ratio 
[1] 1.210277 
$conf.int 
[1] 0.9396227 1.5588927 
$conf.level 
[1] 0.95 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 54 / 60
Working with R data objects Working with lists, data frames, and functions 
The epitools package 
The following epidemiologists, directly or indirectly, contributed to 
ā€™epitoolsā€™: 
TomĀ“as AragĀ“on, MD, DrPH, , UC Berkeley 
Michael P. Fay, PhD, Mathematical Statistician National Institute of 
Allergy and Infectious Diseases 
Wayne Enanoria, PhD, MPH, UC Berkeley 
Travis Porco, PhD, MPH, UC San Francisco 
Michael Samuel, DrPH, California Department of Public Health 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 55 / 60
Working with R data objects Working with lists, data frames, and functions 
Using epitools for outbreak investigations 
Using the epitab function (only arguments are displayed); 
epitab(x, y = NULL, 
method = c(oddsratio, riskratio, rateratio), 
conf.level = 0.95, 
rev = c(neither, rows, columns, both), 
oddsratio = c(wald, fisher, midp, small), 
riskratio = c(wald, boot, small), 
rateratio = c(wald, midp), 
pvalue = c(fisher.exact, midp.exact, chi2), 
correction = FALSE, 
verbose = FALSE) 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 56 / 60
Working with R data objects Working with lists, data frames, and functions 
Hypothesis testing using Oswego: Passing 2 vectors 
 library(epitools) #load ā€™epitoolsā€™ package 
 data(oswego) #load Oswego dataset 
 attach(oswego) #attach dataset 
 round(epitab(jello, ill, method = riskratio)$tab, 2) 
Outcome 
Predictor N p0 Y p1 riskratio lower upper p.value 
N 22 0.42 30 0.58 1.00 NA NA NA 
Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44 
 round(epitab(jello, ill, method = oddsratio)$tab, 2) 
Outcome 
Predictor N p0 Y p1 oddsratio lower upper p.value 
N 22 0.76 30 0.65 1.00 NA NA NA 
Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44 
 detach(oswego) #detach dataset 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 57 / 60
Working with R data objects Working with lists, data frames, and functions 
Hypothesis testing using Oswego: Passing a table 
 jello.tab1 
ill 
jello N Y 
N 22 30 
Y 7 16 
 round(epitab(jello.tab1)$tab, 2) 
ill 
jello N p0 Y p1 oddsratio lower upper p.value 
N 22 0.76 30 0.65 1.00 NA NA NA 
Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44 
 round(epitab(jello.tab1, method = risk)$tab, 2) 
ill 
jello N p0 Y p1 riskratio lower upper p.value 
N 22 0.42 30 0.58 1.00 NA NA NA 
Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 58 / 60
Working with R data objects Working with lists, data frames, and functions 
Hypothesis testing using Oswego: Passing one vector 
 round(epitab(c(22, 30, 7, 16))$tab, 2) 
Outcome 
Predictor Disease1 p0 Disease2 p1 oddsratio lower upper p.value 
Exposed1 22 0.76 30 0.65 1.00 NA NA NA 
Exposed2 7 0.24 16 0.35 1.68 0.59 4.76 0.44 
 round(epitab(c(22, 30, 7, 16), method = risk)$tab, 2) 
Outcome 
Predictor Disease1 p0 Disease2 p1 riskratio lower upper p.value 
Exposed1 22 0.42 30 0.58 1.00 NA NA NA 
Exposed2 7 0.30 16 0.70 1.21 0.84 1.72 0.44 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 59 / 60
Working with R data objects Working with lists, data frames, and functions 
Summary 
1 Background 
Cost 
Quality 
Community 
2 Getting started with R 
Full-function calculator/spreadsheet 
Extensible statistical packages 
High quality graphics tool 
Multi-use programming language 
3 Working with R data objects 
Atomic vs. recursive data objects 
Working with vectors, matrices,  arrays 
Working with lists, data frames, and functions 
TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 60 / 60

More Related Content

Similar to Understanding R for Epidemiologists

Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
PremaGanesh1
Ā 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
DataMine Lab
Ā 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
templedf
Ā 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
vini89
Ā 

Similar to Understanding R for Epidemiologists (20)

R Basics
R BasicsR Basics
R Basics
Ā 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
Ā 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
Ā 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
Ā 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
Ā 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Ā 
R programming for psychometrics
R programming for psychometricsR programming for psychometrics
R programming for psychometrics
Ā 
Optimization of sample configurations for spatial trend estimation
Optimization of sample configurations for spatial trend estimationOptimization of sample configurations for spatial trend estimation
Optimization of sample configurations for spatial trend estimation
Ā 
Naive.pdf
Naive.pdfNaive.pdf
Naive.pdf
Ā 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
Ā 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
Ā 
R language
R languageR language
R language
Ā 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
Ā 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
Ā 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
Ā 
Using r
Using rUsing r
Using r
Ā 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
Ā 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
Ā 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data era
Ā 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
Ā 

More from Tomas J. Aragon

Continuous Decision Improvement: Decisive Leadership for Complex Environments
Continuous Decision Improvement: Decisive Leadership for Complex EnvironmentsContinuous Decision Improvement: Decisive Leadership for Complex Environments
Continuous Decision Improvement: Decisive Leadership for Complex Environments
Tomas J. Aragon
Ā 

More from Tomas J. Aragon (19)

Economic Burden of Alcohol Consumption in the City and County of San Francisc...
Economic Burden of Alcohol Consumption in the City and County of San Francisc...Economic Burden of Alcohol Consumption in the City and County of San Francisc...
Economic Burden of Alcohol Consumption in the City and County of San Francisc...
Ā 
Racial Health Inequities in San Francisco, CA
Racial Health Inequities in San Francisco, CARacial Health Inequities in San Francisco, CA
Racial Health Inequities in San Francisco, CA
Ā 
The Leading Population Health Framework
The Leading Population Health FrameworkThe Leading Population Health Framework
The Leading Population Health Framework
Ā 
What is population health?
What is population health?What is population health?
What is population health?
Ā 
PDSA Problem-Solving
PDSA Problem-SolvingPDSA Problem-Solving
PDSA Problem-Solving
Ā 
Population Health Lean (SLIDES)
Population Health Lean (SLIDES)Population Health Lean (SLIDES)
Population Health Lean (SLIDES)
Ā 
Population Health Lean: An overview
Population Health Lean: An overviewPopulation Health Lean: An overview
Population Health Lean: An overview
Ā 
Structural Trauma and Toxic Stress: Lifecourse Roots of Health Inequities
Structural Trauma and Toxic Stress: Lifecourse Roots of Health InequitiesStructural Trauma and Toxic Stress: Lifecourse Roots of Health Inequities
Structural Trauma and Toxic Stress: Lifecourse Roots of Health Inequities
Ā 
Structural Trauma and Toxic Stress! Inter-generational Roots of Adults Health...
Structural Trauma and Toxic Stress! Inter-generational Roots of Adults Health...Structural Trauma and Toxic Stress! Inter-generational Roots of Adults Health...
Structural Trauma and Toxic Stress! Inter-generational Roots of Adults Health...
Ā 
Leading population health---A results-based lean approach
Leading population health---A results-based lean approachLeading population health---A results-based lean approach
Leading population health---A results-based lean approach
Ā 
Toxic Stress! Childhood Roots of Health Inequities:
Toxic Stress! Childhood Roots of Health Inequities:Toxic Stress! Childhood Roots of Health Inequities:
Toxic Stress! Childhood Roots of Health Inequities:
Ā 
Continuous Decision Improvement: Decisive Leadership for Complex Environments
Continuous Decision Improvement: Decisive Leadership for Complex EnvironmentsContinuous Decision Improvement: Decisive Leadership for Complex Environments
Continuous Decision Improvement: Decisive Leadership for Complex Environments
Ā 
Continuous Decision Improvement (CDI): Public Health Decision Making for Comp...
Continuous Decision Improvement (CDI): Public Health Decision Making for Comp...Continuous Decision Improvement (CDI): Public Health Decision Making for Comp...
Continuous Decision Improvement (CDI): Public Health Decision Making for Comp...
Ā 
POSTER: Designing Learning Health Organization for Collective Impact Using REACH
POSTER: Designing Learning Health Organization for Collective Impact Using REACHPOSTER: Designing Learning Health Organization for Collective Impact Using REACH
POSTER: Designing Learning Health Organization for Collective Impact Using REACH
Ā 
Designing a Learning Health Organization for Collective Impact
Designing a Learning Health Organization for Collective ImpactDesigning a Learning Health Organization for Collective Impact
Designing a Learning Health Organization for Collective Impact
Ā 
Curriculum vitae (LaTeX PDF)
Curriculum vitae (LaTeX PDF)Curriculum vitae (LaTeX PDF)
Curriculum vitae (LaTeX PDF)
Ā 
The High Achieving Governmental Health Department in 2020 as the Community Ch...
The High Achieving Governmental Health Department in 2020 as the Community Ch...The High Achieving Governmental Health Department in 2020 as the Community Ch...
The High Achieving Governmental Health Department in 2020 as the Community Ch...
Ā 
Preparing for Microbial Threats to Health: What Every Professional Should Know
Preparing for Microbial Threats to Health: What Every Professional Should KnowPreparing for Microbial Threats to Health: What Every Professional Should Know
Preparing for Microbial Threats to Health: What Every Professional Should Know
Ā 
Sugar MADNESS: How metabolic syndrome drives obesity and what you can do abou...
Sugar MADNESS: How metabolic syndrome drives obesity and what you can do abou...Sugar MADNESS: How metabolic syndrome drives obesity and what you can do abou...
Sugar MADNESS: How metabolic syndrome drives obesity and what you can do abou...
Ā 

Recently uploaded

Connective Tissue II - Dr Muhammad Ali Rabbani - Medicose Academics
Connective Tissue II - Dr Muhammad Ali Rabbani - Medicose AcademicsConnective Tissue II - Dr Muhammad Ali Rabbani - Medicose Academics
Connective Tissue II - Dr Muhammad Ali Rabbani - Medicose Academics
MedicoseAcademics
Ā 

Recently uploaded (20)

Sell pmk powder cas 28578-16-7 from pmk supplier Telegram +85297504341
Sell pmk powder cas 28578-16-7 from pmk supplier Telegram +85297504341Sell pmk powder cas 28578-16-7 from pmk supplier Telegram +85297504341
Sell pmk powder cas 28578-16-7 from pmk supplier Telegram +85297504341
Ā 
PYODERMA, IMPETIGO, FOLLICULITIS, FURUNCLES, CARBUNCLES.pdf
PYODERMA, IMPETIGO, FOLLICULITIS, FURUNCLES, CARBUNCLES.pdfPYODERMA, IMPETIGO, FOLLICULITIS, FURUNCLES, CARBUNCLES.pdf
PYODERMA, IMPETIGO, FOLLICULITIS, FURUNCLES, CARBUNCLES.pdf
Ā 
Video capsule endoscopy (VCE ) in children
Video capsule endoscopy (VCE ) in childrenVideo capsule endoscopy (VCE ) in children
Video capsule endoscopy (VCE ) in children
Ā 
Hemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.Gawad
Hemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.GawadHemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.Gawad
Hemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.Gawad
Ā 
Vip ā„‚all Girls Shalimar Bagh Phone No 9999965857 High Profile ā„‚all Girl Delhi...
Vip ā„‚all Girls Shalimar Bagh Phone No 9999965857 High Profile ā„‚all Girl Delhi...Vip ā„‚all Girls Shalimar Bagh Phone No 9999965857 High Profile ā„‚all Girl Delhi...
Vip ā„‚all Girls Shalimar Bagh Phone No 9999965857 High Profile ā„‚all Girl Delhi...
Ā 
HIFI* ā„‚all Girls In Thane West Phone šŸ” 9920874524 šŸ” šŸ’ƒ Me All Time Serviā„‚e Ava...
HIFI* ā„‚all Girls In Thane West Phone šŸ” 9920874524 šŸ” šŸ’ƒ Me All Time Serviā„‚e Ava...HIFI* ā„‚all Girls In Thane West Phone šŸ” 9920874524 šŸ” šŸ’ƒ Me All Time Serviā„‚e Ava...
HIFI* ā„‚all Girls In Thane West Phone šŸ” 9920874524 šŸ” šŸ’ƒ Me All Time Serviā„‚e Ava...
Ā 
The Orbit & its contents by Dr. Rabia I. Gandapore.pptx
The Orbit & its contents by Dr. Rabia I. Gandapore.pptxThe Orbit & its contents by Dr. Rabia I. Gandapore.pptx
The Orbit & its contents by Dr. Rabia I. Gandapore.pptx
Ā 
Cas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best suppler
Cas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best supplerCas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best suppler
Cas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best suppler
Ā 
Results For Love Spell Is Guaranteed In 1 Day +27834335081 [BACK LOST LOVE SP...
Results For Love Spell Is Guaranteed In 1 Day +27834335081 [BACK LOST LOVE SP...Results For Love Spell Is Guaranteed In 1 Day +27834335081 [BACK LOST LOVE SP...
Results For Love Spell Is Guaranteed In 1 Day +27834335081 [BACK LOST LOVE SP...
Ā 
Mgr university bsc nursing adult health previous question paper with answers
Mgr university  bsc nursing adult health previous question paper with answersMgr university  bsc nursing adult health previous question paper with answers
Mgr university bsc nursing adult health previous question paper with answers
Ā 
Quality control tests of suppository ...
Quality control tests  of suppository ...Quality control tests  of suppository ...
Quality control tests of suppository ...
Ā 
Is Rheumatoid Arthritis a Metabolic Disorder.pptx
Is Rheumatoid Arthritis a Metabolic Disorder.pptxIs Rheumatoid Arthritis a Metabolic Disorder.pptx
Is Rheumatoid Arthritis a Metabolic Disorder.pptx
Ā 
VIP Pune 7877925207 WhatsApp: Me All Time Serviā„‚e Available Day and Night
VIP Pune 7877925207 WhatsApp: Me All Time Serviā„‚e Available Day and NightVIP Pune 7877925207 WhatsApp: Me All Time Serviā„‚e Available Day and Night
VIP Pune 7877925207 WhatsApp: Me All Time Serviā„‚e Available Day and Night
Ā 
Overview on the Automatic pill identifier
Overview on the Automatic pill identifierOverview on the Automatic pill identifier
Overview on the Automatic pill identifier
Ā 
ANAPHYLAXIS BY DR.SOHAN BISWAS,MBBS,DNB(INTERNAL MEDICINE) RESIDENT.pptx
ANAPHYLAXIS BY DR.SOHAN BISWAS,MBBS,DNB(INTERNAL MEDICINE) RESIDENT.pptxANAPHYLAXIS BY DR.SOHAN BISWAS,MBBS,DNB(INTERNAL MEDICINE) RESIDENT.pptx
ANAPHYLAXIS BY DR.SOHAN BISWAS,MBBS,DNB(INTERNAL MEDICINE) RESIDENT.pptx
Ā 
NDCT Rules, 2019: An Overview | New Drugs and Clinical Trial Rules 2019
NDCT Rules, 2019: An Overview | New Drugs and Clinical Trial Rules 2019NDCT Rules, 2019: An Overview | New Drugs and Clinical Trial Rules 2019
NDCT Rules, 2019: An Overview | New Drugs and Clinical Trial Rules 2019
Ā 
Let's Talk About It: Ovarian Cancer (The Emotional Toll of Treatment Decision...
Let's Talk About It: Ovarian Cancer (The Emotional Toll of Treatment Decision...Let's Talk About It: Ovarian Cancer (The Emotional Toll of Treatment Decision...
Let's Talk About It: Ovarian Cancer (The Emotional Toll of Treatment Decision...
Ā 
Connective Tissue II - Dr Muhammad Ali Rabbani - Medicose Academics
Connective Tissue II - Dr Muhammad Ali Rabbani - Medicose AcademicsConnective Tissue II - Dr Muhammad Ali Rabbani - Medicose Academics
Connective Tissue II - Dr Muhammad Ali Rabbani - Medicose Academics
Ā 
Tissue Banking and Umbilical Cord Blood Banking
Tissue Banking and Umbilical Cord Blood BankingTissue Banking and Umbilical Cord Blood Banking
Tissue Banking and Umbilical Cord Blood Banking
Ā 
Varicose Veins Treatment Aftercare Tips by Gokuldas Hospital
Varicose Veins Treatment Aftercare Tips by Gokuldas HospitalVaricose Veins Treatment Aftercare Tips by Gokuldas Hospital
Varicose Veins Treatment Aftercare Tips by Gokuldas Hospital
Ā 

Understanding R for Epidemiologists

  • 1. Understanding R for Epidemiologists TomĀ“as J. AragĀ“on, MD, DrPH Faculty, Division of Epidemiology UC Berkeley School of Public Health Health Officer, City & County of San Francisco Director, Population Health Division (PHD) San Francisco Department of Public Health Blog: http://www.medepi.com Email: aragon@berkeley.edu September 8, 2014 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 1 / 60
  • 2. Outline 1 Background Cost Quality Community 2 Getting started with R Full-function calculator/spreadsheet Extensible statistical packages High quality graphics tool Multi-use programming language 3 Working with R data objects Atomic vs. recursive data objects Working with vectors, matrices, & arrays Working with lists, data frames, and functions TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 2 / 60
  • 3. Background Background: Major issues Cost Quality Community Functionality TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 3 / 60
  • 4. Background Cost Cost: Open Source vs. Proprietary Software Costs of software Costs of multi-platforms Costs of education and training Costs of adding solutions (e.g., packages) Costs of solving problems and sharing solutions TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 4 / 60
  • 5. Background Quality Quality: Open Source vs. Proprietary Software Core Development Team Large pool of users/testers Quality control process for packages Bug fixes based on need/demand, not profits TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 5 / 60
  • 6. Background Community Community: Open Source vs. Proprietary Software Large community of users Transparent development process Growing number of books and trainings Growing number of free tutorials and manuals TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 6 / 60
  • 7. Background Community Current R contributors Douglas Bates John Chambers Peter Dalgaard Seth Falcon Robert Gentleman Kurt Hornik Stefano Iacus Ross Ihaka Friedrich Leisch Uwe Ligges Thomas Lumley Martin Maechler Duncan Murdoch Paul Murrell Martyn Plummer Brian Ripley Deepayan Sarkar Duncan Temple Lang Luke Tierney Simon Urbanek Source: http://www.r-project.org/contributors.html TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 7 / 60
  • 8. Getting started with R What is R? Full-function calculator/spreadsheet Extensible statistical packages High-quality graphics tool Multi-use programming language TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 8 / 60
  • 9. Getting started with R Full-function calculator/spreadsheet Full-function calculator: Selected math operators Operator Description Try these examples + addition 5+4 āˆ’ subtraction 5-4 multiplication 5*4 / division 5/4 Ė† exponentiation 5^4 āˆ’ unary minus (change current sign) -5 abs absolute value abs(-23) exp exponentiation (e to a power) exp(8) log logarithm (default is natural log) log(exp(8)) sqrt square root sqrt(64) %/% integer divide 10%/%3 %% modulus 10%%3 %*% matrix multiplication xx - matrix(1:4, 2, 2) xx%*%c(1, 1) c(1, 1)%*%xx TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 9 / 60
  • 10. Getting started with R Extensible statistical packages Extensible statistical packages Generalized Linear Models (Base) Linear regression Logistic regression Poisson regression Cox Proportional Hazard models (Survival) Cox PH regression Conditional logistic regression (matched case-control studies) Meta-analysis (meta) Complex survey analysis (survey) Epidemiology packages epitools epicalc epibasix epiR TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 10 / 60
  • 11. Getting started with R High quality graphics tool Graphics display of sample size curves Alternative distribution H1 Power (1 - b) Null distribution H0 b a 2 -Z1-a 2 m0 Z1-a 2 m1 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 11 / 60
  • 12. Getting started with R High quality graphics tool Graphics display of P value function 0.2 0.5 1.0 2.0 2.9 5.0 10.0 20.0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05 0 0 10 20 30 40 50 60 70 80 90 95 100 Confidence level (%) Rate Ratio Pāˆ’value Null hypothesis Median unbiased estimate 95% Lower Confidence Limit = 0.74 95% Upper Confidence Limit = 21.0 95% Confidence Interval TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 12 / 60
  • 13. Getting started with R High quality graphics tool Graphical display of multiple linear regression 0 10 20 30 40 50 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 x1 x2 y TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 13 / 60
  • 14. Getting started with R High quality graphics tool Epidemic curve using Color Brewer colors Unknown WNF WNND 0 20 40 60 80 West Nile Virus Human Cases Reported in California by Disease Week as of December 14, 2004 Cases + Bird 2/24 + Horse 6/20 + Chicken 5/17 + Mosquito 4/14 52 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48 51 Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Disease Week Calendar Month, 2004 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 14 / 60
  • 15. Getting started with R Multi-use programming language Multi-use programming language Vectorized computations Functional programming language Object-oriented programming Text processing (e.g., using regular expressions) Links to C, Fortran, etc. TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 15 / 60
  • 16. Working with R data objects Atomic vs. recursive data objects Data objects in R Object types Vector Matrix Array List Data frame Function Operations Create Name Index Replace Manipulate Do computations TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 16 / 60
  • 17. Working with R data objects Atomic vs. recursive data objects Summary of types of data objects in R Data object Possible modea Default class Atomic vector character, numeric, logical NULL matrix character, numeric, logical NULL array character, numeric, logical NULL Recursive list list NULL data frame list data frame function function NULL a We are ignoring complex numbers TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 17 / 60
  • 18. Working with R data objects Working with vectors, matrices, arrays Understanding vectors A vector is a collection of like elements without dimensions1. The vector elements are all of the same mode (either character, numeric, or logical). y - c(Pedro, Paulo, Maria) y [1] Pedro Paulo Maria x - c(1, 2, 3, 4, 5) x [1] 1 2 3 4 5 x 3 [1] TRUE TRUE FALSE FALSE FALSE 1In other programming languages, vectors are either row vectors or column vectors. R does not make this distinction until it is necessary. TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 18 / 60
  • 19. Working with R data objects Working with vectors, matrices, arrays Understanding vectors: Indexing Indexing by Try these examples Position x - c(chol=234, sbp=148, dbp=78, age=54) x[2] #positions to include x[c(2, 3)] x[-c(1, 3, 4)] #positions to exclude x[-c(1, 4)] Name x[sbp] x[c(sbp, dbp)] Logical x 100 x[x 100] (x 150) (x 70) bp - (x 150) (x 70) x[bp] TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 19 / 60
  • 20. Working with R data objects Working with vectors, matrices, arrays Understanding vectors: Replacement Replacing by Try these examples Position x - c(chol=234, sbp=148, dbp=78, age=54) x[1] x[1] - 250 x Name x[sbp] x[sbp] - 150 x Logical x[x100] x[x100] - NA x TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 20 / 60
  • 21. Working with R data objects Working with vectors, matrices, arrays Understanding vectors: Replacement x - c(chol = 234, sbp = 148, dbp = 78, age = 54) x[1] - 250 #by position x chol sbp dbp age 250 148 78 54 x[sbp] - 150 #by name x chol sbp dbp age 250 150 78 54 x[x100] dbp age 78 54 x[x100] - NA #by logical x chol sbp dbp age 250 150 NA NA TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 21 / 60
  • 22. Working with R data objects Working with vectors, matrices, arrays Understanding matrices A matrix is a collection of like elements organized into a 2-dimensional (tabular) data object. Matrix elements can be either numeric, character, or logical. We can think of a matrix as a vector with a 2-dimensional structure. Contingency tables in epidemiology are represented in R as numeric matrices or arrays. An array is the generalization of matrices to 3 or more dimensions (commonly known as stratified tables). We cover arrays later, for now we will focus on 2-dimensional tables. TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 22 / 60
  • 23. Working with R data objects Working with vectors, matrices, arrays Understanding matrices When R returns a matrix the [n,] indicates the nth row and [,m] indicates the mth column. x - c(a, b, c, d) y - matrix(x, 2, 2) y [,1] [,2] [1,] a c [2,] b d y[1,] [1] a c y[,2] [1] c d TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 23 / 60
  • 24. Working with R data objects Working with vectors, matrices, arrays Understanding matrices x - c(30, 21, 170, 180) # creating y - matrix(x, 2, 2, byrow = TRUE) # creating y [,1] [,2] [1,] 30 21 [2,] 170 180 rownames(y) - c(Deaths, Survivors) # naming colnames(y) - c(Tolbutamide, Placebo) # naming y[2, 1] - 174 # replace by position y[Survivors, Placebo] - 184 # replace by name y Tolbutamide Placebo Deaths 30 21 Survivors 174 184 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 24 / 60
  • 25. Working with R data objects Working with vectors, matrices, arrays Understanding matrices Consider the 2 Ɨ 2 table of crude data in Table. In this randomized clinical trial (RCT), diabetic subjects were randomly assigned to receive either tolbutamide, an oral hypoglycemic drug, or placebo. Because this was a prospective study we can calculate risks, odds, a risk ratio, and an odds ratio. We will do this using R as a calculator. Table : Deaths among subjects who received tolbutamide and placebo in the Unversity Group Diabetes Program (1970) Tolbutamide Placebo Deaths 30 21 Survivors 174 184 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 25 / 60
  • 26. Working with R data objects Working with vectors, matrices, arrays Understanding matrices dat - matrix(c(30, 174, 21, 184), 2, 2) rownames(dat) - c(Deaths, Survivors) colnames(dat) - c(Tolbutamide, Placebo) coltot - apply(dat, 2, sum) #column totals risks - dat[Deaths,]/coltot risk.ratio - risks/risks[2] #risk ratio odds - risks/(1-risks) odds.ratio - odds/odds[2] #odds ratio TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 26 / 60
  • 27. Working with R data objects Working with vectors, matrices, arrays Understanding matrices # display results dat Tolbutamide Placebo Deaths 30 21 Survivors 174 184 rbind(risks, risk.ratio, odds, odds.ratio) Tolbutamide Placebo risks 0.1470588 0.1024390 risk.ratio 1.4355742 1.0000000 odds 0.1724138 0.1141304 odds.ratio 1.5106732 1.0000000 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 27 / 60
  • 28. Working with R data objects Working with vectors, matrices, arrays Understanding arrays An array is a collection of like elements organized into a n-dimensional data object. When R returns an array the [n,,] indicates the nth row and [,m,] indicates the mth column, and so on. x - 1:8 y - array(x, dim=c(2, 2, 2)) y , , 1 [,1] [,2] [1,] 1 3 [2,] 2 4 , , 2 [,1] [,2] [1,] 5 7 [2,] 6 8 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 28 / 60
  • 29. Working with R data objects Working with vectors, matrices, arrays Understanding arrays While a matrix is a 2-dimensional table of like elements, an array is the generalization of matrices to n-dimensions. Stratified contingency tables in epidemiology are represented as array data objects in R. For example, the RCT previously shown comparing the number deaths among diabetic subjects that received tolbutamide vs. placebo is now also stratified by age group: Table : Deaths among subjects who received tolbutamide and placebo in the Unversity Group Diabetes Program (1970), stratifying by age Age55 Age55 Combined Tolb Plac Tolb Plac Tolb Plac Deaths 8 5 22 16 30 21 Survivors 98 115 76 69 174 184 Total 106 120 98 85 204 205 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 29 / 60
  • 30. Working with R data objects Working with vectors, matrices, arrays Understanding arrays tdat - c(8, 98, 5, 115, 22, 76, 16, 69) tdat - array(tdat, c(2, 2, 2)) dimnames(tdat) - list(Outcome=c(Deaths, Survivors), + Treatment=c(Tolbutamide, Placebo), + Age group=c(Age55, Age=55)) tdat , , Age group = Age55 Treatment Outcome Tolbutamide Placebo Deaths 8 5 Survivors 98 115 , , Age group = Age=55 Treatment Outcome Tolbutamide Placebo Deaths 22 16 Survivors 76 69 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 30 / 60
  • 31. Working with R data objects Working with vectors, matrices, arrays Table : Example of 4-dimensional array: Year 2000 population estimates by age, ethnicity, sex, and county Ethnicity County/Sex Age White AfrAmer AsianPI Latino Multirace AmerInd Alameda Female =19 58,160 31,765 40,653 49,738 10,120 839 20ā€“44 112,326 44,437 72,923 58,553 7,658 1,401 45ā€“64 82,205 24,948 33,236 18,534 2,922 822 65+ 49,762 12,834 16,004 7,548 1,014 246 Male =19 61,446 32,277 42,922 53,097 10,102 828 20ā€“44 115,745 36,976 69,053 69,233 6,795 1,263 45ā€“64 81,332 20,737 29,841 17,402 2,506 687 65+ 33,994 8,087 11,855 5,416 711 156 San Francisco Female =19 14,355 6,986 23,265 13,251 2,940 173 20ā€“44 85,766 10,284 52,479 23,458 3,656 526 45ā€“64 35,617 6,890 31,478 9,184 1,144 282 65+ 27,215 5,172 23,044 5,773 554 121 Male =19 14,881 6,959 24,541 14,480 2,851 165 20ā€“44 105,798 11,111 48,379 31,605 3,766 782 45ā€“64 43,694 7,352 26,404 8,674 1,220 354 65+ 20,072 3,329 17,190 3,428 450 76 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 31 / 60
  • 32. Working with R data objects Working with vectors, matrices, arrays Understanding arrays Figure : Schematic representation of a 4-dimensional array: Year 2000 population estimates by age (1), race (2), sex (3), and county (4) TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 32 / 60
  • 33. Working with R data objects Working with vectors, matrices, arrays Understanding arrays Figure : Schematic of a theoretical 5-D array (e.g., data by age (1), race (2), sex (3), party affiliation (4), and state (5)). We can see that the field ā€œstateā€ has 3 levels, and the field ā€œparty affiliationā€ has 2 levels; however, it is not apparent the number of age, race, and sex levels. Although not displayed, age levels would be represented by row names (along 1st dimension), race levels by column names (along 2nd dimension), and sex levels by depth names (along 3rd dimension). TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 33 / 60
  • 34. Working with R data objects Working with lists, data frames, and functions Understanding lists Up to now, we have been working with atomic data objects (vector, matrix, array). In contrast, lists, data frames, and functions are recursive data objects. Recursive data objects have more flexibility in combining diverse data objects into one object. A list provides the most flexibility. Think of a list object as a collection of ā€œbinsā€ that can contain any R object. Lists are very useful for collecting results of an analysis or a function into one data object where all its contents are readily accessible by indexing. TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 34 / 60
  • 35. Working with R data objects Working with lists, data frames, and functions Understanding lists A list is a collection of data objects without any restrictions: x - c(11, 22, 34) y - c(Male, Female, Male) z - matrix(c(67, 34, 56,22), 2, 2) mylist - list(x, y, z) mylist [[1]] [1] 11 22 34 [[2]] [1] Male Female Male [[3]] [,1] [,2] [1,] 67 56 [2,] 34 22 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 35 / 60
  • 36. Working with R data objects Working with lists, data frames, and functions Understanding lists Names can be assigned to each bin of a list. names(mylist) - c(Age, Sex, Data) mylist $Age [1] 11 22 34 $Sex [1] Male Female Male $Data [,1] [,2] [1,] 67 56 [2,] 34 22 mylist$Sex [1] Male Female Male TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 36 / 60
  • 37. Working with R data objects Working with lists, data frames, and functions Understanding lists Figure : Schematic representation of a list of length four. The first bin [1] contains a smiling face [[1]], the second bin [2] contains a flower [[2]], the third bin [3] contains a lightning bolt [[3]], and the fourth bin [[4]] contains a heart [[4]]. When indexing a list object, single brackets [Ā·] indexes the bin, and double brackets [[Ā·]] indexes the bin contents. If the bin has a name, then $name also indexes the contents. TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 37 / 60
  • 38. Working with R data objects Working with lists, data frames, and functions Understanding lists For example, using the UGDP clinical trial data, suppose we perform Fisherā€™s exact test for testing the null hypothesis of independence of rows and columns in a contingency table with fixed marginals. udat - read.csv(http://www.medepi.net/data/ugdp.txt) tab - xtabs(~ Status + Treatment, data = udat)[,2:1] tab Treatment Status Tolbutamide Placebo Death 30 21 Survivor 174 184 ftab - fisher.test(tab) ftab TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 38 / 60
  • 39. Working with R data objects Working with lists, data frames, and functions Understanding lists ftab Fisherā€™s Exact Test for Count Data data: tab p-value = 0.1813 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.8013768 2.8872863 sample estimates: odds ratio 1.509142 The default display only shows partial results. The total results are stored in the object ftab. Letā€™s evaluate the structure of ftab and extract some results: TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 39 / 60
  • 40. Working with R data objects Working with lists, data frames, and functions Understanding lists str(ftab) List of 7 $ p.value : num 0.181 $ conf.int : atomic [1:2] 0.801 2.887 ..- attr(*, conf.level)= num 0.95 $ estimate : Named num 1.51 ..- attr(*, names)= chr odds ratio $ null.value : Named num 1 ..- attr(*, names)= chr odds ratio $ alternative: chr two.sided $ method : chr Fisherā€™s Exact Test for Count Data $ data.name : chr tab - attr(*, class)= chr htest TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 40 / 60
  • 41. Working with R data objects Working with lists, data frames, and functions Understanding lists Letā€™s index some of the bins from ftab. ftab$estimate odds ratio 1.5091 ftab$conf.int [1] 0.80138 2.88729 ftab$conf.int[2] [1] 2.887286 attr(,conf.level) [1] 0.95 ftab$p.value [1] 0.18126 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 41 / 60
  • 42. Working with R data objects Working with lists, data frames, and functions Understanding data frames A data frame is a list with a 2-dimensional (tabular) structure. Epidemiologists are very experienced working with data frames where each row usually represents data collected on individual subjects (also called records or observations) and columns represent fields for each type of data collected (also called variables). subjno - c(1, 2, 3, 4) age - c(34, 56, 45, 23) sex - c(Male, Male, Female, Male) case - c(Yes, No, No, Yes) mydat - data.frame(subjno, age, sex, case) mydat subjno age sex case 1 1 34 Male Yes 2 2 56 Male No 3 3 45 Female No 4 4 23 Male Yes TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 42 / 60
  • 43. Working with R data objects Working with lists, data frames, and functions Understanding data frames Epidemiologists are familiar with tabular data sets where each row is a record and each column is a field. A record can be data collected on individuals or groups. We usually refer to the field name as a variable (e.g., age, gender, ethnicity). Fields can contain numeric or character data. In R, these types of data sets are handled by data frames. Each column of a data frame is usually either a factor or numeric vector, although it can have complex, character, or logical vectors. Data frames have the functionality of matrices and lists. For example, here is the first 10 rows of the infert data set, a matched case-control study published in 1976 that evaluated whether infertility was associated with prior spontaneous or induced abortions. TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 43 / 60
  • 44. Working with R data objects Working with lists, data frames, and functions Understanding data frames data(infert) str(infert) ā€˜data.frameā€™: 248 obs. of 8 variables: $ education : Factor w/ 3 levels 0-5yrs,..: 1 1 ... $ age : num NA 45 NA 23 35 36 23 32 21 28 ... $ parity : num 6 1 6 4 3 4 1 2 1 2 ... $ induced : num 1 1 2 2 1 2 0 0 0 0 ... $ case : num 1 1 1 1 1 1 1 1 1 1 ... $ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ... $ stratum : int 1 2 3 4 5 6 7 8 9 10 ... $ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ... TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 44 / 60
  • 45. Working with R data objects Working with lists, data frames, and functions Understanding data frames infert[1:10, 1:6] education age parity induced case spontaneous 1 0-5yrs NA 6 1 1 2 2 0-5yrs 45 1 1 1 0 3 0-5yrs NA 6 2 1 0 4 0-5yrs 23 4 2 1 0 5 6-11yrs 35 3 1 1 1 6 6-11yrs 36 4 2 1 1 7 6-11yrs 23 1 0 1 0 8 6-11yrs 32 2 0 1 0 9 6-11yrs 21 1 0 1 1 10 6-11yrs 28 2 0 1 0 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 45 / 60
  • 46. Working with R data objects Working with lists, data frames, and functions Understanding data frames The fields are obviously vectors. Letā€™s explore a few of these vectors to see what we can learn about their structure in R. #age variable infert$age [1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31 27 30 26 ... [235] 25 32 25 31 38 26 31 31 25 31 34 35 29 23 mode(infert$age) [1] numeric class(infert$age) [1] numeric TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 46 / 60
  • 47. Working with R data objects Working with lists, data frames, and functions Understanding data frames # education variable infert$education [1] 0-5yrs 0-5yrs 0-5yrs 0-5yrs 6-11yrs 6-11yrs ... [247] 12+ yrs 12+ yrs Levels: 0-5yrs 6-11yrs 12+ yrs mode(infert$education) [1] numeric class(infert$education) [1] factor TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 47 / 60
  • 48. Working with R data objects Working with lists, data frames, and functions Understanding data frames and factors A factor is Rā€™s representation of categorical fields and keeps track of all possible category levels. sex - sample(c(Male, Female), 100, replace = TRUE) mode(sex); class(sex) [1] character [1] character table(sex) sex Female Male 51 49 sexf - factor(sex, levels = c(Male, Female, Transgender)) table(sexf) sexf Male Female Transgender 49 51 0 mode(sexf); class(sexf) [1] numeric [1] factor TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 48 / 60
  • 49. Working with R data objects Working with lists, data frames, and functions Understanding data frames and lists Infert data is a matched case-control study evaluating the association of history of abortions and infertility. Use conditional logistic regression. mod3 - clogit(case ~ spontaneous + induced + + strata(stratum), data = infert) mod3 Call: clogit(case ~ spontaneous + induced + strata(stratum), data = coef exp(coef) se(coef) z p spontaneous 1.99 7.29 0.352 5.63 1.8e-08 induced 1.41 4.09 0.361 3.91 9.4e-05 summod3 - summary(mod3) TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 49 / 60
  • 50. Working with R data objects Working with lists, data frames, and functions Understanding data frames and lists summod3 n= 248 coef exp(coef) se(coef) z Pr(|z|) spontaneous 1.9859 7.2854 0.3524 5.635 1.75e-08 *** induced 1.4090 4.0919 0.3607 3.906 9.38e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 exp(coef) exp(-coef) lower .95 upper .95 spontaneous 7.285 0.1373 3.651 14.536 induced 4.092 0.2444 2.018 8.298 Rsquare= 0.193 (max possible= 0.519 ) Likelihood ratio test= 53.15 on 2 df, p=2.869e-12 Wald test = 31.84 on 2 df, p=1.221e-07 Score (logrank) test = 48.44 on 2 df, p=3.032e-11 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 50 / 60
  • 51. Working with R data objects Working with lists, data frames, and functions Understanding data frames and lists str(summod3) List of 12 $ call : language coxph(formula = Surv(rep(1, 248L), case) ~ spontaneous $ fail : NULL $ na.action : NULL $ n : int 248 $ loglik : num [1:2] -90.8 -64.2 $ coefficients: num [1:2, 1:5] 1.986 1.409 7.285 4.092 0.352 ... ..- attr(*, dimnames)=List of 2 .. ..$ : chr [1:2] spontaneous induced .. ..$ : chr [1:5] coef exp(coef) se(coef) z ... $ conf.int : num [1:2, 1:4] 7.285 4.092 0.137 0.244 3.651 ... ..- attr(*, dimnames)=List of 2 .. ..$ : chr [1:2] spontaneous induced .. ..$ : chr [1:4] exp(coef) exp(-coef) lower .95 upper .95 $ logtest : Named num [1:3] 5.32e+01 2.00 2.87e-12 ... [output truncated] TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 51 / 60
  • 52. Working with R data objects Working with lists, data frames, and functions Understanding data frame and lists summod3$coef coef exp(coef) se(coef) z Pr(|z|) spontaneous 1.985876 7.285423 0.3524435 5.634592 1.754734e-08 induced 1.409012 4.091909 0.3607124 3.906191 9.376245e-05 summod3$coef[1, ] coef exp(coef) se(coef) z Pr(|z|) 1.985876e+00 7.285423e+00 3.524435e-01 5.634592e+00 1.754734e-08 summod3$coef[ ,2] spontaneous induced 7.285423 4.091909 summod3$coef[1,2] [1] 7.285423 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 52 / 60
  • 53. Working with R data objects Working with lists, data frames, and functions Understanding functions Risk Ratio confidence interval from baby Rothman, p. 135 rr.wald - function(x, conf.level = 0.95){ ## prepare input x1 - x[1,1]; n1 - sum(x[1,]) x0 - x[2,1]; n0 - sum(x[2,]) ## do calculations p1 - x1/n1 ##risk among exposed p0 - x0/n0 ##risk among unexposed RR - p1/p0; logRR - log(RR) SElogRR - sqrt(1/x1 - 1/n1 + 1/x0 - 1/n0) Z - qnorm(0.5*(1 + conf.level)) LCL - exp(logRR - Z*SElogRR) UCL - exp(logRR + Z*SElogRR) ##collect output list(x = x, risks = c(p1 = p1, p0 = p0), risk.ratio = RR, conf.int = c(LCL, UCL), conf.level = conf.level) } TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 53 / 60
  • 54. Working with R data objects Working with lists, data frames, and functions Understanding functions Run rr.wald function on UGDP RCT data (results displayed in 2 columns). tab Treatment Status Tolbutamide Placebo Death 30 21 Survivor 174 184 rr.wald(tab) $x Treatment Status Tolbutamide Placebo Death 30 21 Survivor 174 184 $risks p1 p0 0.5882353 0.4860335 $risk.ratio [1] 1.210277 $conf.int [1] 0.9396227 1.5588927 $conf.level [1] 0.95 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 54 / 60
  • 55. Working with R data objects Working with lists, data frames, and functions The epitools package The following epidemiologists, directly or indirectly, contributed to ā€™epitoolsā€™: TomĀ“as AragĀ“on, MD, DrPH, , UC Berkeley Michael P. Fay, PhD, Mathematical Statistician National Institute of Allergy and Infectious Diseases Wayne Enanoria, PhD, MPH, UC Berkeley Travis Porco, PhD, MPH, UC San Francisco Michael Samuel, DrPH, California Department of Public Health TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 55 / 60
  • 56. Working with R data objects Working with lists, data frames, and functions Using epitools for outbreak investigations Using the epitab function (only arguments are displayed); epitab(x, y = NULL, method = c(oddsratio, riskratio, rateratio), conf.level = 0.95, rev = c(neither, rows, columns, both), oddsratio = c(wald, fisher, midp, small), riskratio = c(wald, boot, small), rateratio = c(wald, midp), pvalue = c(fisher.exact, midp.exact, chi2), correction = FALSE, verbose = FALSE) TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 56 / 60
  • 57. Working with R data objects Working with lists, data frames, and functions Hypothesis testing using Oswego: Passing 2 vectors library(epitools) #load ā€™epitoolsā€™ package data(oswego) #load Oswego dataset attach(oswego) #attach dataset round(epitab(jello, ill, method = riskratio)$tab, 2) Outcome Predictor N p0 Y p1 riskratio lower upper p.value N 22 0.42 30 0.58 1.00 NA NA NA Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44 round(epitab(jello, ill, method = oddsratio)$tab, 2) Outcome Predictor N p0 Y p1 oddsratio lower upper p.value N 22 0.76 30 0.65 1.00 NA NA NA Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44 detach(oswego) #detach dataset TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 57 / 60
  • 58. Working with R data objects Working with lists, data frames, and functions Hypothesis testing using Oswego: Passing a table jello.tab1 ill jello N Y N 22 30 Y 7 16 round(epitab(jello.tab1)$tab, 2) ill jello N p0 Y p1 oddsratio lower upper p.value N 22 0.76 30 0.65 1.00 NA NA NA Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44 round(epitab(jello.tab1, method = risk)$tab, 2) ill jello N p0 Y p1 riskratio lower upper p.value N 22 0.42 30 0.58 1.00 NA NA NA Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 58 / 60
  • 59. Working with R data objects Working with lists, data frames, and functions Hypothesis testing using Oswego: Passing one vector round(epitab(c(22, 30, 7, 16))$tab, 2) Outcome Predictor Disease1 p0 Disease2 p1 oddsratio lower upper p.value Exposed1 22 0.76 30 0.65 1.00 NA NA NA Exposed2 7 0.24 16 0.35 1.68 0.59 4.76 0.44 round(epitab(c(22, 30, 7, 16), method = risk)$tab, 2) Outcome Predictor Disease1 p0 Disease2 p1 riskratio lower upper p.value Exposed1 22 0.42 30 0.58 1.00 NA NA NA Exposed2 7 0.30 16 0.70 1.21 0.84 1.72 0.44 TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 59 / 60
  • 60. Working with R data objects Working with lists, data frames, and functions Summary 1 Background Cost Quality Community 2 Getting started with R Full-function calculator/spreadsheet Extensible statistical packages High quality graphics tool Multi-use programming language 3 Working with R data objects Atomic vs. recursive data objects Working with vectors, matrices, arrays Working with lists, data frames, and functions TomĀ“as AragĀ“on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 60 / 60