Introduction into R for the European Historical Population Sample summerschool, Cluj-Napoca, Romana, 2015. Aimed at a public of historians with little quantitative skills
DBA Basics: Getting Started with Performance Tuning.pdf
Introduction into R for historians (part 3: examine and import data)
1. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Examining data and importing data in R
Richard L. Zijdeman
May 29, 2015
Richard L. Zijdeman Examining data and importing data in R
2. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
1 Recap
2 Getting data in R
3 Do it yourself!
4 Plotting using ggplot2
Richard L. Zijdeman Examining data and importing data in R
3. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Recap
Richard L. Zijdeman Examining data and importing data in R
4. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
The structure of objects
Store just about anything in R: numbers, sentences, datasets
Objects
Study the structure of objects: str()
type of object
features of object
ships <- data.frame(year = c(1850, 1860, 1870, 1880),
inbound = c(215, 237, 237, NA),
outbound = c(212, 239, 260, 265))
Richard L. Zijdeman Examining data and importing data in R
5. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Study the structure of object “ships”"
str(ships)
## 'data.frame': 4 obs. of 3 variables:
## $ year : num 1850 1860 1870 1880
## $ inbound : num 215 237 237 NA
## $ outbound: num 212 239 260 265
Richard L. Zijdeman Examining data and importing data in R
6. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Characteristics of objects
Class: class()
Length: length()
Dimensions: dim()
class(ships)
## [1] "data.frame"
length(ships)
## [1] 3
dim(ships) # rows, columns
## [1] 4 3
Richard L. Zijdeman Examining data and importing data in R
7. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Closer inspection of data.frames
names of columns (variables): names()
top/bottom rows: head(), tail()
missing data: is.na()
names(ships)
## [1] "year" "inbound" "outbound"
is.na(ships)
## year inbound outbound
## [1,] FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE TRUE FALSE
Richard L. Zijdeman Examining data and importing data in R
8. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Summarizing data in data.frames
descriptive statistics: summary()
calculus: e.g. min(), mean(), sum()
results table format: table()
summary(ships)
## year inbound outbound
## Min. :1850 Min. :215.0 Min. :212.0
## 1st Qu.:1858 1st Qu.:226.0 1st Qu.:232.2
## Median :1865 Median :237.0 Median :249.5
## Mean :1865 Mean :229.7 Mean :244.0
## 3rd Qu.:1872 3rd Qu.:237.0 3rd Qu.:261.2
## Max. :1880 Max. :237.0 Max. :265.0
## NA's :1
Richard L. Zijdeman Examining data and importing data in R
9. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
is.na(ships)
## year inbound outbound
## [1,] FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE TRUE FALSE
table(is.na(ships))
##
## FALSE TRUE
## 11 1
Richard L. Zijdeman Examining data and importing data in R
10. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Visualizing your data
Not just for analyses!
Data quality
representativeness
missing data
Richard L. Zijdeman Examining data and importing data in R
11. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
plot(ships)
year
215 220 225 230 235
1850186018701880
215220225230235
inbound
1850 1855 1860 1865 1870 1875 1880 210 220 230 240 250 260
210220230240250260
outbound
Richard L. Zijdeman Examining data and importing data in R
12. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Getting data in R
Richard L. Zijdeman Examining data and importing data in R
13. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Data already in R
The “datasets” package
very slim datasets
specific example data
To obtain list of datasets, type:
library(help = "datasets")
To obtain information on a specific dataset, type:
help(swiss) # thus: help(name_of_package)
or to just see the data:
help(swiss)
Richard L. Zijdeman Examining data and importing data in R
14. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Reading in data
Different functions for different files:
Base R: read.table() (read.csv())
foreign package: read.spss(), read.dta(), read.dbf()
openxlsx package: read.xlsx()
alternatives packages:
xlsx(Java required)
gdata (perl-based)
Richard L. Zijdeman Examining data and importing data in R
15. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
read.xlsx() from openxlsx package
file: your file, including directory
sheet: name of sheet
Richard L. Zijdeman Examining data and importing data in R
16. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
read.csv()
file: your file, including directory
header: variable names or not?
sep: seperator
read.csv default: “,”
read.csv2 default: “;”
skip: number of rows to skip
nrows: total number of rows to read
stringsAsFactors
encoding (e.g. “latin1” or “UTF-8”)
Richard L. Zijdeman Examining data and importing data in R
17. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Do it yourself!
Richard L. Zijdeman Examining data and importing data in R
18. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Read in the following files as data.frames:
HSN_basic.xlsx
check the data.frame: using dim(), length()
check the variables: using summary(), min(), table()
Repeat for HSN_marriages.csv:
read in only 100 lines
Richard L. Zijdeman Examining data and importing data in R
19. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Plotting using ggplot2
Richard L. Zijdeman Examining data and importing data in R
20. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
ggplot2
Package by Hadley Wickham
Generic plotting for a great range of plots
ggplot2 website: http://ggplot2.org
excellent tutorial:
https://jofrhwld.github.io/avml2012/#Section_1.1
Richard L. Zijdeman Examining data and importing data in R
21. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Building your graph
Each plot consists of multiple layers
Think of a canvas on which you ‘paint’
data layer
geometries layer
statistics layer
Richard L. Zijdeman Examining data and importing data in R
22. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Data layer
data.frame and aesthetics
ggplot(data.frame, aes(x= ..., y = ...))
geometries layer
ggplot(..., aes(x= ..., y = ...)) +
geom_...() # e.g. geom_line
statistics layer
ggplot(..., aes(x= ..., y = ...)) +
geom_...() +
stat_...() # e.g. stat_smooth
Richard L. Zijdeman Examining data and importing data in R
23. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
an example
Reading in the data
hmar <- read.csv("./../data/derived/HSN_marriages.csv",
stringsAsFactors = FALSE,
encoding = "latin1",
header = TRUE,
nrows = 100)
Richard L. Zijdeman Examining data and importing data in R
24. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Plotting the data
install.packages(ggplot2)
library(ggplot2)
ggplot(hmar, aes(x= M_year, y = Age_bride)) +
geom_point()
Richard L. Zijdeman Examining data and importing data in R
25. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
20
30
40
50
1830 1840 1850 1860 1870
M_year
Age_bride
Richard L. Zijdeman Examining data and importing data in R
26. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Improving the plot
Specify characteristics of the geom_layer
ggplot(hmar, aes(x= M_year, y = Age_bride)) +
geom_point(colour = "blue", size = 3, shape = 18)
See http:
//www.cookbook-r.com/Graphs/Shapes_and_line_types/
Richard L. Zijdeman Examining data and importing data in R
27. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Specify characteristics of the geom_layer
20
30
40
50
1830 1840 1850 1860 1870
M_year
Age_bride
Richard L. Zijdeman Examining data and importing data in R
28. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
A PTE example
Does age at marriage depend on educational attainment?
To marry you need resources
the more attainment the longer it takes to acquire resources
ergo: brides with edu attainment marry later in life
Not a statistical test: but let’s graph this
Richard L. Zijdeman Examining data and importing data in R
29. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
A request from yesterday
Can I plot labels?
ggplot(hmar, aes(x= M_year, y = Age_bride,
label = SIgn_bride)) +
geom_text()
Richard L. Zijdeman Examining data and importing data in R
30. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Yes you can!
Not really useful though. . .
h
a
h
h
h
a
h
a
h
a
a
a
a
h
a
a
h
h
h
h
h
h
h
a
a
h
h
a
a
h
a
a
a
hh
h hh
a
a
a
a
h
a
h
a
h
h
a
a
h
hh
h
a
h
h h
h
h
h
h
a
h
a
h
h
a
h
a
h
h
a
hh
a
h
h
h
h
h
h
a
a
h
h
h
h
h
h
h
h
h
a
h
a
a
h
a
h
20
30
40
50
1830 1840 1850 1860 1870
M_year
Age_bride
Richard L. Zijdeman Examining data and importing data in R
31. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Let’s try with colours. . .
ggplot(hmar, aes(x= M_year, y = Age_bride)) +
geom_point(aes(colour = factor(SIgn_bride)),
size = 3, shape = 18)
Richard L. Zijdeman Examining data and importing data in R
32. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
20
30
40
50
1830 1840 1850 1860 1870
M_year
Age_bride
factor(SIgn_bride)
a
h
No real
pattern, though. . .
Richard L. Zijdeman Examining data and importing data in R
33. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Finalizing the graph
ggplot(hmar, aes(x= M_year, y = Age_bride)) +
geom_point(aes(colour = factor(SIgn_bride)),
size = 3,
shape = 18) +
labs(list(title = "Age of marriage over time",
x = "time (years since A.D.)",
y = "age of bride (years)",
colour = "Signature"))
# here we use colour since legend shows colour
Richard L. Zijdeman Examining data and importing data in R
34. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
20
30
40
50
1830 1840 1850 1860 1870
time (years since A.D.)
ageofbride(years)
Signature
a
h
Age of marriage over time
Richard L. Zijdeman Examining data and importing data in R
35. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Satisfied?
Richard L. Zijdeman Examining data and importing data in R
36. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Actually not. . . the points are plotted on top of each other. . .
Solution: geom_jitter
ggplot(hmar, aes(x= M_year, y = Age_bride)) +
geom_jitter(aes(colour = factor(SIgn_bride)),
size = 3,
shape = 18) +
labs(list(title = "Age of marriage over time",
x = "time (years since A.D.)",
y = "age of bride (years)",
colour = "Signature"))
# here we use colour since legend shows colour
Richard L. Zijdeman Examining data and importing data in R
37. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
20
30
40
50
1830 1840 1850 1860 1870
time (years since A.D.)
ageofbride(years)
Signature
a
h
Age of marriage over time
Richard L. Zijdeman Examining data and importing data in R
38. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
Final remarks on ggplot2
We have just scratched the surface of ggplot2
Build your graph slowly
start with the basics
add complexity step-wise
Now it’s your turn!
Richard L. Zijdeman Examining data and importing data in R
39. Recap
Getting data in R
Do it yourself!
Plotting using ggplot2
A small PTE project
Look at the variables in the HSN files
Think of a research question
Provide a general mechanism and hypothesis
Plot your results
Richard L. Zijdeman Examining data and importing data in R