This document provides an overview of exploratory data analysis (EDA). It discusses how EDA is used to generate and refine questions from data by visualizing, transforming, and modeling the data. Questions can come from hypotheses, problems, or the data itself. EDA plays a role in developing, testing, and refining theories, solving problems, and asking interesting questions about the data. The document emphasizes being skeptical of assumptions and open to multiple interpretations during EDA to maximize learning from the data. It introduces the dplyr and ggplot2 packages for selecting, filtering, summarizing, and visualizing data during the EDA process.
1. The University of Sydney Page 1
Exploratory data
analysis
The basics
Presented by
Professor Peter Reimann
Centre for Research on Learning and
Cognition
2. The University of Sydney Page 2
EDA is a inquiry cycle
Generate
questions
Search for
answers
in the data
Refine
questions
Visualize, transform, model the data
EDA is an important
component of theory-driven,
problem-driven, and
curiosity-driven research.
3. The University of Sydney Page 3
Where do questions come from?
An important source of questions on data are hypotheses derived from theory:
Data Hypotheses Theory
Another source are problems:
Data Questions
Problem(
s)
Data Questions Data
A third source are data themselves:
4. The University of Sydney Page 4
Models of data
EDA plays a role in all three scenarios.
– Theories do not get compared with data as such, but with models of data:
Data Hypotheses TheoryData
model(s)
ED
A
Data Questions
Problem(
s)
Data
model(s)
ED
A
Questions
Data
model(s)
And similarly for the other cases:
Data
Data
model(s)
ED
A
5. The University of Sydney Page 5
Data are not “objective”
– Measurements and observations are not theory- or assumption-free;
– There’s more than one way to build a (statistical) model of any data
set;
– While the data may support a theory, they likely support many other
theories;
– While a data set may support a theory, it could also contain relation
that are contradicting the theory
Hence, even if your data are carefully selected and
measured, and you think you know them well, it is
important to look for the unexpected!
6. The University of Sydney Page 6
The exploratory perspective
Key assumption: The more one knows about the data, the more effectively
data can used to
– develop, test and refine theory,
– solve problems, and
– ask interesting questions.
To maximise what is learned from data, one needs to adhere to two principles:
– scepticism, and
– openness.
One should be sceptical, for instance about the assumption that specific
statistical parameters (i.e., summaries of data, such as the mean) reflect data
faithfully, and open to different interpretations of what the data say.
7. The University of Sydney Page 7
Be sceptical! Be open!
One reason to be sceptical
about statistics in particular
is Anscombe’s Quartet:
– Four datasets with (almost)
identical statistics, but
very different shapes.
By Ascombe https://commons.wikimedia.org/w/index.php?curid=9838454
8. The University of Sydney Page 8
(cont.)
– Statistics (= summative accounts of data) can be misleading
– Data analysis is not identical with statistics:
– Visual analysis should precede statistical analysis
Stay open to multiple interpretations!
– The confirmatory, or hypothesis-testing mode, to data analysis can
keep one from seeing what other patterns might exist in data.
In addition to asking:
– Do these data confirm or disconfirm my hypothesis about x?
Ask:
– What can these data tell me about x?
9. The University of Sydney Page 9
Model and outliers
The basic way of thinking about data:
Data = pattern + deviations
(model + outliers)
(smooth + rough)
Data analysis, including statistical analysis, means to partition data into
patterns/models/smooths and deviations/outliers/roughs
For any given data, there are in principle many ways to do this
partitioning, and there is no logical reason to a priori prefer one over the
other the analysis process is incremental, not one hypothesis testing
step.
10. The University of Sydney Page 10
Our tools for EDA
– dplyr: selecting, filtering, summarising data
– ggplot2: visualising data, patterns, trends.
11. The University of Sydney Page 11
Data selection with dplyr
Variable A (…) Variable v
Observation
1
Value 1A (…) Value 1v
Observation
2
Value 2A (…) Value 2v
(…) (…) (…) (…)
Observation
o
Value oA (…) Value ov
(2) filter on values
(3) arrange
by rows
(1) select variables
(4) mutate: create new variables
(5) sum-
marize
over
values
dplyr is made up out of 5 verbs:
12. The University of Sydney Page 12
“Sentences” in dplyr
General format: verb(data frame, parameters)
– The result is a new data frame: new_frame <- verb(data,
parameter).
Examples:
– filter(flights, month == 1, day == 1)
– arrange(flights, year, month, day)
– select(flights, year, month, day)
– mutate(flights, gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
– summarize(flights, delay = mean(dep_delay))
13. The University of Sydney Page 13
Boolean operations are supported for filtering
and selecting
! Is “not”, | is ”or”, & is
“and”
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
These two return the same observations:
For more on these commands, see for instance
https://www.youtube.com/watch?v=aywFompr1F4
14. The University of Sydney Page 14
Workbook
– The rest of this module is mainly in the workbook.
Notes de l'éditeur
https://en.wikipedia.org/wiki/Anscombe's_quartet. The reason for some of this is that many statistics are very sensitive towards outliers. See in particular 3 and 4.