SlideShare une entreprise Scribd logo
1  sur  14
The University of Sydney Page 1
Exploratory data
analysis
The basics
Presented by
Professor Peter Reimann
Centre for Research on Learning and
Cognition
The University of Sydney Page 2
EDA is a inquiry cycle
Generate
questions
Search for
answers
in the data
Refine
questions
Visualize, transform, model the data
EDA is an important
component of theory-driven,
problem-driven, and
curiosity-driven research.
The University of Sydney Page 3
Where do questions come from?
An important source of questions on data are hypotheses derived from theory:
Data Hypotheses Theory
Another source are problems:
Data Questions
Problem(
s)
Data Questions Data
A third source are data themselves:
The University of Sydney Page 4
Models of data
EDA plays a role in all three scenarios.
– Theories do not get compared with data as such, but with models of data:
Data Hypotheses TheoryData
model(s)
ED
A
Data Questions
Problem(
s)
Data
model(s)
ED
A
Questions
Data
model(s)
And similarly for the other cases:
Data
Data
model(s)
ED
A
The University of Sydney Page 5
Data are not “objective”
– Measurements and observations are not theory- or assumption-free;
– There’s more than one way to build a (statistical) model of any data
set;
– While the data may support a theory, they likely support many other
theories;
– While a data set may support a theory, it could also contain relation
that are contradicting the theory
Hence, even if your data are carefully selected and
measured, and you think you know them well, it is
important to look for the unexpected!
The University of Sydney Page 6
The exploratory perspective
Key assumption: The more one knows about the data, the more effectively
data can used to
– develop, test and refine theory,
– solve problems, and
– ask interesting questions.
To maximise what is learned from data, one needs to adhere to two principles:
– scepticism, and
– openness.
One should be sceptical, for instance about the assumption that specific
statistical parameters (i.e., summaries of data, such as the mean) reflect data
faithfully, and open to different interpretations of what the data say.
The University of Sydney Page 7
Be sceptical! Be open!
One reason to be sceptical
about statistics in particular
is Anscombe’s Quartet:
– Four datasets with (almost)
identical statistics, but
very different shapes.
By Ascombe https://commons.wikimedia.org/w/index.php?curid=9838454
The University of Sydney Page 8
(cont.)
– Statistics (= summative accounts of data) can be misleading
– Data analysis is not identical with statistics:
– Visual analysis should precede statistical analysis
Stay open to multiple interpretations!
– The confirmatory, or hypothesis-testing mode, to data analysis can
keep one from seeing what other patterns might exist in data.
In addition to asking:
– Do these data confirm or disconfirm my hypothesis about x?
Ask:
– What can these data tell me about x?
The University of Sydney Page 9
Model and outliers
The basic way of thinking about data:
Data = pattern + deviations
(model + outliers)
(smooth + rough)
Data analysis, including statistical analysis, means to partition data into
patterns/models/smooths and deviations/outliers/roughs
For any given data, there are in principle many ways to do this
partitioning, and there is no logical reason to a priori prefer one over the
other  the analysis process is incremental, not one hypothesis testing
step.
The University of Sydney Page 10
Our tools for EDA
– dplyr: selecting, filtering, summarising data
– ggplot2: visualising data, patterns, trends.
The University of Sydney Page 11
Data selection with dplyr
Variable A (…) Variable v
Observation
1
Value 1A (…) Value 1v
Observation
2
Value 2A (…) Value 2v
(…) (…) (…) (…)
Observation
o
Value oA (…) Value ov
(2) filter on values
(3) arrange
by rows
(1) select variables
(4) mutate: create new variables
(5) sum-
marize
over
values
dplyr is made up out of 5 verbs:
The University of Sydney Page 12
“Sentences” in dplyr
General format: verb(data frame, parameters)
– The result is a new data frame: new_frame <- verb(data,
parameter).
Examples:
– filter(flights, month == 1, day == 1)
– arrange(flights, year, month, day)
– select(flights, year, month, day)
– mutate(flights, gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
– summarize(flights, delay = mean(dep_delay))
The University of Sydney Page 13
Boolean operations are supported for filtering
and selecting
! Is “not”, | is ”or”, & is
“and”
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
These two return the same observations:
For more on these commands, see for instance
https://www.youtube.com/watch?v=aywFompr1F4
The University of Sydney Page 14
Workbook
– The rest of this module is mainly in the workbook.

Contenu connexe

Tendances

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisgokulprasath06
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introductionManokamnaKochar1
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubMartin Bago
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data VisualizationCenterline Digital
 
Data visualization
Data visualizationData visualization
Data visualizationHoang Nguyen
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using PythonShirin Mojarad, Ph.D.
 
Data Visualisation.pdf
Data Visualisation.pdfData Visualisation.pdf
Data Visualisation.pdfThiyagu K
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptxSarojkumari55
 
PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisAashish Patel
 
Principles of data visualisation 2021
Principles of data visualisation 2021Principles of data visualisation 2021
Principles of data visualisation 2021Marié Roux
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysismlong24
 

Tendances (20)

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introduction
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data Visualization
 
Data visualization
Data visualizationData visualization
Data visualization
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Data Visualisation.pdf
Data Visualisation.pdfData Visualisation.pdf
Data Visualisation.pdf
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptx
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data Analysis
 
Principles of data visualisation 2021
Principles of data visualisation 2021Principles of data visualisation 2021
Principles of data visualisation 2021
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 

Similaire à Exploratory data analysis

Business research (1)
Business research (1)Business research (1)
Business research (1)007donmj
 
business-research.ppt
business-research.pptbusiness-research.ppt
business-research.pptKaneezElahi
 
Relevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareRelevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareSanjeev Deshmukh
 
business-research
business-researchbusiness-research
business-researchMbabba2
 
Research EDU821-1.pptx
Research EDU821-1.pptxResearch EDU821-1.pptx
Research EDU821-1.pptxSalmaNiazi2
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data ManagementMahmoud91Tx
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Approaches To Data Analysis In Social Research
Approaches To Data Analysis In Social ResearchApproaches To Data Analysis In Social Research
Approaches To Data Analysis In Social ResearchKarla Adamson
 
CORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An OverviewCORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An OverviewTrident University
 
Research Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptxResearch Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptxahamedaslambasha1
 
GBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfGBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfStanleyChivandire1
 
Merriam ch 8 5.26.10
Merriam ch 8 5.26.10Merriam ch 8 5.26.10
Merriam ch 8 5.26.10Daberkow
 
Practical Issues in Social Research Methods
Practical Issues in Social Research MethodsPractical Issues in Social Research Methods
Practical Issues in Social Research Methodsjdubrow2000
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Stats Statswork
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web versionMichael Brodie
 

Similaire à Exploratory data analysis (20)

Business research (1)
Business research (1)Business research (1)
Business research (1)
 
Lesson 6 chapter 4
Lesson 6   chapter 4Lesson 6   chapter 4
Lesson 6 chapter 4
 
EDM405 4.pptx
EDM405 4.pptxEDM405 4.pptx
EDM405 4.pptx
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
business-research.ppt
business-research.pptbusiness-research.ppt
business-research.ppt
 
Relevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareRelevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshare
 
business-research
business-researchbusiness-research
business-research
 
Research EDU821-1.pptx
Research EDU821-1.pptxResearch EDU821-1.pptx
Research EDU821-1.pptx
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data Management
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Thirupathi.ppt
Thirupathi.pptThirupathi.ppt
Thirupathi.ppt
 
Approaches To Data Analysis In Social Research
Approaches To Data Analysis In Social ResearchApproaches To Data Analysis In Social Research
Approaches To Data Analysis In Social Research
 
Aishwarya.ppt
Aishwarya.pptAishwarya.ppt
Aishwarya.ppt
 
CORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An OverviewCORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An Overview
 
Research Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptxResearch Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptx
 
GBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfGBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdf
 
Merriam ch 8 5.26.10
Merriam ch 8 5.26.10Merriam ch 8 5.26.10
Merriam ch 8 5.26.10
 
Practical Issues in Social Research Methods
Practical Issues in Social Research MethodsPractical Issues in Social Research Methods
Practical Issues in Social Research Methods
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
 

Dernier

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxrohankumarsinghrore1
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 

Dernier (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

Exploratory data analysis

  • 1. The University of Sydney Page 1 Exploratory data analysis The basics Presented by Professor Peter Reimann Centre for Research on Learning and Cognition
  • 2. The University of Sydney Page 2 EDA is a inquiry cycle Generate questions Search for answers in the data Refine questions Visualize, transform, model the data EDA is an important component of theory-driven, problem-driven, and curiosity-driven research.
  • 3. The University of Sydney Page 3 Where do questions come from? An important source of questions on data are hypotheses derived from theory: Data Hypotheses Theory Another source are problems: Data Questions Problem( s) Data Questions Data A third source are data themselves:
  • 4. The University of Sydney Page 4 Models of data EDA plays a role in all three scenarios. – Theories do not get compared with data as such, but with models of data: Data Hypotheses TheoryData model(s) ED A Data Questions Problem( s) Data model(s) ED A Questions Data model(s) And similarly for the other cases: Data Data model(s) ED A
  • 5. The University of Sydney Page 5 Data are not “objective” – Measurements and observations are not theory- or assumption-free; – There’s more than one way to build a (statistical) model of any data set; – While the data may support a theory, they likely support many other theories; – While a data set may support a theory, it could also contain relation that are contradicting the theory Hence, even if your data are carefully selected and measured, and you think you know them well, it is important to look for the unexpected!
  • 6. The University of Sydney Page 6 The exploratory perspective Key assumption: The more one knows about the data, the more effectively data can used to – develop, test and refine theory, – solve problems, and – ask interesting questions. To maximise what is learned from data, one needs to adhere to two principles: – scepticism, and – openness. One should be sceptical, for instance about the assumption that specific statistical parameters (i.e., summaries of data, such as the mean) reflect data faithfully, and open to different interpretations of what the data say.
  • 7. The University of Sydney Page 7 Be sceptical! Be open! One reason to be sceptical about statistics in particular is Anscombe’s Quartet: – Four datasets with (almost) identical statistics, but very different shapes. By Ascombe https://commons.wikimedia.org/w/index.php?curid=9838454
  • 8. The University of Sydney Page 8 (cont.) – Statistics (= summative accounts of data) can be misleading – Data analysis is not identical with statistics: – Visual analysis should precede statistical analysis Stay open to multiple interpretations! – The confirmatory, or hypothesis-testing mode, to data analysis can keep one from seeing what other patterns might exist in data. In addition to asking: – Do these data confirm or disconfirm my hypothesis about x? Ask: – What can these data tell me about x?
  • 9. The University of Sydney Page 9 Model and outliers The basic way of thinking about data: Data = pattern + deviations (model + outliers) (smooth + rough) Data analysis, including statistical analysis, means to partition data into patterns/models/smooths and deviations/outliers/roughs For any given data, there are in principle many ways to do this partitioning, and there is no logical reason to a priori prefer one over the other  the analysis process is incremental, not one hypothesis testing step.
  • 10. The University of Sydney Page 10 Our tools for EDA – dplyr: selecting, filtering, summarising data – ggplot2: visualising data, patterns, trends.
  • 11. The University of Sydney Page 11 Data selection with dplyr Variable A (…) Variable v Observation 1 Value 1A (…) Value 1v Observation 2 Value 2A (…) Value 2v (…) (…) (…) (…) Observation o Value oA (…) Value ov (2) filter on values (3) arrange by rows (1) select variables (4) mutate: create new variables (5) sum- marize over values dplyr is made up out of 5 verbs:
  • 12. The University of Sydney Page 12 “Sentences” in dplyr General format: verb(data frame, parameters) – The result is a new data frame: new_frame <- verb(data, parameter). Examples: – filter(flights, month == 1, day == 1) – arrange(flights, year, month, day) – select(flights, year, month, day) – mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60) – summarize(flights, delay = mean(dep_delay))
  • 13. The University of Sydney Page 13 Boolean operations are supported for filtering and selecting ! Is “not”, | is ”or”, & is “and” filter(flights, !(arr_delay > 120 | dep_delay > 120)) filter(flights, arr_delay <= 120, dep_delay <= 120) These two return the same observations: For more on these commands, see for instance https://www.youtube.com/watch?v=aywFompr1F4
  • 14. The University of Sydney Page 14 Workbook – The rest of this module is mainly in the workbook.

Notes de l'éditeur

  1. https://en.wikipedia.org/wiki/Anscombe's_quartet. The reason for some of this is that many statistics are very sensitive towards outliers. See in particular 3 and 4.