SlideShare une entreprise Scribd logo
1  sur  17
DATA MANAGEMENT
SCI 2777 • Storytelling with Data • Spring 2014
Sister Edith Bogue • The College of St Scholastica
DISPOSABLE DATA MANAGEMENT
• Researchers know they need clean
reliable data
• The analysis really interests them
• When data arrive do quick manual
clean-up of any problems they see.
• Often cut-and-paste in spreadsheets
• Look for and fix anomalies

• If no errors crop up in the analysis,
they make a clean archive copy
and forget about the data.

The Perils of Disposable Data Management from Prometheus Research blog at
https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
DISPOSABLE DATA MANAGEMENT
• PROBLEM #1: More data arrive and
they have to do the same cut-andpaste / sorting / combining operations
over again.
• PROBLEM #2: An anomaly appears in a
later data set. She has to check all the
earlier data to find out if it’s there too.
It was a cut-and-paste error.
• PROBLEM #3: The results look peculiar, or are opposite to
the prediction. Was it the data handling or is it real?
The Perils of Disposable Data Management from Prometheus Research blog at
https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
GOOD DATA PRACTICES
• ―It’s common to spend many
tedious and frustrating hours
cleaning and wrangling your
data into a usable format,
followed by careful exploration to provide context and
reveal potential problems with the analyses you
want to run.‖
• ―Data cleaning and data transformation are two
major bottlenecks in data analysis.‖

Good Data Management Practices for Data Analysis from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/
Good Data Management Practices for Data Analysis from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/

DATA CLEANING
It should be no surprise that it takes longer
to clean messier data. Unfortunately, there
are many ways that data can be messy.
Powerful tools and practices can help you
turn messy data into clean data.
Good Data Management Practices for Data Analysis from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/

DATA TRANSFORMATION
―This is more subtle. It’s often important to
visualize and model the data in various ways
when conducting an analysis. I’m not talking
about going on fishing expeditions, but rather
about familiarizing yourself with the data…
The point is that frequent data transformations
are required to mediate changes between
these representations, introducing an underappreciated amount of friction in analysis.‖
TIDY DATA
• Each variable forms a column
• Each observation forms a row
• Each data set contains information on
only one observational unit of analysis
(e.g., families, participants, participan
t visits)

Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
MESSY DATA
• Column names represent data values instead
of variable names
• A single column contains data on multiple
variables instead of a single variable
• Variables are contained in both rows and
columns instead of just columns
• A single table contains more than one
observational unit
• Data about an observational unit is spread
across multiple data sets
Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
TIDY TOOLS
• Tidy tools are those that
accept, manipulate, and return tidy data.
• Tidy tools are like Lego blocks—individually
simple but flexible & powerful in combination.
• What tools are tidy?
• Most functions in R
• Most transformations in SPSS or SAS
• Relational databases (an entire skill of its own)

• Spreadsheets are not tidy tools

Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
SCI 2777
• We will learn about cleaning data first with
untidy tools: spreadsheets and the like.
• They are more familiar and easy to use right away
• We will learn how to track the provenance even
with our untidy tools.

• Soon, we will use R for some tasks, and get some
basic skills for using a tidy tool for cleaning data.

Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
A CAUTIONARY EXAMPLE
• THOMAS HERNDON
• Third-year economics grad
student at UMass-Amherst
(age 28)
• Class assignment:
replicate the findings
of a published study.
• Growth in a Time of Debt by
Reinhart & Rogoff in American
Economic Review
• Finding: Growth drops off
sharply if debt is high
• Basis for austerity economics

• Could not replicate
Photo : The 28-Year-Old Who Caught the Excel Error Heard
Round the World. In These Times http://bit.ly/Lz2eDm

• Found 3-4 errors.

Herndon et al. (2013) Does High Public Debt Consistently Stifle Economic Growth?
A Critique of Reinhart and Rogoff. PERI Working Papers Number 322.
http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf
“There were
actually four
errors all together.
Any one error by
itself would not
have been
enough to cause
the negative
average. It was
the combined
effect of all four of
them: They
interacted with
each other and
amplified each
other—almost like
a perfect storm of
errors.”
Quote from: The 28-Year-Old
Who Caught the Excel Error
Heard Round the World. In These
Times http://bit.ly/Lz2eDm

Researchers Finally Replicated Reinhart-Rogoff, and There Are Serious Problems
from Next New Deal at http://bit.ly/1f1XUHG
DATA PROVENANCE
• Main goals
• Keep a record
• Be able to replicate your steps
• Facilitate collaboration (most data work uses a team)

• Versioning
• Some software automatically keeps old versions of files
• Google docs (online files) does this
• Dropbox also syncs files across all your devices,
keeps a local copy on computers (ie one you can use
when there is no internet)
TODAY
• Look at the World Bank Data visually: what do we
notice?
• World Bank Data – computing variables in spreadsheet
using the School of Data instructions.
• Getting your first look at Graphs using the School of
Data instructions.

• Seeing versions of files in Google Drive
GOALS BY JANUARY 29
• Clean data from the World Bank
• First graphs of variables
• Practice in dreaming up analyses
• Beginning to find our own data
• Basic Descriptive Statistics in ALEKS
• Basic Graphics in ALEKS
• FUN with Design

• First thoughts about your projects
DATA MANAGEMENT
SCI 2777 • Storytelling with Data • Spring 2014
Sister Edith Bogue • The College of St Scholastica

Contenu connexe

Tendances

What does a dba do all day long?
What does a dba do all day long?What does a dba do all day long?
What does a dba do all day long?Datasoft Consulting
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategyHimanshu Bari
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except usmark madsen
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data Shallote Dsouza
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data QualityBig Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data QualityBigDataExpo
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementEmpowered Holdings, LLC
 
Data mining techniques and dss
Data mining techniques and dssData mining techniques and dss
Data mining techniques and dssNiyitegekabilly
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slidesNicolas Sarramagna
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Data Governance Assessment - Jan Rutger Merkus MSc
Data Governance Assessment - Jan Rutger Merkus MScData Governance Assessment - Jan Rutger Merkus MSc
Data Governance Assessment - Jan Rutger Merkus MScJan Merkus
 
Introduction to data warehousing
Introduction to data warehousingIntroduction to data warehousing
Introduction to data warehousinguncleRhyme
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data AnalyticsBHARATH KUMAR
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl conceptsjeshocarme
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...PyData
 
Lecture 04 - Granularity in the Data Warehouse
Lecture 04 - Granularity in the Data WarehouseLecture 04 - Granularity in the Data Warehouse
Lecture 04 - Granularity in the Data Warehousephanleson
 

Tendances (20)

What does a dba do all day long?
What does a dba do all day long?What does a dba do all day long?
What does a dba do all day long?
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data QualityBig Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
 
Data mining techniques and dss
Data mining techniques and dssData mining techniques and dss
Data mining techniques and dss
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Data Governance Assessment - Jan Rutger Merkus MSc
Data Governance Assessment - Jan Rutger Merkus MScData Governance Assessment - Jan Rutger Merkus MSc
Data Governance Assessment - Jan Rutger Merkus MSc
 
Introduction to data warehousing
Introduction to data warehousingIntroduction to data warehousing
Introduction to data warehousing
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data Analytics
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
 
Dealing with Dark Data
Dealing with Dark DataDealing with Dark Data
Dealing with Dark Data
 
Lecture 04 - Granularity in the Data Warehouse
Lecture 04 - Granularity in the Data WarehouseLecture 04 - Granularity in the Data Warehouse
Lecture 04 - Granularity in the Data Warehouse
 

En vedette

Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data ManagementAmanda Whitmire
 
Master Data Management
Master Data ManagementMaster Data Management
Master Data ManagementSung Kuan
 
The what, why, and how of master data management
The what, why, and how of master data managementThe what, why, and how of master data management
The what, why, and how of master data managementMohammad Yousri
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...Christopher Bradley
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data managementCunera Buys
 
Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data managementMichael Day
 
MDM Strategy & Roadmap
MDM Strategy & RoadmapMDM Strategy & Roadmap
MDM Strategy & Roadmapvictorlbrown
 
Gartner: Master Data Management Functionality
Gartner: Master Data Management FunctionalityGartner: Master Data Management Functionality
Gartner: Master Data Management FunctionalityGartner
 
Gartner: Seven Building Blocks of Master Data Management
Gartner: Seven Building Blocks of Master Data ManagementGartner: Seven Building Blocks of Master Data Management
Gartner: Seven Building Blocks of Master Data ManagementGartner
 
Data management issues
Data management issuesData management issues
Data management issuesNeha Bansal
 
Survey Research Data Archive: Current Status and Challenges
Survey Research Data Archive: Current Status and ChallengesSurvey Research Data Archive: Current Status and Challenges
Survey Research Data Archive: Current Status and ChallengesBob Chao
 
LIS 653, Session 11: Data Management & Curation
LIS 653, Session 11: Data Management & CurationLIS 653, Session 11: Data Management & Curation
LIS 653, Session 11: Data Management & CurationDr. Starr Hoffman
 
Data Management: Tips & Tools
Data Management: Tips & ToolsData Management: Tips & Tools
Data Management: Tips & ToolsStephanie Wright
 
NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods
NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods
NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods Nebraska Library Commission
 
Data Archiving and Processing
Data Archiving and ProcessingData Archiving and Processing
Data Archiving and ProcessingCRRC-Armenia
 
CDO Vision: Data Governance Priorities
CDO Vision: Data Governance PrioritiesCDO Vision: Data Governance Priorities
CDO Vision: Data Governance PrioritiesDATAVERSITY
 
Managing data throughout the research lifecycle
Managing data throughout the research lifecycleManaging data throughout the research lifecycle
Managing data throughout the research lifecycleMarieke Guy
 

En vedette (20)

Data Management for Dummies
Data Management for DummiesData Management for Dummies
Data Management for Dummies
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
Master Data Management
Master Data ManagementMaster Data Management
Master Data Management
 
The what, why, and how of master data management
The what, why, and how of master data managementThe what, why, and how of master data management
The what, why, and how of master data management
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
5 Steps To Master Data Management
5 Steps To Master Data Management5 Steps To Master Data Management
5 Steps To Master Data Management
 
Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data management
 
MDM Strategy & Roadmap
MDM Strategy & RoadmapMDM Strategy & Roadmap
MDM Strategy & Roadmap
 
Gartner: Master Data Management Functionality
Gartner: Master Data Management FunctionalityGartner: Master Data Management Functionality
Gartner: Master Data Management Functionality
 
Gartner: Seven Building Blocks of Master Data Management
Gartner: Seven Building Blocks of Master Data ManagementGartner: Seven Building Blocks of Master Data Management
Gartner: Seven Building Blocks of Master Data Management
 
Ebook - The Guide to Master Data Management
Ebook - The Guide to Master Data Management Ebook - The Guide to Master Data Management
Ebook - The Guide to Master Data Management
 
Data management issues
Data management issuesData management issues
Data management issues
 
Survey Research Data Archive: Current Status and Challenges
Survey Research Data Archive: Current Status and ChallengesSurvey Research Data Archive: Current Status and Challenges
Survey Research Data Archive: Current Status and Challenges
 
LIS 653, Session 11: Data Management & Curation
LIS 653, Session 11: Data Management & CurationLIS 653, Session 11: Data Management & Curation
LIS 653, Session 11: Data Management & Curation
 
Data Management: Tips & Tools
Data Management: Tips & ToolsData Management: Tips & Tools
Data Management: Tips & Tools
 
NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods
NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods
NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods
 
Data Archiving and Processing
Data Archiving and ProcessingData Archiving and Processing
Data Archiving and Processing
 
CDO Vision: Data Governance Priorities
CDO Vision: Data Governance PrioritiesCDO Vision: Data Governance Priorities
CDO Vision: Data Governance Priorities
 
Managing data throughout the research lifecycle
Managing data throughout the research lifecycleManaging data throughout the research lifecycle
Managing data throughout the research lifecycle
 

Similaire à Data Management Practices for Clean Analysis

Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciencesChris Dwan
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6varshakumar21
 
Navigating the BI Stack _
Navigating the BI Stack _Navigating the BI Stack _
Navigating the BI Stack _Michael Phipps
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data scienceLoïc Lejoly
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxAkash527744
 
DataEd Online: Data Architecture and Data Modeling Differences — Achieving a ...
DataEd Online: Data Architecture and Data Modeling Differences — Achieving a ...DataEd Online: Data Architecture and Data Modeling Differences — Achieving a ...
DataEd Online: Data Architecture and Data Modeling Differences — Achieving a ...DATAVERSITY
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861
 
Data Collaboration Stack
Data Collaboration StackData Collaboration Stack
Data Collaboration StackPierre Brunelle
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxAbdullahAbbasi55
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDatavalley.ai
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
Crossing the bridge - how do we link end-user-computing and formal tech for d...
Crossing the bridge - how do we link end-user-computing and formal tech for d...Crossing the bridge - how do we link end-user-computing and formal tech for d...
Crossing the bridge - how do we link end-user-computing and formal tech for d...J On The Beach
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217lyarmey
 

Similaire à Data Management Practices for Clean Analysis (20)

Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
 
Navigating the BI Stack _
Navigating the BI Stack _Navigating the BI Stack _
Navigating the BI Stack _
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
DataEd Online: Data Architecture and Data Modeling Differences — Achieving a ...
DataEd Online: Data Architecture and Data Modeling Differences — Achieving a ...DataEd Online: Data Architecture and Data Modeling Differences — Achieving a ...
DataEd Online: Data Architecture and Data Modeling Differences — Achieving a ...
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Data Collaboration Stack
Data Collaboration StackData Collaboration Stack
Data Collaboration Stack
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
Crossing the bridge - how do we link end-user-computing and formal tech for d...
Crossing the bridge - how do we link end-user-computing and formal tech for d...Crossing the bridge - how do we link end-user-computing and formal tech for d...
Crossing the bridge - how do we link end-user-computing and formal tech for d...
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 

Plus de Sr Edith Bogue

How to Report Test Results
How to Report Test ResultsHow to Report Test Results
How to Report Test ResultsSr Edith Bogue
 
Introduction to the t test
Introduction to the t testIntroduction to the t test
Introduction to the t testSr Edith Bogue
 
Principles of Design (Williams)
Principles of Design (Williams)Principles of Design (Williams)
Principles of Design (Williams)Sr Edith Bogue
 
Repeated Measures ANOVA
Repeated Measures ANOVARepeated Measures ANOVA
Repeated Measures ANOVASr Edith Bogue
 
Two-Way ANOVA Overview & SPSS interpretation
Two-Way ANOVA Overview & SPSS interpretationTwo-Way ANOVA Overview & SPSS interpretation
Two-Way ANOVA Overview & SPSS interpretationSr Edith Bogue
 
Repeated Measures ANOVA - Overview
Repeated Measures ANOVA - OverviewRepeated Measures ANOVA - Overview
Repeated Measures ANOVA - OverviewSr Edith Bogue
 
Oneway ANOVA - Overview
Oneway ANOVA - OverviewOneway ANOVA - Overview
Oneway ANOVA - OverviewSr Edith Bogue
 
One-Sample Hypothesis Tests
One-Sample Hypothesis TestsOne-Sample Hypothesis Tests
One-Sample Hypothesis TestsSr Edith Bogue
 
Review & Hypothesis Testing
Review & Hypothesis TestingReview & Hypothesis Testing
Review & Hypothesis TestingSr Edith Bogue
 
Sustaining the Ministry of Sponsorship
Sustaining the Ministry of SponsorshipSustaining the Ministry of Sponsorship
Sustaining the Ministry of SponsorshipSr Edith Bogue
 
Central Tendency - Overview
Central Tendency - Overview Central Tendency - Overview
Central Tendency - Overview Sr Edith Bogue
 
Introduction to z-Scores
Introduction to z-Scores Introduction to z-Scores
Introduction to z-Scores Sr Edith Bogue
 

Plus de Sr Edith Bogue (20)

How to Report Test Results
How to Report Test ResultsHow to Report Test Results
How to Report Test Results
 
Introduction to the t test
Introduction to the t testIntroduction to the t test
Introduction to the t test
 
Location Scores
Location  ScoresLocation  Scores
Location Scores
 
Variability
VariabilityVariability
Variability
 
Principles of Design (Williams)
Principles of Design (Williams)Principles of Design (Williams)
Principles of Design (Williams)
 
Levels of Measurement
Levels of MeasurementLevels of Measurement
Levels of Measurement
 
Chi-Square Example
Chi-Square ExampleChi-Square Example
Chi-Square Example
 
Repeated Measures ANOVA
Repeated Measures ANOVARepeated Measures ANOVA
Repeated Measures ANOVA
 
Two-Way ANOVA Overview & SPSS interpretation
Two-Way ANOVA Overview & SPSS interpretationTwo-Way ANOVA Overview & SPSS interpretation
Two-Way ANOVA Overview & SPSS interpretation
 
Repeated Measures ANOVA - Overview
Repeated Measures ANOVA - OverviewRepeated Measures ANOVA - Overview
Repeated Measures ANOVA - Overview
 
Oneway ANOVA - Overview
Oneway ANOVA - OverviewOneway ANOVA - Overview
Oneway ANOVA - Overview
 
Effect size
Effect sizeEffect size
Effect size
 
One-Sample Hypothesis Tests
One-Sample Hypothesis TestsOne-Sample Hypothesis Tests
One-Sample Hypothesis Tests
 
Review & Hypothesis Testing
Review & Hypothesis TestingReview & Hypothesis Testing
Review & Hypothesis Testing
 
Sustaining the Ministry of Sponsorship
Sustaining the Ministry of SponsorshipSustaining the Ministry of Sponsorship
Sustaining the Ministry of Sponsorship
 
Demographic Processes
Demographic ProcessesDemographic Processes
Demographic Processes
 
Location scores
Location scoresLocation scores
Location scores
 
Central Tendency - Overview
Central Tendency - Overview Central Tendency - Overview
Central Tendency - Overview
 
Introduction to z-Scores
Introduction to z-Scores Introduction to z-Scores
Introduction to z-Scores
 
Graphing
GraphingGraphing
Graphing
 

Dernier

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 

Dernier (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 

Data Management Practices for Clean Analysis

  • 1. DATA MANAGEMENT SCI 2777 • Storytelling with Data • Spring 2014 Sister Edith Bogue • The College of St Scholastica
  • 2. DISPOSABLE DATA MANAGEMENT • Researchers know they need clean reliable data • The analysis really interests them • When data arrive do quick manual clean-up of any problems they see. • Often cut-and-paste in spreadsheets • Look for and fix anomalies • If no errors crop up in the analysis, they make a clean archive copy and forget about the data. The Perils of Disposable Data Management from Prometheus Research blog at https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
  • 3. DISPOSABLE DATA MANAGEMENT • PROBLEM #1: More data arrive and they have to do the same cut-andpaste / sorting / combining operations over again. • PROBLEM #2: An anomaly appears in a later data set. She has to check all the earlier data to find out if it’s there too. It was a cut-and-paste error. • PROBLEM #3: The results look peculiar, or are opposite to the prediction. Was it the data handling or is it real? The Perils of Disposable Data Management from Prometheus Research blog at https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
  • 4. GOOD DATA PRACTICES • ―It’s common to spend many tedious and frustrating hours cleaning and wrangling your data into a usable format, followed by careful exploration to provide context and reveal potential problems with the analyses you want to run.‖ • ―Data cleaning and data transformation are two major bottlenecks in data analysis.‖ Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/
  • 5. Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/ DATA CLEANING It should be no surprise that it takes longer to clean messier data. Unfortunately, there are many ways that data can be messy. Powerful tools and practices can help you turn messy data into clean data.
  • 6. Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/ DATA TRANSFORMATION ―This is more subtle. It’s often important to visualize and model the data in various ways when conducting an analysis. I’m not talking about going on fishing expeditions, but rather about familiarizing yourself with the data… The point is that frequent data transformations are required to mediate changes between these representations, introducing an underappreciated amount of friction in analysis.‖
  • 7. TIDY DATA • Each variable forms a column • Each observation forms a row • Each data set contains information on only one observational unit of analysis (e.g., families, participants, participan t visits) Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  • 8. MESSY DATA • Column names represent data values instead of variable names • A single column contains data on multiple variables instead of a single variable • Variables are contained in both rows and columns instead of just columns • A single table contains more than one observational unit • Data about an observational unit is spread across multiple data sets Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  • 9. TIDY TOOLS • Tidy tools are those that accept, manipulate, and return tidy data. • Tidy tools are like Lego blocks—individually simple but flexible & powerful in combination. • What tools are tidy? • Most functions in R • Most transformations in SPSS or SAS • Relational databases (an entire skill of its own) • Spreadsheets are not tidy tools Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  • 10. SCI 2777 • We will learn about cleaning data first with untidy tools: spreadsheets and the like. • They are more familiar and easy to use right away • We will learn how to track the provenance even with our untidy tools. • Soon, we will use R for some tasks, and get some basic skills for using a tidy tool for cleaning data. Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  • 12. • THOMAS HERNDON • Third-year economics grad student at UMass-Amherst (age 28) • Class assignment: replicate the findings of a published study. • Growth in a Time of Debt by Reinhart & Rogoff in American Economic Review • Finding: Growth drops off sharply if debt is high • Basis for austerity economics • Could not replicate Photo : The 28-Year-Old Who Caught the Excel Error Heard Round the World. In These Times http://bit.ly/Lz2eDm • Found 3-4 errors. Herndon et al. (2013) Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. PERI Working Papers Number 322. http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf
  • 13. “There were actually four errors all together. Any one error by itself would not have been enough to cause the negative average. It was the combined effect of all four of them: They interacted with each other and amplified each other—almost like a perfect storm of errors.” Quote from: The 28-Year-Old Who Caught the Excel Error Heard Round the World. In These Times http://bit.ly/Lz2eDm Researchers Finally Replicated Reinhart-Rogoff, and There Are Serious Problems from Next New Deal at http://bit.ly/1f1XUHG
  • 14. DATA PROVENANCE • Main goals • Keep a record • Be able to replicate your steps • Facilitate collaboration (most data work uses a team) • Versioning • Some software automatically keeps old versions of files • Google docs (online files) does this • Dropbox also syncs files across all your devices, keeps a local copy on computers (ie one you can use when there is no internet)
  • 15. TODAY • Look at the World Bank Data visually: what do we notice? • World Bank Data – computing variables in spreadsheet using the School of Data instructions. • Getting your first look at Graphs using the School of Data instructions. • Seeing versions of files in Google Drive
  • 16. GOALS BY JANUARY 29 • Clean data from the World Bank • First graphs of variables • Practice in dreaming up analyses • Beginning to find our own data • Basic Descriptive Statistics in ALEKS • Basic Graphics in ALEKS • FUN with Design • First thoughts about your projects
  • 17. DATA MANAGEMENT SCI 2777 • Storytelling with Data • Spring 2014 Sister Edith Bogue • The College of St Scholastica