SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
DATA
JOURNALISM
Dr. Bahareh Heravi
@Bahareh360
Week 8

Cleaning and Analysing Data
 
DATA	
  is	
  o(en	
  ugly	
  	
  
&	
  
MESSY	
  
Data Profiling
Assess current state of your data.
Data Cleaning
Correct the issues you found during‘data profiling’.
Exploring data
Checking data
Filtering data
Cleaning data
Reshaping data
Annotating data
Linking data
Dataset
Powerhouse Museum objects collection
Download from:
http://data.freeyourmetadata.org/powerhouse-
museum/phm-collection.tsv
Open Refine and load the dataset.
Sorting data
Faceting data
To select a subset of your data to work on.
To get useful insight into your data.
To apply a transformation to a subset of your data.
Types of Facets

Text facets for text
Numeric facets for number and dates
Predefined/customised facets
Text facets

Text facets used for faceting text
Examples:County or city names, TD names
Text facets
Numeric facets

Numeric facets used for faceting numerical values
and ranges.
Examples:Expenditure,crime rate
Numeric facets
Detecting blanks
Removing blanks
Detecting duplicates
Removing duplicates
Warning:
If we remove all the original records will also be
removed!
Removing duplicates
Removing duplicates
Now you can remove.	
  
Facet by blank	
  
Congratulations you have removed
all blank and duplicate values.
Simple cell transformations
Advanced data operations
Clustering
Transformations
Multi-valued cells
Derived columns
Splitting data across columns
Regular Expressions
GREL(General Refine Expression Language)
Multi-valued cells
To split a cell in
Clustering
To cluster similar (syntactically) items together.
To be used to fix inconsistencies,typos,etc.
Examples in the dataset: Agricultural equipment 
Agricultural Equipment
Costume 
Costumes
Clustering
Clustering
Transforming cell values
Transforming cell valuesGREL	
  	
  
(General	
  Refine	
  Expression	
  Language)	
  
Resources
Using OpenRefine by 
RubbenVerborgh and Max DeWilde
http://freeyourmetadata.org/cleanup/
Cleaning Data with Refine, School of Data
The Bastard Book of Regular Expressions by Dan Nguyen
GREL:https://github.com/OpenRefine/OpenRefine/wiki/General-Refine-Expression-Language
 
Ques8ons?	
  
	
  
Bahareh	
  R.	
  Heravi	
  
	
  
	
  
	
  
@Bahareh360	
  
	
  
	
  
	
  

Contenu connexe

Tendances

CDISC2RDF overview with examples
CDISC2RDF overview with examplesCDISC2RDF overview with examples
CDISC2RDF overview with examples
Kerstin Forsberg
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
butest
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
Chris Freeland
 
poster-Jing-09302014
poster-Jing-09302014poster-Jing-09302014
poster-Jing-09302014
Jing Xie
 

Tendances (20)

America’s most wanted
America’s most wantedAmerica’s most wanted
America’s most wanted
 
" Overview of the Metadata in the new CountrySTAT platform "
" Overview of the Metadata in the new CountrySTAT platform "" Overview of the Metadata in the new CountrySTAT platform "
" Overview of the Metadata in the new CountrySTAT platform "
 
Datasets with bioschemas
Datasets with bioschemasDatasets with bioschemas
Datasets with bioschemas
 
Creating Incentives
Creating IncentivesCreating Incentives
Creating Incentives
 
Information Intermediaries
Information IntermediariesInformation Intermediaries
Information Intermediaries
 
Linked Data Hypercubes
Linked Data HypercubesLinked Data Hypercubes
Linked Data Hypercubes
 
QB'er demonstration
QB'er demonstrationQB'er demonstration
QB'er demonstration
 
Design and creation of ontologies for environmental information retrieval
Design and creation of ontologies for environmental information retrievalDesign and creation of ontologies for environmental information retrieval
Design and creation of ontologies for environmental information retrieval
 
Holistic Benchmarking of Big Linked Data: HOBBIT
Holistic Benchmarking of Big Linked Data: HOBBITHolistic Benchmarking of Big Linked Data: HOBBIT
Holistic Benchmarking of Big Linked Data: HOBBIT
 
WRDS WebEx ISB
WRDS WebEx ISBWRDS WebEx ISB
WRDS WebEx ISB
 
Annotating Search Results from Web Databases
Annotating Search Results from Web DatabasesAnnotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
Annotating search results from web databases
Annotating search results from web databasesAnnotating search results from web databases
Annotating search results from web databases
 
CDISC2RDF overview with examples
CDISC2RDF overview with examplesCDISC2RDF overview with examples
CDISC2RDF overview with examples
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
 
Reuse of Repository Data
Reuse of Repository DataReuse of Repository Data
Reuse of Repository Data
 
Extracting and sharing data citations from Google Scholar for collaborative e...
Extracting and sharing data citations from Google Scholar for collaborative e...Extracting and sharing data citations from Google Scholar for collaborative e...
Extracting and sharing data citations from Google Scholar for collaborative e...
 
Presentation of HOBBIT's versioning benchmark at Graph-TA
Presentation of HOBBIT's versioning benchmark at Graph-TAPresentation of HOBBIT's versioning benchmark at Graph-TA
Presentation of HOBBIT's versioning benchmark at Graph-TA
 
poster-Jing-09302014
poster-Jing-09302014poster-Jing-09302014
poster-Jing-09302014
 

En vedette

Mac201 data journalism lecture
Mac201 data journalism lectureMac201 data journalism lecture
Mac201 data journalism lecture
Rob Jewitt
 

En vedette (8)

Dataviz presentation at ThingsKamp2015 Istanbul
Dataviz presentation at ThingsKamp2015 IstanbulDataviz presentation at ThingsKamp2015 Istanbul
Dataviz presentation at ThingsKamp2015 Istanbul
 
Data Journalism - Social Data
Data Journalism - Social Data Data Journalism - Social Data
Data Journalism - Social Data
 
Data Journalism - Data Mashing and Summarisation
Data Journalism - Data Mashing and SummarisationData Journalism - Data Mashing and Summarisation
Data Journalism - Data Mashing and Summarisation
 
Data Journalism - Start working with Data
Data Journalism  - Start working with DataData Journalism  - Start working with Data
Data Journalism - Start working with Data
 
Mac201 data journalism lecture
Mac201 data journalism lectureMac201 data journalism lecture
Mac201 data journalism lecture
 
Data Journalism - Finding Data
Data Journalism - Finding DataData Journalism - Finding Data
Data Journalism - Finding Data
 
Data Journalism - Storytelling with Data
Data Journalism - Storytelling with DataData Journalism - Storytelling with Data
Data Journalism - Storytelling with Data
 
Data Journalism - Introduction
Data Journalism - IntroductionData Journalism - Introduction
Data Journalism - Introduction
 

Similaire à Data Journalism - Cleaning Data

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
Datamining
DataminingDatamining
Datamining
sumit621
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
bhagathk
 

Similaire à Data Journalism - Cleaning Data (20)

Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptx
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptx
 
Database
DatabaseDatabase
Database
 
data.2.pptx
data.2.pptxdata.2.pptx
data.2.pptx
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Training MS Access 2007
Training MS Access 2007Training MS Access 2007
Training MS Access 2007
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)
 
Dats nih-dccpc-kc7-april2018-prs-uoxf
Dats  nih-dccpc-kc7-april2018-prs-uoxfDats  nih-dccpc-kc7-april2018-prs-uoxf
Dats nih-dccpc-kc7-april2018-prs-uoxf
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
Part1
Part1Part1
Part1
 
Datamining
DataminingDatamining
Datamining
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdf
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
 
Data base
Data baseData base
Data base
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 

Dernier

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 

Dernier (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 

Data Journalism - Cleaning Data