SlideShare a Scribd company logo
1 of 30
1
DATA WRANGLING
FIND LOAD CLEAN
2
DATA WRANGLING
FIND LOAD CLEAN
WHERE CAN I GET DATA FROM?
Client data isn't easy to get
THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA
3
Public data isn't relevant
We have internal
information. Getting
information from outside is
our challenge. There’s no
way of doing that.
– Senior Editor
Leading Media Company
“
INDIA’S RELIGIONS
5
If you search on google.co.in for "how do I convert to", here are the suggestions Google shows
The popularity influences the order.
So there's a good chance that the religions on top are more often searched for.
AUSTRALIA’S RELIGIONS
6
But be careful of how you interpret it.
In Australia, PDF is not a religion. Unless you're a data scientist.
7
USE MULTIPLE APPROACHES TO FIND YOUR DATA
8
Public data catalogues
https://github.com/caesar0301/awesome-public-datasets
https://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md
Govt data websites
https://data.gov.in/
https://data.gov/
https://data.gov.uk/
https://data.gov.sg/
http://publicdata.eu/
or search on Google
https://www.google.com/
or ask people
Humans™
1
2
3
4
9
EXERCISE
LET'S FIND SOME DATASETS
(YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)
10
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I STORE & PROCESS DATA?
WE LOAD DATA INTO OUR PROGRAMS OR OTHERS'
11
Files Databases
• Delimited text: CSV, TSV, PSV
• Formatted text: TXT, PRN
• Marked up text: HTML, XML, JSON,
JSON Line, YAML, SQL
• Spreadsheets: XLS*, ODS, MDB,
ACCDB, DBF
• Specialised formats: HDF5, SQLite,
DTA (Stata), C4.5, CDF
• Graph formats: GEXF, GDF, GML,
GraphML, GraphViz DOT
• Unstructured: TXT, PDF, Images,
Audio, Video, ...
• In-memory databases: DataFrames
• Relational databases: Oracle, MySQL,
PostgreSQL, SQL Server, DB2, Sybase,
Informix, ...
• Document databases: MongoDB,
CouchDB, ElasticSearch, Firebase
• Distributed databases: HFS, Spark
• Cloud data stores: BigQuery,
DynamoDB, RedShift, Azure SQL
Database, DocumentDB, ...
• APIs: Twitter, Facebook, Google,
Wikipedia, YouTube, ...
Use CSV when sharing tabular data.
Use JSON for hierarchical data.
Use in-memory, else relational databases.
Don't analyse big data. Shrink it.
12
EXERCISE
LET'S LOAD FROM A SITE
THE GOOGLE SEARCH DATA YOU SAW EARLIER
LET'S LOAD A BIG DATASET
A FEW COLUMNS FROM A LEAKED OK CUPID SURVEY
LET'S LOAD AN UNSTRUCTURED TABLE
A TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF
13
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I FIX THE DATA ISSUES?
CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES
14
Fix rows &
columns
Fix missing
values
Standarise
values
Fix invalid
values
Filter
data
When we receive a dataset, we find a pattern of things that go wrong. These
can be fixed in specific ways.
Here's a workflow / checklist of things to look out for and fix.
After this, check if the data is complete, and sufficient to solve the problem.
FIX ROWS AND COLUMNS
15
Fix rows Examples
Delete incorrect rows Header rows, Footer rows
Delete summary rows Total, subtotal rows
Delete extra rows
Column number indicators (1), (2), ...
Blank rows
Fix columns Examples
Add column names if missing Files with missing header row
Rename columns consistently Abbreviations, encoded columns
Delete unnecessary columns Unidentified columns, irrelevant columns
Split columns for more data Split http://host:port/path into [Host, Port, Path]
Merge columns for identifiers Merge Firstname, Lastname into Name
Merge State, District into FullDistrict
Align misaligned columns Dataset may have shifted columns
FIX MISSING VALUES
16
Fix missing values Examples
Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing
Fill missing values with...
Constant (e.g. zero)
Column (e.g. created date defaults to updated date)
Function (e.g. average of rows/columns)
External data
Remove missing values
Delete row
Delete column
Fill partial missing values Missing time zone, century etc.
STANDARDISE VALUES
17
Standardise numbers Examples
Remove outliers Removing high and low values
Standardise units lbs to kgs, m/s for speed
Scale values if required Fit to percentage scale
Standardise precision 2.1 to 2.10
Standardise text Examples
Remove extra characters Common prefix/suffix, leading/trailing/multiple spaces
Standardise case Uppercase, lowercase, Title Case, Sentence case, etc
Standardise format 23/10/16 to 2016/10/20
“Modi, Narendra" to “Narendra Modi"
FIX INVALID VALUES
18
Fix invalid values Examples
Encode unicode properly CP1252 instead of UTF-8
Convert incorrect data types
String to number: "12,300"
String to date: "2013-Aug"
Number to string: PIN Code 110001 to "110001"
Correct values not in list Non-existent country, PIN code
Correct wrong structure Phone number with over 10 digits
Correct values beyond range Temperature less than -273° C (0° K)
Validate internal rules
Gross sales > Net sales
Date of delivery > Date of ordering
If Title is "Mr" then Gender is "M"
In these cases, treat value as "missing".
Remove it, or fix it with a formula.
The formula may involve the value, row, column,
entire dataset, or external data
FILTER DATA
19
Filter data Examples
Deduplicate data
Remove identical rows
Remove rows where some columns are identical
Filter rows
Filter by segments
Filter by date period
Filter columns Pick columns relevant to analysis
Aggregate data Group by required keys, aggregate the rest
20
EXERCISE
ASSEMBLY ELECTION DATA
SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED
The ECI website has this data.
21
… and, most of the data is in PDFs
22
The PDF files have a reasonably clear structure
23
… that translates into text that can be parsed
24
… which, with some effort, can be converted into a structured format
… and at this point, we need to start checking for errors.
25
At this point, we start checking what’s gone wrong
Each row here
is one
constituency.
The number of
candidates
that have
contested in
each
constituency
in every year
is shown as a
table.
You can see
that some
patterns
emerge here.
26
Not every spelling error is easily identifiable by the first letter
Parties are mis-spelt
MADMK
MAMAK
MDMK
Party names change
AIADMK
ADMK
ADK
Parties restructure
INC(I)
INC
Constituency names mis-spelt
BHADRACHALAM
BHADRACHELAM
BHADRAHCALAM
27
Fortunately, large scale data itself can provide a solution
28
… with modern tools that support machine learning
29
30
DATA WRANGLING
FIND LOAD CLEAN

More Related Content

What's hot

introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisGramener
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxAbdullahAbbasi55
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 

What's hot (20)

Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data science
Data scienceData science
Data science
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Data mining
Data mining Data mining
Data mining
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 

Similar to Data Wrangling

ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Gupta ayankprojectassignmnet
Gupta ayankprojectassignmnetGupta ayankprojectassignmnet
Gupta ayankprojectassignmnetAyank Gupta
 
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningShivarkarSandip
 
OutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainOutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainRamlalMeena5
 
How to source good data
How to source good dataHow to source good data
How to source good dataSolveXia
 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Dreamforce07
 
Lecture 19
Lecture 19Lecture 19
Lecture 19Shani729
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessingpurnimatm
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptxProfPPavanKumar
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptxProfPPavanKumar
 

Similar to Data Wrangling (20)

Data analysis training
Data analysis trainingData analysis training
Data analysis training
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Databases
DatabasesDatabases
Databases
 
Gupta ayankprojectassignmnet
Gupta ayankprojectassignmnetGupta ayankprojectassignmnet
Gupta ayankprojectassignmnet
 
Alteryx Tableau Integration | Clean Your Data Faster for Tableau with Alteryx
Alteryx Tableau Integration | Clean Your Data Faster for Tableau with AlteryxAlteryx Tableau Integration | Clean Your Data Faster for Tableau with Alteryx
Alteryx Tableau Integration | Clean Your Data Faster for Tableau with Alteryx
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
DataPreprocessing.ppt
DataPreprocessing.pptDataPreprocessing.ppt
DataPreprocessing.ppt
 
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data Cleaning
 
OutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainOutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the main
 
How to source good data
How to source good dataHow to source good data
How to source good data
 
Complete Guide to Data Quality
Complete Guide to Data QualityComplete Guide to Data Quality
Complete Guide to Data Quality
 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
12 Days of Data
12 Days of Data12 Days of Data
12 Days of Data
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 
03Preprocessing.ppt
03Preprocessing.ppt03Preprocessing.ppt
03Preprocessing.ppt
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 

More from Gramener

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer VisionGramener
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionGramener
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & HealthcareGramener
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingGramener
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityGramener
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen PluginGramener
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsGramener
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsGramener
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX ProgramGramener
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceGramener
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : WebinarGramener
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarGramener
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesGramener
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarGramener
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : WebinarGramener
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - WebinarGramener
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Gramener
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Gramener
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceGramener
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesGramener
 

More from Gramener (20)

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer Vision
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in Manufacturing
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save Biodiversity
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen Plugin
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI Products
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad Performance
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : Webinar
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: Webinar
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best Outcomes
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - Webinar
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data Science
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to Stories
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

Data Wrangling

  • 2. 2 DATA WRANGLING FIND LOAD CLEAN WHERE CAN I GET DATA FROM?
  • 3. Client data isn't easy to get THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA 3 Public data isn't relevant
  • 4. We have internal information. Getting information from outside is our challenge. There’s no way of doing that. – Senior Editor Leading Media Company “
  • 5. INDIA’S RELIGIONS 5 If you search on google.co.in for "how do I convert to", here are the suggestions Google shows The popularity influences the order. So there's a good chance that the religions on top are more often searched for.
  • 6. AUSTRALIA’S RELIGIONS 6 But be careful of how you interpret it. In Australia, PDF is not a religion. Unless you're a data scientist.
  • 7. 7
  • 8. USE MULTIPLE APPROACHES TO FIND YOUR DATA 8 Public data catalogues https://github.com/caesar0301/awesome-public-datasets https://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md Govt data websites https://data.gov.in/ https://data.gov/ https://data.gov.uk/ https://data.gov.sg/ http://publicdata.eu/ or search on Google https://www.google.com/ or ask people Humans™ 1 2 3 4
  • 9. 9 EXERCISE LET'S FIND SOME DATASETS (YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)
  • 10. 10 DATA WRANGLING FIND LOAD CLEAN HOW DO I STORE & PROCESS DATA?
  • 11. WE LOAD DATA INTO OUR PROGRAMS OR OTHERS' 11 Files Databases • Delimited text: CSV, TSV, PSV • Formatted text: TXT, PRN • Marked up text: HTML, XML, JSON, JSON Line, YAML, SQL • Spreadsheets: XLS*, ODS, MDB, ACCDB, DBF • Specialised formats: HDF5, SQLite, DTA (Stata), C4.5, CDF • Graph formats: GEXF, GDF, GML, GraphML, GraphViz DOT • Unstructured: TXT, PDF, Images, Audio, Video, ... • In-memory databases: DataFrames • Relational databases: Oracle, MySQL, PostgreSQL, SQL Server, DB2, Sybase, Informix, ... • Document databases: MongoDB, CouchDB, ElasticSearch, Firebase • Distributed databases: HFS, Spark • Cloud data stores: BigQuery, DynamoDB, RedShift, Azure SQL Database, DocumentDB, ... • APIs: Twitter, Facebook, Google, Wikipedia, YouTube, ... Use CSV when sharing tabular data. Use JSON for hierarchical data. Use in-memory, else relational databases. Don't analyse big data. Shrink it.
  • 12. 12 EXERCISE LET'S LOAD FROM A SITE THE GOOGLE SEARCH DATA YOU SAW EARLIER LET'S LOAD A BIG DATASET A FEW COLUMNS FROM A LEAKED OK CUPID SURVEY LET'S LOAD AN UNSTRUCTURED TABLE A TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF
  • 13. 13 DATA WRANGLING FIND LOAD CLEAN HOW DO I FIX THE DATA ISSUES?
  • 14. CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES 14 Fix rows & columns Fix missing values Standarise values Fix invalid values Filter data When we receive a dataset, we find a pattern of things that go wrong. These can be fixed in specific ways. Here's a workflow / checklist of things to look out for and fix. After this, check if the data is complete, and sufficient to solve the problem.
  • 15. FIX ROWS AND COLUMNS 15 Fix rows Examples Delete incorrect rows Header rows, Footer rows Delete summary rows Total, subtotal rows Delete extra rows Column number indicators (1), (2), ... Blank rows Fix columns Examples Add column names if missing Files with missing header row Rename columns consistently Abbreviations, encoded columns Delete unnecessary columns Unidentified columns, irrelevant columns Split columns for more data Split http://host:port/path into [Host, Port, Path] Merge columns for identifiers Merge Firstname, Lastname into Name Merge State, District into FullDistrict Align misaligned columns Dataset may have shifted columns
  • 16. FIX MISSING VALUES 16 Fix missing values Examples Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing Fill missing values with... Constant (e.g. zero) Column (e.g. created date defaults to updated date) Function (e.g. average of rows/columns) External data Remove missing values Delete row Delete column Fill partial missing values Missing time zone, century etc.
  • 17. STANDARDISE VALUES 17 Standardise numbers Examples Remove outliers Removing high and low values Standardise units lbs to kgs, m/s for speed Scale values if required Fit to percentage scale Standardise precision 2.1 to 2.10 Standardise text Examples Remove extra characters Common prefix/suffix, leading/trailing/multiple spaces Standardise case Uppercase, lowercase, Title Case, Sentence case, etc Standardise format 23/10/16 to 2016/10/20 “Modi, Narendra" to “Narendra Modi"
  • 18. FIX INVALID VALUES 18 Fix invalid values Examples Encode unicode properly CP1252 instead of UTF-8 Convert incorrect data types String to number: "12,300" String to date: "2013-Aug" Number to string: PIN Code 110001 to "110001" Correct values not in list Non-existent country, PIN code Correct wrong structure Phone number with over 10 digits Correct values beyond range Temperature less than -273° C (0° K) Validate internal rules Gross sales > Net sales Date of delivery > Date of ordering If Title is "Mr" then Gender is "M" In these cases, treat value as "missing". Remove it, or fix it with a formula. The formula may involve the value, row, column, entire dataset, or external data
  • 19. FILTER DATA 19 Filter data Examples Deduplicate data Remove identical rows Remove rows where some columns are identical Filter rows Filter by segments Filter by date period Filter columns Pick columns relevant to analysis Aggregate data Group by required keys, aggregate the rest
  • 20. 20 EXERCISE ASSEMBLY ELECTION DATA SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED
  • 21. The ECI website has this data. 21
  • 22. … and, most of the data is in PDFs 22
  • 23. The PDF files have a reasonably clear structure 23
  • 24. … that translates into text that can be parsed 24
  • 25. … which, with some effort, can be converted into a structured format … and at this point, we need to start checking for errors. 25
  • 26. At this point, we start checking what’s gone wrong Each row here is one constituency. The number of candidates that have contested in each constituency in every year is shown as a table. You can see that some patterns emerge here. 26
  • 27. Not every spelling error is easily identifiable by the first letter Parties are mis-spelt MADMK MAMAK MDMK Party names change AIADMK ADMK ADK Parties restructure INC(I) INC Constituency names mis-spelt BHADRACHALAM BHADRACHELAM BHADRAHCALAM 27
  • 28. Fortunately, large scale data itself can provide a solution 28
  • 29. … with modern tools that support machine learning 29