SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
Martin Magdinier - @magdmartin 1
Iterative data discovery and
transformation with
Martin Magdinier - @magdmartin
OpenRefine - @OpenRefine
http://openrefine.org
Martin Magdinier - @magdmartin 2
80% of data analysis
is spent on the process of
cleaning, transformation and integration
Martin Magdinier - @magdmartin 3
• Duplicate value & Typos
• Multi value cells
• Data in the wrong field
• Missing / Partial Values
• Encoding Errors
• Change format (text, number, date)
• Flat to relational data set
• Schema alignment
• Transpose rows and columns
• Join data-set
• Enrichment from other sources
(MDM, API calls)
Data Quality & Integration &
Is Time Consuming
Martin Magdinier - @magdmartin 4
OpenRefine
Bridges The Skill Gap
DBA
ETL
Data Science
Spreadsheet User
Data Visualization / Interpretation
Data
Preparation
Understand The Data
(Business Skills)
Know How To Transform
Data
(Technical Skills)
User Base
Martin Magdinier - @magdmartin 5
• SaaS and on-premise solution for extra
compute power, collaboration and
lightweight ETL
• On demand training
• Custom development
• Free & Open Source
• Community developed for 5 years
• Available on local machine only
• 5,000+ monthly download
• Strong user base with Open Data,
Library, Semantic web and Bio
Science
Semantic
WebLibraryBio
ScienceOpen
Data
Martin Magdinier - @magdmartin 6
Data Engineer
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Agile Data Process
Martin Magdinier - @magdmartin 7
Data Engineer
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Data Scientist
Develop
Machine Learning
&
Data Analysis
Model
Agile Data Process
Martin Magdinier - @magdmartin 8
Data Engineer
IT
Support
Governance
Access To Data
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Data Scientist
Discovery
Data Wrangling
Profiling
Preparation
Quality
Integration
Agile Data Process
Business Analyst
Develop
Machine Learning
&
Data Analysis
Model
Sense Making
Data Exploration
Reporting
Analysis
Scale
Real -Time
Lightweight ETL
Migration
Martin Magdinier - @magdmartin 9
Business Analyst
Data Engineer
IT
Support
Governance
Access To Data
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Data Scientist
Discovery
Data Wrangling
Profiling
Preparation
Quality
Integration
Agile Data Process
Develop
Machine Learning
&
Data Analysis
Model
ETL
Tools
Martin Magdinier - @magdmartin 10
Demo: 2014 Toronto
Cleared Building Permits
http://ow.ly/Js8GD
Data Discovery
1. What of Permit Type are
issued?
2. Explore Previous usage ;
Application Date &
Dwelling Units Created
Data Preparation
1. Geocode with Google
Maps API
2. Map Construction with
over 10 new Dwelling
Units Created
Martin Magdinier - @magdmartin 11
Iterative data discovery and
transformation with
Martin Magdinier - @magdmartin
OpenRefine - @OpenRefine
http://openrefine.org

Contenu connexe

Tendances

When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Cambridge Semantics
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseMarin Dimitrov
 
How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps Ontotext
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...semanticsconference
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...semanticsconference
 
Fast Data processing with RFX
Fast Data processing with RFXFast Data processing with RFX
Fast Data processing with RFXTrieu Nguyen
 
High quality Linked Data generation for librarians
High quality Linked Data generation for librariansHigh quality Linked Data generation for librarians
High quality Linked Data generation for librariansandimou
 
On-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudOn-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudMarin Dimitrov
 
Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsMongoDB
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastFranz Inc. - AllegroGraph
 
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...andimou
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 

Tendances (20)

When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Tracking data lineage at Stitch Fix
Tracking data lineage at Stitch FixTracking data lineage at Stitch Fix
Tracking data lineage at Stitch Fix
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and Reuse
 
How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
 
NoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and AnalyticsNoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and Analytics
 
Fast Data processing with RFX
Fast Data processing with RFXFast Data processing with RFX
Fast Data processing with RFX
 
High quality Linked Data generation for librarians
High quality Linked Data generation for librariansHigh quality Linked Data generation for librarians
High quality Linked Data generation for librarians
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
On-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudOn-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the Cloud
 
Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB Applications
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcast
 
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 

Similaire à Iterative data discovery and transformation with open refine

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Precisely
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AIGary Allemann
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata StrategiesDATAVERSITY
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarDatameer
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Somuvadali 180712051740
Somuvadali 180712051740Somuvadali 180712051740
Somuvadali 180712051740somu-vadali
 
Qiagram
QiagramQiagram
Qiagramjwppz
 
Data Systems Integration & Business Value Pt. 1: Metadata
Data Systems Integration & Business Value Pt. 1: MetadataData Systems Integration & Business Value Pt. 1: Metadata
Data Systems Integration & Business Value Pt. 1: MetadataDATAVERSITY
 
Data-Ed: Data Systems Integration & Business Value PT. 1: Metadata
Data-Ed: Data Systems Integration & Business Value PT. 1: MetadataData-Ed: Data Systems Integration & Business Value PT. 1: Metadata
Data-Ed: Data Systems Integration & Business Value PT. 1: MetadataData Blueprint
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
IT6701 Information Management - Unit III
IT6701 Information Management - Unit IIIIT6701 Information Management - Unit III
IT6701 Information Management - Unit IIIpkaviya
 
Metadata Strategies - Data Squared
Metadata Strategies - Data SquaredMetadata Strategies - Data Squared
Metadata Strategies - Data SquaredDATAVERSITY
 
Master Your Data. Master Your Business
Master Your Data. Master Your BusinessMaster Your Data. Master Your Business
Master Your Data. Master Your BusinessDLT Solutions
 
Learn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
Learn How to Turbocharge Your AI/ML Data Workflows with Data EnrichmentLearn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
Learn How to Turbocharge Your AI/ML Data Workflows with Data EnrichmentPrecisely
 
Getting Meta at Mesa
Getting Meta at MesaGetting Meta at Mesa
Getting Meta at MesaSafe Software
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupCaserta
 
Data-Ed: Business Value From MDM
Data-Ed: Business Value From MDM Data-Ed: Business Value From MDM
Data-Ed: Business Value From MDM Data Blueprint
 
Data-Ed Online Webinar: Business Value from MDM
Data-Ed Online Webinar: Business Value from MDMData-Ed Online Webinar: Business Value from MDM
Data-Ed Online Webinar: Business Value from MDMDATAVERSITY
 

Similaire à Iterative data discovery and transformation with open refine (20)

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata Strategies
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Somuvadali 180712051740
Somuvadali 180712051740Somuvadali 180712051740
Somuvadali 180712051740
 
Qiagram
QiagramQiagram
Qiagram
 
Data Systems Integration & Business Value Pt. 1: Metadata
Data Systems Integration & Business Value Pt. 1: MetadataData Systems Integration & Business Value Pt. 1: Metadata
Data Systems Integration & Business Value Pt. 1: Metadata
 
Data-Ed: Data Systems Integration & Business Value PT. 1: Metadata
Data-Ed: Data Systems Integration & Business Value PT. 1: MetadataData-Ed: Data Systems Integration & Business Value PT. 1: Metadata
Data-Ed: Data Systems Integration & Business Value PT. 1: Metadata
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
IT6701 Information Management - Unit III
IT6701 Information Management - Unit IIIIT6701 Information Management - Unit III
IT6701 Information Management - Unit III
 
Metadata Strategies - Data Squared
Metadata Strategies - Data SquaredMetadata Strategies - Data Squared
Metadata Strategies - Data Squared
 
Master Your Data. Master Your Business
Master Your Data. Master Your BusinessMaster Your Data. Master Your Business
Master Your Data. Master Your Business
 
Learn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
Learn How to Turbocharge Your AI/ML Data Workflows with Data EnrichmentLearn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
Learn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
 
Getting Meta at Mesa
Getting Meta at MesaGetting Meta at Mesa
Getting Meta at Mesa
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
Data-Ed: Business Value From MDM
Data-Ed: Business Value From MDM Data-Ed: Business Value From MDM
Data-Ed: Business Value From MDM
 
Data-Ed Online Webinar: Business Value from MDM
Data-Ed Online Webinar: Business Value from MDMData-Ed Online Webinar: Business Value from MDM
Data-Ed Online Webinar: Business Value from MDM
 

Dernier

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Iterative data discovery and transformation with open refine

  • 1. Martin Magdinier - @magdmartin 1 Iterative data discovery and transformation with Martin Magdinier - @magdmartin OpenRefine - @OpenRefine http://openrefine.org
  • 2. Martin Magdinier - @magdmartin 2 80% of data analysis is spent on the process of cleaning, transformation and integration
  • 3. Martin Magdinier - @magdmartin 3 • Duplicate value & Typos • Multi value cells • Data in the wrong field • Missing / Partial Values • Encoding Errors • Change format (text, number, date) • Flat to relational data set • Schema alignment • Transpose rows and columns • Join data-set • Enrichment from other sources (MDM, API calls) Data Quality & Integration & Is Time Consuming
  • 4. Martin Magdinier - @magdmartin 4 OpenRefine Bridges The Skill Gap DBA ETL Data Science Spreadsheet User Data Visualization / Interpretation Data Preparation Understand The Data (Business Skills) Know How To Transform Data (Technical Skills) User Base
  • 5. Martin Magdinier - @magdmartin 5 • SaaS and on-premise solution for extra compute power, collaboration and lightweight ETL • On demand training • Custom development • Free & Open Source • Community developed for 5 years • Available on local machine only • 5,000+ monthly download • Strong user base with Open Data, Library, Semantic web and Bio Science Semantic WebLibraryBio ScienceOpen Data
  • 6. Martin Magdinier - @magdmartin 6 Data Engineer Scale & Automate Processes Data Quality Manage Master Data Agile Data Process
  • 7. Martin Magdinier - @magdmartin 7 Data Engineer Scale & Automate Processes Data Quality Manage Master Data Data Scientist Develop Machine Learning & Data Analysis Model Agile Data Process
  • 8. Martin Magdinier - @magdmartin 8 Data Engineer IT Support Governance Access To Data Scale & Automate Processes Data Quality Manage Master Data Data Scientist Discovery Data Wrangling Profiling Preparation Quality Integration Agile Data Process Business Analyst Develop Machine Learning & Data Analysis Model Sense Making Data Exploration Reporting Analysis Scale Real -Time Lightweight ETL Migration
  • 9. Martin Magdinier - @magdmartin 9 Business Analyst Data Engineer IT Support Governance Access To Data Scale & Automate Processes Data Quality Manage Master Data Data Scientist Discovery Data Wrangling Profiling Preparation Quality Integration Agile Data Process Develop Machine Learning & Data Analysis Model ETL Tools
  • 10. Martin Magdinier - @magdmartin 10 Demo: 2014 Toronto Cleared Building Permits http://ow.ly/Js8GD Data Discovery 1. What of Permit Type are issued? 2. Explore Previous usage ; Application Date & Dwelling Units Created Data Preparation 1. Geocode with Google Maps API 2. Map Construction with over 10 new Dwelling Units Created
  • 11. Martin Magdinier - @magdmartin 11 Iterative data discovery and transformation with Martin Magdinier - @magdmartin OpenRefine - @OpenRefine http://openrefine.org