SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Data quality challenges in the
Canadensys network of
occurrence records: examples,
tools, and solutions

Chris&an	
  Gendreau,	
  David	
  Shorthouse	
  &	
  Peter	
  Desmet	
  
Game	
  plan
	
  
•  Introduc&on	
  to	
  Canadensys	
  
•  Data	
  quality	
  @	
  Canadensys	
  
•  Canadensys	
  processing	
  solu&ons	
  
•  Numbers	
  from	
  Canadensys	
  
•  Hopes	
  and	
  expecta&ons	
  
A Network
Of people and collections
Canadensys Headquarters
Université de Montréal Biodiversity Centre
data.canadensys.net/vascan	
  
data.canadensys.net/ipt	
  
data.canadensys.net/explorer	
  
Data quality related
activities
From an aggregator perspective
During	
  data	
  entry	
  
•  Help	
  to	
  avoid	
  typographical	
  errors	
  
•  Help	
  to	
  convert	
  verba&m	
  data	
  

Actor : data entry person
Before	
  publica&on	
  
•  Detect	
  file	
  character	
  encoding	
  issue	
  
•  Detect	
  duplicate	
  or	
  missing	
  IDs	
  

Actor : data publisher

Previous Activity:
Data entry
During	
  aggrega&on	
  
•  Process	
  data:	
  valida&on,	
  cleaning	
  
•  Produce	
  structured	
  reports	
  :	
  quality	
  control	
  	
  

Actor : data aggregator

Previous Activity:
Before publication
AKer	
  aggrega&on	
  
•  Allow	
  and	
  facilitate	
  community	
  feedback	
  
•  Help	
  data	
  publisher	
  to	
  integrate	
  correc&ons	
  

Actor : users and community

Previous Activity:
Aggregation
Canadensys	
  tools	
  
during	
  data	
  entry	
  

data.canadensys.net/tools	
  
Why	
  do	
  we	
  process	
  data?	
  
•  Enrich	
  our	
  Explorer,	
  h"p://data.canadensys.net	
  
•  Provide	
  structured	
  reports	
  to	
  data	
  providers	
  
•  Help	
  iden&fy	
  records	
  that	
  need	
  re-­‐examina&on	
  
•  Help	
  to	
  improve	
  data	
  entry	
  procedure	
  
Data	
  processing	
  
Processing	
  solu&ons	
  
Narwhals	
  to	
  the	
  rescue	
  

Narwhal image Public Domain
The	
  narwhal-­‐processor	
  approach	
  
●  Single	
  field	
  processing	
  to	
  allow	
  complex	
  
processing	
  (combined	
  fields)	
  
●  Processors	
  with	
  common	
  interface	
  ease	
  
integra&on	
  and	
  usage	
  
●  Collabora&on	
  

https://github.com/Canadensys/narwhal-processor
Data	
  usability	
  
before	
  processing	
  
96%	
  

100%	
  

92%	
  
90%	
  

%	
  of	
  non-­‐null	
  clean	
  verba>m	
  data	
  

80%	
  
70%	
  

60%	
  

60%	
  
50%	
  

44%	
  

40%	
  
30%	
  
20%	
  
10%	
  
0%	
  

country	
  text	
  

state/province	
  text	
  

coordinates	
  

dates	
  
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
	
  
	
  
	
  

USA	
  

ISO	
  3166-­‐2:US,	
  
United	
  States	
  
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
•  16%	
  of	
  provided	
  state/province	
  text	
  
	
  
	
  
	
  
Qué	
  

ISO	
  3166-­‐2	
  CA-­‐
QC,	
  Quebec	
  
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
•  16%	
  of	
  provided	
  state/province	
  text	
  
•  4%	
  of	
  provided	
  coordinates	
  
	
  
	
  
45°	
  32'	
  25"	
  N,	
  
129°	
  40'	
  31"	
  W	
  

45.5402778,	
  
-­‐129.6752778	
  
Data	
  usability	
  
aKer	
  processing	
  

•  7%	
  of	
  provided	
  country	
  text	
  
•  16%	
  of	
  provided	
  state/province	
  text	
  
•  4%	
  of	
  provided	
  coordinates	
  
•  42%	
  of	
  provided	
  dates	
  
	
  
	
  

2008	
  VI	
  13	
  

2008-­‐06-­‐13	
  
Data	
  usability	
  
including	
  processed	
  data	
  
4%	
  

100%	
  

7%	
  
90%	
  

%	
  of	
  non-­‐null	
  provided	
  

80%	
  
70%	
  

16%	
  

42%	
  

60%	
  
50%	
  

96%	
  

92%	
  
40%	
  

60%	
  

30%	
  

44%	
  

20%	
  
10%	
  
0%	
  

country	
  text	
  

state/province	
  text	
  

coordinates	
  

dates	
  
Projects	
  With	
  Data	
  Quality	
  Tools	
  
•  Atlas	
  of	
  living	
  Australia	
  
•  GBIF	
  Norway,	
  GBIF	
  Spain,	
  Na&onal	
  
Biodiversity	
  Network,	
  BioVeL	
  …	
  	
  
•  GBIF	
  libraries	
  
•  Most	
  nodes	
  have	
  their	
  own	
  data	
  quality	
  
rou&ne	
  
Hopes	
  and	
  expecta&ons	
  
We	
  do	
  not	
  want	
  to	
  
•  Maintain	
  taxonomic	
  authority	
  files	
  
•  Maintain	
  country,	
  province	
  and	
  city	
  lists	
  
We	
  prefer	
  to	
  
•  Efficiently	
  use	
  specialized	
  resources/services	
  
•  Provide	
  report,	
  quality	
  indices	
  
Help	
  from	
  Seman&c	
  Web	
  
•  Data	
  in	
  other	
  languages	
  (French,	
  Spanish,	
  …)	
  
should	
  not	
  be	
  flagged	
  as	
  error	
  
•  Misspellings	
  should	
  be	
  shared	
  as	
  a	
  common	
  
resource	
  (e.g.	
  SKOS)	
  
•  Understand	
  historical	
  data	
  (e.g.	
  collected	
  in	
  
USSR	
  in	
  1980)	
  
Repor&ng	
  and	
  log	
  
•  DarwinCore	
  annota&ons	
  for	
  processed	
  data	
  
•  Shared	
  vocabulary	
  for	
  structured	
  reports	
  and	
  
quality	
  indices	
  
Summary	
  
•  Tools	
  available	
  for	
  sharing	
  
•  Use,	
  review,	
  contribute	
  
•  Opportunity	
  for	
  broad	
  coordina&on	
  and	
  
increased	
  efficiencies	
  
Thanks	
  

Anne Bruneau, Institut de recherche en biologie végétale and
Département de Sciences Biologiques, Université de Montréal
Contact	
  
	
  hrp://www.canadensys.net	
  
	
  hrp://github.com/Canadensys	
  
	
  @Canadensys	
  

Gulo gulo, Larry Master (www.masterimages.org)
Mul&-­‐field	
  processing	
  
DwC	
  Field	
  

Raw	
  data	
  

Processed	
  data	
  

verba&mLa&tude	
  

45°30ʹ′N	
  

	
  45.5	
  

verba&mLongitude	
  

73°34ʹ′W	
  

-­‐73.5666667	
  

country	
  

Canada	
  

Canada	
  

stateProvince	
  

QC	
  

Quebec	
  

municipality	
  

Montreal	
  City	
  

Montreal	
  
Mul&-­‐field	
  processing	
  
1.  Get	
  informa&on	
  on	
  coordinates	
  
45.5,-­‐73.5666667	
  
2.  Compare	
  with	
  processed	
  data	
  
3.  Assert	
  that	
  these	
  coordinates	
  are	
  in	
  Montréal	
  

Contenu connexe

Similaire à Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
Open Metrics for Open Repositories at OR2012
Open Metrics for Open Repositories at OR2012Open Metrics for Open Repositories at OR2012
Open Metrics for Open Repositories at OR2012Nick Sheppard
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...Carole Goble
 
How metadata drives data sharing; UK Data Archive
How metadata drives data sharing; UK Data Archive How metadata drives data sharing; UK Data Archive
How metadata drives data sharing; UK Data Archive Louise Corti
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
NTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsNTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsAzavea
 
Supporting Data Stewardship in the Solid Earth Sciences
Supporting Data Stewardship in the Solid Earth SciencesSupporting Data Stewardship in the Solid Earth Sciences
Supporting Data Stewardship in the Solid Earth SciencesVicki Ferrini
 
Optique presentation
Optique presentationOptique presentation
Optique presentationDBOnto
 
Exploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryExploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryAzavea
 
RDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRobin Rice
 
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Data Con LA
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processLouise Corti
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...CIARD Movement
 
ER & L 2016: CORAL User Group Meeting
ER & L 2016: CORAL User Group MeetingER & L 2016: CORAL User Group Meeting
ER & L 2016: CORAL User Group MeetingScott Vieira
 
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...imgcommcall
 

Similaire à Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions (20)

DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Open Metrics for Open Repositories at OR2012
Open Metrics for Open Repositories at OR2012Open Metrics for Open Repositories at OR2012
Open Metrics for Open Repositories at OR2012
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
 
Quantitative Analyst
Quantitative AnalystQuantitative Analyst
Quantitative Analyst
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
How metadata drives data sharing; UK Data Archive
How metadata drives data sharing; UK Data Archive How metadata drives data sharing; UK Data Archive
How metadata drives data sharing; UK Data Archive
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
NTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsNTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
NTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
 
Supporting Data Stewardship in the Solid Earth Sciences
Supporting Data Stewardship in the Solid Earth SciencesSupporting Data Stewardship in the Solid Earth Sciences
Supporting Data Stewardship in the Solid Earth Sciences
 
Optique presentation
Optique presentationOptique presentation
Optique presentation
 
Exploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryExploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban Forestry
 
Fabien Tarrade'ds CV
Fabien Tarrade'ds CVFabien Tarrade'ds CV
Fabien Tarrade'ds CV
 
RDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the Data
 
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production process
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...
 
ER & L 2016: CORAL User Group Meeting
ER & L 2016: CORAL User Group MeetingER & L 2016: CORAL User Group Meeting
ER & L 2016: CORAL User Group Meeting
 
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

  • 1. Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Chris&an  Gendreau,  David  Shorthouse  &  Peter  Desmet  
  • 2. Game  plan   •  Introduc&on  to  Canadensys   •  Data  quality  @  Canadensys   •  Canadensys  processing  solu&ons   •  Numbers  from  Canadensys   •  Hopes  and  expecta&ons  
  • 3. A Network Of people and collections
  • 4. Canadensys Headquarters Université de Montréal Biodiversity Centre
  • 8. Data quality related activities From an aggregator perspective
  • 9. During  data  entry   •  Help  to  avoid  typographical  errors   •  Help  to  convert  verba&m  data   Actor : data entry person
  • 10. Before  publica&on   •  Detect  file  character  encoding  issue   •  Detect  duplicate  or  missing  IDs   Actor : data publisher Previous Activity: Data entry
  • 11. During  aggrega&on   •  Process  data:  valida&on,  cleaning   •  Produce  structured  reports  :  quality  control     Actor : data aggregator Previous Activity: Before publication
  • 12. AKer  aggrega&on   •  Allow  and  facilitate  community  feedback   •  Help  data  publisher  to  integrate  correc&ons   Actor : users and community Previous Activity: Aggregation
  • 13. Canadensys  tools   during  data  entry   data.canadensys.net/tools  
  • 14. Why  do  we  process  data?   •  Enrich  our  Explorer,  h"p://data.canadensys.net   •  Provide  structured  reports  to  data  providers   •  Help  iden&fy  records  that  need  re-­‐examina&on   •  Help  to  improve  data  entry  procedure  
  • 16. Processing  solu&ons   Narwhals  to  the  rescue   Narwhal image Public Domain
  • 17. The  narwhal-­‐processor  approach   ●  Single  field  processing  to  allow  complex   processing  (combined  fields)   ●  Processors  with  common  interface  ease   integra&on  and  usage   ●  Collabora&on   https://github.com/Canadensys/narwhal-processor
  • 18. Data  usability   before  processing   96%   100%   92%   90%   %  of  non-­‐null  clean  verba>m  data   80%   70%   60%   60%   50%   44%   40%   30%   20%   10%   0%   country  text   state/province  text   coordinates   dates  
  • 19. Data  usability   aKer  processing   •  7%  of  provided  country  text         USA   ISO  3166-­‐2:US,   United  States  
  • 20. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text         Qué   ISO  3166-­‐2  CA-­‐ QC,  Quebec  
  • 21. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text   •  4%  of  provided  coordinates       45°  32'  25"  N,   129°  40'  31"  W   45.5402778,   -­‐129.6752778  
  • 22. Data  usability   aKer  processing   •  7%  of  provided  country  text   •  16%  of  provided  state/province  text   •  4%  of  provided  coordinates   •  42%  of  provided  dates       2008  VI  13   2008-­‐06-­‐13  
  • 23. Data  usability   including  processed  data   4%   100%   7%   90%   %  of  non-­‐null  provided   80%   70%   16%   42%   60%   50%   96%   92%   40%   60%   30%   44%   20%   10%   0%   country  text   state/province  text   coordinates   dates  
  • 24. Projects  With  Data  Quality  Tools   •  Atlas  of  living  Australia   •  GBIF  Norway,  GBIF  Spain,  Na&onal   Biodiversity  Network,  BioVeL  …     •  GBIF  libraries   •  Most  nodes  have  their  own  data  quality   rou&ne  
  • 26. We  do  not  want  to   •  Maintain  taxonomic  authority  files   •  Maintain  country,  province  and  city  lists  
  • 27. We  prefer  to   •  Efficiently  use  specialized  resources/services   •  Provide  report,  quality  indices  
  • 28. Help  from  Seman&c  Web   •  Data  in  other  languages  (French,  Spanish,  …)   should  not  be  flagged  as  error   •  Misspellings  should  be  shared  as  a  common   resource  (e.g.  SKOS)   •  Understand  historical  data  (e.g.  collected  in   USSR  in  1980)  
  • 29. Repor&ng  and  log   •  DarwinCore  annota&ons  for  processed  data   •  Shared  vocabulary  for  structured  reports  and   quality  indices  
  • 30. Summary   •  Tools  available  for  sharing   •  Use,  review,  contribute   •  Opportunity  for  broad  coordina&on  and   increased  efficiencies  
  • 31. Thanks   Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal
  • 32. Contact    hrp://www.canadensys.net    hrp://github.com/Canadensys    @Canadensys   Gulo gulo, Larry Master (www.masterimages.org)
  • 33. Mul&-­‐field  processing   DwC  Field   Raw  data   Processed  data   verba&mLa&tude   45°30ʹ′N    45.5   verba&mLongitude   73°34ʹ′W   -­‐73.5666667   country   Canada   Canada   stateProvince   QC   Quebec   municipality   Montreal  City   Montreal  
  • 34. Mul&-­‐field  processing   1.  Get  informa&on  on  coordinates   45.5,-­‐73.5666667   2.  Compare  with  processed  data   3.  Assert  that  these  coordinates  are  in  Montréal