SlideShare a Scribd company logo
1 of 37
OMG! MY METADATA IS AS
  FRESH AS THE BACKSTREET
 BOYS: HOW GOOGLE REFINE
 CAN UPDATE, CLEAN UP AND
LINK YOUR METADATA TO THE
             WIDER WORLD
                 SARAH BETH WEEKS

   LIBRARY TECHNOLOGY CONFERENCE 2013

                   WEEKSS@STOLAF.EDU
                       @RASCALWHALE
SAMPLE PROJECT: NORDIC AMERICAN
                IMPRINTS

Situation: Wanted to match publishers of our books against a
list of important Nordic American Publishers (compiled by Penny
Huf fman) to find materials for our special collections.
Problem: Hard to compare when publication info is not
controlled:
ANSWER: GOOGLE REFINE!

Google Refine can “match and
 merge” messy data filled with:
 Random, leading or trailing spaces
 stray punctuation
 typos
 odd capitalization
  and more!
CREATE YOUR PROJECT USING ANY
        SPREADSHEET
USE “COMMON TRANSFORMS” TO FIX
“WHITESPACE” PROBLEMS IN A SINGLE CLICK
3. CLEAN UP STRAY CHARACTERS ([].?:) USING
   “TRANSFORM” AND REGULAR EXPRESSIONS
(OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
4. REPEAT COMMON TRANSFORMS
5. CLUSTER AND EDIT
(THIS IS WHERE THE MAGIC HAPPENS)
FUNCTION 1: FINGERPRINT
    (MOST RELIABLE)
NGRAM METHOD
 (STILL RELIABLE: MORE MATCHES BUT LESS
RELIABILIT Y AS YOU DECREASE NGRAM SIZE)
PHONETIC MATCHING
(ESPECIALLY USEFUL WHEN DEALING WITH
          TRANSLATED TEXT)
(MORE FALSE MATCHES TO WATCH FOR
    WITH PHONETIC FUNCTIONS)
NEAREST NEIGHBOR (PPM) MATCHING
(SLOWER AND MORE FALSE MATCHES BUT
 CATCHES WHAT OTHER METHODS MISS)
(SET RADIUS HIGHER, BLOCK CHARACTERS
  LOWER TO GENERATE MORE MATCHES)
AFTER USING OTHER METHODS, RUN
THROUGH FINGERPRINT AND NGRAM AGAIN
BE AWARE THAT THINGS THAT WEREN’T
 CLUSTERED WON’T HAVE BEEN FIXED
6. USE THE TEXT FACET TO SEE ALL
         UNIQUE VALUES
YOU CAN SCROLL THROUGH THE LIST TO
     SPOT CHECK FOR PROBLEMS
CLICK EDIT TO T YPE NEW TEXT FOR ALL
       CELLS WITH THIS VALUE
OTHER CLEAN-UP WE DID:
     PUBLISHERS
OTHER CLEAN-UP WE DID:
      GIFT NOTES
ALSO WORKS FOR NUMBERS/DATES
END RESULT?

 Using Google Refine we were able to reduce the
  3230 unique values for city (260|a) to just 1153. For
  publishers (260|b) we went from 11342 unique
  names for publishers to approximately 6500.
 This project helped to identify over 2,000 potential
  candidates for our Nordic American Imprints
  collection. (These are still being evaluated).
 The controlled publishers, cities of publications and
  dates will be added to a local 9xx field for faceting in
  our future special collections discover tool. Users will
  be able to browse our Nordic American Imprints
  collection by publisher, city or state.
BUT WAIT! THERE’S MORE!!
     LINKED DATA!!!
FREEBASE IS THE DEFAULT SERVICE
(WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
CHOOSE THE RIGHT “T YPE” AND MOST
   CELLS WILL BE AUTO-MATCHED
FOR THE REST CLICK THE OPTIONS TO
     SEE WHAT EACH REPRESENTS
 Then click “Match All Identical Cells” (or double checkmarks)
  to link all cells with this text to this Freebase topic
OR “SEARCH FOR MATCH” TO BRING UP
 AN AUTO-FILL LIST TO CHOOSE FROM
EVEN COOLER: NOW YOU CAN BRING
    DATA IN FROM FREEBASE!
CHOOSE WHAT INFO YOU WANT TO ADD
THIS NEW DATA IS NOW ADDED TO YOUR
           SPREADSHEET
TO SEE WHAT COLUMNS (DATA) YOU CAN
        ADD FROM FREEBASE:
Browse the properties at: http://schemas.freebaseapps.com /
MATCH LOCAL SUBJECT HEADING TO LC
    (FREEYOURMETADATA.ORG)
SPARQL ENDPOINTS

 Install the RDF Extension for Google Refine
  http://refine.deri.ie/




 SPARQL Endpoints
 http://labs.mondeca.com/sparqlEndpointsStatus/index.html
 CKAN Data Hub: http://datahub.io/dataset/
ADD SPARQL-BASED RECONCILIATION
            SERVICE
THANK YOU!

Questions?

Link to a public version of this presentation
 at my (personal) blog:
     gardenandalibrary.blogspot.com
I’m also happy to take questions by e-
 mail
              weekss@stolaf.edu

More Related Content

What's hot

Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
Juan Sequeda
 
Done reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weDone reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide we
James Arnold
 
Web data from R
Web data from RWeb data from R
Web data from R
schamber
 

What's hot (20)

Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
The Lonesome LOD Cloud
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD Cloud
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availability
 
Semantic web application architecture
Semantic web   application architectureSemantic web   application architecture
Semantic web application architecture
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion API
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
Creating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraCreating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with Hydra
 
Done reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weDone reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide we
 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is Federated
 
Web data from R
Web data from RWeb data from R
Web data from R
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
 
Asp.Net The Data List Control
Asp.Net   The Data List ControlAsp.Net   The Data List Control
Asp.Net The Data List Control
 
Talis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineTalis Platform: A Linked Data Engine
Talis Platform: A Linked Data Engine
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
 
Reasoned SPARQL
Reasoned SPARQLReasoned SPARQL
Reasoned SPARQL
 
CEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREFCEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREF
 

Similar to OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Web
samar_slideshare
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
Lucidworks
 

Similar to OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world (20)

Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQLJoy Nelson - Workshop on BIBFRAME, RDF and SPAQL
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 
Splunk bsides
Splunk bsidesSplunk bsides
Splunk bsides
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Web
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
 
Why MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - HabilelabsWhy MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - Habilelabs
 
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using Visualbox
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
3 map reduce perspectives
3 map reduce perspectives3 map reduce perspectives
3 map reduce perspectives
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

  • 1. OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP AND LINK YOUR METADATA TO THE WIDER WORLD SARAH BETH WEEKS LIBRARY TECHNOLOGY CONFERENCE 2013 WEEKSS@STOLAF.EDU @RASCALWHALE
  • 2. SAMPLE PROJECT: NORDIC AMERICAN IMPRINTS Situation: Wanted to match publishers of our books against a list of important Nordic American Publishers (compiled by Penny Huf fman) to find materials for our special collections. Problem: Hard to compare when publication info is not controlled:
  • 3. ANSWER: GOOGLE REFINE! Google Refine can “match and merge” messy data filled with: Random, leading or trailing spaces stray punctuation typos odd capitalization  and more!
  • 4. CREATE YOUR PROJECT USING ANY SPREADSHEET
  • 5. USE “COMMON TRANSFORMS” TO FIX “WHITESPACE” PROBLEMS IN A SINGLE CLICK
  • 6. 3. CLEAN UP STRAY CHARACTERS ([].?:) USING “TRANSFORM” AND REGULAR EXPRESSIONS (OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
  • 7. 4. REPEAT COMMON TRANSFORMS
  • 9. (THIS IS WHERE THE MAGIC HAPPENS)
  • 10. FUNCTION 1: FINGERPRINT (MOST RELIABLE)
  • 11. NGRAM METHOD (STILL RELIABLE: MORE MATCHES BUT LESS RELIABILIT Y AS YOU DECREASE NGRAM SIZE)
  • 12. PHONETIC MATCHING (ESPECIALLY USEFUL WHEN DEALING WITH TRANSLATED TEXT)
  • 13. (MORE FALSE MATCHES TO WATCH FOR WITH PHONETIC FUNCTIONS)
  • 14. NEAREST NEIGHBOR (PPM) MATCHING (SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS MISS)
  • 15. (SET RADIUS HIGHER, BLOCK CHARACTERS LOWER TO GENERATE MORE MATCHES)
  • 16. AFTER USING OTHER METHODS, RUN THROUGH FINGERPRINT AND NGRAM AGAIN
  • 17. BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE BEEN FIXED
  • 18. 6. USE THE TEXT FACET TO SEE ALL UNIQUE VALUES
  • 19. YOU CAN SCROLL THROUGH THE LIST TO SPOT CHECK FOR PROBLEMS
  • 20. CLICK EDIT TO T YPE NEW TEXT FOR ALL CELLS WITH THIS VALUE
  • 21. OTHER CLEAN-UP WE DID: PUBLISHERS
  • 22. OTHER CLEAN-UP WE DID: GIFT NOTES
  • 23. ALSO WORKS FOR NUMBERS/DATES
  • 24. END RESULT?  Using Google Refine we were able to reduce the 3230 unique values for city (260|a) to just 1153. For publishers (260|b) we went from 11342 unique names for publishers to approximately 6500.  This project helped to identify over 2,000 potential candidates for our Nordic American Imprints collection. (These are still being evaluated).  The controlled publishers, cities of publications and dates will be added to a local 9xx field for faceting in our future special collections discover tool. Users will be able to browse our Nordic American Imprints collection by publisher, city or state.
  • 25. BUT WAIT! THERE’S MORE!! LINKED DATA!!!
  • 26. FREEBASE IS THE DEFAULT SERVICE (WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
  • 27. CHOOSE THE RIGHT “T YPE” AND MOST CELLS WILL BE AUTO-MATCHED
  • 28. FOR THE REST CLICK THE OPTIONS TO SEE WHAT EACH REPRESENTS  Then click “Match All Identical Cells” (or double checkmarks) to link all cells with this text to this Freebase topic
  • 29. OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE FROM
  • 30. EVEN COOLER: NOW YOU CAN BRING DATA IN FROM FREEBASE!
  • 31. CHOOSE WHAT INFO YOU WANT TO ADD
  • 32. THIS NEW DATA IS NOW ADDED TO YOUR SPREADSHEET
  • 33. TO SEE WHAT COLUMNS (DATA) YOU CAN ADD FROM FREEBASE: Browse the properties at: http://schemas.freebaseapps.com /
  • 34. MATCH LOCAL SUBJECT HEADING TO LC (FREEYOURMETADATA.ORG)
  • 35. SPARQL ENDPOINTS  Install the RDF Extension for Google Refine http://refine.deri.ie/  SPARQL Endpoints  http://labs.mondeca.com/sparqlEndpointsStatus/index.html  CKAN Data Hub: http://datahub.io/dataset/
  • 37. THANK YOU! Questions? Link to a public version of this presentation at my (personal) blog: gardenandalibrary.blogspot.com I’m also happy to take questions by e- mail weekss@stolaf.edu