SlideShare une entreprise Scribd logo
1  sur  30
Bob Kasenchak
Project Coordinator
Access Innovations
bob_kasenchak@accessinn.com
@taxobob
DISCLAIMER
I Don’t Have Time for Metadata!
OUTLINE
• Data
• Structured Data
• Unstructured Data
• Metadata
• Subject Metadata
• Entity (author, institution) Metadata
• Document Type Metadata
• Automating Metadata
• Heuristic/Statistical/Inferential
• Rule-based
I Don’t Have Time for Metadata!
CASE STUDIES
I Don’t Have Time for Metadata!
STRUCTURED VS. UNSTRUCTURED DATA
Present different problems – and possible solutions – for
automatically adding metadata
I Don’t Have Time for Metadata!
STRUCTURED VS. UNSTRUCTURED DATA
I Don’t Have Time for Metadata!
Association,in view of abuses and lack of consistency
in published reports, has asserted that the all-inclusive
income statement,containing allincome items
recognized as determinantsof net income, is the answer
to these questions.2 The Securities and Exchange
Commission has also
strongly favored this solution.3 On the 1 Committeeon
Accounting Procedure, American
Instituteof Accountants, "Income and Earned Surplus,"
Accounting Research BulletinNo. 32 (December,
1947). 2 (1) "A TentativeStatementof Accounting
Principles Affecting Corporate Reports," THE
ACCOUNTING REvIEw, June, 1936, pp. 187-191; (2)
Accounting
STRUCTURED VS. UNSTRUCTURED DATA
I Don’t Have Time for Metadata!
<volume>325</volume>
<issue>5945</issue>
<fpage seq="c">1206</fpage>
<lpage>1206</lpage>
<history><date date-type="received"><day>26</day><month>02</month><year>2009
</year></date><date date-type="accepted"><day>11</day><month>08</month>
<year>2009</year></date></history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2009</copyright-statement>
<copyright-year>2009</copyright-year>
<copyright-holder>Your name here</copyright-holder>
</permissions>
<abstract>
<p>Our extended ontogenetic growth model is a theoretical model based on conservation
of energy and general biological mechanisms underlying ontogenetic growth. We do not
believe that the comments of Makarieva <italic>et al</italic>. and Sousa <italic>et al
</italic>. expose substantive problems with our model. Nevertheless, they raise
interesting, still unresolved questions and point to philosophical differences about the role
of theory and of simple, general models as opposed to complicated, specific models.</p>
</abstract>
STRUCTURED VS. UNSTRUCTURED DATA
• Just extracting basic information
• Author
• Institution
• Title
• Document type
• Accession number(s)
…can be a challenge.
However…
I Don’t Have Time for Metadata!
STRUCTURED VS. UNSTRUCTURED DATA
• Predictability
• Positionality
I Don’t Have Time for Metadata!
Journal name/
Issue/Vol./etc.
Article Title
Copyright info
Author info
Abstract
UNSTRUCTURED DATA => STRUCTURED DATA!
<journal>Transactions on Vehicular Technology</journal>
<article-title>Relationship of Average Transmitted and Received Energies in Adaptive
Transmission</article-title>
<authors><author-surname>Kotelba</author-surname><author-firstname>Adrian</author-
firstname><affiliation>Member, IEEE</affiliation></authors>
<copyright-info><copyright-date>2009</copyright-date></copyright-info>
<abstract><p>This paper studies the…</p></abstract>
NOTE: Some cleanup may be required
I Don’t Have Time for Metadata!
STRUCTURED VS. UNSTRUCTURED DATA
• Basic information already tagged, labeled, and easy to
extract
• Author info
• Title
• Journal/Volume/Issue etc.
• We can add semantic (or subject) metadata
• Targeting only those parts of the text we require
• Title
• Abstract
• Full text body
• Exclude references, etc.
I Don’t Have Time for Metadata!
SEMANTIC METADATA
 Uncontrolled
 Automatic keyword extraction
 Crowdsourced/folksonomic tags
 Controlled – from a Thesaurus (or Taxonomy…)
 Inferential (Heuristic; Statistical)
 Rule-based
I Don’t Have Time for Metadata!
SEMANTIC METADATA: HOW?
 Controlled – from a Thesaurus (or Taxonomy…)
 Inferential (Heuristic; Statistical)
 Rule-based
 Manual tagging
 Automatic tagging
I Don’t Have Time for Metadata!
SEMANTIC METADATA: MANUAL ENTRY
I Don’t Have Time for Metadata!
SEMANTIC METADATA: MANUAL ENTRY
I Don’t Have Time for Metadata!
A Thought Experiment
• Let’s say a manual indexer can index 10 records/hour
• Let’s say the manual indexers are perfectly consistent (they’re not)
• Let’s say your manual indexers are paid $10/hour (good luck with that)
If you have 10,000 articles/pieces of content:
It would take a manual indexer 1000 hours (25 weeks) and cost $10,000
If you have 100,000 articles:
It would take a manual indexer 10,000 hours (250 weeks, or almost 5 years)
and cost $100,000
If you have 1,000,000 articles:
It would take a manual indexer 100,000 hours (~48 years) and $1,000,000
SEMANTIC METADATA: AUTOMATED
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
 Disambiguate the ambiguous
 Specify most specific topics
 Improve information retrieval
 Search
 Browse
 Enable advanced analytics
I Don’t Have Time for Metadata!
SEMANTIC METADATA: DISAMBIGUATION
“Mercury”
I Don’t Have Time for Metadata!
SEMANTIC METADATA: SPECIFICATION
Beyond exact string matches: Synonymy
Fiber optic gyroscopes Fiber optic
gyros
Fiber-optic gyroscopes Fiber-optic
gyros
Fibre optic gyroscopes Fibre optic
gyros
Fibre-optic gyroscopes Fibre-optic
gyros
Fiberoptic gyroscopes Fiberoptic gyros
Optical fiber gyroscopes Optical fiber gyros
Optical fibre gyroscopes Optical fibre gyros
FOGs FOG’s
I Don’t Have Time for Metadata!
SEMANTIC METADATA: SPECIFICATION
Beyond exact string matches: Context. Matters.
 Indexing to most specific term
- Microscopes
- Electron microscopes
- Scanning electron microscopes
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
Improving information retrieval (Search, Browse)
SEARCH ≠ BROWSE
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
Improving information retrieval: Search
 Allows user to search by tags
 Ensures consistent and reliable retrieval
 Speeds electronic search
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
Improving information retrieval: Search
I Don’t Have Time for Metadata!
Subject
Metadata
SEMANTIC METADATA: WHY?
Improving information retrieval: Search
I Don’t Have Time for Metadata!
Metadata-based
Search
Results
Based on
metadata
SEMANTIC METADATA: WHY?
Improving information retrieval: Browse
I Don’t Have Time for Metadata!
Taxonomy
browse
Results
Based on
metadata
SEMANTIC METADATA: WHY?
Improving information retrieval: Browse
I Don’t Have Time for Metadata!
Taxonomy
browse
Additional
Search
filters
SEMANTIC METADATA: WHY?
Improving information retrieval: Analytics
 Combine subject metadata with metadata about
 Authors
 Institutions
 Publications (Journals, Magazines, etc.)
 Publication Types
…to create detailed informatics about your data, users,
authors, and whatever else is relevant or useful
I Don’t Have Time for Metadata!
SEMANTIC METADATA: WHY?
Improving information retrieval: Analytics
I Don’t Have Time for Metadata!
Taxonomy
term
Narrower
terms
Broader
Term(s)
Authors who publish
on this topic
I DON’T HAVE TIME FOR METADATA!
I Don’t Have Time for Metadata!
Since Metadata allows you to do things you already have
want
need to do:
It’s always time for metadata.
Bob Kasenchak
Project Coordinator
Access Innovations
bob_kasenchak@accessinn.com
@taxobob
Thank you!

Contenu connexe

Tendances

Webinar: Business Solutions and Metadata Design
Webinar:  Business Solutions and Metadata DesignWebinar:  Business Solutions and Metadata Design
Webinar: Business Solutions and Metadata Designmartingarland
 
II-SDV 2012 Text Mining, Term Mining and Visualization - Improving the Impac...
II-SDV 2012 Text Mining, Term Mining and Visualization  - Improving the Impac...II-SDV 2012 Text Mining, Term Mining and Visualization  - Improving the Impac...
II-SDV 2012 Text Mining, Term Mining and Visualization - Improving the Impac...Dr. Haxel Consult
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Improve your Searches, Get Trained up on Expernova!
Improve your Searches, Get Trained up on Expernova!Improve your Searches, Get Trained up on Expernova!
Improve your Searches, Get Trained up on Expernova!Expernova
 
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text MiningII-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text MiningDr. Haxel Consult
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open ScienceBeth Plale
 
Search strategy Tax 2019
Search strategy Tax 2019Search strategy Tax 2019
Search strategy Tax 2019pvhead123
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudDhaval Thakker
 
Funding data for research
Funding data for researchFunding data for research
Funding data for researchCrossref
 
Global ID’s & Publicizing Researches (ORCID)
Global ID’s & Publicizing Researches (ORCID)Global ID’s & Publicizing Researches (ORCID)
Global ID’s & Publicizing Researches (ORCID)Nabeel Salih Ali
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringIRJET Journal
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas WorkshopNiall Beard
 
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...Dr. Haxel Consult
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Bradley Allen
 

Tendances (20)

Webinar: Business Solutions and Metadata Design
Webinar:  Business Solutions and Metadata DesignWebinar:  Business Solutions and Metadata Design
Webinar: Business Solutions and Metadata Design
 
II-SDV 2012 Text Mining, Term Mining and Visualization - Improving the Impac...
II-SDV 2012 Text Mining, Term Mining and Visualization  - Improving the Impac...II-SDV 2012 Text Mining, Term Mining and Visualization  - Improving the Impac...
II-SDV 2012 Text Mining, Term Mining and Visualization - Improving the Impac...
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Improve your Searches, Get Trained up on Expernova!
Improve your Searches, Get Trained up on Expernova!Improve your Searches, Get Trained up on Expernova!
Improve your Searches, Get Trained up on Expernova!
 
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text MiningII-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
Search strategy Tax 2019
Search strategy Tax 2019Search strategy Tax 2019
Search strategy Tax 2019
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
 
Funding data for research
Funding data for researchFunding data for research
Funding data for research
 
Global ID’s & Publicizing Researches (ORCID)
Global ID’s & Publicizing Researches (ORCID)Global ID’s & Publicizing Researches (ORCID)
Global ID’s & Publicizing Researches (ORCID)
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document Clustering
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
 
eSUG fall 2011
eSUG fall 2011eSUG fall 2011
eSUG fall 2011
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)
 

Similaire à Automating metadata to improve information retrieval

Nicola Pagni - Anomaly Detection in Elasticsearch
Nicola Pagni - Anomaly Detection in ElasticsearchNicola Pagni - Anomaly Detection in Elasticsearch
Nicola Pagni - Anomaly Detection in ElasticsearchMeetupDataScienceRoma
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Why quality control and quality assurance is important for the legacy of GEOT...
Why quality control and quality assurance is important for the legacy of GEOT...Why quality control and quality assurance is important for the legacy of GEOT...
Why quality control and quality assurance is important for the legacy of GEOT...Adam Leadbetter
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataMaori Ito
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and boltsNBER
 
WOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsWOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsAndreas Kamilaris
 
How to Apply Your Taxonomy to Your Content Automatically
How to Apply Your Taxonomy to Your Content AutomaticallyHow to Apply Your Taxonomy to Your Content Automatically
How to Apply Your Taxonomy to Your Content AutomaticallyAccess Innovations, Inc.
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksLucidworks
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
 
Talavant Data Lake Analytics
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics Sean Forgatch
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution WSO2
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignJuliet Hougland
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanPhilippe Rocca-Serra
 
Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Christopher Biow
 
01-Introduction.pptx
01-Introduction.pptx01-Introduction.pptx
01-Introduction.pptxShree Shree
 
01-Introduction.pptx
01-Introduction.pptx01-Introduction.pptx
01-Introduction.pptxShree Shree
 

Similaire à Automating metadata to improve information retrieval (20)

Nicola Pagni - Anomaly Detection in Elasticsearch
Nicola Pagni - Anomaly Detection in ElasticsearchNicola Pagni - Anomaly Detection in Elasticsearch
Nicola Pagni - Anomaly Detection in Elasticsearch
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Why quality control and quality assurance is important for the legacy of GEOT...
Why quality control and quality assurance is important for the legacy of GEOT...Why quality control and quality assurance is important for the legacy of GEOT...
Why quality control and quality assurance is important for the legacy of GEOT...
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
WOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsWOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of Things
 
How to Apply Your Taxonomy to Your Content Automatically
How to Apply Your Taxonomy to Your Content AutomaticallyHow to Apply Your Taxonomy to Your Content Automatically
How to Apply Your Taxonomy to Your Content Automatically
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Talavant Data Lake Analytics
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
 
Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
 
Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010
 
01-Introduction.pdf
01-Introduction.pdf01-Introduction.pdf
01-Introduction.pdf
 
01-Introduction.pptx
01-Introduction.pptx01-Introduction.pptx
01-Introduction.pptx
 
01-Introduction.pptx
01-Introduction.pptx01-Introduction.pptx
01-Introduction.pptx
 

Plus de Access Innovations, Inc.

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsAccess Innovations, Inc.
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8Access Innovations, Inc.
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Access Innovations, Inc.
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Access Innovations, Inc.
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Access Innovations, Inc.
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut ItAccess Innovations, Inc.
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityAccess Innovations, Inc.
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedAccess Innovations, Inc.
 

Plus de Access Innovations, Inc. (20)

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
 
Smart submit
Smart submitSmart submit
Smart submit
 
Plos taxonomy beyond search dhug 2021
Plos taxonomy beyond search   dhug 2021Plos taxonomy beyond search   dhug 2021
Plos taxonomy beyond search dhug 2021
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)
 
Data harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacingData harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacing
 
Data harmony update 2021
Data harmony update 2021 Data harmony update 2021
Data harmony update 2021
 
Atypon dhug2021
Atypon dhug2021Atypon dhug2021
Atypon dhug2021
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021
 
Asce more than just topic taxonomies
Asce more than just topic taxonomiesAsce more than just topic taxonomies
Asce more than just topic taxonomies
 
Acs discoverability-dhug2021
Acs discoverability-dhug2021Acs discoverability-dhug2021
Acs discoverability-dhug2021
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut It
 
Health Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut ItHealth Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut It
 
Why Keywords Don't Cut It
Why Keywords Don't Cut ItWhy Keywords Don't Cut It
Why Keywords Don't Cut It
 
Data Harmony update 2020 final
Data Harmony update 2020 finalData Harmony update 2020 final
Data Harmony update 2020 final
 
Data Harmony Update 2020 final
Data Harmony Update 2020 finalData Harmony Update 2020 final
Data Harmony Update 2020 final
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
DHUG 2018 - Florida Thesis OCR
DHUG 2018 - Florida Thesis OCRDHUG 2018 - Florida Thesis OCR
DHUG 2018 - Florida Thesis OCR
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
 

Automating metadata to improve information retrieval

  • 1. Bob Kasenchak Project Coordinator Access Innovations bob_kasenchak@accessinn.com @taxobob
  • 2. DISCLAIMER I Don’t Have Time for Metadata!
  • 3. OUTLINE • Data • Structured Data • Unstructured Data • Metadata • Subject Metadata • Entity (author, institution) Metadata • Document Type Metadata • Automating Metadata • Heuristic/Statistical/Inferential • Rule-based I Don’t Have Time for Metadata!
  • 4. CASE STUDIES I Don’t Have Time for Metadata!
  • 5. STRUCTURED VS. UNSTRUCTURED DATA Present different problems – and possible solutions – for automatically adding metadata I Don’t Have Time for Metadata!
  • 6. STRUCTURED VS. UNSTRUCTURED DATA I Don’t Have Time for Metadata! Association,in view of abuses and lack of consistency in published reports, has asserted that the all-inclusive income statement,containing allincome items recognized as determinantsof net income, is the answer to these questions.2 The Securities and Exchange Commission has also strongly favored this solution.3 On the 1 Committeeon Accounting Procedure, American Instituteof Accountants, "Income and Earned Surplus," Accounting Research BulletinNo. 32 (December, 1947). 2 (1) "A TentativeStatementof Accounting Principles Affecting Corporate Reports," THE ACCOUNTING REvIEw, June, 1936, pp. 187-191; (2) Accounting
  • 7. STRUCTURED VS. UNSTRUCTURED DATA I Don’t Have Time for Metadata! <volume>325</volume> <issue>5945</issue> <fpage seq="c">1206</fpage> <lpage>1206</lpage> <history><date date-type="received"><day>26</day><month>02</month><year>2009 </year></date><date date-type="accepted"><day>11</day><month>08</month> <year>2009</year></date></history> <permissions> <copyright-statement>Copyright &#x00A9; 2009</copyright-statement> <copyright-year>2009</copyright-year> <copyright-holder>Your name here</copyright-holder> </permissions> <abstract> <p>Our extended ontogenetic growth model is a theoretical model based on conservation of energy and general biological mechanisms underlying ontogenetic growth. We do not believe that the comments of Makarieva <italic>et al</italic>. and Sousa <italic>et al </italic>. expose substantive problems with our model. Nevertheless, they raise interesting, still unresolved questions and point to philosophical differences about the role of theory and of simple, general models as opposed to complicated, specific models.</p> </abstract>
  • 8. STRUCTURED VS. UNSTRUCTURED DATA • Just extracting basic information • Author • Institution • Title • Document type • Accession number(s) …can be a challenge. However… I Don’t Have Time for Metadata!
  • 9. STRUCTURED VS. UNSTRUCTURED DATA • Predictability • Positionality I Don’t Have Time for Metadata! Journal name/ Issue/Vol./etc. Article Title Copyright info Author info Abstract
  • 10. UNSTRUCTURED DATA => STRUCTURED DATA! <journal>Transactions on Vehicular Technology</journal> <article-title>Relationship of Average Transmitted and Received Energies in Adaptive Transmission</article-title> <authors><author-surname>Kotelba</author-surname><author-firstname>Adrian</author- firstname><affiliation>Member, IEEE</affiliation></authors> <copyright-info><copyright-date>2009</copyright-date></copyright-info> <abstract><p>This paper studies the…</p></abstract> NOTE: Some cleanup may be required I Don’t Have Time for Metadata!
  • 11. STRUCTURED VS. UNSTRUCTURED DATA • Basic information already tagged, labeled, and easy to extract • Author info • Title • Journal/Volume/Issue etc. • We can add semantic (or subject) metadata • Targeting only those parts of the text we require • Title • Abstract • Full text body • Exclude references, etc. I Don’t Have Time for Metadata!
  • 12. SEMANTIC METADATA  Uncontrolled  Automatic keyword extraction  Crowdsourced/folksonomic tags  Controlled – from a Thesaurus (or Taxonomy…)  Inferential (Heuristic; Statistical)  Rule-based I Don’t Have Time for Metadata!
  • 13. SEMANTIC METADATA: HOW?  Controlled – from a Thesaurus (or Taxonomy…)  Inferential (Heuristic; Statistical)  Rule-based  Manual tagging  Automatic tagging I Don’t Have Time for Metadata!
  • 14. SEMANTIC METADATA: MANUAL ENTRY I Don’t Have Time for Metadata!
  • 15. SEMANTIC METADATA: MANUAL ENTRY I Don’t Have Time for Metadata! A Thought Experiment • Let’s say a manual indexer can index 10 records/hour • Let’s say the manual indexers are perfectly consistent (they’re not) • Let’s say your manual indexers are paid $10/hour (good luck with that) If you have 10,000 articles/pieces of content: It would take a manual indexer 1000 hours (25 weeks) and cost $10,000 If you have 100,000 articles: It would take a manual indexer 10,000 hours (250 weeks, or almost 5 years) and cost $100,000 If you have 1,000,000 articles: It would take a manual indexer 100,000 hours (~48 years) and $1,000,000
  • 16. SEMANTIC METADATA: AUTOMATED I Don’t Have Time for Metadata!
  • 17. SEMANTIC METADATA: WHY?  Disambiguate the ambiguous  Specify most specific topics  Improve information retrieval  Search  Browse  Enable advanced analytics I Don’t Have Time for Metadata!
  • 18. SEMANTIC METADATA: DISAMBIGUATION “Mercury” I Don’t Have Time for Metadata!
  • 19. SEMANTIC METADATA: SPECIFICATION Beyond exact string matches: Synonymy Fiber optic gyroscopes Fiber optic gyros Fiber-optic gyroscopes Fiber-optic gyros Fibre optic gyroscopes Fibre optic gyros Fibre-optic gyroscopes Fibre-optic gyros Fiberoptic gyroscopes Fiberoptic gyros Optical fiber gyroscopes Optical fiber gyros Optical fibre gyroscopes Optical fibre gyros FOGs FOG’s I Don’t Have Time for Metadata!
  • 20. SEMANTIC METADATA: SPECIFICATION Beyond exact string matches: Context. Matters.  Indexing to most specific term - Microscopes - Electron microscopes - Scanning electron microscopes I Don’t Have Time for Metadata!
  • 21. SEMANTIC METADATA: WHY? Improving information retrieval (Search, Browse) SEARCH ≠ BROWSE I Don’t Have Time for Metadata!
  • 22. SEMANTIC METADATA: WHY? Improving information retrieval: Search  Allows user to search by tags  Ensures consistent and reliable retrieval  Speeds electronic search I Don’t Have Time for Metadata!
  • 23. SEMANTIC METADATA: WHY? Improving information retrieval: Search I Don’t Have Time for Metadata! Subject Metadata
  • 24. SEMANTIC METADATA: WHY? Improving information retrieval: Search I Don’t Have Time for Metadata! Metadata-based Search Results Based on metadata
  • 25. SEMANTIC METADATA: WHY? Improving information retrieval: Browse I Don’t Have Time for Metadata! Taxonomy browse Results Based on metadata
  • 26. SEMANTIC METADATA: WHY? Improving information retrieval: Browse I Don’t Have Time for Metadata! Taxonomy browse Additional Search filters
  • 27. SEMANTIC METADATA: WHY? Improving information retrieval: Analytics  Combine subject metadata with metadata about  Authors  Institutions  Publications (Journals, Magazines, etc.)  Publication Types …to create detailed informatics about your data, users, authors, and whatever else is relevant or useful I Don’t Have Time for Metadata!
  • 28. SEMANTIC METADATA: WHY? Improving information retrieval: Analytics I Don’t Have Time for Metadata! Taxonomy term Narrower terms Broader Term(s) Authors who publish on this topic
  • 29. I DON’T HAVE TIME FOR METADATA! I Don’t Have Time for Metadata! Since Metadata allows you to do things you already have want need to do: It’s always time for metadata.
  • 30. Bob Kasenchak Project Coordinator Access Innovations bob_kasenchak@accessinn.com @taxobob Thank you!