Presented by Bob Kasenchak of Access Innovations, Inc. at the 2014 Special Libraries Association (SLA) annual meeting in Vancouver, British Columbia on June 7, 2014.
3. OUTLINE
• Data
• Structured Data
• Unstructured Data
• Metadata
• Subject Metadata
• Entity (author, institution) Metadata
• Document Type Metadata
• Automating Metadata
• Heuristic/Statistical/Inferential
• Rule-based
I Don’t Have Time for Metadata!
5. STRUCTURED VS. UNSTRUCTURED DATA
Present different problems – and possible solutions – for
automatically adding metadata
I Don’t Have Time for Metadata!
6. STRUCTURED VS. UNSTRUCTURED DATA
I Don’t Have Time for Metadata!
Association,in view of abuses and lack of consistency
in published reports, has asserted that the all-inclusive
income statement,containing allincome items
recognized as determinantsof net income, is the answer
to these questions.2 The Securities and Exchange
Commission has also
strongly favored this solution.3 On the 1 Committeeon
Accounting Procedure, American
Instituteof Accountants, "Income and Earned Surplus,"
Accounting Research BulletinNo. 32 (December,
1947). 2 (1) "A TentativeStatementof Accounting
Principles Affecting Corporate Reports," THE
ACCOUNTING REvIEw, June, 1936, pp. 187-191; (2)
Accounting
7. STRUCTURED VS. UNSTRUCTURED DATA
I Don’t Have Time for Metadata!
<volume>325</volume>
<issue>5945</issue>
<fpage seq="c">1206</fpage>
<lpage>1206</lpage>
<history><date date-type="received"><day>26</day><month>02</month><year>2009
</year></date><date date-type="accepted"><day>11</day><month>08</month>
<year>2009</year></date></history>
<permissions>
<copyright-statement>Copyright © 2009</copyright-statement>
<copyright-year>2009</copyright-year>
<copyright-holder>Your name here</copyright-holder>
</permissions>
<abstract>
<p>Our extended ontogenetic growth model is a theoretical model based on conservation
of energy and general biological mechanisms underlying ontogenetic growth. We do not
believe that the comments of Makarieva <italic>et al</italic>. and Sousa <italic>et al
</italic>. expose substantive problems with our model. Nevertheless, they raise
interesting, still unresolved questions and point to philosophical differences about the role
of theory and of simple, general models as opposed to complicated, specific models.</p>
</abstract>
8. STRUCTURED VS. UNSTRUCTURED DATA
• Just extracting basic information
• Author
• Institution
• Title
• Document type
• Accession number(s)
…can be a challenge.
However…
I Don’t Have Time for Metadata!
9. STRUCTURED VS. UNSTRUCTURED DATA
• Predictability
• Positionality
I Don’t Have Time for Metadata!
Journal name/
Issue/Vol./etc.
Article Title
Copyright info
Author info
Abstract
10. UNSTRUCTURED DATA => STRUCTURED DATA!
<journal>Transactions on Vehicular Technology</journal>
<article-title>Relationship of Average Transmitted and Received Energies in Adaptive
Transmission</article-title>
<authors><author-surname>Kotelba</author-surname><author-firstname>Adrian</author-
firstname><affiliation>Member, IEEE</affiliation></authors>
<copyright-info><copyright-date>2009</copyright-date></copyright-info>
<abstract><p>This paper studies the…</p></abstract>
NOTE: Some cleanup may be required
I Don’t Have Time for Metadata!
11. STRUCTURED VS. UNSTRUCTURED DATA
• Basic information already tagged, labeled, and easy to
extract
• Author info
• Title
• Journal/Volume/Issue etc.
• We can add semantic (or subject) metadata
• Targeting only those parts of the text we require
• Title
• Abstract
• Full text body
• Exclude references, etc.
I Don’t Have Time for Metadata!
12. SEMANTIC METADATA
Uncontrolled
Automatic keyword extraction
Crowdsourced/folksonomic tags
Controlled – from a Thesaurus (or Taxonomy…)
Inferential (Heuristic; Statistical)
Rule-based
I Don’t Have Time for Metadata!
13. SEMANTIC METADATA: HOW?
Controlled – from a Thesaurus (or Taxonomy…)
Inferential (Heuristic; Statistical)
Rule-based
Manual tagging
Automatic tagging
I Don’t Have Time for Metadata!
15. SEMANTIC METADATA: MANUAL ENTRY
I Don’t Have Time for Metadata!
A Thought Experiment
• Let’s say a manual indexer can index 10 records/hour
• Let’s say the manual indexers are perfectly consistent (they’re not)
• Let’s say your manual indexers are paid $10/hour (good luck with that)
If you have 10,000 articles/pieces of content:
It would take a manual indexer 1000 hours (25 weeks) and cost $10,000
If you have 100,000 articles:
It would take a manual indexer 10,000 hours (250 weeks, or almost 5 years)
and cost $100,000
If you have 1,000,000 articles:
It would take a manual indexer 100,000 hours (~48 years) and $1,000,000
17. SEMANTIC METADATA: WHY?
Disambiguate the ambiguous
Specify most specific topics
Improve information retrieval
Search
Browse
Enable advanced analytics
I Don’t Have Time for Metadata!
20. SEMANTIC METADATA: SPECIFICATION
Beyond exact string matches: Context. Matters.
Indexing to most specific term
- Microscopes
- Electron microscopes
- Scanning electron microscopes
I Don’t Have Time for Metadata!
22. SEMANTIC METADATA: WHY?
Improving information retrieval: Search
Allows user to search by tags
Ensures consistent and reliable retrieval
Speeds electronic search
I Don’t Have Time for Metadata!
24. SEMANTIC METADATA: WHY?
Improving information retrieval: Search
I Don’t Have Time for Metadata!
Metadata-based
Search
Results
Based on
metadata
25. SEMANTIC METADATA: WHY?
Improving information retrieval: Browse
I Don’t Have Time for Metadata!
Taxonomy
browse
Results
Based on
metadata
26. SEMANTIC METADATA: WHY?
Improving information retrieval: Browse
I Don’t Have Time for Metadata!
Taxonomy
browse
Additional
Search
filters
27. SEMANTIC METADATA: WHY?
Improving information retrieval: Analytics
Combine subject metadata with metadata about
Authors
Institutions
Publications (Journals, Magazines, etc.)
Publication Types
…to create detailed informatics about your data, users,
authors, and whatever else is relevant or useful
I Don’t Have Time for Metadata!
28. SEMANTIC METADATA: WHY?
Improving information retrieval: Analytics
I Don’t Have Time for Metadata!
Taxonomy
term
Narrower
terms
Broader
Term(s)
Authors who publish
on this topic
29. I DON’T HAVE TIME FOR METADATA!
I Don’t Have Time for Metadata!
Since Metadata allows you to do things you already have
want
need to do:
It’s always time for metadata.