SlideShare a Scribd company logo
1 of 34
Download to read offline
Taxonomies
           for Text Analytics
          and Auto-Indexing
Heather Hedden
Hedden Information Management
Text Analytics World, Boston, MA
October 4, 2012
Introduction: Text Analytics and Taxonomies
    Text analytics can be used to index content without the
    use of taxonomies/controlled vocabularies.

    Text analytics can be used to index content with
    taxonomies/controlled vocabularies for better results.

Text analytics can generate terms from text to be used:
1. As a source to manually build taxonomies
2. To auto-categorize/classify content against existing
   taxonomies



2                   © 2012 Hedden Information Management
Outline
    Taxonomy Introduction: Definitions & Types
    Taxonomy Introduction: Purposes & Benefits
    Synonyms for Terms
    Auto-Indexing and Auto-Categorization
    Taxonomies for Auto-Categorization
    Taxonomy Resources




3                  © 2012 Hedden Information Management
Outline
    Taxonomy Introduction: Definitions & Types
    Taxonomy Introduction: Purposes & Benefits
    Synonyms for Terms
    Auto-Indexing and Auto-Categorization
    Taxonomies for Auto-Categorization
    Taxonomy Resources




4                  © 2012 Hedden Information Management
Definitions & Types
Broad designations                                  Specific types
(essentially the same meaning                       (different meanings):
   – used interchangeably):
    Controlled Vocabularies (CV)                         Term Lists/Pick lists
    Knowledge Organization Systems                       Synonym Rings
    Taxonomies                                           Authority Files
                                                         Taxonomies
                                                            Hierarchical
                                                            Faceted
                                                         Thesauri
                                                         Ontologies



5                     © 2012 Hedden Information Management
Definitions & Types

Broad Designations:
Controlled vocabulary, knowledge organization system, taxonomy
  An authoritative, restricted list of terms (words or phrases)
  Each term for a single unambiguous concept
  (synonyms/nonpreferred terms, as cross-references, may be
  included)
  Policies (control) for who, when, and how new terms can be added
  May or may not have structured relationships between terms
  To support indexing/tagging/metadata management of content to
  facilitate content management and retrieval




  6                  © 2012 Hedden Information Management
Definitions & Types

Specific types

    Term Lists/Pick lists
    Synonym Rings
    Authority Files
    Taxonomies
    Thesauri
    Ontologies




7                      © 2012 Hedden Information Management
Definitions & Types: Specific Types
Term List
  A simple list of terms
  Lacking synonyms, usually short
  enough for browsing
  Often displayed in drop-down scroll
  boxes




8                  © 2012 Hedden Information Management
Definitions & Types: Specific Types

Synonym Ring
    A controlled vocabulary with
    synonyms or near-synonyms for
    each concept
    No designated “preferred” term:
    All terms are equal and point to
    each other, as in a ring.
    Table for terms does not display
    to the user




9                     © 2012 Hedden Information Management
Definitions & Types: Specific Types
Taxonomy
  A controlled vocabulary with internal structure.
  Terms are grouped or have hierarchical relationships.
  Emphasizes categories and classification for end-user
  display.
  May or may not have synonyms.

     Hierarchical – all terms have broader/narrower
     relationships to each other to form one big hierarchy
     Faceted – terms are grouped by attribute/aspect and are
     used in combination for indexing and search


10                   © 2012 Hedden Information Management
Definitions & Types: Specific Types Leisure and culture
                                                              .   Arts and entertainment venues
                                                              .   . Museums and galleries
Hierarchical Taxonomy                                         .   Children's activities
                                                              .   Culture and creativity
example  (UK’s IPSV):                                         .   . Architecture
                                                              .   . Crafts
 Top Level Headings                                           .   . Heritage
                                                              .   . Literature
       Business and industry                                  .   . Music
       Economics and finance                                  .   . Performing arts
       Education and skills                                   .   . Visual arts
       Employment, jobs and careers                           .   Entertainment and events
       Environment                                            .   Gambling and lotteries
       Government, politics and public                        .   Hobbies and interests
       administration                                         .   Parks and gardens
       Health, well-being and care                            .   Sports and recreation
       Housing                                                .   . Team sports
       Information and communication                          .   . . Cricket
       International affairs and defence                      .   . . Football
       Leisure and culture                                    .   . . Rugby
       Life in the community                                  .   . Water sports
       People and organisations                               .   . Winter sports
       Public order, justice and rights                       .   Sports and recreation facilities
       Science, technology and innovation                     .   Tourism
       Transport and infrastructure                           .   . Passports and visas
                                                              .   Young people's activities
  11                              © 2012 Hedden Information Management
Definitions & Types: Specific Types
Faceted Taxonomy examples




                   © 2012 Hedden Information Management
Outline
     Taxonomy Introduction: Definitions & Types
     Taxonomy Introduction: Purposes & Benefits
     Synonyms for Terms
     Auto-Indexing and Auto-Categorization
     Taxonomies for Auto-Categorization
     Taxonomy Resources




13                  © 2012 Hedden Information Management
Purposes & Benefits


1.   Controlled vocabulary aspect:
     Brings together different wordings (synonyms) for the same
     concept and disambiguates terms
        Helps people search for information by different
        names
        Helps people retrieve matching concepts, not just
        words

2.   Taxonomy or thesaurus structure aspect:
     Organizes information into a logical structure
        Helps people browse or navigate for information

                         © 2012 Hedden Information Management
Purposes & Benefits


  Helps people search for information by different names

  There are multiple ways to describe the same thing.
  A controlled vocabulary gathers synonyms, acronyms,
  variant spellings, etc.
  Without a controlled vocabulary keyword searches would
  miss some relevant documents, due to:
     Use of different words (e.g. Attorneys, instead of
     Lawyers)
     Use of different phrases (e.g. Deceptive acts or
     practices instead of Unfair practices)
     User does not knowing the spelling of unusual names
     (e.g. Condoleezza Rice)

                   © 2012 Hedden Information Management
Purposes & Benefits




                 © 2012 Hedden Information Management
Purposes & Benefits


  Helps people retrieve matching concepts, not just words

  A single term may have multiple meanings.
  Controlled vocabulary terms can be clarified/ disambiguated.
  Without a controlled vocabulary, too many irrelevant
  documents would be retrieved.
  A search restricted on the controlled vocabulary retrieves
  concepts not just words.
     Excludes document with mere text-string matches (e.g.
     monitors for computers, not the verb “observes”)




                   © 2012 Hedden Information Management
Outline
     Taxonomy Introduction: Definitions & Types
     Taxonomy Introduction: Purposes & Benefits
     Synonyms for Terms
     Auto-Indexing and Auto-Categorization
     Taxonomies for Auto-Categorization
     Taxonomy Resources




18                  © 2012 Hedden Information Management
Synonyms for Terms

  Supports search in most controlled vocabulary types:
  synonym rings, authority files, thesauri, (some taxonomies)
  Anticipating both:
     varied user search string entries
     varied forms in the text for the same content
  For both manual and automated indexing
  A concept may have any number of synonyms,
   but a synonym can point to only one preferred term
  Varied synonym sources:
     Search analytics records
     Interviews and use cases
     Legacy print indexes
     Obvious patterns (acronyms, phrase inversions, etc.)

                    © 2012 Hedden Information Management
Synonyms for Terms

Not all are “synonyms.”

Types include:
  synonyms: Cars USE Automobiles
  near-synonyms: Junior high USE Middle school
  variant spellings: Defence USE Defense
  lexical variants: Hair loss USE Baldness
  foreign language proper nouns: Luftwaffe USE German Air Force
  acronyms/spelled out forms: UN USE United Nations
  scientific/technical names: Neoplasms USE Cancer
  phrase variations (in print): Buses, school USE School buses
  antonyms: Misbehavior USE Behavior
  narrower terms: Alcoholism USE Substance abuse

Also called “variant terms,” “equivalence” terms, “non-preferred
                        © 2012 Hedden Information Management
   terms”
Outline
     Taxonomy Introduction: Definitions & Types
     Taxonomy Introduction: Purposes & Benefits
     Synonyms for Terms
     Auto-Indexing and Auto-Categorization
     Taxonomies for Auto-Categorization
     Taxonomy Resources




21                  © 2012 Hedden Information Management
Auto-Indexing and Auto-Categorization

Choosing human vs. automated indexing:

Human indexing                             Automated indexing
• Manageable number of docs                • Very large number of docs
• Higher accuracy in indexing              • Greater speed in indexing
• May include non-text files               • Text files only
• Invest in people                         • Invest in technology
• Low-tech: can build your own             • High-tech: must purchase
  indexing UI                                auto-indexing software
• Internal control                         • Software vendor relationship




  22                   © 2012 Hedden Information Management
Auto-Indexing and Auto-Categorization

Automated Indexing Technologies
   Entity extraction
   Text analytics and text mining, based on NLP
   Auto-categorization




  23                 © 2012 Hedden Information Management
Auto-Indexing and Auto-Categorization

Choosing auto-indexing methods:

Information extraction/text analytics         Auto-categorization
•   For varied and undifferentiated           •   For consistent doc types/formats
    document types                            •   For structured or pre-tagged
•   For unstructured content                      content
•   For varied subject areas                  •   For limited/focused subject
•   Terms may or may not be displayed         •   Displays categories to user
•   Not necessarily with taxonomy             •   Leverages a taxonomy


Combine both text analytics and auto-categorization:
1. Text analytics to extract concepts from unstructured varied content
2. Auto-categorization to apply benefits of a taxonomy/controlled vocabulary


    24                    © 2012 Hedden Information Management
Auto-Indexing and Auto-Categorization
auto-categorization = auto-classification = automated subject
   indexing

Auto-categorization makes use of the controlled vocabulary
   matched with extracted terms.

Primary auto-categorization technologies:
   1. Machine-learning and training documents
   2. Rules-based categorization




                     © 2012 Hedden Information Management
Auto-Indexing and Auto-Categorization

Machine-learning based auto-categorization:
      Automatically indexes based on previous examples
      Complex mathematical algorithms are created
      Taxonomist must then provide multiple representative sample
      documents for each CV term to “train” the system.
      Best if pre-indexed records exist (i.e. converting from human to
      automated indexing), then hundreds of varied documents can be
      used for each term.




 26                    © 2012 Hedden Information Management
Auto-Indexing and Auto-Categorization

Rules-based auto-categorization
       Taxonomist must write rules for each CV term
       Like advanced Boolean searching or regular expressions

Example:


Bush
IF (INITIAL CAPS AND (MENTIONS "president*" OR WITH administration*" OR
        AROUND "white house" OR NEAR "george"))
USE
        U.S. President
ELSE
        USE Shrubs
ENDIF
                                                 Data Harmony



  27                          © 2012 Hedden Information Management
Outline
     Taxonomy Introduction: Definitions & Types
     Taxonomy Introduction: Purposes & Benefits
     Synonyms for Terms
     Auto-Indexing and Auto-Categorization
     Taxonomies for Auto-Categorization
     Taxonomy Resources




28                  © 2012 Hedden Information Management
Taxonomies for Auto-Categorization

No matter which method of auto-indexing, auto-indexing impacts
    controlled vocabulary creation:
       Continual update work is needed (new training documents or
       new rules) for each new term created.
       Feeding training documents is easier for non-information
       professionals, than is writing rules




  29                    © 2012 Hedden Information Management
Taxonomies for Auto-Categorization

Taxonomies designed for auto-categorization:
    Need more, varied synonym/variant terms
    Need variant terms of different parts of speech
    Cannot have subtle differences between preferred terms
    Avoid creating many action-terms
    Taxonomy needs to be more content-tailored, content-based




  30                 © 2012 Hedden Information Management
Taxonomies for Auto-Categorization

Synonym/variant term differences:


For human-indexing                          For auto-categorization
Presidential candidates                     Presidential candidate
Candidates, presidential                    Presidential candidacy
                                            Candidate for president
                                            Candidacy for president
                                            Presidential hopeful
                                            Running for president
                                            Campaigning for president
                                            Presidential nominee



  31                    © 2012 Hedden Information Management
Outline
     Taxonomy Introduction: Definitions & Types
     Taxonomy Introduction: Purposes & Benefits
     Synonyms for Terms
     Auto-Indexing and Auto-Categorization
     Taxonomies for Auto-Categorization
     Taxonomy Resources




32                  © 2012 Hedden Information Management
Taxonomy Resources
     ANSI/NISO Z39.19 (2005) Guidelines for Construction, Format, and
     Management of Monolingual Controlled Vocabularies. Bethesda,
     MD: NISO Press. www.niso.org
     Hedden, Heather. (2010) The Accidental Taxonomist. Medford, NJ:
     Information Today Inc. www.accidental-taxonomist.com
     American Society for Indexing: Taxonomies and Controlled
     Vocabularies Special Interest Group www.taxonomies-sig.org
     Special Libraries Association (SLA): Taxonomy Division
     http://wiki.sla.org/display/SLATAX
     Taxonomy Community of Practice discussion group
     http://finance.groups.yahoo.com/group/TaxoCoP
     "Taxonomies and Controlled Vocabularies“ Simmons College
     Graduate School of Library and Information Science Continuing
     Education Program, 5 weeks. $250. November 2012, January 2013.
     http://alanis.simmons.edu/ceweb/byinstructor.php#9
33                                                                33
                         © 2012 Hedden Information Management
Questions/Contact




Heather Hedden
Hedden Information Management
Carlisle, MA
heather@hedden.net
978-467-5195
www.hedden-information.com
www.linkedin.com/in/hedden
twitter.com/hhedden
accidental-taxonomist.blogspot.com


34                      © 2012 Hedden Information Management

More Related Content

What's hot

Selecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology ManagementSelecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology ManagementHeather Hedden
 
Benefits of Taxonomies
Benefits of TaxonomiesBenefits of Taxonomies
Benefits of TaxonomiesHeather Hedden
 
Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Fred Leise
 
Managing Taxonomy Tagging
Managing Taxonomy TaggingManaging Taxonomy Tagging
Managing Taxonomy TaggingHeather Hedden
 
A Brief Introduction to SKOS
A Brief Introduction to SKOSA Brief Introduction to SKOS
A Brief Introduction to SKOSHeather Hedden
 
Mapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and OntologiesMapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and OntologiesHeather Hedden
 
A Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge GraphsA Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge GraphsHeather Hedden
 
The Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingThe Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingDanny Greefhorst
 
Taxonomy Design for SharePoint
Taxonomy Design for SharePointTaxonomy Design for SharePoint
Taxonomy Design for SharePointHeather Hedden
 
Customer-Focused Thesauri
Customer-Focused ThesauriCustomer-Focused Thesauri
Customer-Focused ThesauriHeather Hedden
 
Introduction To Controlled Vocabularies
Introduction To Controlled VocabulariesIntroduction To Controlled Vocabularies
Introduction To Controlled VocabulariesFred Leise
 

What's hot (20)

Selecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology ManagementSelecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology Management
 
Taxonomy Fundamentals Workshop 2013
Taxonomy Fundamentals Workshop 2013Taxonomy Fundamentals Workshop 2013
Taxonomy Fundamentals Workshop 2013
 
Taxonomy 101
Taxonomy 101Taxonomy 101
Taxonomy 101
 
Benefits of Taxonomies
Benefits of TaxonomiesBenefits of Taxonomies
Benefits of Taxonomies
 
Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?
 
Managing Taxonomy Tagging
Managing Taxonomy TaggingManaging Taxonomy Tagging
Managing Taxonomy Tagging
 
Ontology and Other Semantic Options
Ontology and Other Semantic OptionsOntology and Other Semantic Options
Ontology and Other Semantic Options
 
A Brief Introduction to SKOS
A Brief Introduction to SKOSA Brief Introduction to SKOS
A Brief Introduction to SKOS
 
Taxonomy Fundamentals Workshop
Taxonomy Fundamentals WorkshopTaxonomy Fundamentals Workshop
Taxonomy Fundamentals Workshop
 
Taxonomy and seo sla 05-06-10(jc)
Taxonomy and seo   sla 05-06-10(jc)Taxonomy and seo   sla 05-06-10(jc)
Taxonomy and seo sla 05-06-10(jc)
 
Mapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and OntologiesMapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and Ontologies
 
A Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge GraphsA Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge Graphs
 
Taxonomies for Users
Taxonomies for UsersTaxonomies for Users
Taxonomies for Users
 
Machine Aided Indexer
Machine Aided IndexerMachine Aided Indexer
Machine Aided Indexer
 
Thesauri
ThesauriThesauri
Thesauri
 
The Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingThe Role of Thesauri in Data Modeling
The Role of Thesauri in Data Modeling
 
Taxonomy Design for SharePoint
Taxonomy Design for SharePointTaxonomy Design for SharePoint
Taxonomy Design for SharePoint
 
Customer-Focused Thesauri
Customer-Focused ThesauriCustomer-Focused Thesauri
Customer-Focused Thesauri
 
The Myth of Topic Maps
The Myth of Topic MapsThe Myth of Topic Maps
The Myth of Topic Maps
 
Introduction To Controlled Vocabularies
Introduction To Controlled VocabulariesIntroduction To Controlled Vocabularies
Introduction To Controlled Vocabularies
 

Viewers also liked

Taxonomy, ontology, folksonomies & SKOS.
Taxonomy, ontology, folksonomies & SKOS.Taxonomy, ontology, folksonomies & SKOS.
Taxonomy, ontology, folksonomies & SKOS.Janet Leu
 
Taxonomy Development and Digital Projects
Taxonomy Development and Digital ProjectsTaxonomy Development and Digital Projects
Taxonomy Development and Digital Projects daniela barbosa
 
Taxonomy and UX lessons learned for e-commerce
Taxonomy and UX lessons learned for e-commerceTaxonomy and UX lessons learned for e-commerce
Taxonomy and UX lessons learned for e-commerceGary Carlson
 
Starting a Taxonomy Project (Presented at SLA 2013 Conference)
Starting a Taxonomy Project (Presented at SLA 2013 Conference)Starting a Taxonomy Project (Presented at SLA 2013 Conference)
Starting a Taxonomy Project (Presented at SLA 2013 Conference)Miraida Morales
 
Taming taxonomy—a practical intro
Taming taxonomy—a practical introTaming taxonomy—a practical intro
Taming taxonomy—a practical introAlberta Soranzo
 
homophone, homonomy, polysemy
homophone, homonomy, polysemyhomophone, homonomy, polysemy
homophone, homonomy, polysemystefani llontop
 
Taxonomies for E-commerce
Taxonomies for E-commerceTaxonomies for E-commerce
Taxonomies for E-commerceHeather Hedden
 
Understanding Website Taxonomy
Understanding Website TaxonomyUnderstanding Website Taxonomy
Understanding Website TaxonomyIksula
 
Taxonomy Is User Experience
Taxonomy Is User ExperienceTaxonomy Is User Experience
Taxonomy Is User ExperienceDave Cooksey
 
Academic word list synonyms vietnamese
Academic word list synonyms   vietnameseAcademic word list synonyms   vietnamese
Academic word list synonyms vietnamesePhan Thị Trai Anh
 
Polysemous and Homonymous expressions
Polysemous and Homonymous expressionsPolysemous and Homonymous expressions
Polysemous and Homonymous expressionsJuan Miguel Palero
 

Viewers also liked (20)

Taxonomy, ontology, folksonomies & SKOS.
Taxonomy, ontology, folksonomies & SKOS.Taxonomy, ontology, folksonomies & SKOS.
Taxonomy, ontology, folksonomies & SKOS.
 
Taxonomy Interoperability Standards
Taxonomy Interoperability StandardsTaxonomy Interoperability Standards
Taxonomy Interoperability Standards
 
Naming the nyms
Naming the nymsNaming the nyms
Naming the nyms
 
Taxonomy Development and Digital Projects
Taxonomy Development and Digital ProjectsTaxonomy Development and Digital Projects
Taxonomy Development and Digital Projects
 
User-Driven Taxonomies
User-Driven TaxonomiesUser-Driven Taxonomies
User-Driven Taxonomies
 
Taxonomy and UX lessons learned for e-commerce
Taxonomy and UX lessons learned for e-commerceTaxonomy and UX lessons learned for e-commerce
Taxonomy and UX lessons learned for e-commerce
 
Starting a Taxonomy Project (Presented at SLA 2013 Conference)
Starting a Taxonomy Project (Presented at SLA 2013 Conference)Starting a Taxonomy Project (Presented at SLA 2013 Conference)
Starting a Taxonomy Project (Presented at SLA 2013 Conference)
 
Nym S
Nym SNym S
Nym S
 
E-Commerce Taxonomies
E-Commerce TaxonomiesE-Commerce Taxonomies
E-Commerce Taxonomies
 
Taxonomy Governance and Iteration
Taxonomy Governance and IterationTaxonomy Governance and Iteration
Taxonomy Governance and Iteration
 
Taming taxonomy—a practical intro
Taming taxonomy—a practical introTaming taxonomy—a practical intro
Taming taxonomy—a practical intro
 
homophone, homonomy, polysemy
homophone, homonomy, polysemyhomophone, homonomy, polysemy
homophone, homonomy, polysemy
 
Taxonomy And Metadata
Taxonomy And MetadataTaxonomy And Metadata
Taxonomy And Metadata
 
Taxonomies for E-commerce
Taxonomies for E-commerceTaxonomies for E-commerce
Taxonomies for E-commerce
 
Understanding Website Taxonomy
Understanding Website TaxonomyUnderstanding Website Taxonomy
Understanding Website Taxonomy
 
Taxonomy Is User Experience
Taxonomy Is User ExperienceTaxonomy Is User Experience
Taxonomy Is User Experience
 
Synonyms
SynonymsSynonyms
Synonyms
 
Academic word list synonyms vietnamese
Academic word list synonyms   vietnameseAcademic word list synonyms   vietnamese
Academic word list synonyms vietnamese
 
Polysemous and Homonymous expressions
Polysemous and Homonymous expressionsPolysemous and Homonymous expressions
Polysemous and Homonymous expressions
 
Semantics
SemanticsSemantics
Semantics
 

Similar to Taxonomies for Text Analytics and Auto-indexing

Topic Maps - Human-oriented semantics?
Topic Maps - Human-oriented semantics?Topic Maps - Human-oriented semantics?
Topic Maps - Human-oriented semantics?Lars Marius Garshol
 
Dealing the Cards
Dealing the CardsDealing the Cards
Dealing the CardsTSoholt
 
Taxonomy design best practices
Taxonomy design best practices Taxonomy design best practices
Taxonomy design best practices voginip
 
Metadata practice and direction: a community perspective
Metadata practice and direction:a community perspectiveMetadata practice and direction:a community perspective
Metadata practice and direction: a community perspectivelisld
 
Deep Machine Reading for Customer Analytics
Deep Machine Reading for Customer AnalyticsDeep Machine Reading for Customer Analytics
Deep Machine Reading for Customer AnalyticsNaveen Ashish
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringMustafa Jarrar
 
Chapter 9Enterprise Content and Record ManagementSt. Rit
Chapter 9Enterprise Content and Record ManagementSt. RitChapter 9Enterprise Content and Record ManagementSt. Rit
Chapter 9Enterprise Content and Record ManagementSt. RitJinElias52
 
Professional ethics
Professional ethicsProfessional ethics
Professional ethicsAbaid Ch
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative researchGhulam Qambar
 
Taxonomies and Metadata in Information Architecture
Taxonomies and Metadata in Information ArchitectureTaxonomies and Metadata in Information Architecture
Taxonomies and Metadata in Information ArchitectureAccess Innovations, Inc.
 
Should libraries discontinue using and maintaining controlled subject vocabul...
Should libraries discontinue using and maintaining controlled subject vocabul...Should libraries discontinue using and maintaining controlled subject vocabul...
Should libraries discontinue using and maintaining controlled subject vocabul...Ryan Scicluna
 
Culture and Usability - Cross Cultural Issues in HCI, IIT Guwahati
Culture and Usability - Cross Cultural Issues in HCI, IIT GuwahatiCulture and Usability - Cross Cultural Issues in HCI, IIT Guwahati
Culture and Usability - Cross Cultural Issues in HCI, IIT GuwahatiSameer Chavan
 
Glis Culture 03 20071029
Glis Culture 03 20071029Glis Culture 03 20071029
Glis Culture 03 20071029Jan Pawlowski
 
Evaluating Taxonomies
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating TaxonomiesJoseph Busch
 
Sfre Medline 0910
Sfre Medline 0910Sfre Medline 0910
Sfre Medline 0910JanetMac
 
Expertise Networks
Expertise NetworksExpertise Networks
Expertise NetworksJoel Alleyne
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentThe Digital Group
 

Similar to Taxonomies for Text Analytics and Auto-indexing (20)

Taxonomy made easy
Taxonomy made easyTaxonomy made easy
Taxonomy made easy
 
Topic Maps - Human-oriented semantics?
Topic Maps - Human-oriented semantics?Topic Maps - Human-oriented semantics?
Topic Maps - Human-oriented semantics?
 
Dealing the Cards
Dealing the CardsDealing the Cards
Dealing the Cards
 
Taxonomy design best practices
Taxonomy design best practices Taxonomy design best practices
Taxonomy design best practices
 
Metadata practice and direction: a community perspective
Metadata practice and direction:a community perspectiveMetadata practice and direction:a community perspective
Metadata practice and direction: a community perspective
 
Deep Machine Reading for Customer Analytics
Deep Machine Reading for Customer AnalyticsDeep Machine Reading for Customer Analytics
Deep Machine Reading for Customer Analytics
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology Engineering
 
Chapter 9Enterprise Content and Record ManagementSt. Rit
Chapter 9Enterprise Content and Record ManagementSt. RitChapter 9Enterprise Content and Record ManagementSt. Rit
Chapter 9Enterprise Content and Record ManagementSt. Rit
 
Controlled Vocabulary.pptx
Controlled Vocabulary.pptxControlled Vocabulary.pptx
Controlled Vocabulary.pptx
 
Professional ethics
Professional ethicsProfessional ethics
Professional ethics
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative research
 
Taxonomies and Metadata in Information Architecture
Taxonomies and Metadata in Information ArchitectureTaxonomies and Metadata in Information Architecture
Taxonomies and Metadata in Information Architecture
 
Should libraries discontinue using and maintaining controlled subject vocabul...
Should libraries discontinue using and maintaining controlled subject vocabul...Should libraries discontinue using and maintaining controlled subject vocabul...
Should libraries discontinue using and maintaining controlled subject vocabul...
 
Culture and Usability - Cross Cultural Issues in HCI, IIT Guwahati
Culture and Usability - Cross Cultural Issues in HCI, IIT GuwahatiCulture and Usability - Cross Cultural Issues in HCI, IIT Guwahati
Culture and Usability - Cross Cultural Issues in HCI, IIT Guwahati
 
Glis Culture 03 20071029
Glis Culture 03 20071029Glis Culture 03 20071029
Glis Culture 03 20071029
 
Evaluating Taxonomies
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating Taxonomies
 
Sfre Medline 0910
Sfre Medline 0910Sfre Medline 0910
Sfre Medline 0910
 
Expertise Networks
Expertise NetworksExpertise Networks
Expertise Networks
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic Enrichment
 
How Ontologies Power Chatbots
How Ontologies Power ChatbotsHow Ontologies Power Chatbots
How Ontologies Power Chatbots
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Taxonomies for Text Analytics and Auto-indexing

  • 1. Taxonomies for Text Analytics and Auto-Indexing Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 4, 2012
  • 2. Introduction: Text Analytics and Taxonomies Text analytics can be used to index content without the use of taxonomies/controlled vocabularies. Text analytics can be used to index content with taxonomies/controlled vocabularies for better results. Text analytics can generate terms from text to be used: 1. As a source to manually build taxonomies 2. To auto-categorize/classify content against existing taxonomies 2 © 2012 Hedden Information Management
  • 3. Outline Taxonomy Introduction: Definitions & Types Taxonomy Introduction: Purposes & Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources 3 © 2012 Hedden Information Management
  • 4. Outline Taxonomy Introduction: Definitions & Types Taxonomy Introduction: Purposes & Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources 4 © 2012 Hedden Information Management
  • 5. Definitions & Types Broad designations Specific types (essentially the same meaning (different meanings): – used interchangeably): Controlled Vocabularies (CV) Term Lists/Pick lists Knowledge Organization Systems Synonym Rings Taxonomies Authority Files Taxonomies Hierarchical Faceted Thesauri Ontologies 5 © 2012 Hedden Information Management
  • 6. Definitions & Types Broad Designations: Controlled vocabulary, knowledge organization system, taxonomy An authoritative, restricted list of terms (words or phrases) Each term for a single unambiguous concept (synonyms/nonpreferred terms, as cross-references, may be included) Policies (control) for who, when, and how new terms can be added May or may not have structured relationships between terms To support indexing/tagging/metadata management of content to facilitate content management and retrieval 6 © 2012 Hedden Information Management
  • 7. Definitions & Types Specific types Term Lists/Pick lists Synonym Rings Authority Files Taxonomies Thesauri Ontologies 7 © 2012 Hedden Information Management
  • 8. Definitions & Types: Specific Types Term List A simple list of terms Lacking synonyms, usually short enough for browsing Often displayed in drop-down scroll boxes 8 © 2012 Hedden Information Management
  • 9. Definitions & Types: Specific Types Synonym Ring A controlled vocabulary with synonyms or near-synonyms for each concept No designated “preferred” term: All terms are equal and point to each other, as in a ring. Table for terms does not display to the user 9 © 2012 Hedden Information Management
  • 10. Definitions & Types: Specific Types Taxonomy A controlled vocabulary with internal structure. Terms are grouped or have hierarchical relationships. Emphasizes categories and classification for end-user display. May or may not have synonyms. Hierarchical – all terms have broader/narrower relationships to each other to form one big hierarchy Faceted – terms are grouped by attribute/aspect and are used in combination for indexing and search 10 © 2012 Hedden Information Management
  • 11. Definitions & Types: Specific Types Leisure and culture . Arts and entertainment venues . . Museums and galleries Hierarchical Taxonomy  . Children's activities . Culture and creativity example  (UK’s IPSV): . . Architecture . . Crafts Top Level Headings . . Heritage . . Literature Business and industry . . Music Economics and finance . . Performing arts Education and skills . . Visual arts Employment, jobs and careers . Entertainment and events Environment . Gambling and lotteries Government, politics and public . Hobbies and interests administration . Parks and gardens Health, well-being and care . Sports and recreation Housing . . Team sports Information and communication . . . Cricket International affairs and defence . . . Football Leisure and culture . . . Rugby Life in the community . . Water sports People and organisations . . Winter sports Public order, justice and rights . Sports and recreation facilities Science, technology and innovation . Tourism Transport and infrastructure . . Passports and visas . Young people's activities 11 © 2012 Hedden Information Management
  • 12. Definitions & Types: Specific Types Faceted Taxonomy examples © 2012 Hedden Information Management
  • 13. Outline Taxonomy Introduction: Definitions & Types Taxonomy Introduction: Purposes & Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources 13 © 2012 Hedden Information Management
  • 14. Purposes & Benefits 1. Controlled vocabulary aspect: Brings together different wordings (synonyms) for the same concept and disambiguates terms Helps people search for information by different names Helps people retrieve matching concepts, not just words 2. Taxonomy or thesaurus structure aspect: Organizes information into a logical structure Helps people browse or navigate for information © 2012 Hedden Information Management
  • 15. Purposes & Benefits Helps people search for information by different names There are multiple ways to describe the same thing. A controlled vocabulary gathers synonyms, acronyms, variant spellings, etc. Without a controlled vocabulary keyword searches would miss some relevant documents, due to: Use of different words (e.g. Attorneys, instead of Lawyers) Use of different phrases (e.g. Deceptive acts or practices instead of Unfair practices) User does not knowing the spelling of unusual names (e.g. Condoleezza Rice) © 2012 Hedden Information Management
  • 16. Purposes & Benefits © 2012 Hedden Information Management
  • 17. Purposes & Benefits Helps people retrieve matching concepts, not just words A single term may have multiple meanings. Controlled vocabulary terms can be clarified/ disambiguated. Without a controlled vocabulary, too many irrelevant documents would be retrieved. A search restricted on the controlled vocabulary retrieves concepts not just words. Excludes document with mere text-string matches (e.g. monitors for computers, not the verb “observes”) © 2012 Hedden Information Management
  • 18. Outline Taxonomy Introduction: Definitions & Types Taxonomy Introduction: Purposes & Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources 18 © 2012 Hedden Information Management
  • 19. Synonyms for Terms Supports search in most controlled vocabulary types: synonym rings, authority files, thesauri, (some taxonomies) Anticipating both: varied user search string entries varied forms in the text for the same content For both manual and automated indexing A concept may have any number of synonyms, but a synonym can point to only one preferred term Varied synonym sources: Search analytics records Interviews and use cases Legacy print indexes Obvious patterns (acronyms, phrase inversions, etc.) © 2012 Hedden Information Management
  • 20. Synonyms for Terms Not all are “synonyms.” Types include: synonyms: Cars USE Automobiles near-synonyms: Junior high USE Middle school variant spellings: Defence USE Defense lexical variants: Hair loss USE Baldness foreign language proper nouns: Luftwaffe USE German Air Force acronyms/spelled out forms: UN USE United Nations scientific/technical names: Neoplasms USE Cancer phrase variations (in print): Buses, school USE School buses antonyms: Misbehavior USE Behavior narrower terms: Alcoholism USE Substance abuse Also called “variant terms,” “equivalence” terms, “non-preferred © 2012 Hedden Information Management terms”
  • 21. Outline Taxonomy Introduction: Definitions & Types Taxonomy Introduction: Purposes & Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources 21 © 2012 Hedden Information Management
  • 22. Auto-Indexing and Auto-Categorization Choosing human vs. automated indexing: Human indexing Automated indexing • Manageable number of docs • Very large number of docs • Higher accuracy in indexing • Greater speed in indexing • May include non-text files • Text files only • Invest in people • Invest in technology • Low-tech: can build your own • High-tech: must purchase indexing UI auto-indexing software • Internal control • Software vendor relationship 22 © 2012 Hedden Information Management
  • 23. Auto-Indexing and Auto-Categorization Automated Indexing Technologies Entity extraction Text analytics and text mining, based on NLP Auto-categorization 23 © 2012 Hedden Information Management
  • 24. Auto-Indexing and Auto-Categorization Choosing auto-indexing methods: Information extraction/text analytics Auto-categorization • For varied and undifferentiated • For consistent doc types/formats document types • For structured or pre-tagged • For unstructured content content • For varied subject areas • For limited/focused subject • Terms may or may not be displayed • Displays categories to user • Not necessarily with taxonomy • Leverages a taxonomy Combine both text analytics and auto-categorization: 1. Text analytics to extract concepts from unstructured varied content 2. Auto-categorization to apply benefits of a taxonomy/controlled vocabulary 24 © 2012 Hedden Information Management
  • 25. Auto-Indexing and Auto-Categorization auto-categorization = auto-classification = automated subject indexing Auto-categorization makes use of the controlled vocabulary matched with extracted terms. Primary auto-categorization technologies: 1. Machine-learning and training documents 2. Rules-based categorization © 2012 Hedden Information Management
  • 26. Auto-Indexing and Auto-Categorization Machine-learning based auto-categorization: Automatically indexes based on previous examples Complex mathematical algorithms are created Taxonomist must then provide multiple representative sample documents for each CV term to “train” the system. Best if pre-indexed records exist (i.e. converting from human to automated indexing), then hundreds of varied documents can be used for each term. 26 © 2012 Hedden Information Management
  • 27. Auto-Indexing and Auto-Categorization Rules-based auto-categorization Taxonomist must write rules for each CV term Like advanced Boolean searching or regular expressions Example: Bush IF (INITIAL CAPS AND (MENTIONS "president*" OR WITH administration*" OR AROUND "white house" OR NEAR "george")) USE U.S. President ELSE USE Shrubs ENDIF Data Harmony 27 © 2012 Hedden Information Management
  • 28. Outline Taxonomy Introduction: Definitions & Types Taxonomy Introduction: Purposes & Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources 28 © 2012 Hedden Information Management
  • 29. Taxonomies for Auto-Categorization No matter which method of auto-indexing, auto-indexing impacts controlled vocabulary creation: Continual update work is needed (new training documents or new rules) for each new term created. Feeding training documents is easier for non-information professionals, than is writing rules 29 © 2012 Hedden Information Management
  • 30. Taxonomies for Auto-Categorization Taxonomies designed for auto-categorization: Need more, varied synonym/variant terms Need variant terms of different parts of speech Cannot have subtle differences between preferred terms Avoid creating many action-terms Taxonomy needs to be more content-tailored, content-based 30 © 2012 Hedden Information Management
  • 31. Taxonomies for Auto-Categorization Synonym/variant term differences: For human-indexing For auto-categorization Presidential candidates Presidential candidate Candidates, presidential Presidential candidacy Candidate for president Candidacy for president Presidential hopeful Running for president Campaigning for president Presidential nominee 31 © 2012 Hedden Information Management
  • 32. Outline Taxonomy Introduction: Definitions & Types Taxonomy Introduction: Purposes & Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources 32 © 2012 Hedden Information Management
  • 33. Taxonomy Resources ANSI/NISO Z39.19 (2005) Guidelines for Construction, Format, and Management of Monolingual Controlled Vocabularies. Bethesda, MD: NISO Press. www.niso.org Hedden, Heather. (2010) The Accidental Taxonomist. Medford, NJ: Information Today Inc. www.accidental-taxonomist.com American Society for Indexing: Taxonomies and Controlled Vocabularies Special Interest Group www.taxonomies-sig.org Special Libraries Association (SLA): Taxonomy Division http://wiki.sla.org/display/SLATAX Taxonomy Community of Practice discussion group http://finance.groups.yahoo.com/group/TaxoCoP "Taxonomies and Controlled Vocabularies“ Simmons College Graduate School of Library and Information Science Continuing Education Program, 5 weeks. $250. November 2012, January 2013. http://alanis.simmons.edu/ceweb/byinstructor.php#9 33 33 © 2012 Hedden Information Management
  • 34. Questions/Contact Heather Hedden Hedden Information Management Carlisle, MA heather@hedden.net 978-467-5195 www.hedden-information.com www.linkedin.com/in/hedden twitter.com/hhedden accidental-taxonomist.blogspot.com 34 © 2012 Hedden Information Management