SlideShare une entreprise Scribd logo
1  sur  35
Automatic Taxonomy Generation
for a News Group
Anna Divoli
Pingar Research
@annadivoli
San Francisco Apr 2013
A Case Study
Why Automatic Generation?
Dynamic
Fast
Cheap
Consistent
RDF / Flexible
…
Why from a Document Collection?
Focused/specific
Optimal for those documents
…
Why?
The Team
The Process
News Group Case Study
Evaluation
Other Use Cases
Summary
Talk Overview
San Francisco Apr 2013
Taxonomy Generation Research Team
Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian Witten
Constructing a Focused Taxonomy from a Document Collection
To appear in Proceedings of the Extended Semantic Web Conference 2013,
ESWC, Montpellier, France
?
How Taxonomy Generation Works
Input:
Documents
stored somewhere
Analysis:
Using variety of tools*
and datasets, extract
concepts,
entities, relations
Grouping & Output:
A taxonomy is created
that groups resulting
taxonomy terms
hierarchically
Custom
Taxonomy
Taxonomy Generation Overview
Taxonomy Generation - Detailed
Document
Database
Solr
Concepts &
Relations Database
Sesame
1. Import
& convert to text
2. Extract concepts
3. Annotate
with Linked Data
4. Disambiguate
clashing concepts
5. Consolidate
taxonomy
Input
Docs
Preferred
top-level terms
Focused
SKOS
Taxonomy
Taxonomy Generation in 5 Steps!
Input
Documents Document
Database
1. Convert to text
Current input:
• Directory path read
recursively
Other possible inputs:
• Docs in a database or a DMS
• Emails +attachments
(Exchange)
• Website URL
• RSS feed
External tool to
convert different file
formats to text
Database to store
document content
Step 1. Document input & conversion
Documents
Database
Concepts
Database
2. Extract concepts
http://localhost/solr/select?q=path:mycollectiondocument456.txt
Pingar API:
Taxonomy Terms:
Climate and Weather
Leaders
Agreements
People:
Yvo de Boer
Maite Nkoana-Mashabane
Organizations:
Associated Press
South African Council of Churches
Locations:
South Africa
Wikify:
Wikipedia Terms:
South Africa
Yvo de Boer
U.N.
Climate agreements
Associated Press
Specific terminology:
green policies; climate diplomacy
Step 2. Extracting concepts
Annotations
Database
3. Annotate with
Linked Data
mycollection/document456.txt
Pingar API:
People:
Yvo de Boer
Maite Nkoana-Mashabane
Organizations:
Associated Press
South African Council of Churches
Locations:
South Africa
Later this additional info
will help create
e-Discovery & semantic search
solutions
Concepts
Database
Step 3. Annotation with meaning
Final Concepts
Database
4. Disambiguate
clashing concepts
wikipedia.org/wiki/Ocean
wikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_inc
www.fao.org/aos/agrovoc#c_4607
Over the past three years, Apple has acquired three mapping companies
For millions of years, the oceans have been filled with sounds from natural sources.
Two concepts were extracted,
that are dissimilar
Discard the incorrect one
Two concepts were extracted,
that are similar
Accept both correct
Agrovoc term:
Marine areas
Concepts
Database
Step 4. Discarding irrelevant meanings
5a. Add relationsConcepts &
Relations Database
felines tiger bird
horse family
zebra donkey pigeonhorselizard
Category:Carnivorous animals Category:Animals
animals Building the taxonomy
bottom up
Broader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/Animals
Focused
SKOS
Taxonomy
Step 5a. Group taxonomy
Films and film making
Film stars
Mila Kunis
Daniel Radcliffe
Sally Hawkins
Julianna Margulies
Association football clubs
Former Football League clubs
Manchester United F.C.
Manchester United F.C.
Manchester City F.C.
Finance
Economics and finance
Personal finance
Commercial finance
Tax
Capital gains tax
Tax
Capital gains tax
5b. Prune relationsConcepts &
Relations Database
Focused
SKOS
Taxonomy
Step 5b. Consolidating taxonomy
Analysis:
Using variety of tools*
and datasets, extract
concepts, entities, relations
Custom
Taxonomy
Taxonomy Generation Process
Input:
Documents
stored somewhere
Output:
A taxonomy is created
that groups resulting
taxonomy terms hierarhically
* Pingar API for People, Organization, Locations & Taxonomy Terms from
related taxonomies;
Wikification for related Wikipedia articles and category relations;
Linked Data analysis for creating links to Freebase & DBpedia
File-share
SharePoint
Exchange
Etc
?
How Does It Look Like?
Fairfax NZ
This taxonomy was created from 2000 news
articles by Fairfax New Zealand around
Christmas 2011.
Taxonomy Statistics
Concept Count: 10158
Edges Count: 12668
Intermediate Count: 1383
Leaves Count: 8748
Labels Count: 11545
Nesting Counts
0: 27, 1: 6102, 2: 2903, 3: 2891
4: 2057, 5: 1202, 6: 745, 7: 354
8: 179, 9: 41, 10: 10
Average Depth: 2.65
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Case Study: A News Group
Labels & Relations
Case Study: A News Group
Case Study: A News Group
Fairfax - 4 Days from Sep 2001
Excerpt of the taxonomy generated from:
Fairfax articles taken from
- Sep 9th & 10th (1242 articles) and
- Sep 13th & 14th (1667 articles) NZT!
Colors of terms:
- proposed to group other terms
- found in both document collections
- in 9-10 Sep 2001 docs
- in 13-14 Sep 2001 docs
- search match
Taxonomy Statistics:
Concept Count: 12699
Edges Count: 13755
Intermediate Count: 709
Leaves Count: 11985
Labels Count: 12741
Case Study: A News Group
proposed to group other terms
in both document collections
in 9-10 Sep 2001 docs
in 13-14 Sep 2001 docs
……………………………………………………………….
……………………………………………………………….
Case Study: A News Group
proposed to group other terms
in both document collections
in 9-10 Sep 2001 docs
in 13-14 Sep 2001 docs
FairFax NZ - 4 Days from Sep 2001
Excerpt of the taxonomy generated from:
Fairfax articles taken from
- Sep 9th & 10th (1242 articles) and
- Sep 13th & 14th (1667 articles) NZT!
Colors of terms:
- proposed to group other terms
- found in both document collections
- in 9-10 Sep 2001 docs
- in 13-14 Sep 2001 docs
- search match
Taxonomy Statistics:
Concept Count: 12699
Edges Count: 13755
Intermediate Count: 709
Leaves Count: 11985
Labels Count: 12741
Average Depth: 1.85
( 0: 5 - 1: 4082 - 2: 8980 - 3: 755
4: 333 - 5: 132 - 6: 31 - 7: 6 - 8: 1 )
Including NZPSV
Taxonomy Statistics:
Concept Count: 13970
Edges Count: 15020
Intermediate Count: 1277
Leaves Count: 12677
Labels Count: 15407
Average Depth: 3
(0: 16 - 1: 10153 - 2: 1888 - 3: 1400
4: 1203 - 5: 1053 - 6: 756 - 7: 427
8: 252 - 9: 267 - 10: 341 - 11: 315
12: 330 - 13: 149 - 14: 134 - 15: 87
16: 10 )
Case Study: A News Group
September 2001 Christmas 2011
Case Study: A News Group
proposed to group other terms
in both document collections
in 9-10 Sep 2001 docs
in 13-14 Sep 2001 docs
Evaluation
Sources of error in concept identification
Type Number Errors Rate
People 1145 37 3.2%
Organizations 496 51 10.3%
Locations 988 114 11.5%
Wikipedia named entities 832 71 8.5%
Wikipedia other entities 99 16 16.4%
Taxonomy 868 229 26.4%
DBPedia 868 81 8.1%
Freebase 135 12 8.9%
Overall 3447 393 11.4%
Recall: 75%
(comparing with manually generated taxonomy for the same domain)
Precision:
89% for concepts
90% for relations
(15 human judges based evaluation)
Other Use Cases
How to refine search by metadata?What’s in these files / emails?
What to include into our
corporate taxonomy?
How to find all docs on a given topic?
Content Audit
Information Architecture
Better search with facets
Better browsing
proposed to group other concepts
in two or more document collections
in the bipolar document collection
in the breast cancer document collection
in the neither cancer or bipolar doc. collection
Other Use Cases: Discovery
Summary
Entity Extraction
Linked Data
Disambiguation
Consolidation
News Group Case Study
Other Use Cases
More?
 bit.ly/f-step
pingar.com
@PingarHQ
anna.divoli@pingar.com
@annadivoli
Focused SKOS Taxonomy Extraction Process (F-STEP) wiki

Contenu connexe

Similaire à Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)
Bradley Allen
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Julien PLU
 
Just keep clicking Till You Find It: Building a Library Digital Collection In...
Just keep clicking Till You Find It: Building a Library Digital Collection In...Just keep clicking Till You Find It: Building a Library Digital Collection In...
Just keep clicking Till You Find It: Building a Library Digital Collection In...
Gretchen Gueguen
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
Mounia Lalmas-Roelleke
 
Running Head ANNOTATED BIBLIOGRAPHY 1ANNOTATED BIBLIOGRAP.docx
Running Head ANNOTATED BIBLIOGRAPHY 1ANNOTATED BIBLIOGRAP.docxRunning Head ANNOTATED BIBLIOGRAPHY 1ANNOTATED BIBLIOGRAP.docx
Running Head ANNOTATED BIBLIOGRAPHY 1ANNOTATED BIBLIOGRAP.docx
toddr4
 

Similaire à Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study (20)

Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)
 
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
 
Just keep clicking Till You Find It: Building a Library Digital Collection In...
Just keep clicking Till You Find It: Building a Library Digital Collection In...Just keep clicking Till You Find It: Building a Library Digital Collection In...
Just keep clicking Till You Find It: Building a Library Digital Collection In...
 
Open Calais Release 4.0
Open Calais Release 4.0Open Calais Release 4.0
Open Calais Release 4.0
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information access
 
Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to C...
Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to C...Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to C...
Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to C...
 
Finding sci tech grey literature information
Finding sci tech grey literature informationFinding sci tech grey literature information
Finding sci tech grey literature information
 
Ands National Identifier Solution
Ands National Identifier SolutionAnds National Identifier Solution
Ands National Identifier Solution
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Web Today, Good Tomorrow? Transactional archiving of web content [Long Version]
Web Today, Good Tomorrow? Transactional archiving of web content [Long Version]Web Today, Good Tomorrow? Transactional archiving of web content [Long Version]
Web Today, Good Tomorrow? Transactional archiving of web content [Long Version]
 
Assessing Annotated Corpora As Research Output
Assessing Annotated Corpora As Research OutputAssessing Annotated Corpora As Research Output
Assessing Annotated Corpora As Research Output
 
Connecting people, places and things
Connecting people, places and things Connecting people, places and things
Connecting people, places and things
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Running Head ANNOTATED BIBLIOGRAPHY 1ANNOTATED BIBLIOGRAP.docx
Running Head ANNOTATED BIBLIOGRAPHY 1ANNOTATED BIBLIOGRAP.docxRunning Head ANNOTATED BIBLIOGRAPHY 1ANNOTATED BIBLIOGRAP.docx
Running Head ANNOTATED BIBLIOGRAPHY 1ANNOTATED BIBLIOGRAP.docx
 

Plus de Anna Divoli

How computers understand text content - by Anna Divoli
How computers understand text content - by Anna DivoliHow computers understand text content - by Anna Divoli
How computers understand text content - by Anna Divoli
Anna Divoli
 
Ebi apr2011 usability-part
Ebi apr2011 usability-partEbi apr2011 usability-part
Ebi apr2011 usability-part
Anna Divoli
 

Plus de Anna Divoli (7)

AI for information management: why and how
AI for information management: why and howAI for information management: why and how
AI for information management: why and how
 
How computers understand text content - by Anna Divoli
How computers understand text content - by Anna DivoliHow computers understand text content - by Anna Divoli
How computers understand text content - by Anna Divoli
 
NLP Tales in Biomedicine (introductory presentation for the Auckland NLP Meet...
NLP Tales in Biomedicine (introductory presentation for the Auckland NLP Meet...NLP Tales in Biomedicine (introductory presentation for the Auckland NLP Meet...
NLP Tales in Biomedicine (introductory presentation for the Auckland NLP Meet...
 
"Findability and usability lessons learnt from text analytics" By: Anna Div...
"Findability and usability   lessons learnt from text analytics" By: Anna Div..."Findability and usability   lessons learnt from text analytics" By: Anna Div...
"Findability and usability lessons learnt from text analytics" By: Anna Div...
 
Anna Divoli (Pingar Research) "How taxonomies and facets bring end-users clos...
Anna Divoli (Pingar Research) "How taxonomies and facets bring end-users clos...Anna Divoli (Pingar Research) "How taxonomies and facets bring end-users clos...
Anna Divoli (Pingar Research) "How taxonomies and facets bring end-users clos...
 
Divoli Presentation at EBI Apr2011 Usability Part
Divoli Presentation at EBI Apr2011 Usability PartDivoli Presentation at EBI Apr2011 Usability Part
Divoli Presentation at EBI Apr2011 Usability Part
 
Ebi apr2011 usability-part
Ebi apr2011 usability-partEbi apr2011 usability-part
Ebi apr2011 usability-part
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

  • 1. Automatic Taxonomy Generation for a News Group Anna Divoli Pingar Research @annadivoli San Francisco Apr 2013 A Case Study
  • 2. Why Automatic Generation? Dynamic Fast Cheap Consistent RDF / Flexible … Why from a Document Collection? Focused/specific Optimal for those documents … Why?
  • 3. The Team The Process News Group Case Study Evaluation Other Use Cases Summary Talk Overview San Francisco Apr 2013
  • 4. Taxonomy Generation Research Team Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian Witten Constructing a Focused Taxonomy from a Document Collection To appear in Proceedings of the Extended Semantic Web Conference 2013, ESWC, Montpellier, France
  • 6. Input: Documents stored somewhere Analysis: Using variety of tools* and datasets, extract concepts, entities, relations Grouping & Output: A taxonomy is created that groups resulting taxonomy terms hierarchically Custom Taxonomy Taxonomy Generation Overview
  • 8. Document Database Solr Concepts & Relations Database Sesame 1. Import & convert to text 2. Extract concepts 3. Annotate with Linked Data 4. Disambiguate clashing concepts 5. Consolidate taxonomy Input Docs Preferred top-level terms Focused SKOS Taxonomy Taxonomy Generation in 5 Steps!
  • 9. Input Documents Document Database 1. Convert to text Current input: • Directory path read recursively Other possible inputs: • Docs in a database or a DMS • Emails +attachments (Exchange) • Website URL • RSS feed External tool to convert different file formats to text Database to store document content Step 1. Document input & conversion
  • 10. Documents Database Concepts Database 2. Extract concepts http://localhost/solr/select?q=path:mycollectiondocument456.txt Pingar API: Taxonomy Terms: Climate and Weather Leaders Agreements People: Yvo de Boer Maite Nkoana-Mashabane Organizations: Associated Press South African Council of Churches Locations: South Africa Wikify: Wikipedia Terms: South Africa Yvo de Boer U.N. Climate agreements Associated Press Specific terminology: green policies; climate diplomacy Step 2. Extracting concepts
  • 11. Annotations Database 3. Annotate with Linked Data mycollection/document456.txt Pingar API: People: Yvo de Boer Maite Nkoana-Mashabane Organizations: Associated Press South African Council of Churches Locations: South Africa Later this additional info will help create e-Discovery & semantic search solutions Concepts Database Step 3. Annotation with meaning
  • 12. Final Concepts Database 4. Disambiguate clashing concepts wikipedia.org/wiki/Ocean wikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_inc www.fao.org/aos/agrovoc#c_4607 Over the past three years, Apple has acquired three mapping companies For millions of years, the oceans have been filled with sounds from natural sources. Two concepts were extracted, that are dissimilar Discard the incorrect one Two concepts were extracted, that are similar Accept both correct Agrovoc term: Marine areas Concepts Database Step 4. Discarding irrelevant meanings
  • 13. 5a. Add relationsConcepts & Relations Database felines tiger bird horse family zebra donkey pigeonhorselizard Category:Carnivorous animals Category:Animals animals Building the taxonomy bottom up Broader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/Animals Focused SKOS Taxonomy Step 5a. Group taxonomy
  • 14. Films and film making Film stars Mila Kunis Daniel Radcliffe Sally Hawkins Julianna Margulies Association football clubs Former Football League clubs Manchester United F.C. Manchester United F.C. Manchester City F.C. Finance Economics and finance Personal finance Commercial finance Tax Capital gains tax Tax Capital gains tax 5b. Prune relationsConcepts & Relations Database Focused SKOS Taxonomy Step 5b. Consolidating taxonomy
  • 15. Analysis: Using variety of tools* and datasets, extract concepts, entities, relations Custom Taxonomy Taxonomy Generation Process Input: Documents stored somewhere Output: A taxonomy is created that groups resulting taxonomy terms hierarhically * Pingar API for People, Organization, Locations & Taxonomy Terms from related taxonomies; Wikification for related Wikipedia articles and category relations; Linked Data analysis for creating links to Freebase & DBpedia File-share SharePoint Exchange Etc
  • 16. ? How Does It Look Like?
  • 17. Fairfax NZ This taxonomy was created from 2000 news articles by Fairfax New Zealand around Christmas 2011. Taxonomy Statistics Concept Count: 10158 Edges Count: 12668 Intermediate Count: 1383 Leaves Count: 8748 Labels Count: 11545 Nesting Counts 0: 27, 1: 6102, 2: 2903, 3: 2891 4: 2057, 5: 1202, 6: 745, 7: 354 8: 179, 9: 41, 10: 10 Average Depth: 2.65 Case Study: A News Group
  • 18. Case Study: A News Group
  • 19. Case Study: A News Group
  • 20. Case Study: A News Group
  • 21. Case Study: A News Group
  • 22. Case Study: A News Group
  • 23. Case Study: A News Group
  • 25. Case Study: A News Group
  • 26. Case Study: A News Group Fairfax - 4 Days from Sep 2001 Excerpt of the taxonomy generated from: Fairfax articles taken from - Sep 9th & 10th (1242 articles) and - Sep 13th & 14th (1667 articles) NZT! Colors of terms: - proposed to group other terms - found in both document collections - in 9-10 Sep 2001 docs - in 13-14 Sep 2001 docs - search match Taxonomy Statistics: Concept Count: 12699 Edges Count: 13755 Intermediate Count: 709 Leaves Count: 11985 Labels Count: 12741
  • 27. Case Study: A News Group proposed to group other terms in both document collections in 9-10 Sep 2001 docs in 13-14 Sep 2001 docs ………………………………………………………………. ……………………………………………………………….
  • 28. Case Study: A News Group proposed to group other terms in both document collections in 9-10 Sep 2001 docs in 13-14 Sep 2001 docs
  • 29. FairFax NZ - 4 Days from Sep 2001 Excerpt of the taxonomy generated from: Fairfax articles taken from - Sep 9th & 10th (1242 articles) and - Sep 13th & 14th (1667 articles) NZT! Colors of terms: - proposed to group other terms - found in both document collections - in 9-10 Sep 2001 docs - in 13-14 Sep 2001 docs - search match Taxonomy Statistics: Concept Count: 12699 Edges Count: 13755 Intermediate Count: 709 Leaves Count: 11985 Labels Count: 12741 Average Depth: 1.85 ( 0: 5 - 1: 4082 - 2: 8980 - 3: 755 4: 333 - 5: 132 - 6: 31 - 7: 6 - 8: 1 ) Including NZPSV Taxonomy Statistics: Concept Count: 13970 Edges Count: 15020 Intermediate Count: 1277 Leaves Count: 12677 Labels Count: 15407 Average Depth: 3 (0: 16 - 1: 10153 - 2: 1888 - 3: 1400 4: 1203 - 5: 1053 - 6: 756 - 7: 427 8: 252 - 9: 267 - 10: 341 - 11: 315 12: 330 - 13: 149 - 14: 134 - 15: 87 16: 10 ) Case Study: A News Group
  • 30. September 2001 Christmas 2011 Case Study: A News Group proposed to group other terms in both document collections in 9-10 Sep 2001 docs in 13-14 Sep 2001 docs
  • 31. Evaluation Sources of error in concept identification Type Number Errors Rate People 1145 37 3.2% Organizations 496 51 10.3% Locations 988 114 11.5% Wikipedia named entities 832 71 8.5% Wikipedia other entities 99 16 16.4% Taxonomy 868 229 26.4% DBPedia 868 81 8.1% Freebase 135 12 8.9% Overall 3447 393 11.4% Recall: 75% (comparing with manually generated taxonomy for the same domain) Precision: 89% for concepts 90% for relations (15 human judges based evaluation)
  • 32. Other Use Cases How to refine search by metadata?What’s in these files / emails? What to include into our corporate taxonomy? How to find all docs on a given topic? Content Audit Information Architecture Better search with facets Better browsing
  • 33. proposed to group other concepts in two or more document collections in the bipolar document collection in the breast cancer document collection in the neither cancer or bipolar doc. collection Other Use Cases: Discovery