SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
TEXT TAGGING WITH FINITE STATE
TRANSDUCERS
David Smiley
Software Systems Engineer, Lead
Text Tagging with
Finite State Transducers
David Smiley
Lucene/Solr Revolution 2013
© 2012 The MITRE Corporation. All rights reserved.
About David Smiley
 Working at MITRE, for 13 years
 web development, Java, search
 Published 1st book on Solr; then 2nd edition (2009, 2011)
 Apache Lucene / Solr committer/PMC member (2012)
 Presented at Lucene Revolution (2010) & Basis O.S. Search
Conference (2011, 2012)
 Taught Solr classes at MITRE (2010, 2011, 2012)
 Solr search consultant within MITRE and its sponsors, and
privately
3
What is “Text Tagging” and “FSTs”?
 First, I need to establish the context:
 JIEDDO’s OpenSextant project
 Though this presentation is not about OpenSextant or
geotagging
 Ultimately, I want to convey how cool Lucene’s FSTs are
 And you may have a need for a text tagger
 Or a geotagger (like OpenSextant)
OpenSextant
A DoD Funded Project: JIEDDO/COIC & NGA
Open Source approval recently obtained
OpenSextant Project
 A geotagging solution for unstructured text
 Finds place name references in natural language
 “… I live near Boston … ”
 Finds “Boston” with input character offset #s
 Often resolves to multiple gazetteer entries: “Boston” has 73
 What’s a Gazetteer?
 A dictionary of place names with metadata like latitude &
longitude
How does it work?
The “Naïve” Tagger
 AKA “Text Tagger”
 Simply consults a dictionary/gazetteer; no fancy NLP
 There’s nothing geospatial about it
 Subsequent NLP processing eliminates low-confidence tags
 Actually, not so simple
 Names vary in word length
 Must find overlapping names
 but not names within names
The Gazetteer
 13 million place name records
 8.1M distinct place names
 Why not 13M?
 Ambiguous names (e.g. San Diego)
 Text analysis normalization (e.g. diacritic removal, etc.)
 2.8M are single-word names (1/3rd)
 2.3 avg. words / name
 14 avg. chars / name
3 Naïve Tagger Implementations
 GATE’s Tagger
 In-memory Aho-Corasick string-matching algorithm
 Requires an estimated 80 GB RAM !! (for our data)
 FAST
 A JIEDDO developed MySQL based Tagger
 “Reasonable” RAM requirements ~4GB
 SLOW (~15x, 20x? not certain). ~1 doc/second
 A JIEDDO developed Solr/FST based Tagger …
Finite State Transducers
Applied to text tagging
Finite State Automata (FSA)
 SortedSet<char[]>:
 mop, moth, pop, slop, sloth, stop, top
Note: a “Trie” data structure is similar but only shares prefixes
Finite State Transducer (FST)
 Adds optional output to each arc
 SortedMap<char[],int>
 mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
Lucene’s FST Implementation
 FST encoded as a byte[]
 Memory efficient! And fast to load from disk.
 Write-once API (immutable)
 Build minimal, acyclic FST from pre-sorted inputs
 Fast (linear time with input size), low memory
 Optional two-pass packing can shrink by ~25%
 SortedMap<int[],T>: arcs are sorted by label
 getByOutput also possible if outputs are sorted
 http://s.apache.org/LuceneFSTs
Based on a
Mihov & Maurel
paper, 2001
FSTs and Text Tagging
 My approach involves two layers of FSTs:
 A word dictionary FST to hold each unique word
 Enables using integers as substitutes for char[]
 Via getByOutput(12345) -> “New”
 Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345
 A word phrase FST comprised of word id string keys
 Ex: “New York City” -> [12345, 5522111, 345]
 Value are arrays of gazetteer primary keys
Memory Use
 Word Dict FST:
 3.3M words with ordinal ids in 26MB of RAM
 Name Phrase FST:
 8.1M word id phrases in 90 MB of RAM
 Plus 82MB of arrays of gazetteer primary key ids
 Total: 198 MB (compare to 80GB GATE Aho-Corasick)
 Building it consumes ~1.5GB Java heap, for 2 minutes
Experimental measurements
 Single FST Experiment
 1 FST of analyzed character word phrase -> int id
 “new york city” -> 6344207
 Theory: more than 2x the memory
 Result: 69 MB! (compare to 26+90) 41% reduction
 Retrospective: What I would have done differently
 Index a field of concatenated terms (custom TokenFilter).
 More disk needed but reduces build time & memory
requirements. Unclear effect on tagging performance.
 Potential to use MemoryPostingsFormat, a Lucene Codec that
uses an FST internally + vInt doc ids, instead of custom FST code.
Tagging Algorithm
It’s complicated! Single-pass (streaming) algorithm
 For each input word, lookup its ordinal id, then:
1. Create an FST arc iterator for name phrase
2. Append the iterator onto a queue of active ones
3. Try to advance all iterators
 Remove those that don’t advance
Iterator linked-list queue:
Head: New, York, City ✔
Head+1: York, City
Head+2: City …
Speed Benchmarks
Docs/Sec RAM (GB)
OpenSextant: GATE Tagger ? 80
OpenSextant: MySQL based Tagger 1.1 4
OpenSextant: Solr/FST Tagger 15.9 2*
Measures single-threaded performance of geotagging 428
documents in the “ACE” collection. OpenSextant tests all had
the same gazetteer.
Integrated with Solr
 As a custom Solr Request Handler
 Builds the FSTs from the index (the gazetteer)
 Configurable
 Text analysis (e.g. phonetic)
 Exclude gazetteer docs by configured query
 Optional partial word phrase matching
 Optional sub-tags tagging
 Solr integration benefits
 Solr as a taxonomy manager! Web-service, searchable,
scalable, easy to update, …
~$ curl -XPOST 'http://localhost:8983/solr/tag
?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston"
{
"responseHeader":{
"status":0,
"QTime":1898},
"tagsCount":1,
"tags":[[
"startOffset",12,
"endOffset",18,
"ids",[1190927,
1099063,
2562742,
2667203,
2684629,
2695904,
2653982,
2657690,
2585165,
2597292,
…
… 11890986,
11891415]]],
"matchingDocs":{"numFound":73,"start":0,"docs":[
{
"id":12719030,
"place_id":"USGS1893700",
"name":"Boston",
"lat":65.01667,
"lon":-163.28333,
"feat_class":"L",
"feat_code":"AREA",
"FIPS_cc":"US",
"ISO_cc":["US"],
"cc":"US",
"ISO3_cc":"USA",
"adm1":"US02",
"adm2":"US02.0180",
"name_bias":0.0,
"id_bias":0.04,
"geo":"65.01667,-163.28333"},
…
Where can you get this?
 https://github.com/openSextant/SolrTextTagger
 An independent module of OpenSextant
 Might seek incubator status at http://www.osgeo.org
 Includes documentation, tests
Concluding Remarks
 Lucene FSTs are awesome!
 Great for storing large amounts of strings in-memory
 Or other string-like data: e.g. IP addresses, geohashes
 The API is hard to use, however
 The text tagger should be useful independent of
OpenSextant
 Tag people/org names or special keywords
 Might be ported to Lucene as an alternative to its synonym
token filter
 I’ve got an idea on applying these concepts to Lucene
“Shingling” as a codec to make it more scalable
CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
CONTACT
David Smiley
dsmiley@mitre.org

Contenu connexe

Tendances

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 

Tendances (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
ClassLoader Leaks
ClassLoader LeaksClassLoader Leaks
ClassLoader Leaks
 
Big Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon GershinskyBig Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon Gershinsky
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
MongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To TransactionsMongoDB WiredTiger Internals: Journey To Transactions
MongoDB WiredTiger Internals: Journey To Transactions
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Tracking Huge Files with Git LFS
Tracking Huge Files with Git LFSTracking Huge Files with Git LFS
Tracking Huge Files with Git LFS
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 

En vedette

Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theorists
tjcarter
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_version
tjcarter
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VE
Charles Palus
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future states
Charles Palus
 

En vedette (10)

Adult Development
Adult DevelopmentAdult Development
Adult Development
 
Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theorists
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_version
 
Adult Development
Adult Development Adult Development
Adult Development
 
Adult development theory
Adult development theoryAdult development theory
Adult development theory
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
類義語検索と類義語ハイライト
類義語検索と類義語ハイライト類義語検索と類義語ハイライト
類義語検索と類義語ハイライト
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VE
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future states
 
HMC Conference 2011 Scotland
HMC Conference 2011 ScotlandHMC Conference 2011 Scotland
HMC Conference 2011 Scotland
 

Similaire à Text tagging with finite state transducers

Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
Ravi Okade
 

Similaire à Text tagging with finite state transducers (20)

Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Basics of XML
Basics of XMLBasics of XML
Basics of XML
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
Clojure talk at Münster JUG
Clojure talk at Münster JUGClojure talk at Münster JUG
Clojure talk at Münster JUG
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo Seattle
 
MongoDB @ fliptop
MongoDB @ fliptopMongoDB @ fliptop
MongoDB @ fliptop
 
Component Framework Primer for JSF Users
Component Framework Primer for JSF UsersComponent Framework Primer for JSF Users
Component Framework Primer for JSF Users
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 

Plus de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

Dernier (20)

COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 

Text tagging with finite state transducers

  • 1. TEXT TAGGING WITH FINITE STATE TRANSDUCERS David Smiley Software Systems Engineer, Lead
  • 2. Text Tagging with Finite State Transducers David Smiley Lucene/Solr Revolution 2013 © 2012 The MITRE Corporation. All rights reserved.
  • 3. About David Smiley  Working at MITRE, for 13 years  web development, Java, search  Published 1st book on Solr; then 2nd edition (2009, 2011)  Apache Lucene / Solr committer/PMC member (2012)  Presented at Lucene Revolution (2010) & Basis O.S. Search Conference (2011, 2012)  Taught Solr classes at MITRE (2010, 2011, 2012)  Solr search consultant within MITRE and its sponsors, and privately 3
  • 4. What is “Text Tagging” and “FSTs”?  First, I need to establish the context:  JIEDDO’s OpenSextant project  Though this presentation is not about OpenSextant or geotagging  Ultimately, I want to convey how cool Lucene’s FSTs are  And you may have a need for a text tagger  Or a geotagger (like OpenSextant)
  • 5. OpenSextant A DoD Funded Project: JIEDDO/COIC & NGA Open Source approval recently obtained
  • 6. OpenSextant Project  A geotagging solution for unstructured text  Finds place name references in natural language  “… I live near Boston … ”  Finds “Boston” with input character offset #s  Often resolves to multiple gazetteer entries: “Boston” has 73  What’s a Gazetteer?  A dictionary of place names with metadata like latitude & longitude
  • 7.
  • 8. How does it work?
  • 9. The “Naïve” Tagger  AKA “Text Tagger”  Simply consults a dictionary/gazetteer; no fancy NLP  There’s nothing geospatial about it  Subsequent NLP processing eliminates low-confidence tags  Actually, not so simple  Names vary in word length  Must find overlapping names  but not names within names
  • 10. The Gazetteer  13 million place name records  8.1M distinct place names  Why not 13M?  Ambiguous names (e.g. San Diego)  Text analysis normalization (e.g. diacritic removal, etc.)  2.8M are single-word names (1/3rd)  2.3 avg. words / name  14 avg. chars / name
  • 11. 3 Naïve Tagger Implementations  GATE’s Tagger  In-memory Aho-Corasick string-matching algorithm  Requires an estimated 80 GB RAM !! (for our data)  FAST  A JIEDDO developed MySQL based Tagger  “Reasonable” RAM requirements ~4GB  SLOW (~15x, 20x? not certain). ~1 doc/second  A JIEDDO developed Solr/FST based Tagger …
  • 13. Finite State Automata (FSA)  SortedSet<char[]>:  mop, moth, pop, slop, sloth, stop, top Note: a “Trie” data structure is similar but only shares prefixes
  • 14. Finite State Transducer (FST)  Adds optional output to each arc  SortedMap<char[],int>  mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
  • 15. Lucene’s FST Implementation  FST encoded as a byte[]  Memory efficient! And fast to load from disk.  Write-once API (immutable)  Build minimal, acyclic FST from pre-sorted inputs  Fast (linear time with input size), low memory  Optional two-pass packing can shrink by ~25%  SortedMap<int[],T>: arcs are sorted by label  getByOutput also possible if outputs are sorted  http://s.apache.org/LuceneFSTs Based on a Mihov & Maurel paper, 2001
  • 16. FSTs and Text Tagging  My approach involves two layers of FSTs:  A word dictionary FST to hold each unique word  Enables using integers as substitutes for char[]  Via getByOutput(12345) -> “New”  Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345  A word phrase FST comprised of word id string keys  Ex: “New York City” -> [12345, 5522111, 345]  Value are arrays of gazetteer primary keys
  • 17. Memory Use  Word Dict FST:  3.3M words with ordinal ids in 26MB of RAM  Name Phrase FST:  8.1M word id phrases in 90 MB of RAM  Plus 82MB of arrays of gazetteer primary key ids  Total: 198 MB (compare to 80GB GATE Aho-Corasick)  Building it consumes ~1.5GB Java heap, for 2 minutes
  • 18. Experimental measurements  Single FST Experiment  1 FST of analyzed character word phrase -> int id  “new york city” -> 6344207  Theory: more than 2x the memory  Result: 69 MB! (compare to 26+90) 41% reduction  Retrospective: What I would have done differently  Index a field of concatenated terms (custom TokenFilter).  More disk needed but reduces build time & memory requirements. Unclear effect on tagging performance.  Potential to use MemoryPostingsFormat, a Lucene Codec that uses an FST internally + vInt doc ids, instead of custom FST code.
  • 19. Tagging Algorithm It’s complicated! Single-pass (streaming) algorithm  For each input word, lookup its ordinal id, then: 1. Create an FST arc iterator for name phrase 2. Append the iterator onto a queue of active ones 3. Try to advance all iterators  Remove those that don’t advance Iterator linked-list queue: Head: New, York, City ✔ Head+1: York, City Head+2: City …
  • 20. Speed Benchmarks Docs/Sec RAM (GB) OpenSextant: GATE Tagger ? 80 OpenSextant: MySQL based Tagger 1.1 4 OpenSextant: Solr/FST Tagger 15.9 2* Measures single-threaded performance of geotagging 428 documents in the “ACE” collection. OpenSextant tests all had the same gazetteer.
  • 21. Integrated with Solr  As a custom Solr Request Handler  Builds the FSTs from the index (the gazetteer)  Configurable  Text analysis (e.g. phonetic)  Exclude gazetteer docs by configured query  Optional partial word phrase matching  Optional sub-tags tagging  Solr integration benefits  Solr as a taxonomy manager! Web-service, searchable, scalable, easy to update, …
  • 22. ~$ curl -XPOST 'http://localhost:8983/solr/tag ?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston" { "responseHeader":{ "status":0, "QTime":1898}, "tagsCount":1, "tags":[[ "startOffset",12, "endOffset",18, "ids",[1190927, 1099063, 2562742, 2667203, 2684629, 2695904, 2653982, 2657690, 2585165, 2597292, … … 11890986, 11891415]]], "matchingDocs":{"numFound":73,"start":0,"docs":[ { "id":12719030, "place_id":"USGS1893700", "name":"Boston", "lat":65.01667, "lon":-163.28333, "feat_class":"L", "feat_code":"AREA", "FIPS_cc":"US", "ISO_cc":["US"], "cc":"US", "ISO3_cc":"USA", "adm1":"US02", "adm2":"US02.0180", "name_bias":0.0, "id_bias":0.04, "geo":"65.01667,-163.28333"}, …
  • 23. Where can you get this?  https://github.com/openSextant/SolrTextTagger  An independent module of OpenSextant  Might seek incubator status at http://www.osgeo.org  Includes documentation, tests
  • 24. Concluding Remarks  Lucene FSTs are awesome!  Great for storing large amounts of strings in-memory  Or other string-like data: e.g. IP addresses, geohashes  The API is hard to use, however  The text tagger should be useful independent of OpenSextant  Tag people/org names or special keywords  Might be ported to Lucene as an alternative to its synonym token filter  I’ve got an idea on applying these concepts to Lucene “Shingling” as a codec to make it more scalable
  • 25. CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 CONTACT David Smiley dsmiley@mitre.org