Language Computer Corporation provides customizable natural language processing software solutions. It has developed several information extraction systems over the years, including named entity recognition, relationship extraction, event extraction, question answering, and summarization. Its most recent system, CiceroCustom, allows users to easily create custom extractors for any type of entity, attribute, or event within minutes using an active learning approach. This provides more flexibility than traditional extraction methods.
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
Language Computer Corporation: Text Extraction Profile
1. Language Computer Corporation:
Knowledge Supremacy through
Customizable Text Extraction Products
Andrew Hickl, CEO / President
Language Computer Corporation
December 2008
2. Language Computer Corporation (LCC)
“Boutique” provider of next‐generation natural language processing
•
software solutions for Government and commercial customers
Founded 1995
•
Based in Richardson, Texas
•
25 developers and researchers
•
Strong track record: top marks at more than 20 different Government
•
evaluations since 1999
Question Answering (TREC, 1999‐2008)
–
Summarization (DUC, 2003‐2008)
–
Information Extraction (ACE, 2005‐2006)
–
Textual Inference (RTE, 2006‐2008)
–
3. A Brief History of LCC
1996‐2004: Closed‐Domain Information Extraction
•
MUC / Tipster (precursors to ACE)
–
Grammar‐ or rule‐based systems
–
Entity Extraction (100+ types, English)
–
Relationship Extraction (50+ types, English)
–
Event Extraction (5‐8 types, English)
–
1999 – : First Automatic Question‐Answering Systems
•
TREC Question Answering evaluations
–
Factoid: What is Britney Spears’s middle name?
–
Complex: What impact did Hurricane Gustav have on the Dallas economy?
–
Yes/No: Did Lindsay Lohan’s album reach #1?
–
How‐To: How do I file an extension on my 2008 Federal Income Taxes?
–
Why: Why did John McCain name Sarah Palin as his running mate?
–
4. A Brief History of LCC
2002 – : Wide‐Coverage Entity Extraction System
•
– Used a maximum entropy‐based framework to categorize more than 350
different name categories in text
• English: 368 types
• French, Spanish, German, Dutch, Russian, Japanese: ~100 types
• Arabic, Chinese, Farsi, Korean: ~50 types
– Dependent on sources of training data
2004 – : First Open‐Domain, Customizable Event Extraction System
•
Used active learning to leverage feedback gathered from a user
–
Allows users to define event extractors for any event of interest
–
Deployed for other languages: English, Arabic, Chinese, Korean, Farsi
–
Completely ontology‐independent
–
5. A Brief History of LCC
2007 – : First Customizable Information Extraction Systems
•
– Allows users to define extractors for any entity, attribute, relationship, or
sentiment / attitude expressed in text
– Used active learning to leverage feedback gathered from a user
– Leverages automatic candidate generation techniques to find new instances
for extractor training
– Deployed for other languages: English, Arabic, Chinese
– Completely ontology‐independent
2007 – : Truly Domain‐Independent Extraction
•
– Allows extractors to maintain high levels of performance, regardless of
training or testing domain
– Reduces “overfitting” to particular domain
– Reduces “tag spam”: overtagging of certain (frequent) categories in out‐of‐
domain documents
6. A Brief History of LCC
2008 – : First Automatic Dossier / Infobox Generation System
•
– Learns what attributes and relationships are inherently relevant for an entity
from information stored in unstructured text
– Generates either Wikipedia‐style infoboxes or prose descriptions
(a.k.a. “dossiers”) for each entity
– Capable of analogizing from existing structured data resources or learning
from feedback provided by users
2008 – : Robust Textual Inference for NLP Applications
•
– Deployed state‐of‐the‐art system for recognizing textual entailment to
validate content stored in large databases
– Developed temporal inference systems capable of accurately timestamping
events mentioned in text / message traffic
7. Our Mission
Provide customers with knowledge supremacy necessary to
support analytic operations in any domain.
Make it easy (and cost‐effective) to unlock knowledge from collections
of unstructured text in any language or domain.
Develop “game changing” search and discovery tools which
turn knowledge into value.
Build the premier information extraction brand.
8. Key Delineators
Scalable.
•
– LCC’s entity, relationship, attribute, and event extraction tools provide access to
more types of information than any other provider.
Customizable.
•
– LCC’s customization framework allows content providers to add value to existing
repositories quickly – and cheaply.
Flexible.
•
– LCC’s learning‐based extraction tools won’t degrade when run on “new” types of
documents.
Deployable.
•
– LCC offers distributable and parallelizable components which can be run
in any environment – big or small.
Integrate‐able.
•
– LCC’s products are designed to interoperate with a customer’s existing text and
knowledge management tools.
Reliable.
•
– 10+ years of excellence in providing USG customer with high‐tech
NLP solutions that just work.
9. How do you achieve knowledge supremacy?
Wide Coverage (enough for most applications)
•
Customizable (in minutes, or less)
•
Trainable (by application builders or end‐users)
•
Domain Portable (with next to no human intervention)
•
Fast (enough to index TBs of text)
•
Manageable (demonstrated value‐add)
•
Challenge: Is it possible to build an extraction system
which can learn hundreds of types?
10. Solving (part of) the Coverage Problem: CiceroLite
LCC’s wide‐coverage named entity recognizer, CiceroLite, categorizes 8
•
high‐frequency NE classes with over 90% precision and recall.
But it’s capable of much more: the English language version of CiceroLite
can also categorize 368 different NE classes, including:
11. How do you achieve knowledge supremacy?
Wide Coverage (enough for most applications)
•
Customizable (in minutes, or less)
•
Trainable (by application builders or end‐users)
•
Domain Portable (with next to no human intervention)
•
Fast (enough to index TBs of text)
•
Manageable (demonstrated value‐add)
•
Challenge: Is it possible to build an extraction system
which can allow users to create new extractors?
12. Introducing… CiceroCustom
CiceroCustom can be used to extract nearly any type of entity, attribute,
•
relationship, or event information from text without the need for hand‐
crafted rules or pre‐specified extraction templates.
Three steps to customized information extraction:
•
– Step 1. Use CiceroCustom to define a customized extractor which specifies
that the type of information that a user is most interested in.
– Step 2. Use the CiceroCustom GUI to “train” each extractor:
• Mark instances as “relevant” or “irrelevant”
• Correct annotations supplied by CiceroCustom
• Accurate results seen after < 15 minutes of training
– Step 3. Use extractors to extract information from new texts
13. Traditional Text Extraction vs. CiceroCustom
Traditional Extraction CiceroCustom
Ontology Required? Fixed set of templates User‐defined templates
Techniques used? Heuristics / Classifiers Active Learning
Information considered? Limited to information found in Inter‐ and Intra‐ sentential
a single sentence extraction
Access to discourse information? N/A Automatic Discourse Parsing
Domain portability? Domain‐Dependent Domain‐Independent
Applicable to new genres? Performance degrades when Robust performance across
applied to new genres document genres
Representation of information? Fixed, Immutable Dynamically created
Discovery of new, essential User Automatic
information?
Coreference? User Automatic
Level of expertise required? Extraction Experts Any End User
Time to create extractors? Days, Weeks Minutes, Hours
14. CiceroCustom: Innovations
First open‐domain extraction system that can be customized in minutes
•
Active learning‐based framework makes it possible for novices to train high‐performance extractors
–
in under an hour
Extractors can be refined / split / fused as needs change
–
State‐of‐the‐art inference‐based instance fusion
•
State‐of‐the‐art temporal, spatial, and textual inference components make it possible to fuse partial
–
representations into coherent instances that can be used operationally
Automatic Discovery of Essential Information Related to Candidates
•
Rich semantic substrate helps extraction models identify all of the information needed for extraction
–
First Extraction System to Leverage Multiple Semantic Parsers
•
Combines dependency information from PropBank, NomBank, and FrameNet to automatically
–
create semantic representations for entities, attributes, relationships, or events of interest
First work done leveraging semantic parsing for extraction done at LCC: (Surdeanu et al. 2003)
–
State‐of‐the‐Art Discourse Parsing
•
Identification of relations between sentences or events provides for greater recall of extractors
–
Extraction can go beyond a single sentence
–
15. How do you achieve knowledge supremacy?
Wide Coverage (enough for most applications)
•
Customizable (in minutes, or less)
•
Trainable (by application builders or end‐users)
•
Domain Portable (with next to no human intervention)
•
Fast (enough to index TBs of text)
•
Manageable (demonstrated value‐add)
•
16. What does it mean to be “domain portable”?
Performance of most learning‐based extraction systems (entity, event,
•
etc.) suffers when trained and tested on different types of documents
• Most IE systems suffer degradation of > ‐30% when ported to new
domains (e.g. newswire message traffic)
LCC is pioneering new unsupervised and lightly‐supervised approaches to
•
reduce the amount of degradation observed when testing on out‐of‐
domain documents
With ~15 minutes of input from a user,
LCC reduces extractor error by an average of 25%.
17. How do you achieve knowledge supremacy?
Wide Coverage (enough for most applications)
•
Customizable (in minutes, or less)
•
Trainable (by application builders or end‐users)
•
Domain Portable (with next to no human intervention)
•
Fast (enough to index TBs of text)
•
Manageable (demonstrated value‐add)
•
19. How do you achieve knowledge supremacy?
Wide Coverage (enough for most applications)
•
Customizable (in minutes, or less)
•
Trainable (by application builders or end‐users)
•
Domain Portable (with next to no human intervention)
•
Fast (enough to index TBs of text)
•
Manageable (demonstrated value‐add)
•
20. LCC Text Processing Cycle
Question Answering
Open APIs
Semantic Search
Web Services
Keyword Expansion
Java RMI
Analytic Info
Output Need
Geocoding
Predictive Analysis Spatial Inference
Situational
Analysis
Socio-Cultural Analysis Timestamping
Awareness
Dossier Generation Temporal Inference
Data
Processing
Collection
Data Ingestion & Indexing
Named Entity Recognition
Information Extraction
Coreference Resolution
21. Dossier Generation (2009)
Need for tools which can automatically assemble
•
high‐quality knowledge resources from information
extracted from text
LCC is developing an integrated, unsupervised
•
Dossier Generation capability which can assemble
relevant entity profiles (either as unstructured text or
Intellipedia‐style structured infoboxes)
Hundreds of Entity, Relation, Attribute, Events
–
Implicit Relations from Data Mining Systems
–
Normalized Dates / Times / Locations
–
Learning‐based relevance detection algorithms capable
–
of learning what’s relevant for each individual or
category of individuals
22. Database Validation (2009)
Content
Validation
Information
The attack took place in the morning.
Retrieval
The attack killed 2 caretakers.
The attack damaged 50 cars.
The attack damaged 20 buildings.
Commitment The mosque was in Mariengasse.
Extraction
Anas Shakfeh said the attack was a protest by
rightest circles against the Islam conference.
23. Knowledge Acquisition for Link Analysis (2009)
Entity Extraction
Relationship Extraction
Event Extraction
Untyped Dependency Extraction
Model Semantic
Feedback Triples
Weights,
Entailment Graph
Pruning
Validation Population
Candidate Graph
Relations Edges
Inference
Enrichment
24. LCC Services
Custom End‐to‐End Application Development
•
Custom Component Development
•
Corporate R&D
•
Production Services
•
Data Verification Services
•
Support and Maintenance
•
25. Who is LCC’s customer base?
Target Markets
•
Government, Intelligence, and Defense
–
Commodity Search Providers
–
Company, Credit, and Financial Information
–
News and Trade Publishers
–
General Aggregators and Distributors
–
Pharma
–
Emerging Markets
•
Legal
–
CRM
–
Supply Chain Management
–
Business Intelligence Providers
–
Healthcare
–
26. Who are LCC’s partners?
Strategic Partners Technology Partners
• •
– Application Developers (with Extraction Providers
–
complementary S&D interests) Data Mining Providers
–
– Visualization Developers Database Providers
–
– Commodity Search Providers Inference Providers
–
– Mobile App Developers
Integration Partners Channel Partners
• •
– Large Government integrators – Content Providers
with access to customers, • News
systems of record • Education
– Large software vendors with • Financial
interest in extraction technology
• Business Intelligence
27. CiceroLite
High‐performance named entity
•
recognition for multiple
languages
Foreign Languages:
•
– English (3/2009: > 1000 types)
– Spanish, French, Dutch, German,
Russian, Japanese (~100 types)
– Arabic, Chinese, Farsi, Korean
(~50 types)
Available as server or standalone
•
application
28. PinPoint
Geocoding of more than 10M
•
place names
Absolute Expressions
–
Relative Expressions
–
Street Addresses
–
Latitude / Longitude or MGRS
–
Timestamping for events and
•
event‐denoting nominals
– Absolute Expressions
– Relative Expressions
– Duration Estimation
Available as a server app only
•
29. CiceroCustom
Open‐domain, customizable:
•
– Entity
– Attribute
– Relationship
and
– Event Extraction
Foreign language support:
•
– Arabic, Chinese
Available as a server or
•
standalone application
30. IndexManager
Distributable annotation and
•
indexing that’s compatible with
all of LCC’s products
• Can index annotations from
multiple providers into single
open‐standard index format
Document formats supported:
•
.xml, .html, .pdf, .doc, .ppt, .txt,
e‐mail, etc.
Available as a server or a desktop
•
application
31. Sentiment Tracking
Identifies sentiment, opinions,
•
and other subjective attitudes
held by individuals towards any of
a set of “target” products or
issues.
Only available for English
•
Only available as a server app
•
Can be run with LCC’s indexes –
•
or any standard Apache Lucene
index.
32. Ferret
State‐of‐the‐art question
•
answering for factoid, list, and
complex questions
Foreign Language Support:
•
– English, Arabic, Chinese, Farsi,
Korean, Turkish, Spanish, French,
Dutch, German, Japanese
Available as a server or
•
standalone application
• Can be run with LCC’s indexes –
or any standard Apache Lucene
index.
33. GistTexter
Summarization for document
•
clusters or search results
Foreign Language Support:
•
– English, Arabic, Chinese, Farsi,
and Korean
Available as a server or
•
standalone application
• Can be run with LCC’s indexes –
or any standard Apache Lucene
index.
34. For More Information
For more information, contact us:
•
– Andrew Hickl, CEO/President
andy@languagecomputer.com
tel: (972) 231‐0052, Extension 114
cel: (858) 366‐8424
Websites:
•
– Corporate: http://www.languagecomputer.com
– Labs: http://labs.languagecomputer.com
– Online Demos: http://www.getferret.com