Language Computer Corporation: Text Extraction Profile

Language Computer Corporation:
Knowledge Supremacy through
Customizable Text Extraction Products

Andrew Hickl, CEO / President
Language Computer Corporation
December 2008

Language Computer Corporation (LCC)

“Boutique” provider of next‐generation natural language processing
•
software solutions for Government and commercial customers
Founded 1995
•
Based in Richardson, Texas
•
25 developers and researchers
•
Strong track record: top marks at more than 20 different Government
•
evaluations since 1999
Question Answering (TREC, 1999‐2008)
–
Summarization (DUC, 2003‐2008)
–
Information Extraction (ACE, 2005‐2006)
–
Textual Inference (RTE, 2006‐2008)
–

A Brief History of LCC

1996‐2004:  Closed‐Domain Information Extraction
•
MUC / Tipster (precursors to ACE)
–
Grammar‐ or rule‐based systems
–
Entity Extraction (100+ types, English)
–
Relationship Extraction (50+ types, English)
–
Event Extraction (5‐8 types, English)
–

1999 – : First Automatic Question‐Answering Systems
•
TREC Question Answering evaluations
–
Factoid:  What is Britney Spears’s middle name?
–
Complex:  What impact did Hurricane Gustav have on the Dallas economy?
–
Yes/No:  Did Lindsay Lohan’s album reach #1?
–
How‐To:  How do I file an extension on my 2008 Federal Income Taxes?
–
Why:  Why did John McCain name Sarah Palin as his running mate?
–


2002 – : Wide‐Coverage Entity Extraction System
•
– Used a maximum entropy‐based framework to categorize more than 350
different name categories in text
• English:  368 types
• French, Spanish, German, Dutch, Russian, Japanese:  ~100 types
• Arabic, Chinese, Farsi, Korean:  ~50 types
– Dependent on sources of training data

2004 – :  First Open‐Domain, Customizable Event Extraction System
•
Used active learning to leverage feedback gathered from a user
–
Allows users to define event extractors for any event of interest
–
Deployed for other languages:  English, Arabic, Chinese, Korean, Farsi
–
Completely ontology‐independent
–


2007 – :  First Customizable Information Extraction Systems
•
– Allows users to define extractors for any entity, attribute, relationship, or
sentiment / attitude expressed in text
– Used active learning to leverage feedback gathered from a user
– Leverages automatic candidate generation techniques to find new instances
for extractor training
– Deployed for other languages:  English, Arabic, Chinese
– Completely ontology‐independent

2007 – :  Truly Domain‐Independent Extraction
•
– Allows extractors to maintain high levels of performance, regardless of
training  or testing domain
– Reduces “overfitting” to particular domain
– Reduces “tag spam”:  overtagging of certain (frequent) categories in out‐of‐
domain documents


2008 – : First Automatic Dossier / Infobox Generation System
•
– Learns what attributes and relationships are inherently relevant for an entity
from information stored in unstructured text
– Generates either Wikipedia‐style infoboxes or prose descriptions
(a.k.a. “dossiers”) for each entity
– Capable of analogizing from existing structured data resources or learning
from feedback provided by users

2008 – : Robust Textual Inference for NLP Applications
•
– Deployed state‐of‐the‐art system for recognizing textual entailment to
validate content stored in large databases
– Developed temporal inference systems capable of accurately timestamping
events mentioned in text / message traffic

Our Mission

Provide customers with knowledge supremacy necessary to
support analytic operations in any domain.

Make it easy (and cost‐effective) to unlock knowledge from collections
of unstructured text in any language or domain.

Develop “game changing” search and discovery tools which
turn knowledge into value.

Build the premier information extraction brand.

Key Delineators

Scalable.
•
– LCC’s entity, relationship, attribute, and event extraction tools provide access to
more types of information than any other provider.
Customizable.
•
– LCC’s customization framework allows content providers to add value to existing
repositories quickly – and cheaply.
Flexible.
•
– LCC’s learning‐based extraction tools won’t degrade when run on “new” types of
documents.
Deployable.
•
– LCC offers distributable and parallelizable components which can be run
in any environment – big or small.
Integrate‐able.
•
– LCC’s products are designed to interoperate with a customer’s existing text and
knowledge management tools.
Reliable.
•
– 10+ years of excellence in providing USG customer with high‐tech
NLP solutions that just work.

How do you achieve knowledge supremacy?

Wide Coverage (enough for most applications)
•
Customizable (in minutes, or less)
•
Trainable (by application builders or end‐users)
•
Domain Portable (with next to no human intervention)
•
Fast (enough to index TBs of text)
•
Manageable (demonstrated value‐add)
•

Challenge: Is it possible to build an extraction system
which can learn hundreds of types?

Solving (part of) the Coverage Problem: CiceroLite

LCC’s wide‐coverage named entity recognizer, CiceroLite, categorizes 8
•
high‐frequency NE classes with over 90% precision and recall.

But it’s capable of much more: the English language version of CiceroLite
can also categorize 368 different NE classes, including:


•
•
•
•
•
•

Challenge: Is it possible to build an extraction system
which can allow users to create new extractors?

Introducing… CiceroCustom

CiceroCustom can be used to extract nearly any type of entity, attribute,
•
relationship, or event information from text without the need for hand‐
crafted rules or pre‐specified extraction templates.

Three steps to customized information extraction:
•
– Step 1. Use CiceroCustom to define a customized extractor which specifies
that the type of information that a user is most interested in.
– Step 2. Use the CiceroCustom GUI to “train” each extractor:
• Mark instances as “relevant” or “irrelevant”
• Correct annotations supplied by CiceroCustom
• Accurate results seen after < 15 minutes of training
– Step 3. Use extractors to extract information from new texts

Traditional Text Extraction vs. CiceroCustom
Traditional Extraction CiceroCustom
Ontology Required? Fixed set of templates User‐defined templates
Techniques used? Heuristics / Classifiers Active Learning
Information considered? Limited to information found in Inter‐ and Intra‐ sentential
a single sentence extraction
Access to discourse information? N/A Automatic Discourse Parsing
Domain portability? Domain‐Dependent Domain‐Independent
Applicable to new genres? Performance degrades when Robust performance across
applied to new genres document genres
Representation of information? Fixed, Immutable Dynamically created
Discovery of new, essential User Automatic
information?
Coreference? User Automatic
Level of expertise required? Extraction Experts Any End User
Time to create extractors? Days, Weeks Minutes, Hours

CiceroCustom: Innovations

First open‐domain extraction system that can be customized in minutes
•
Active learning‐based framework makes it possible for novices to train high‐performance extractors
–
in under an hour
Extractors can be refined / split / fused as needs change
–
State‐of‐the‐art inference‐based instance fusion
•
State‐of‐the‐art temporal, spatial, and textual inference components make it possible to fuse partial
–
representations into coherent instances that can be used operationally
Automatic Discovery of Essential Information Related to Candidates
•
Rich semantic substrate helps extraction models identify all of the information needed for extraction
–
First Extraction System to Leverage Multiple Semantic Parsers
•
Combines dependency information from PropBank, NomBank, and FrameNet to automatically
–
create semantic representations for entities, attributes, relationships, or events of interest
First work done leveraging semantic parsing for extraction done at LCC: (Surdeanu et al. 2003)
–
State‐of‐the‐Art Discourse Parsing
•
Identification of relations between sentences or events provides for greater recall of extractors
–
Extraction can go beyond a single sentence
–


•
•
•
•
•
•

What does it mean to be “domain portable”?

Performance of most learning‐based extraction systems (entity, event,
•
etc.) suffers when trained and tested on different types of documents
• Most IE systems suffer degradation of > ‐30% when ported to new
domains (e.g. newswire message traffic)

LCC is pioneering new unsupervised and lightly‐supervised approaches to
•
reduce the amount of degradation observed when testing on out‐of‐
domain documents

With ~15 minutes of input from a user,
LCC reduces extractor error by an average of 25%.

Performance Profile: 2 GHz, single core, 2 GB RAM

LCC Text Processing Cycle
Question Answering
Open APIs
Semantic Search
Web Services
Keyword Expansion
Java RMI

Analytic Info
Output Need

Geocoding
Predictive Analysis Spatial Inference
Situational
Analysis
Socio-Cultural Analysis Timestamping
Awareness
Dossier Generation Temporal Inference

Data
Processing
Collection

Data Ingestion & Indexing
Named Entity Recognition
Information Extraction
Coreference Resolution

Dossier Generation (2009)

Need for tools which can automatically assemble
•
high‐quality knowledge resources from information
extracted from text

LCC is developing an integrated, unsupervised
•
Dossier Generation capability which can assemble
relevant entity profiles (either as unstructured text or
Intellipedia‐style structured infoboxes)
Hundreds of Entity, Relation, Attribute, Events
–
Implicit Relations from Data Mining Systems
–
Normalized Dates / Times / Locations
–
Learning‐based relevance detection algorithms capable
–
of learning what’s relevant for each individual or
category of individuals

Database Validation (2009)

Content
Validation
Information
The attack took place in the morning.
Retrieval
The attack killed 2 caretakers.

The attack damaged 50 cars.

The attack damaged 20 buildings.

Commitment The mosque was in Mariengasse.
Extraction
Anas Shakfeh said the attack was a protest by
rightest circles against the Islam conference.

Knowledge Acquisition for Link Analysis (2009)
Entity Extraction
Relationship Extraction
Event Extraction
Untyped Dependency Extraction

Model Semantic
Feedback Triples

Weights,
Entailment Graph
Pruning
Validation Population

Candidate Graph
Relations Edges
Inference
Enrichment

LCC Services

Custom End‐to‐End Application Development
•
Custom Component Development
•
Corporate R&D
•
Production Services
•
Data Verification Services
•
Support and Maintenance
•

Who is LCC’s customer base?

Target Markets
•
Government, Intelligence, and Defense
–
Commodity Search Providers
–
Company, Credit, and Financial Information
–
News and Trade Publishers
–
General Aggregators and Distributors
–
Pharma
–

Emerging Markets
•
Legal
–
CRM
–
Supply Chain Management
–
Business Intelligence Providers
–
Healthcare
–

Who are LCC’s partners?

Strategic Partners Technology Partners
• •
– Application Developers (with Extraction Providers
–
complementary S&D interests) Data Mining Providers
–
– Visualization Developers Database Providers
–
– Commodity Search Providers Inference Providers
–
– Mobile App Developers

Integration Partners Channel Partners
• •
– Large Government integrators – Content Providers
with access to customers, • News
systems of record • Education
– Large software vendors with • Financial
interest in extraction technology
• Business Intelligence

CiceroLite

High‐performance named entity
•
recognition for multiple
languages

Foreign Languages:
•
– English (3/2009: > 1000 types)
– Spanish, French, Dutch, German,
Russian, Japanese (~100 types)
– Arabic, Chinese, Farsi, Korean
(~50 types)

Available as server or standalone
•
application

PinPoint

Geocoding of more than 10M
•
place names
Absolute Expressions
–
Relative Expressions
–
Street Addresses
–
Latitude / Longitude or MGRS
–

Timestamping for events and
•
event‐denoting nominals
– Absolute Expressions
– Relative Expressions
– Duration Estimation

Available as a server app only
•

CiceroCustom

Open‐domain, customizable:
•
– Entity
– Attribute
– Relationship
and
– Event Extraction

Foreign language support:
•
– Arabic, Chinese

Available as a server or
•
standalone application

IndexManager

Distributable annotation and
•
indexing that’s compatible with
all of LCC’s products
• Can index annotations from
multiple providers into single
open‐standard index format

Document formats supported:
•
.xml, .html, .pdf, .doc, .ppt, .txt,
e‐mail, etc.

Available as a server or a desktop
•
application

Sentiment Tracking

Identifies sentiment, opinions,
•
and other subjective attitudes
held by individuals towards any of
a set of “target” products or
issues.

Only available for English
•

Only available as a server app
•

Can be run with LCC’s indexes –
•
or any standard Apache Lucene
index.

Ferret

State‐of‐the‐art question
•
answering for factoid, list, and
complex questions

Foreign Language Support:
•
– English, Arabic, Chinese, Farsi,
Korean, Turkish, Spanish, French,
Dutch, German, Japanese

•
• Can be run with LCC’s indexes –
index.

GistTexter

Summarization for document
•
clusters or search results

Foreign Language Support:
•
– English, Arabic, Chinese, Farsi,
and Korean

•
• Can be run with LCC’s indexes –
index.

For More Information

For more information, contact us:
•

– Andrew Hickl, CEO/President
andy@languagecomputer.com
tel:  (972) 231‐0052, Extension 114
cel:  (858) 366‐8424

Websites:
•
– Corporate:  http://www.languagecomputer.com
– Labs:  http://labs.languagecomputer.com
– Online Demos:  http://www.getferret.com

Language Computer Corporation: Text Extraction Profile

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Language Computer Corporation: Text Extraction Profile

Similar to Language Computer Corporation: Text Extraction Profile (20)

Language Computer Corporation: Text Extraction Profile