Contenu connexe Similaire à Enterprise information extraction: recent developments and open challenges (20) Enterprise information extraction: recent developments and open challenges2. Who we are
Researchers from the Search and Analytics group
at IBM Almaden Research Center
– Frederick Reiss
– Yunyao Li
– Laura Chiticariu
– Sriram Raghavan (virtual)
Working on information extraction since 2006-08
– SystemT project
– Code shipping with 8 IBM products
2
© 2009 IBM Corporation
3. Road Map
u
Yo
What is Information Extraction? (Fred Reiss)
ere
h Declarative Information Extraction (Fred Reiss)
re
a
What the Declarative Approach Enables
– Scalable Infrastructure (Yunyao Li)
– Development Support (Laura Chiticariu)
Conclusion / Q&A (Fred Reiss)
3
© 2009 IBM Corporation
4. Obligatory “What is Information Extraction?” Slide
Distill structured data from unstructured and semi-structured text
Exploit the extracted data in your applications
For years, Microsoft
Corporation CEO Bill Gates
was against open source. But
today he appears to have
changed his mind. "We can be
open source. We love the
concept of shared source,"
said Bill Veghte, a Microsoft
VP. "That's a super-important
shift for us in terms of code
access.“
Annotations
Annotations
Name
Bill Gates
Bill Veghte
Richard Stallman
Title
Organization
CEO
Microsoft
VP
Microsoft
Founder Free Soft..
Richard Stallman, founder of
the Free Software Foundation,
countered saying…
(from Cohen’s IE tutorial, 2003)
4
© 2009 IBM Corporation
5. Bibliography at the end of
the slide deck.
SIGMOD 2006 Tutorial [Doan06] in One Slide
Information extraction has been an area of study in Natural
Language Processing and AI for years
Core ideas from database research not a part of existing
work in this area
– Declarative languages
– Well-defined semantics
– Cost-based optimization
The challenge: Can we build a “System R” for information
extraction?
Survey of early-stage projects attacking this problem
5
© 2009 IBM Corporation
6. What’s new?
New enterprise-focused applications…
…driving new requirements…
…leading to declarative approaches
6
© 2009 IBM Corporation
7. Enterprise Applications of Information Extraction
Previous tutorial showed research prototypes
– Avatar: Semantic search on personal emails
– DBLife: Use IE to build a knowledge base about
database researchers
– AliBaba: IE over medical research papers
Since then, IE has gone mainstream
– Enterprise Semantic Search
– Enterprise Data as a Service
– Business Intelligence
– Data-driven Enterprise Mashups
7
© 2009 IBM Corporation
8. Enterprise Semantic Search
Use information extraction to improve accuracy and
presentation of search results
Extract geographical
information
Extract acronyms
and their meanings
Gumshoe (IBM)
[Zhu07,Li06]
Identify pages in
different parts of
the intranet that
are about the
same topic
8
© 2009 IBM Corporation
9. Enterprise Data as a Service
Extract and clean useful information
hidden in publicly available
documents
Rent the extracted information
over the Internet
DBLife [1]
Midas (IBM)
(Demo today!)
9
...<issuer>
...<issuer>
<issuerCik>0000070858</issuerCik>
<issuerCik>0000070858</issuerCik>
<issuerName>BANK OF AMERICA CORP /DE/</issuerName>
<issuerName>BANK OF AMERICA CORP /DE/</issuerName>
<issuerTradingSymbol>BAC</issuerTradingSymbol>
<issuerTradingSymbol>BAC</issuerTradingSymbol>
</issuer>
</issuer>
<reportingOwner>
<reportingOwner>
<reportingOwnerId>
<reportingOwnerId>
<rptOwnerCik>0001090355</rptOwnerCik>
<rptOwnerCik>0001090355</rptOwnerCik>
<rptOwnerName>THAIN JOHN A</rptOwnerName>
<rptOwnerName>THAIN JOHN A</rptOwnerName>
</reportingOwnerId>
</reportingOwnerId>
<reportingOwnerAddress>
<reportingOwnerAddress>
<rptOwnerStreet1>C/O GOLDMAN SACHS GROUP</rptOwnerStreet1>
<rptOwnerStreet1>C/O GOLDMAN SACHS GROUP</rptOwnerStreet1>
<rptOwnerStreet2>85 BROAD STREET</rptOwnerStreet2>
<rptOwnerStreet2>85 BROAD STREET</rptOwnerStreet2>
<rptOwnerCity>NEW YORK</rptOwnerCity>
<rptOwnerCity>NEW YORK</rptOwnerCity>
...
...
</reportingOwnerAddress>
</reportingOwnerAddress>
<reportingOwnerRelationship>
<reportingOwnerRelationship>
<isOfficer>1</isOfficer>
<isOfficer>1</isOfficer>
<officerTitle>Pres Glbl Bkg Sec & Wlth Mgmt</officerTitle>
<officerTitle>Pres Glbl Bkg Sec & Wlth Mgmt</officerTitle>
</reportingOwnerRelationship>
</reportingOwnerRelationship>
</reportingOwner> ...
</reportingOwner> ...
© 2009 IBM Corporation
10. Enterprise Data
Public Data
Business Intelligence
10
Social networks
Traditional
BI Tools
Blogs
Government data
Information
Extraction
Data
Warehouse
Emails
Call center records
Legacy data
New
BI Tools
Important applications
Important applications
Marketing: Customer sentiment, brand
Marketing: Customer sentiment, brand
management
management
Legal: Electronic legal discovery,
Legal: Electronic legal discovery,
identifying product pipeline problems
identifying product pipeline problems
Strategy: Important economic events,
Strategy: Important economic events,
monitoring competitors
monitoring competitors
© 2009 IBM Corporation
11. IBM eDiscovery Analyzer
Enterprise Data
Public Data
Business Intelligence
11
Social networks
Traditional
BI Tools
Blogs
Government data
Information
Extraction
Data
Warehouse
Emails
Call center records
Legacy data
New
BI Tools
Important applications
Important applications
Marketing: Customer sentiment, brand
Marketing: Customer sentiment, brand
management
management
Legal: Electronic legal discovery,
Legal: Electronic legal discovery,
identifying product pipeline problems
identifying product pipeline problems
Strategy: Important economic events,
Strategy: Important economic events,
monitoring competitors
monitoring competitors
© 2009 IBM Corporation
12. Data-Driven Mashups
Extract structured
information from
unstructured feeds
Join extracted information
with other structured
enterprise data
IBM Lotus Notes
Live Text
IBM InfoSphere MashupHub
[Simmen09]
12
© 2009 IBM Corporation
13. Enterprise Information Extraction
IE has become increasingly important to emerging enterprise
applications
Set of requirements driven by enterprise apps that use information
extraction
– Scalability
• Large data volumes, often orders of magnitude larger than classical NLP
corpora
– Accuracy
• Garbage-in garbage-out: Usefulness of application is often tied to quality
of extraction
– Usability
• Building an accurate IE system is labor-intensive
• Professional programmers are much more expensive than grad students!
13
© 2009 IBM Corporation
14. A Canonical IE System
Feature
Selection
Text
14
Entity
Identification
Features
Entity
Resolution
Entities and
Relationships
Structured
Information
© 2009 IBM Corporation
15. A Canonical IE System
Feature
Selection
Text
Entity
Identification
Features
Entity
Resolution
Entities and
Relationships
Structured
Information
Boundaries between these stages are not clear-cut
This diagram shows a simplified logical data flow
– Traditionally, physical data flow the same as logical
– But the systems we’ll talk about take a very different
approach to the actual order of execution
15
© 2009 IBM Corporation
16. Feature Selection
Identify features
– Very simple, “atomic” entities
– Inputs for other stages
Examples of features
– Dictionary match
– Regular expression match
– Part of speech
Typical components used
– Off-the-shelf morphology package
– Many simple rules
Very time-consuming and underappreciated
16
© 2009 IBM Corporation
17. Entity Identification
Use basic features to build more complex features
– Example:
…was done by Mr. Jack Gurbingal at the…
Dictionary match:
Common first name
+
Regular expr match:
Capitalized word
=
Complex feature:
Potential person name
Use other features to determine which of the complex
features are instances of entities and relationships
Most information extraction research focuses on this stage
– Variety of different techniques
17
© 2009 IBM Corporation
18. Entity Resolution
Perform complex analyses over entities and
relationships
Examples
– Identify entities that refer to the same person or thing
– Join extracted information with external structured data
Not the main focus of this tutorial
– But interacts with other parts of information extraction
18
© 2009 IBM Corporation
22. Person-Phone Example: Entities and Relationships
Feature
Selection
Text
Entity
Entity
Identification
Identification
Features
Person
Entity
Resolution
Structured
Information
Entities,
Rels.
.
Phone
Call John Merker at 555-1212.
John also has a cell #: 555-1234
Person
22
NumType
Phone
© 2009 IBM Corporation
23. Person-Phone Example: Entities and Relationships
Feature
Selection
Text
Same
Same
Person
Person
Entity
Identification
Features
Person
Entity
Resolution
Structured
Information
Entities,
Rels.
Join with
Join with
office phone
office phone
directory
directory
Phone
Call John Merker at 555-1212.
John also has a cell #: 555-1234
Person
23
NumType
Phone
© 2009 IBM Corporation
24. Road Map
What is Information Extraction?
are
u
Yo
ere
h
Declarative Information Extraction
What the Declarative Approach Enables
– Scalable Infrastructure (Yunyao Li)
– Development Support (Laura Chiticariu)
Conclusion / Q&A (Fred Reiss)
24
© 2009 IBM Corporation
25. Declarative Information Extraction
Overview of traditional approaches to information
extraction
Practical issues with applying traditional
approaches
How recent work has used declarative approaches
to address these issues
Different types of declarative approaches
25
© 2009 IBM Corporation
26. Traditional Approaches to Information Extraction
Two dominant types:
– Rule-Based
– Machine Learning-Based
Distinction is based on how Entity Identification is
performed
Feature
Selection
Text
26
Entity
Identification
Features
Entity
Resolution
Entities and
Relationships
Structured
Information
© 2009 IBM Corporation
27. Anatomy of a Rule-Based System
Example
Documents
Feature
Selection
Rules
Feature
Selection
Text
27
Entity
Identification
Rules
Entity
Identification
Features
Entity
Resolution
Entities,
Rels.
Structured
Information
© 2009 IBM Corporation
28. Anatomy of a Machine Learning-Based System
Labeled
Documents
Example
Documents
Features
and
Labels
Feature
Selection
Feature
Selection
Rules
Feature
Selection
Text
28
Training
Model
Entity
Identification
Features
Entity
Resolution
Entities,
Rels.
Structured
Information
© 2009 IBM Corporation
29. A Brief History of IE in the NLP Community
Rule-Based
1978-1997: MUC (Message
Understanding Conference) –
DARPA competition 1987 to 1997
– FRUMP [DeJong82]
– FASTUS [Appelt93],
– TextPro, PROTEUS
1998: Common Pattern
Specification Language (CPSL)
standard [Appelt98]
– Standard for subsequent rulebased systems
1999-2010: Commercial products,
GATE
Machine Learning
At first: Simple techniques like
Naive Bayes
1990’s: Learning Rules
– AUTOSLOG [Riloff93]
– CRYSTAL [Soderland98]
– SRV [Freitag98]
2000’s: More specialized models
– Hidden Markov Models [Leek97]
– Maximum Entropy Markov
Models [McCallum00]
– Conditional Random Fields
[Lafferty01]
– Automatic feature expansion
For further reading:
Sunita Sarawagi’s Survey [Sarawagi08], Claire Cardie’s Survey [Cardie97]
29
© 2009 IBM Corporation
30. Tying the System Together: Traditional IE Frameworks
Traditional approach:
Workflow system
– Sequence of discrete steps
– Data only flows forward
GATE1 and UIMA2 are the most
popular frameworks
– Type systems and standard
data formats
Web services and Hadoop also
in common use
– No standard data format
Workflow for the ANNIE system [Cunningham09]
30
1. GATE (General Architecture for Text Engineering) official web site: http://gate.ac.uk/
2. Apache UIMA (Unstructured Information Management Architecture) official web site: http://uima.apache.org/
© 2009 IBM Corporation
31. Sequential Execution in CPSL Rules
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Level 2
〈Person〉 〈Token〉[~ “at”] 〈Phone〉 〈PersonPhone〉
〈Person〉 〈Token〉[~ “at”] 〈Phone〉 〈PersonPhone〉
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in sagittis facilisis arcu auguet rum velit, sed <Person> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es
Level 1
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
oin, in <FirstName> <CapsWord> at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
〈Digits〉 〈Token〉[~ “-”] 〈Digits〉 〈Phone〉
〈Digits〉 〈Token〉[~ “-”] 〈Digits〉 〈Phone〉
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proi
enina i facilisis, <Person> at <Digits>-<Digits> arcu tincidun
orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
〈FirstName〉 〈CapsWord〉 〈Person〉
〈FirstName〉 〈CapsWord〉 〈Person〉
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in sagittis facilisis arcu augue velit, <FirstName> <CapsWord> at <Digits>-<Digits>. hendrerit faucibus pede mi ipsum.
rabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
© 2009 IBM ultrices sit
giat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, Corporation
Level 0 (Feature Selection)
31
32. Problems with Traditional IE Approaches
Complex, fixed pipelines and rule sets
Semantics tied to order of execution
Scalability
Data only flows forward, leading to
wasted work in early stages.
Accuracy
Lots of custom procedural code.
Usability
32
Hard to understand why the system
produces a particular result.
© 2009 IBM Corporation
33. Declarative to the Rescue!
Define the logical constraints
between rules/components
System determines order of
execution
Scalability
Optimizer avoids wasted work
Accuracy
More expressive rule languages;
Combine different tools easily
Usability
Describe what to extract,
instead of how to extract it
33
© 2009 IBM Corporation
34. What do we mean by “declarative”?
Common vision:
– Separate semantics from order of execution
– Build the system around a language like SQL or Datalog
Different systems have different interpretations
Three main categories
– High-Level Declarative
• Most common approach
– Completely Declarative
– Mixed Declarative
34
© 2009 IBM Corporation
35. High-Level Declarative
Replace the overall IE framework with a declarative language
Each individual extraction component is still a “black box”
Example 1: SQoUT[Jain08]
SQL query
Catalog of
Extraction
Modules
35
Optimizer
Query plan combines
extraction modules
with scan and index
access to data.
© 2009 IBM Corporation
36. High-Level Declarative
Replace the overall IE framework with a declarative language
Each individual extraction component is still a “black box”
Example 1: SQoUT[Jain08]
Example 2: PSOX[Bohannon08]
36
© 2009 IBM Corporation
37. High-Level Declarative
Replace the overall IE framework with a declarative language
Each individual extraction component is still a “black box”
Example 1: SQoUT[Jain08]
Example 2: PSOX[Bohannon08]
Advantages:
– Allows use of many existing “black box” packages
– High-level performance optimizations possible
– Clear semantics for using different packages for the same task
Drawbacks:
– Doesn’t address issues that occur within a given “black box”
– Limited opportunities for optimization, unless “black boxes” can
provide hints
37
© 2009 IBM Corporation
38. Completely Declarative
One declarative language covers all stages of extraction
Example 1: AQL language in SystemT [Chiticariu10]
-- Find all matches
-- of a dictionary
create view Name as
extract dictionary
CommonFirstName
on D.text as name
from Document D;
-- Match people with their
-- phone numbers
create view PersonPhone as
select P.name as person,
N.num as phone
from Person P, PhoneNum N
where …
Feature
Selection
Text
38
Entity
Identification
Features
-- Find pairs of references
-- to the same person
create view SamePerson as
select P1.name as name1,
P2.name as name2
from Person P1, Person P2
where …
Entity
Resolution
Entities,
Rels.
Structured
Information
© 2009 IBM Corporation
39. Sequential Execution in CPSL Rules
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Level 2
〈Person〉 〈Token〉[~ “at”] 〈Phone〉 〈PersonPhone〉
〈Person〉 〈Token〉[~ “at”] 〈Phone〉 〈PersonPhone〉
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in sagittis facilisis arcu auguet rum velit, sed <Person> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es
Level 1
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
oin, in <FirstName> <CapsWord> at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
〈Digits〉 〈Token〉[~ “-”] 〈Digits〉 〈Phone〉
〈Digits〉 〈Token〉[~ “-”] 〈Digits〉 〈Phone〉
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proi
enina i facilisis, <Person> at <Digits>-<Digits> arcu tincidun
orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
〈FirstName〉 〈CapsWord〉 〈Person〉
〈FirstName〉 〈CapsWord〉 〈Person〉
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in sagittis facilisis arcu augue velit, <FirstName> <CapsWord> at <Digits>-<Digits>. hendrerit faucibus pede mi ipsum.
rabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
© 2009 IBM ultrices sit
giat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, Corporation
Level 0 (Feature Selection)
39
40. Declarative Semantics Example:
Identifying Musician-Instrument Relationships
(pipe | guitar | hammond organ |…)
(Person Annotator)
Instrument
Person
〈Person〉 〈0-5 tokens〉 〈Instrument〉
PersonPlaysInstrument
John
Pipe
John
Pipe
plays
the
guitar
plays the
guitar
〈Person〉 〈Person〉 〈Token〉 〈Token〉 〈Instrument〉
John Pipe plays the guitar
Person Person
Instrument
〈Person〉 〈Instrument〉 〈Token〉 〈Token〉 〈Instrument〉
John Pipe
〈Person〉
plays
〈Token〉
the
〈Token〉
guitar
〈Instrument〉
Person
Instrument
40
© 2009 IBM Corporation
41. Completely Declarative
One declarative language covers all stages of extraction
Example 1: AQL language in SystemT [Chiticariu10]
Example 2: Conditional Random Fields in SQL [Wang10]
41
© 2009 IBM Corporation
42. Completely Declarative
One declarative language covers all stages of extraction
Example 1: AQL language in SystemT [Chiticariu10]
Example 2: Conditional Random Fields in SQL [Wang10]
Advantages:
– Unified language clear semantics from top to bottom
– Optimizer has full control over low-level operations
– Can incorporate existing packages using user-defined
functions
Drawbacks:
– Code inside UDFs doesn’t benefit from declarativeness
42
© 2009 IBM Corporation
43. Mixed Declarative
Language provides declarativeness at the level of some, but
not all, of the extraction operations, both at the individual
and pipeline level
Example: Xlog (CIMPLE) [Shen07]
This Datalog predicate
represents a large, opaque
block of extraction code.
This predicate is defined
in Datalog, using low-level
operations.
43
Extraction program for talk extracts, from [1]
© 2009 IBM Corporation
44. Mixed Declarative
Language provides declarativeness at the level of some, but
not all, of the extraction operations, both at the individual
and pipeline level
Example: Xlog (CIMPLE) [Shen08]
Advantages:
– Ability to reuse existing “black box” packages
– Optimizer gets some flexibility to reorder low-level operations
Drawbacks:
– Challenging to build an optimizer that does both “high-level”
and “low-level” optimizations
44
© 2009 IBM Corporation
45. Declarative to the Rescue!
Different notions of declarativeness in
different systems
All kinds address the major issues in
enterprise IE, but in different ways
Scalability
Optimizer avoids wasted work
Accuracy
More expressive rule languages;
Combine different tools easily
Usability
Describe what to extract,
instead of how to extract it
45
© 2009 IBM Corporation
46. Road Map
What is Information Extraction? (Fred Reiss)
Declarative Information Extraction (Fred Reiss)
What the Declarative Approach Enables
Y
46
– Scalable Infrastructure (Yunyao Li)
ere
h
– Development Support (Laura Chiticariu)
re
a
ou
Conclusion/Questions
© 2009 IBM Corporation
48. Declarative to the Rescue!
Define the logical constraints
between rules/components
System determines order of
execution
Scalability
Optimizer avoids wasted work
Accuracy
More expressive rule languages;
Combine different tools easily
Usability
Describe what to extract,
instead of how to extract it
48
© 2009 IBM Corporation
49. Conventional vs. Declarative IE Infrastructure
Conventional:
– Operational semantics
and implementation are
hard-coded and
interconnected
Declarative:
– Separate semantics from
implementation.
– Database-style design:
Optimizer + Runtime
Declarative
Declarative
Language
Language
Extraction
Extraction
Pipeline
Pipeline
49
Runtime
Runtime
Environment
Environment
Optimizer
Optimizer
Plan
Plan
Runtime
Runtime
Environment
Environment
© 2009 IBM Corporation
50. Why Declarative IE for Scalability
An informal experimental
study [Reiss08]
– Collection of 4.5 million
web logs
– Band Review Annotator:
identify informal reviews
of concerts
20x faster
CPSL-based
implementation
50
Declarative
implementation
© 2009 IBM Corporation
51. Different Aspects of Design for Scalability
Optimization
– Granularity
• High-level: annotator composition
• Low-level: basic extraction operators
– Strategy:
• Rewrite-based
• Cost-based
Runtime Model
– Document-Centric vs. Collection-Centric
51
© 2009 IBM Corporation
52. Optimization Granularity for Declarative IE
Annotator Composition
– Each annotator extracts one
or more entities or
relationships
Basic Extraction Operator
– Each operator represents
an atomic extraction
operation
• E.g. Person annotator
– Black box assumption on
how an annotator works
– Optimizing composition of
extraction pipeline
High-level declarative
52
Mixed declarative
• E.g. dictionary matching,
regular expression, join,…
– System is fully aware of
how each extraction
operator works
– Optimizing each basic
extraction operator
Completely declarative
© 2009 IBM Corporation
53. Optimization Strategies for Declarative IE
Rewrite-based
– Applying rewrite rules to
transform the declarative
form of the annotators to a
equivalent form that is more
efficient
Cost-Based
– Enumerating all possible
physical execution plans,
estimate their cost, and
choose the one with the
minimum expected cost
Systems may mix these two approaches
53
© 2009 IBM Corporation
54. Runtime Model for Declarative IE
Document-Centric
Collection-Centric
Annotations
Annotated
Document
Stream
Runtime
Runtime
Environment
Environment
Runtime
Runtime
Environment
Environment
Input
Document
Stream
54
Annotations
Annotations
Document
Document
Collection
Collection
Auxiliary
Auxiliary
index
index
© 2009 IBM Corporation
56. Cimple
Rewrite-based optimization
[Shen07]
– Inverted-index based simple pattern matching
• Shared document scan
AND
AND
AND Ullman
OR
*
P1= “(Jeff|Jeffery)ss*Ullman”
P2=“(Jeff|Jeffery)ss*Naughton”
P3=“Laurass*Haas”
P4=“Peterss*Haas”
Simple patterns
AND Naughton
OR
*
(p1)
(p2)
AND
AND
AND Haas
Lauras *
AND
Peters
s*
(p3)
Haas
*
Naughton
P2
Lauras
P3
P4
Haas
Jeffs Jefferys s*
P1
Peters
Jeffs Jefferys s*
Ullman
P3, P4
Inverted Index
s*
(p4)
Parse trees
56
© 2009 IBM Corporation
57. Cimple
Pushing down text properties
[Shen07]
– Eg: To find an all-capitalized line
σallcaps(x)
lines(d,x,n)
σallcaps(x)
lines(d,x,n)
σcontainCaps(d)
Plan a
Scoping
Plan b
[Shen07]
– Imposing location conditions on where to extract spans
• Eg: Check for names only within two lines of the occurrence of titles
Incorporating cost-model to decide how to apply the rewrite.
57
© 2009 IBM Corporation
58. Cimple
Collection-centric runtime model
– Document collection (or snapshots of document collection)
– Previous extraction results
Reusing previous extraction results
[Chen08][Chen09]
• Similar to maintaining materialized views
• Cyclex: IE program viewed as
one big blackbox [Chen08]
• Delex: IE program viewed as a
workflow of blackboxes [Chen09]
58
© 2009 IBM Corporation
59. RAD [Khaitan09]
Query language: a declarative subset of CPSL specification
– Regular expressions over features and existing annotations
Query
tokenization
chunking
Sentence
Document
Document
Collection
Collection
Document
Document
Inverted index
Inverted index
Generating indexed features
• Dictionary lookup (Eg. First name)
• Part of speech lookup (Eg. Noun, verb)
• Regular expression on tokens (E.g. CapsWord, Alphanum)
Optimizer
Optimizer
Generating derived entities over the index using
series of join operators
(E.g. Person, Organization)
Document
Document
Inverted index
Inverted index
++Annotations
Annotations
Offline process
59
© 2009 IBM Corporation
60. RAD
Cost-based Optimization based on Posting-list
Statistics
• E.g. ANYWORD@ANYWORD.com for Email
Another zig-zag
join over the
inverted index
R3
Zig-zag Join
over the
inverted index
R2
R1
ANYWORD
.
ANYWORD
@
Plan a
60
c
o
R4
R2
m
ANYWORD
R1
@
R3
.
c
o
m
ANYWORD
Plan b
© 2009 IBM Corporation
62. Declarative to the Rescue!
Define the logical constraints
between rules/components
System determines order of
execution
Scalability
Optimizer avoids wasted work
Accuracy
More expressive rule languages;
Combine different tools easily
Usability
Describe what to extract,
instead of how to extract it
62
© 2009 IBM Corporation
63. Conventional vs. Declarative IE Infrastructure
Conventional:
– Operational semantics
and implementation are
hard-coded and
interconnected
Declarative:
– Separate semantics from
implementation.
– Database-style design:
Optimizer + Runtime
Declarative
Declarative
Language
Language
Extraction
Extraction
Pipeline
Pipeline
63
Runtime
Runtime
Environment
Environment
Optimizer
Optimizer
Plan
Plan
Runtime
Runtime
Environment
Environment
© 2009 IBM Corporation
64. Different Aspects of Design for Scalability
Optimization
– Granularity
• High-level: annotator composition
• Low-level: basic extraction operators
– Strategy:
• Rewrite-based
• Cost-based
Runtime Model
– Document-Centric vs. Collection-Centric
64
© 2009 IBM Corporation
66. SQoUT [Ipeirotis07][Jain07,08,09]
Focus on composition of extraction systems
SQL Query
Entities/relations
to extract
Extraction
Extraction
System Repository
System Repository
System E0
0
Retrieval
Retrieval
Strategy
Strategy
… …
… …
Extraction
Extraction
Retrieval
Retrieval
Strategy
Strategy
System Em
m
66
Query
Data
Data
Cleaning
Cleaning
Document
Document
Collection
Collection
Extraction results
results
Extracted View
© 2009 IBM Corporation
67. SQoUT
Cost-based Query Optimization
New Plan Enumeration Strategies
– Document retrieval strategies
• Eg: filtered scan
– Running the annotator only over potentially relevant docs
– Join execution
• Independent join, outer/inner join, zig-zag join:
– Extraction results of one relation can determine the docs retrieved for
another relation.
Efficiency vs. Quality Cost Model
Goodness
67
Quality
Efficiency
Weight
© 2009 IBM Corporation
68. SystemT [Reiss08] [Krishnamurthy08] [Chiticariu10]
Final
Plan
Rules
PrePreprocessor
processor
Blocks
Planner
Planner
Plan
Enumerator
Block
Plans
PostPostprocessor
processor
Cost Model
• Divide rules into
compilation blocks.
• Rewrite-based
optimization within each
block
68
• Merge block plans into a
single operator graph.
• System R Style Costbased optimization
within each block.
• Rewrite-based
optimization across
blocks.
© 2009 IBM Corporation
69. Example: Restricted Span Evaluation (RSE)
Leverage the sequential nature
of text
– Join predicates on character
or token distance
Only evaluate the inner on the
relevant portions of the
document
Limited applicability
– Need to guarantee exact
same results
Only look for dictionary
matches in the vicinity of a
phone number.
69
John Smith at 555-1212
RSEJoin
555-1212
John Smith
Regex
Dictionary
…John Smith at 555-1212…
© 2009 IBM Corporation
70. Example: Shared Dictionary Matching (SDM)
Rewrite-based optimization
– Applied to the algebraic plan during postprocessing
Evaluate multiple dictionaries in a single pass
D1
Dict
D2
subplan
70
Dict
D1
D2
subplan
SDMDict
SDM
Dictionary
Operator
© 2009 IBM Corporation
71. SystemT
Document-centric Runtime
Model:
– One document at a time
– Entities extracted are
associated with their
source document
Annotated
Document
Stream
Runtime
Runtime
Environment
Environment
Input
Document
Stream
Why one document at a time?
71
© 2009 IBM Corporation
72. Scaling SystemT: From Laptop to Cluster
In Lotus Notes Live Text
InCognosToro Text Analytics
Cognos Toro Analytics
Jaql Runtime
Lotus Notes
Lotus Notes
Client
Client
Email
Message
Hadoop Map-Reduce
Jaql Function Wrapper
Jaql Function Wrapper
Display
Annotated Email
SystemT
Runtime
Input
Adapter
SystemT
Runtime
Output
Adapter
Jaql Function Wrapper
Jaql Function Wrapper
Input
Adapter
SystemT
Runtime
Output
Adapter
Jaql Function Wrapper
Jaql Function Wrapper
Documents
Input
Adapter
SystemT
Runtime
Output
Adapter
Jaql Function Wrapper
Jaql Function Wrapper
Input
Adapter
SystemT Output
Runtime Jaql Function Wrapper
Adapter
Jaql Function Wrapper
Input
Adapter
SystemT
Runtime
Output
Adapter
Hadoop Cluster
72
© 2009 IBM Corporation
73. BayesStore [Wang10]
Probabilistic declarative IE
– In-database machine learning for efficiency and scalability
Text Data and Conditional Random Fields (CRF) Model
document
Token
table
73
CRF
model
Factor
table
© 2009 IBM Corporation
74. BayesStore
Viterbi Inference SQL Implementation
– Implementing dynamic programming algorithm using recursive
queries
Rewrite-based
optimization.
74
© 2009 IBM Corporation
76. Road Map
What is Information Extraction? (Fred Reiss)
Declarative Information Extraction (Fred Reiss)
What the Declarative Approach Enables
You ar
76
e here
– Scalable Infrastructure (Yunyao Li)
– Development Support (Laura Chiticariu)
© 2009 IBM Corporation
78. Declarative to the Rescue!
Define the logical constraints
between rules/components
System determines order of
execution
Scalability
Optimizer avoids wasted work
Accuracy
More expressive rule languages;
Combine different tools easily
Usability
Describe what to extract,
instead of how to extract it
78
© 2009 IBM Corporation
79. A Canonical IE System
Feature
Selection
Text
Entity
Identification
Features
Entity
Resolution
Entities and
Relationships
Structured
Information
Developing IE systems is an extremely
time-consuming, error prone process
79
© 2009 IBM Corporation
80. The Life Cycle of an IE System
Development
Usage / Maintenance
Develop
Use
Developer 1. Features
2. Rules / labeled data
Analyze
80
Test
Refine
User
Test
© 2009 IBM Corporation
81. Example 1: Explaining Extraction Results
---------------------------------------- Document Preprocessing
--------------------------------------create view Doc as
select D.text as text
from DocScan D;
------------------------------------------------------------------------------- Document Preprocessing
-- Basic Named Entity Annotators
-----------------------------------------------------------------------------create view Doc as
select D.text as text
-- Find initial words
from DocScan D;
create view InitialWord1 as
select R.match as word
-----------------------------------------from Regex(/b([p{Upper}].s*){1,5}b/, Doc.text) R
-- Basic Named Entity Annotators 10, Doc.text) R
from RegexTok(/([p{Upper}].s*){1,5}/,
----------------------------------------- added on 04/18/2008
where Not(MatchesRegex(/M.D./, R.match));
-- Find initial words
-- Yunyao: view InitialW ord1 as capture names with prefix
create added on 11/21/2008 to
(we use it asR.match as word
select initial
-- to avoid adding too many commplex rules)
--from Regex(/b([p{Upper}].s*){1,5}b/, Doc.text)
create view InitialWord2 as
R
select D.match as word
from RegexTok(/([p{Upper}].s*){1,5}/, 10,
from Dictionary('specialNamePrefix.dict', Doc.text) D;
Doc.text) R
create view InitialWord as
-- added on 04/18/2008
(select I.word as word from InitialWord1R.match));
where Not(MatchesRegex(/M.D./, I)
union all
(select I.word as word from InitialWord2 I);
-- Yunyao: added on 11/21/2008 to capture names
with prefix (we use it as initial
-- Find weak initial words
-- to avoid adding too many
create view WeakInitialWord as commplex rules)
select R.match as word ord2 as
create view InitialW
--from Regex(/b([p{Upper}].?s*){1,5}b/, Doc.text) R;
select D.match as word
from RegexTok(/([p{Upper}].?s*){1,5}/, 10, Doc.text) R
from Dictionary('specialNamePrefix.dict', Doc.text)
-D;added on 05/12/2008
-- Do not allow weak initial word to be a word longer than
three characters
create view InitialW ord as
where Not(ContainsRegex(/[p{Upper}]{3}/, R.match))
(select I.word as
-- added on 04/14/2009 word from InitialWord1 I)
union all
-- Do not allow weak initial words to match the timezon
and Not(ContainsDict('timeZone.dict', R.match)); I);
(select I.word as word from InitialWord2
------------------------------------------------ Strong Phone Numbers
-- Find weak initial words
----------------------------------------------create view W eakInitialWord as
create dictionary StrongPhoneVariantDictionary as (
select
'phone', R.match as word
--from Regex(/b([p{Upper}].?s*){1,5}b/, Doc.text)
'cell',
R;
'contact',
'direct', RegexTok(/([p{Upper}].?s*){1,5}/, 10,
from
'office',
Doc.text) R
-- Yunyao: Added new strong clues for phone numbers
-- added on 05/12/2008
'tel', Do not allow weak initial word to be a word
-'dial',
longer than three characters
'Telefon',
where
'mobile', Not(ContainsRegex(/[p{Upper}]{3}/,
R.match))
'Ph',
'Phone Number',
-- added on 04/14/2009
'Direct Line', allow weak initial words to match the
-- Do not
'Telephone
timezon No',
'TTY', Not(ContainsDict('timeZone.dict', R.match));
and
'Toll Free',
'Toll-free',
------------------------------------------------ German
-- Strong Phone Numbers
'Fon',
----------------------------------------------'Telefon Geschaeftsstelle',
'Telefon Geschäftsstelle',
create dictionary StrongPhoneVariantDictionary as (
'Telefon Zweigstelle',
'phone',
'Telefon Hauptsitz',
'cell',
'Telefon (Geschaeftsstelle)',
'contact',
'Telefon (Geschäftsstelle)',
'direct',
'Telefon (Zweigstelle)',
'office',
'Telefon (Hauptsitz)',
-- Yunyao: Added new strong clues for phone
'Telefonnummer',
numbers
'Telefon Geschaeftssitz',
'Telefon Geschäftssitz',
'tel',
'Telefon (Geschaeftssitz)',
'dial',
'Telefon (Geschäftssitz)',
'Telefon',
'Telefon Persönlich',
'mobile',
'Telefon persoenlich',
'Ph',
'Telefon (Persönlich)',
'Phone Number',
'Telefon (persoenlich)',
'Direct
'Handy', Line',
'Handy-Nummer',
'Telephone No',
'Telefon arbeit',
'TTY',
'TelefonFree',
'Toll (arbeit)'
);
'Toll-free',
create view Initial as
--'Junior' (Yunyao: comments out to avoid mismatches such as Junior National [team player],
-- If we can have large negative dictionary to eliminate such mismatches,
-- then this may be recovered
--'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name')
-- for German names
-- TODO: need further test
,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor',
'Herr Professor', 'Frau professor', 'Baron', 'graf'
-- Find dictionary matches for all title initials
create view LastName as
select C.lastname as lastname
--from Consolidate(ValidLastNameAll.lastname) C;
from ValidLastNameAll C
consolidate on C.lastname;
select D.match as initial
--'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name')
-- for German names
-- TODO: need further test
,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor',
'Herr Professor', 'Frau professor', 'Baron', 'graf'
);
-- Find dictionary matches for all first names
-- Mostly US first names
create view StrictFirstName1 as
select D.match as firstname
from Dictionary('strictFirst.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
);
-- German first names
create view StrictFirstName2 as
select D.match as firstname
from Dictionary('strictFirst_german.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
-- Find dictionary matches for all title initials
from Dictionary('InitialDict', Doc.text) D;
-- Yunyao: added 05/09/2008 to capture person name suffix
create dictionary PersonSuffixDict as
(
',jr.', ',jr', 'III', 'IV', 'V', 'VI'
);
create view PersonSuffix as
select D.match as suffix
from Dictionary('PersonSuffixDict', Doc.text) D;
-- Find capitalized words that look like person names and not in the non-name dictionary
create view CapsPersonCandidate as
select R.match as name
--from Regex(/bp{Upper}p{Lower}[p{Alpha}]{1,20}b/, Doc.text) R
--from Regex(/bp{Upper}p{Lower}[p{Alpha}]{0,10}(['-][p{Upper}])?[p{Alpha}]{1,10}b/, Doc.text) R
-- change to enable unicode match
--from Regex(/bp{Lu}p{M}*[p{Ll}p{Lo}]p{M}*[p{L}p{M}*]{0,10}(['-][p{Lu}p{M}*])?[p{L}p{M}*]{1,10}b/, Doc.text) R
--from Regex(/bp{Lu}p{M}*[p{Ll}p{Lo}]p{M}*[p{L}p{M}*]{0,10}(['-][p{Lu}p{M}*])?(p{L}p{M}*){1,10}b/, Doc.text) R
-- Allow fully capitalized words
--from Regex(/bp{Lu}p{M}*(p{L}p{M}*){0,10}(['-][p{Lu}p{M}*])?(p{L}p{M}*){1,10}b/, Doc.text) R
from RegexTok(/p{Lu}p{M}*(p{L}p{M}*){0,10}(['-][p{Lu}p{M}*])?(p{L}p{M}*){1,10}/, 4, Doc.text) R --'
where Not(ContainsDicts(
'FilterPersonDict',
'filterPerson_position.dict',
'filterPerson_german.dict',
'InitialDict',
'StrongPhoneVariantDictionary',
'stateList.dict',
'organization_suffix.dict',
'industryType_suffix.dict',
'streetSuffix_forPerson.dict',
'wkday.dict',
'nationality.dict',
'stateListAbbrev.dict',
'stateAbbrv.ChicagoAPStyle.dict', R.match));
create view CapsPerson as
select C.name as name
from CapsPersonCandidate C
where Not(MatchesRegex(/(p{Lu}p{M}*)+-.*([p{Ll}p{Lo}]p{M}*).*/, C.name))
and Not(MatchesRegex(/.*([p{Ll}p{Lo}]p{M}*).*-(p{Lu}p{M}*)+/, C.name));
create view CapsPersonNoP as
select CP.name as name
from CapsPerson CP
where Not(ContainsRegex(/'/, CP.name)); --'
create dictionary InitialDict as
( 'Pro','Bono','Enterprises','Group','Said','Says','Assista
nt','Vice','Warden','Contribution',
'rev.', 'col.', 'reverend', 'prof.', 'professor.',
'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.', 'Sales',
'Research', 'Development', 'Product',
'messrs.', 'dr.', 'master.', 'marquis', 'monsieur',
'Support', 'Manager', 'Telephone', 'Phone', 'Contact',
'ds', 'di'
'Information',
--'Dear' (Yunyao: comments out to avoid mismatches such as
'Electronics','Managed','West','East','North','South',
Dear Member),
'Teaches','Ministry', 'Church', avoid mismatches such
--'Junior' (Yunyao: comments out to'Association',
as'Laboratories', [team player],
Junior National 'Living', 'Community', 'Visiting',
-- 'Officer', have large negative'Only', 'Additionally', such
If we can 'After', 'Pls', 'FYI', dictionary to eliminate
mismatches, 'Acquire', 'Addition', 'America',
'Adding',
-- then this phrases that are likely to be at the start of a
-- short may be recovered
sentence
'Yes', 'No', 'Ja', 'Nein','Kein', 'Keine', 'Gegenstimme',
-- TODO: to be double checked
'Another', 'Anyway','Associate', 'At', 'Athletes', 'It',
'Enron', 'EnronXGate', 'Have', 'However',
'Company', 'Companies', 'IBM','Annual',
-- common verbs appear with person names in
financial reports
-- ideally we want to have a general comprehensive
verb list to use as a filter dictionary
'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees',
'Warns', 'Announces', 'Reviews'
-- Laura 06/02/2009: new filter dict for title for SEC
domain in filterPerson_title.dict
);
create dictionary GreetingsDict as
(
'Hey', 'Hi', 'Hello', 'Dear',
-- German greetings
'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo',
-- Italian
'Ciao',
-- Spanish
'Hola',
-- French
'Bonjour'
);
81
create dictionary InitialDict as
(
'rev.', 'col.', 'reverend', 'prof.', 'professor.',
'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.',
'messrs.', 'dr.', 'master.', 'marquis', 'monsieur',
'ds', 'di'
--'Dear' (Yunyao: comments out to avoid
mismatches such as Dear Member),
-- Spain first name from blue pages
create view StrictFirstName7 as
select D.match as firstname
from Dictionary('names/strictFirst_spain.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
--============================================================
-- Find strict capitalized words
--create view StrictCapsPerson as
create view StrictCapsPerson as
select R.name as name
from StrictCapsPersonR R
where MatchesRegex(/bp{Lu}p{M}*[p{Ll}p{Lo}]p{M}*(p{L}p{M}*){1,20}b/, R.name);
-- Find dictionary matches for all last names
create view StrictLastName1 as
select D.match as lastname
from Dictionary('strictLast.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName3 as
select D.match as lastname
from Dictionary('strictLast_german_bluePages.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName4 as
select D.match as lastname
from Dictionary('uniqMostCommonSurname.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName6 as
select D.match as lastname
from Dictionary('names/strictLast_france.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName7 as
select D.match as lastname
from Dictionary('names/strictLast_spain.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName8 as
select D.match as lastname
from Dictionary('names/strictLast_india.partial.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName9 as
select D.match as lastname
from Dictionary('names/strictLast_israel.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName as
(select S.lastname as lastname from StrictLastName1 S)
union all
(select S.lastname as lastname from StrictLastName2 S)
union all
(select S.lastname as lastname from StrictLastName3 S)
union all
(select S.lastname as lastname from StrictLastName4 S)
union all
(select S.lastname as lastname from StrictLastName5 S)
union all
(select S.lastname as lastname from StrictLastName6 S)
union all
(select S.lastname as lastname from StrictLastName7 S)
union all
(select S.lastname as lastname from StrictLastName8 S)
union all
(select S.lastname as lastname from StrictLastName9 S);
-- Relaxed version of last name
create view RelaxedLastName1 as
select CombineSpans(SL.lastname, CP.name) as lastname
from StrictLastName SL,
StrictCapsPerson CP
where FollowsTok(SL.lastname, CP.name, 1, 1)
and MatchesRegex(/-/, SpanBetween(SL.lastname, CP.name));
create view RelaxedLastName2 as
select CombineSpans(CP.name, SL.lastname) as lastname
from StrictLastName SL,
StrictCapsPerson CP
where FollowsTok(CP.name, SL.lastname, 1, 1)
and MatchesRegex(/-/, SpanBetween(CP.name, SL.lastname));
-- all the last names
create view LastNameAll as
(select N.lastname as lastname from StrictLastName N)
union all
(select N.lastname as lastname from RelaxedLastName1 N)
union all
(select N.lastname as lastname from RelaxedLastName2 N);
from Dictionary('names/name_israel.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
from FirstName FN,
InitialWord IW,
CapsPerson CP
where FollowsTok(FN.firstname, IW.word, 0, 0)
and FollowsTok(IW.word, CP.name, 0, 0);
create view NamesAll as
(select P.name as name from NameDict P)
union all
(select P.name as name from NameDict1 P)
union all
(select P.name as name from NameDict2 P)
union all
(select P.name as name from NameDict3 P)
union all
(select P.name as name from NameDict4 P)
union all
(select P.firstname as name from FirstName P)
union all
/**
* Translation for Rule 3r2
*
* This relaxed version of rule '3' will find person names like
Thomas B.M . David
* But it only insists that the second word is in the person
dictionary
*/
/*
<rule annotation=Person id=3r2>
<internal>
<token attribute={etc}>CAPSPERSON</token>
<token attribute={etc}>INITIALWORD</token>
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token>
</internal>
</rule>*/
create view PersonDict as
select C.name as name
--from Consolidate(NamesAll.name) C;
from NamesAll C
consolidate on C.name;
create view Person3r2 as
select CombineSpans(CP.name, LN.lastname) as person
from LastName LN,
InitialWord IW,
CapsPerson CP
where FollowsTok(CP.name, IW.word, 0, 0)
and FollowsTok(IW.word, LN.lastname, 0, 0);
--==========================================================
-- Actual Rules
--==========================================================
/**
* Translation for Rule 4
*
* This rule will find person names like David Thomas
*/
/*
<rule annotation=Person id=4>
<internal>
<token
attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token>
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person4WithNewLine as
select CombineSpans(FN.firstname, LN.lastname) as person
from FirstName FN,
LastName LN
where FollowsTok(FN.firstname, LN.lastname, 0, 0);
-- For 3-part Person names
create view Person3P1 as
select CombineSpans(F.firstname, L.lastname) as person
from StrictFirstName F,
StrictCapsPersonR S,
StrictLastName L
where FollowsTok(F.firstname, S.name, 0, 0)
--and FollowsTok(S.name, L.lastname, 0, 0)
and FollowsTok(F.firstname, L.lastname, 1, 1)
and Not(Equals(GetText(F.firstname), GetText(L.lastname)))
and Not(Equals(GetText(F.firstname), GetText(S.name)))
and Not(Equals(GetText(S.name), GetText(L.lastname)))
and Not(ContainsRegex(/[nrt]/, SpanBetween(F.firstname, L.lastname)));
create view Person3P2 as
select CombineSpans(P.name, L.lastname) as person
from PersonDict P,
StrictCapsPersonR S,
StrictLastName L
where FollowsTok(P.name, S.name, 0, 0)
--and FollowsTok(S.name, L.lastname, 0, 0)
and FollowsTok(P.name, L.lastname, 1, 1)
and Not(Equals(GetText(P.name), GetText(L.lastname)))
and Not(Equals(GetText(P.name), GetText(S.name)))
and Not(Equals(GetText(S.name), GetText(L.lastname)))
and Not(ContainsRegex(/[nrt]/, SpanBetween(P.name, L.lastname)));
-- Yunyao: 05/20/2008 revised to Person4WrongCandidates due
to performance reason
-- NOTE: current optimizer execute Equals first thus make
Person4Wrong very expensive
--create view Person4Wrong as
--select CombineSpans(FN.firstname, LN.lastname) as person
--from FirstName FN,
-LastName LN
--where FollowsTok(FN.firstname, LN.lastname, 0, 0)
-- and ContainsRegex(/[nr]/, SpanBetween(FN.firstname,
LN.lastname))
-- and Equals(GetText(FN.firstname), GetText(LN.lastname));
create view Person3P3 as
select CombineSpans(F.firstname, P.name) as person
from PersonDict P,
StrictCapsPersonR S,
StrictFirstName F
where FollowsTok(F.firstname, S.name, 0, 0)
--and FollowsTok(S.name, P.name, 0, 0)
and FollowsTok(F.firstname, P.name, 1, 1)
and Not(Equals(GetText(P.name), GetText(F.firstname)))
and Not(Equals(GetText(P.name), GetText(S.name)))
and Not(Equals(GetText(S.name), GetText(F.firstname)))
and Not(ContainsRegex(/[nrt]/, SpanBetween(F.firstname, P.name)));
create view Person4WrongCandidates as
select FN.firstname as firstname, LN.lastname as lastname
from FirstName FN,
LastName LN
where FollowsTok(FN.firstname, LN.lastname, 0, 0)
and ContainsRegex(/[nr]/, SpanBetween(FN.firstname,
LN.lastname));
/**
* Translation for Rule 1
* Handles names of persons like Mr. Vladimir E. Putin
*/
/*
<rule annotation=Person id=1>
<token attribute={etc}INITIAL{etc}>CANYWORD</token>
<internal>
<token attribute={etc}>CAPSPERSON</token>
<token attribute={etc}>INITIALW ORD</token>
<token attribute={etc}>CAPSPERSON</token>
</internal>
</rule>
*/
SystemT’s Person extractor
SystemT’s Person extractor
create view StrictCapsPersonR as
select R.match as name
--from Regex(/bp{Lu}p{M}*(p{L}p{M}*){1,20}b/, CapsPersonNoP.name) R;
from RegexTok(/p{Lu}p{M}*(p{L}p{M}*){1,20}/, 1, CapsPersonNoP.name) R;
create view StrictLastName5 as
select D.match as lastname
from Dictionary('names/strictLast_italy.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
-- new entries
-- France first name from blue pages
create view StrictFirstName6 as
select D.match as firstname
from Dictionary('names/strictFirst_france.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
-- Israel first name from blue pages
create view StrictFirstName9 as
select D.match as firstname
from Dictionary('names/strictFirst_israel.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
'Pro','Bono','Enterprises','Group','Said','Says','Assistant','Vice
'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My',
','Warden','Contribution',
'His','Her',
'Research', 'Development', 'Product', 'Sales', 'Support',
'Their','Popcorn', 'Name', 'July', 'June','Join',
'Manager', 'Telephone', 'Phone', 'Contact', 'Information',
'Business', 'Administrative', 'South', 'Members',
'Electronics','Managed','West','East','North','South',
'Address', 'Please', 'List',
'Teaches','Ministry', 'Church', 'Association', 'Laboratories',
'Public', 'Inc', 'Parkway',
'Living', 'Community', 'Visiting', 'Brother', 'Buy', 'Then',
'Officer', 'After', 'Pls', 'FYI', 'Only', 'Additionally', 'Adding',
'Services', 'Statements',
'Acquire', 'Addition', 'America', 'Commissioner',
'President', 'Governor',
-- short phrases that are likely to be at the start of a sentence
'Commitment', 'Commits', 'Hey',
'Yes', 'No', 'Ja','End', 'Exit', 'Experiences', 'Finance',
'Director', 'Nein','Kein', 'Keine', 'Gegenstimme',
-- TODO: to be double checked
'Elementary', 'W ednesday', 'At', 'Athletes', 'It', 'Enron',
'Another', 'Anyway','Associate',
'Nov', 'Infrastructure', 'Inside', 'Convention',
'EnronXGate', 'Have', 'However',
'Judge', 'Lady', 'Friday', 'Project',
'Company', 'Companies', 'IBM','Annual', 'Projected',
'Recalls', 'Regards', 'Recently', 'Administration',
-- common verbs appear with person names in financial
reports
'Independence', 'Denied',
-- ideally we want to have a general comprehensive verb list
'Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike',
to 'W as', a filter dictionary
use as 'Were', 'Secretary',
'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees',
'Speaker', 'Chairman', 'Consider', 'Consultant',
'Warns', 'Announces', 'Reviews'
'County', 'Court', 'Defensive',
-- Laura 06/02/2009: new filter dict for title for SEC domain in
'Northwestern',
filterPerson_title.dict 'Place', 'Hi', 'Futures', 'Athlete',
); 'Invitational', 'System',
'International', 'Main', 'Online', 'Ideally'
-- Italy first name from blue pages
create view StrictFirstName5 as
select D.match as firstname
from Dictionary('names/strictFirst_italy.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
--============================================================
--TODO: need to think through how to deal with hypened name
-- one way to do so is to run Regex(pattern, CP.name) and enforce CP.name does not contain '
-- need more testing before confirming the change
create view StrictLastName2 as
select D.match as lastname
from Dictionary('strictLast_german.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create dictionary GreetingsDict as
-- more entries
(
,'If','Our', 'About', 'Analyst', 'On', 'Of', 'By', 'HR',
'Hey', 'Hi', 'Hello', 'Dear',
'Mkt', 'Pre', 'Post',
-- German greetings 'Ice', 'Surname', 'Lastname',
'Condominium',
'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo',
'firstname', 'Name', 'familyname',
-- Italian
-- Italian greeting
'Ciao',
'Ciao',
-- Spanish
'Hola',
-- Spanish greeting
-- French
'Hola',
'Bonjour'
-- French greeting
); 'Bonjour',
-- german first name from blue page
create view StrictFirstName4 as
select D.match as firstname
from Dictionary('strictFirst_german_bluePages.dict', Doc.text)
D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
-- Find strict capitalized words with two letter or more (relaxed version of StrictCapsPerson)
'President', 'Governor', 'Commissioner', 'Commitment',
--include 'core/GenericNE/Person.aql';
'Commits', 'Hey',
'Director', 'End', 'Exit', 'Experiences', 'Finance',
'Elementary', 'Wednesday',
'Nov', 'Infrastructure', 'Inside', 'Convention',
'Judge', 'Lady', 'Friday', 'Project', 'Projected',
create dictionary FilterPersonDict as
'Recalls', 'Regards', 'Recently', 'Administration',
(
'Independence', 'Denied',
'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher',
'Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike', 'Was',
'All','Tell',
'Were', 'Secretary',
'Speaker', 'Chairman', 'Consider', 'Consultant', 'County',
'Friends', 'Friend', 'Colleague', 'Colleagues',
'Court', 'Defensive',
'Managers','If',
'Northwestern', 'Place', 'Hi', 'Futures', 'Athlete', 'Invitational',
'Customer', 'Users', 'User', 'Valued', 'Executive',
'System',
'Chairs',
'International', 'Main', 'Online', 'Ideally'
'New', 'Owner', 'Conference', 'Please', 'Outlook',
-- more entries
'Lotus', 'Notes', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt', 'Pre',
,'If','Our', 'About',
'This', 'That', 'There', 'Here', 'Subscribers', 'W hat',
'Post',
'W hen', 'Where', 'Which',
'Condominium', 'Ice', 'Surname', 'Lastname', 'firstname',
'Name', 'familyname', 'Thanks', 'Thanksgiving','Senator',
'W ith', 'While',
-- Italian greeting
'Platinum', 'Perspective',
'Ciao',
'Manager', 'Ambassador', 'Professor', 'Dear',
-- Spanish greeting 'Athelet',
'Contact', 'Cheers',
'Hola',
'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center',
-- French greeting
'The', 'Take', 'Junior',
'Bonjour',
'Both', 'Communities', 'Greetings', 'Hope',
-- new entries
'Restaurants', 'Properties',
-- nick names for US first names
create view StrictFirstName3 as
select D.match as firstname
from Dictionary('strictNickName.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
-- Indian first name from blue pages
-- TODO: still need to clean up the remaining entries
create view StrictFirstName8 as
select D.match as firstname
from Dictionary('names/strictFirst_india.partial.dict', Doc.text)
D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
-- German
--include 'core/GenericNE/Person.aql';
'Fon',
'Telefon Geschaeftsstelle',
'Telefon Geschäftsstelle',
create dictionary FilterPersonDict as
'Telefon Zweigstelle',
(
'Telefon Hauptsitz',
'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', 'All','Tell',
'Telefon (Geschaeftsstelle)',
'Friends', 'Friend', 'Colleague', 'Colleagues', 'Managers','If',
'Telefon (Geschäftsstelle)',
'Customer', 'Users', 'User', 'Valued', 'Executive', 'Chairs',
'Telefon (Zweigstelle)',
'New', 'Owner', 'Conference', 'Please', 'Outlook', 'Lotus',
'Telefon (Hauptsitz)',
'Notes',
'Telefonnummer',
'This', 'That', 'There', 'Here', 'Subscribers', 'What', 'When',
'Where', 'Which',
'Telefon Geschaeftssitz',
'With', 'While', 'Thanks', 'Thanksgiving','Senator', 'Platinum',
'Telefon Geschäftssitz',
'Perspective', (Geschaeftssitz)',
'Telefon
'Manager', 'Ambassador', 'Professor', 'Dear', 'Contact',
'Telefon (Geschäftssitz)',
'Cheers', 'Athelet',
'Telefon Persönlich',
'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', 'The', 'Take',
'Telefon persoenlich',
'Junior',
'Telefon (Persönlich)',
'Both', 'Communities', 'Greetings', 'Hope', 'Restaurants',
'Properties', (persoenlich)',
'Telefon
'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My', 'His','Her',
'Handy',
'Their','Popcorn', 'Name', 'July', 'June','Join',
'Handy-Nummer',
'Business', 'Administrative', 'South', 'Members', 'Address',
'Telefon arbeit',
'Please', 'List',(arbeit)'
'Telefon
'Public', 'Inc', 'Parkway', 'Brother', 'Buy', 'Then', 'Services',
);
'Statements',
--------------------------------------create view ValidLastNameAll as
select N.lastname as lastname
from LastNameAll N
-- do not allow partially all capitalized words
where Not(MatchesRegex(/(p{Lu}p{M}*)
+-.*([p{Ll}p{Lo}]p{M}*).*/, N.lastname))
and Not(MatchesRegex(/.*([p{Ll}p{Lo}]p{M}*).*(p{Lu}p{M}*)+/, N.lastname));
-- union all the dictionary matches for first names
create view StrictFirstName as
(select S.firstname as firstname from StrictFirstName1 S)
union all
(select S.firstname as firstname from StrictFirstName2 S)
union all
(select S.firstname as firstname from StrictFirstName3 S)
union all
(select S.firstname as firstname from StrictFirstName4 S)
union all
(select S.firstname as firstname from StrictFirstName5 S)
union all
(select S.firstname as firstname from StrictFirstName6 S)
union all
(select S.firstname as firstname from StrictFirstName7 S)
union all
(select S.firstname as firstname from StrictFirstName8 S)
union all
(select S.firstname as firstname from StrictFirstName9 S);
-- Relaxed versions of first name
create view RelaxedFirstName1 as
select CombineSpans(S.firstname, CP.name) as firstname
from StrictFirstName S,
StrictCapsPerson CP
where FollowsTok(S.firstname, CP.name, 1, 1)
and MatchesRegex(/-/, SpanBetween(S.firstname, CP.name));
create view Person1 as
select CombineSpans(CP1.name, CP2.name) as person
from Initial I,
CapsPerson CP1,
InitialWord IW ,
CapsPerson CP2
where FollowsTok(I.initial, CP1.name, 0, 0)
and FollowsTok(CP1.name, IW.word, 0, 0)
and FollowsTok(IW .word, CP2.name, 0, 0);
--and Not(ContainsRegex(/[nr]/, SpanBetween(I.initial, CP2.name)));
-- all the first names
create view FirstNameAll as
(select N.firstname as firstname from StrictFirstName N)
union all
(select N.firstname as firstname from RelaxedFirstName1 N)
union all
(select N.firstname as firstname from RelaxedFirstName2 N);
create view ValidFirstNameAll as
select N.firstname as firstname
from FirstNameAll N
where Not(MatchesRegex(/(p{Lu}p{M}*)
+-.*([p{Ll}p{Lo}]p{M}*).*/, N.firstname))
and Not(MatchesRegex(/.*([p{Ll}p{Lo}]p{M}*).*(p{Lu}p{M}*)+/, N.firstname));
create view FirstName as
select C.firstname as firstname
--from Consolidate(ValidFirstNameAll.firstname) C;
from ValidFirstNameAll C
consolidate on C.firstname;
-- Combine all dictionary matches for both last names and first
names
create view NameDict as
select D.match as name
from Dictionary('name.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
create view NameDict1 as
select D.match as name
from Dictionary('names/name_italy.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
create view NameDict2 as
select D.match as name
from Dictionary('names/name_france.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
create view NameDict3 as
select D.match as name
from Dictionary('names/name_spain.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
create view NameDict4 as
select D.match as name
-- relaxed version of Rule4a
-- Yunyao: split the following rules into two to improve
performance
-- TODO: Test case for optimizer
-- create view Person4ar1 as
-- select CombineSpans(CP.name, FN.firstname) as person
--from FirstName FN,
-CapsPerson CP
--where FollowsTok(CP.name, FN.firstname, 1, 1)
--and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname))
--and Not(M atchesRegex(/(.|n|r)*(.|?|!|'|sat|sin)( )*/,
LeftContext(CP.name, 10)))
--and Not(M atchesRegex(/(?i)(.+fully)/, CP.name))
--and GreaterThan(GetBegin(CP.name), 10);
/**
* Translation for Rule 1a
* Handles names of persons like Mr. Vladimir Putin
*/
/*
<rule annotation=Person id=1a>
<token attribute={etc}INITIAL{etc}>CANYWORD</token>
<internal>
<token attribute={etc}>CAPSPERSON</token>{1,3}
</internal>
</rule>*/
~250 AQL rules
~250 AQL rules
create view RelaxedFirstName2 as
select CombineSpans(CP.name, S.firstname) as firstname
from StrictFirstName S,
StrictCapsPerson CP
where FollowsTok(CP.name, S.firstname, 1, 1)
and MatchesRegex(/-/, SpanBetween(CP.name, S.firstname));
create view Person4ar1temp as
select FN.firstname as firstname, CP.name as name
from FirstName FN,
CapsPerson CP
where FollowsTok(CP.name, FN.firstname, 1, 1)
and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname));
-- Split into two rules so that single token annotations are serperated from others
-- Single token annotations
create view Person1a1 as
select CP1.name as person
from Initial I,
CapsPerson CP1
where FollowsTok(I.initial, CP1.name, 0, 0)
--- start changing this block
--- disallow allow newline
and Not(ContainsRegex(/[nt]/,SpanBetween(I.initial,CP1.name)))
--- end changing this block
;
-- Yunyao: added 05/09/2008 to match patterns such as "Mr. B. B. Buy"
/*
create view Person1a2 as
select CombineSpans(name.block, CP1.name) as person
from Initial I,
BlockTok(0, 1, 2, InitialW ord.word) name,
CapsPerson CP1
where FollowsTok(I.initial, name.block, 0, 0)
and FollowsTok(name.block, CP1.name, 0, 0)
and Not(ContainsRegex(/[nt]/,CombineSpans(I.initial, CP1.name)));
*/
create view Person1a as
-- (
select P.person as person from Person1a1 P
-- )
-- union all
-- (select P.person as person from Person1a2 P)
;
/*
create view Person1a_more as
select name.block as person
from Initial I,
BlockTok(0, 2, 3, CapsPerson.name) name
where FollowsTok(I.initial, name.block, 0, 0)
and Not(ContainsRegex(/[nt]/,name.block))
--- start changing this block
-- disallow newline
and Not(ContainsRegex(/[nt]/,SpanBetween(I.initial,name.block)))
--- end changing this block
;
*/
/**
* Translation for Rule 3
* Find person names like Thomas B.M. David
*/
/*
<rule annotation=Person id=3>
<internal>
<token attribute={etc}PERSON{etc}>CAPSPERSON</token>
<token attribute={etc}>INITIALW ORD</token>
<token attribute={etc}PERSON{etc}>CAPSPERSON</token>
</internal>
</rule>*/
create view Person3 as
select CombineSpans(P1.name, P2.name) as person
from PersonDict P1,
--InitialW ord IW,
WeakInitialWord IW ,
PersonDict P2
where FollowsTok(P1.name, IW .word, 0, 0)
and FollowsTok(IW .word, P2.name, 0, 0)
and Not(Equals(GetText(P1.name), GetText(P2.name)));
/**
* Translation for Rule 3r1
*
* This relaxed version of rule '3' will find person names like Thomas B.M. David
* But it only insists that the first word is in the person dictionary
*/
/*
<rule annotation=Person id=3r1>
<internal>
<token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token>
<token attribute={etc}>INITIALW ORD</token>
<token attribute={etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person4 as
(select P.person as person from Person4WithNewLine P)
minus
(select CombineSpans(P.firstname, P.lastname) as person
from Person4WrongCandidates P
where Equals(GetText(P.firstname), GetText(P.lastname)));
/**
* Translation for Rule4a
* This rule will find person names like Thomas, David
*/
/*
<rule annotation=Person id=4a>
<internal>
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token>
<token attribute={etc}>,</token>
<token
attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person4a as
select CombineSpans(LN.lastname, FN.firstname) as person
from FirstName FN,
LastName LN
where FollowsTok(LN.lastname, FN.firstname, 1, 1)
and ContainsRegex(/,/,SpanBetween(LN.lastname,
FN.firstname));
create view Person4ar1 as
select CombineSpans(P.name, P.firstname) as person
from Person4ar1temp P
where Not(MatchesRegex(/(.|n|r)*(.|?|!|'|sat|sin)( )*/,
LeftContext(P.name, 10))) --'
and Not(MatchesRegex(/(?i)(.+fully)/, P.name))
and GreaterThan(GetBegin(P.name), 10);
create view Person4ar2 as
select CombineSpans(LN.lastname, CP.name) as person
from CapsPerson CP,
LastName LN
where FollowsTok(LN.lastname, CP.name, 0, 1)
and ContainsRegex(/,/,SpanBetween(LN.lastname, CP.name));
/**
* Translation for Rule2
*
* This rule will handles names of persons like B.M . Thomas
David, where Thomas occurs in some person dictionary
*/
/*
<rule annotation=Person id=2>
<internal>
<token attribute={etc}>INITIALWORD</token>
<token attribute={etc}PERSON{etc}>CAPSPERSON</token>
<token attribute={etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person2 as
select CombineSpans(IW.word, CP.name) as person
from InitialWord IW,
PersonDict P,
CapsPerson CP
where FollowsTok(IW.word, P.name, 0, 0)
and FollowsTok(P.name, CP.name, 0, 0);
/**
* Translation for Rule 2a
*
* The rule handles names of persons like B.M . Thomas David,
where David occurs in some person dictionary
*/
/*
<rule annotation=Person id=2a>
<internal>
<token attribute={etc}>INITIALWORD</token>
<token attribute={etc}>CAPSPERSON</token>
<token attribute={etc}>NEWLINE</token>?
<token attribute={etc}PERSON{etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person2a as
select CombineSpans(IW.word, P.name) as person
from InitialWord IW,
CapsPerson CP,
PersonDict P
where FollowsTok(IW.word, CP.name, 0, 0)
and FollowsTok(CP.name, P.name, 0, 0);
/*
<rule annotation=Person id=4r1>
<internal>
<token
attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</toke
n>
<token attribute={etc}>NEWLINE</token>?
<token attribute={etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person4r1 as
select CombineSpans(FN.firstname, CP.name) as person
from FirstName FN,
CapsPerson CP
where FollowsTok(FN.firstname, CP.name, 0, 0);
/**
* Translation for Rule 4r2
*
* This relaxed version of rule '4' will find person
names Thomas, David
* But it only insists that the SECOND word is in some person
dictionary
*/
/*
<rule annotation=Person id=4r2>
<token attribute={etc}>ANYWORD</token>
<internal>
<token attribute={etc}>CAPSPERSON</token>
<token attribute={etc}>NEWLINE</token>?
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</toke
n>
</internal>
</rule>
*/
create view Person4r2 as
select CombineSpans(CP.name, LN.lastname) as person
from CapsPerson CP,
LastName LN
where FollowsTok(CP.name, LN.lastname, 0, 0);
/**
* Translation for Rule 5
*
* This rule will find other single token person first names
*/
/*
<rule annotation=Person id=5>
<internal>
<token attribute={etc}>INITIALWORD</token>?
<token
attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</toke
n>
</internal>
</rule>
*/
create view Person5 as
select CombineSpans(IW.word, FN.firstname) as person
from InitialWord IW,
FirstName FN
where FollowsTok(IW.word, FN.firstname, 0, 0);
/**
* Translation for Rule 6
*
* This rule will find other single token person last names
*/
/*
<rule annotation=Person id=6>
<internal>
<token attribute={etc}>INITIALWORD</token>?
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</toke
n>
</internal>
</rule>
*/
create view Person6 as
select CombineSpans(IW.word, LN.lastname) as person
from InitialWord IW,
LastName LN
where FollowsTok(IW.word, LN.lastname, 0, 0);
-=================================================
=========
-- End of rules
--- Create final list of names based on all the matches extracted
--=================================================
=========
/**
* Union all matches found by strong rules, except the ones
directly come
* from dictionary matches
*/
create view PersonStrongWithNewLine as
(select P.person as person from Person1 P)
--union all
-- (select P.person as person from Person1a_more P)
union all
(select P.person as person from Person3 P)
union all
(select P.person as person from Person4 P)
union all
(select P.person as person from Person3P1 P);
create view PersonStrongSingleTokenOnly as
(select P.person as person from Person5 P)
union all
(select P.person as person from Person6 P)
union all
(select P.firstname as person from FirstName P)
union all
(select P.lastname as person from LastName P)
union all
(select P.person as person from Person1a P);
-- Yunyao: added 05/09/2008 to expand person names with
suffix
create view PersonStrongSingleTokenOnlyExpanded1 as
select CombineSpans(P.person,S.suffix) as person
from
PersonStrongSingleTokenOnly P,
PersonSuffix S
where
FollowsTok(P.person, S.suffix, 0, 0);
-- Yunyao: added 04/14/2009 to expand single token person
name with a single initial
-- extend single token person with a single initial
create view PersonStrongSingleTokenOnlyExpanded2 as
select CombineSpans(R.person, RightContext(R.person,2)) as
person
from PersonStrongSingleTokenOnly R
where MatchesRegex(/ +[p{Upper}]bs*/,
RightContext(R.person,3));
create view PersonStrongSingleToken as
(select P.person as person from
PersonStrongSingleTokenOnly P)
union all
(select P.person as person from
PersonStrongSingleTokenOnlyExpanded1 P)
union all
(select P.person as person from
PersonStrongSingleTokenOnlyExpanded2 P);
/**
* Union all matches found by weak rules
*/
create view PersonWeak1WithNewLine as
(select P.person as person from Person3r1 P)
union all
(select P.person as person from Person3r2 P)
union all
(select P.person as person from Person4r1 P)
union all
(select P.person as person from Person4r2 P)
union all
(select P.person as person from Person2 P)
union all
(select P.person as person from Person2a P)
union all
(select P.person as person from Person3P2 P)
union all
(select P.person as person from Person3P3 P);
-- weak rules that identify (LastName, FirstName)
create view PersonWeak2WithNewLine as
(select P.person as person from Person4a P)
union all
(select P.person as person from Person4ar1 P)
union all
(select P.person as person from Person4ar2 P);
--include 'core/GenericNE/Person-FilterNewLineSingle.aql';
--include 'core/GenericNE/Person-Filter.aql';
Person
create view PersonBase as
(select P.person as person from PersonStrongWithNewLine
P)
union all
(select P.person as person from PersonWeak1WithNewLine
P)
union all
(select P.person as person from PersonWeak2WithNewLine
P);
output view PersonBase;
“Global financial services firm Morgan Stanley announced … ““
“Global financial services firm Morgan Stanley announced …
create view Person3r1 as
create view ValidLastNameAll as
select N.lastname as lastname
© 2009 IBM Corporation
Notes de l'éditeur To update the Collection-Centric, add auxiliary index + annotation store
Each extraction result is stored with its source document and its associated positions in the document
Basically:
Convert JAPE rule into a relational calculus expression => Big self-join over a table of <word, position> pairs
Generate efficient join plan using (inverted) index access when possible
Some part still require going back to the document --- want these high in the operator graph
At the high level, the optimization strategy is very similar to the one in System R, but with novel access method, novel join algorithms, 2-dismensional cost model
The document-centric model enables embedding SystemT in a wide variety of applications.
For instance, in lotus notes, when a user opens an email, at the same time, that email message is sent to SystemT runtime which will generate annotations on the fly.
When the email is displayed for the user, the annotations just generated will be displayed as well.
Meanwhile, SystemT can also be embedded as a Map job in a map-reduce framework, which allows the system to scale up and process large volume of documents.