This is the deck that I made when taking CS767 at Univ. of Michigan in 2006. While it is a few years' old, it is still a useful deck for people who are new to information extraction.
3. What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
NAME
TITLE
ORGANIZATION
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Courtesy of William W. Cohen
4. What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Courtesy of William W. Cohen
5. What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka “named entity
Gates
extraction”
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
6. What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
7. What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
8. What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
NAME
Bill Gates
Bill Veghte
Richard Stallman
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
October 14, 2002, 4:00 a.m. PT
Courtesy of William W. Cohen
10. Landscape of IE Techniques
Classify Pre-segmented
Candidates
Classifier
which class?
Try alternate
window sizes:
Context Free Grammars
Finite State Machines
Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky.
BEGIN
which class?
Abraham Lincoln was born in Kentucky.
Our Focus
NNP
today! NNP
Most likely state sequence?
V
V
P
Classifier
VP
NP
BEGIN
END
BEGIN
NP
PP
which class?
END
?
Boundary Models
Classifier
ars
e
Alabama
Alaska
…
Wisconsin
Wyoming
Abraham Lincoln was born in Kentucky.
ely
p
member?
Abraham Lincoln was born in Kentucky.
l ik
Abraham Lincoln was born in Kentucky.
Sliding Window
Mo
st
Lexicons
VP
S
Courtesy of William W. Cohen
11. Markov Property
S1: rain
S2: cloud
S3: sun
1/2
The state of a system at time t+1,
qt+1, is conditionally independent of
{qt-1, qt-2, …, q1, q0} given qt
S2
1/3
1/2
S1
2/3
1
S2
In another word, current state
determines the probability
distribution for the next state.
12. Markov Property
S1: rain
S2: cloud
S3: sun
1/2
State-transition probabilities,
S2
1/3
1/2
S1
A=
2/3
0
1
0
0 .5 0 .5 0
0.67 0.33 0
S3
1
Q: given today is sunny (i.e., q1=3),
what is the probability of “sun-cloud”
with the model?
14. IE with Hidden Markov Model
Given a sequence of observations:
SI/EECS 767 is held weekly at SIN2 .
and a trained HMM:
Find the most likely state sequence: (Viterbi)
course name
location name
background
arg max s P ( s , o )
SI/EECS 767 is held weekly at SIN2
Any words said to be generated by the designated “course name”
state extract as a course name:
Course name: SI/EECS 767
15. Name Entity Extraction
[Bikel, et al 1998]
Person
end-ofsentence
start-ofsentence
Org
(Five other name classes)
Other
Hidden
states
20. Learning HMM for IE
[Seymore, 1999]
Consider labeled, unlabeled, and distantly-labeled data
21. Some Issues with HMM
• Need to enumerate all possible observation
sequences
• Not practical to represent multiple interacting
features or long-range dependencies of the
observations
• Very strict independence assumptions on the
observations
22. Maximum Entropy Markov Models
[Lafferty, 2001]
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations
Pr( st | xt ) = ...
Courtesy of William W. Cohen
23. MEMM
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state history
Pr( st | xt , st −1, st − 2, ...) = ...
Courtesy of William W. Cohen
24. HMM vs. MEMM
St-1
St
St+1
...
Pr( s, o) = ∏ Pr( si | si −1 ) Pr(oi | si −1 )
i
Ot-1
Ot
St-1
Pr( s | o) = ∏ Pr( si | si −1 , oi −1 )
i
Ot-1
Ot+1
St
Ot
St+1
...
Ot+1
25. Label Bias Problem with MEMM
Consider this MEMM
Pr(12|ro) = Pr(2|1,ro)Pr(1,ro) = Pr(2| 1,o)Pr(1,r)
Pr(12|ri) = Pr(2|1,ri)Pr(1,ri) = Pr(2| 1,i)Pr(1,r)
Pr(2|1,o) = Pr(2|1,r) = 1
Pr(12|ro) = Pr(12|ri)
But it should be Pr(12|ro) < Pr(12|ri)!
26. Solve the Label Bias Problem
• Change the state-transition structure of the
model
– Not always practical to change the set of states
• Start with a fully-connected model and let the
training procedure figure out a good structure
– Prelude the use of prior, which is very valuable
29. Conditional Distribution
If the graph G = (V, E) of Y is a tree, the conditional distribution over
the label sequence Y = y, given X = x, by fundamental theorem of
random fields is:
pθ (y | x) µ exp ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷
v∈V ,k
e∈E,k
x is a data sequence
y is a label sequence
v is a vertex from vertex set V = set of label random variables
e is an edge from edge set E over V
fk and gk are given and fixed. gk is a Boolean vertex feature; fk
is a Boolean edge feature
k is the number of features
θ = (λ1 , λ2 ,L , λn ; µ1 , µ 2 ,L , µ n ); λk and µ k are parameters to be
estimated
y|e is the set of components of y defined by edge e
y|v is the set of components of y defined by vertex v
30. Conditional Distribution
• CRFs use the observation-dependent normalization Z(x) for the
conditional distributions:
1
pθ (y | x) =
exp ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷
Z (x)
v∈V ,k
e∈E,k
Z(x) is a normalization over the data sequence x
31. HMM like CRF
Single feature for each state-state pair (y’,y) and stateobservation pair in the data to train CRF
Yt-1
Xt-1
Yt
Xt
Yt+1
...
= 1 if yu = y’ and yv = y
0 otherwise
Xt+1
= 1 if yv = y and xv = x
0 otherwise
λy’,y and µy,x are equivalent to the logarithm of the HMM transition
probability Pr(y’|y) and observation probability Pr(x|y)
32. HMM like CRF
For a chain structure, the conditional probability of a label
sequence can be expressed in matrix form.
For each position i in the observed sequence x, define matrix
Where ei is the edge with label (yi-1, yi) and vi is the vertex with
label yi
33. HMM like CRF
The normalization function is the (start, stop) entry of the product
of these matrices
The conditional probability of label sequence y is:
where, y0 = start and yn+1 = stop
34. Parameter Estimation
The problem: determine the parameters
From training data
with empirical distribution
The goal: maximize the log-likelihood objective function
35. Parameter Estimation –
Iterative Scaling Algorithms
Update the weights as
Appropriately chosen
and
for
for edge feature fk is the solution of
T(x, y) is a global property of (x,y) and efficiently computing the
Right-hand sides of the above equation is a problem
36. Algorithm S
Define slack feature:
p(y y)
For each index i = 0, …, n+1 we' , define forward vectors
And backward vectors
37. Algorithm S
=
=
∑
x
n
p (x )∑
n =1
∑
y 'y
p ( y ' , y , x ) f k ( e i , y | e i = ( y ', y ) , x )
α i−1 ( y '| x ) M i ( y ' , y | x ) β i ( y | x )
p ( y ', y , x ) =
Z (x)
=
38. Algorithm S
The rate of convergence is governed by step size
which is Inversely proportional to constant S, but S is generally
quite large, resulting in slow convergence.
39. Algorithm T
Keeps track of partial T total. It accumulates feature expectations
into counters indexed by T(x)
Use forward-back ward recurrences to compute the expectation
ak,t of feature fk and bk,t of feature gk given that T(x) = t
40. Experiments
• Modeling label bias problem
– 2000 training and 500 test samples generated by
HMM
– CRF error is 4.6%
– MEMM error is 42%
CRF solves label bias problem
44. CRF vs. HMM
Each open square represents a data set with α < ½, and a sold
square indicates a data set with a α ≥ ½. When the data is mostly
second order α ≥ ½, the discriminatively trained CRF usually
outperforms the MEMM
45. POS Tagging Experiments
• First-order HMM, MEMM and CRF model
• Data set: Penn Tree bank
• 50-50% test-train split
• Uses MEMM parameter vector as a starting point for
training the corresponding CRF to accelerate
convergence speed.
46. Interactive IE using CRF
Interactive parser updates IE results according to user’s
changes. Color coding used to alert the ambiguity of IE.
47. Some IE tools Available
• MALLET (UMass)
–
–
–
–
statistical natural language processing,
document classification,
clustering,
information extraction
– other machine learning applications to text.
• Sample Application:
GeneTaggerCRF: a gene-entity tagger based on MALLET
(MAchine Learning for LanguagE Toolkit). It uses
conditional random fields to find genes in a text file.
48. MinorThird
• http://minorthird.sourceforge.net/
• “a collection of Java classes for storing
text, annotating text, and learning to
extract entities and categorize text”
• Stored documents can be annotated in
independent files using TextLabels
(denoting, say, part-of-speech and
semantic information)
49. GATE
• http://gate.ac.uk/ie/annie.html
• leading toolkit for Text Mining
• distributed with an Information Extraction
component set called ANNIE (demo)
• Used in many research projects
– Long list can be found on its website
– Under integration of IBM UIMA
50. Sunita Sarawagi's CRF package
• http://crf.sourceforge.net/
• A Java implementation of conditional
random fields for sequential labeling.
51. UIMA (IBM)
• Unstructured Information
Management Architecture.
– A platform for unstructured
information management solutions
from combinations of semantic
analysis (IE) and search
components.
52. Some Interesting Website based
on IE
• ZoomInfo
• CiteSeer.org
(some of us using it everyday!)
• Google Local, Google Scholar
• and many more…