Information Extraction --- An one hour summary

Information Extraction
Yunyao Li
EECS /SI 767
03/29/2006

The Problem
Date
Time: Start - End
Location
Speaker

Person

What is “Information Extraction”
As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

NAME

TITLE

ORGANIZATION

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…

Courtesy of William W. Cohen

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT
customers.

IE

NAME
Bill Gates
Bill Veghte
Richard Stallman

TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..



Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
customers.

Microsoft Corporation
CEO
Bill Gates
Microsoft
aka “named entity
Gates
extraction”
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation

October 14, 2002, 4:00 a.m. PT
customers.

Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder


customers.

* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder

NAME
Bill Gates
Bill Veghte
Richard Stallman


TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..

October 14, 2002, 4:00 a.m. PT


Landscape of IE Techniques
Classify Pre-segmented
Candidates

Classifier
which class?

Try alternate
window sizes:

Context Free Grammars

Finite State Machines

Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky.
BEGIN

which class?

Abraham Lincoln was born in Kentucky.

Our Focus
NNP
today! NNP

Most likely state sequence?

V

V

P

Classifier
VP

NP
BEGIN

END

BEGIN

NP
PP

which class?

END

?

Boundary Models

Classifier

ars
e

Alabama
Alaska
…
Wisconsin
Wyoming


ely
p

member?


l ik


Sliding Window

Mo
st

Lexicons

VP
S


Markov Property
S1: rain
S2: cloud
S3: sun

1/2

The state of a system at time t+1,
qt+1, is conditionally independent of
{qt-1, qt-2, …, q1, q0} given qt

S2
1/3

1/2

S1

2/3
1

S2

In another word, current state
determines the probability
distribution for the next state.

Markov Property
S1: rain
S2: cloud
S3: sun

1/2

State-transition probabilities,

S2
1/3

1/2

S1

A=

2/3

0
1
 0
 0 .5 0 .5 0 


0.67 0.33 0



S3

1

Q: given today is sunny (i.e., q1=3),
what is the probability of “sun-cloud”
with the model?

Hidden Markov Model
S1: rain
S2: cloud
S3: sun

1/2
1/10

S2
1/2

S1
4/5

9/10
1/3

2/3
1

1/5

state sequences

O1

S3
3/10

7/10

O2

O3

O4

observations

O5

IE with Hidden Markov Model
Given a sequence of observations:
SI/EECS 767 is held weekly at SIN2 .

and a trained HMM:

Find the most likely state sequence: (Viterbi)

course name
location name
background

 
arg max s P ( s , o )

SI/EECS 767 is held weekly at SIN2
Any words said to be generated by the designated “course name”
state extract as a course name:
Course name: SI/EECS 767

Name Entity Extraction
[Bikel, et al 1998]

Person
end-ofsentence

start-ofsentence

Org
(Five other name classes)

Other

Hidden
states

Name Entity Extraction
Transition
probabilities

Observation
probabilities

P(st | st-1, ot-1 )

P(ot | st , st-1 )
or P(ot | st , ot-1 )

(1) Generating first word of a name-class

(2) Generating the rest of words in the name-class

(3) Generating “+end+” in a name-class

Training: Estimating Probabilities

Back-Off
“unknown words” and insufficient training data
Transition
probabilities

Observation
probabilities

P(st | st-1 )

P(ot | st )

P(st )

P(ot )

HMM-Experimental Results
Train on ~500k words of news wire text.
Results:

Learning HMM for IE
[Seymore, 1999]

Consider labeled, unlabeled, and distantly-labeled data

Some Issues with HMM
• Need to enumerate all possible observation
sequences
• Not practical to represent multiple interacting
features or long-range dependencies of the
observations
• Very strict independence assumptions on the
observations

Maximum Entropy Markov Models
[Lafferty, 2001]
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…

S t-1

St

S t+1

…
is “Wisniewski”
part of
noun phrase

…

ends in
“-ski”

O t -1

Ot

O t +1

Idea: replace generative model in HMM with a maxent
model, where state depends on observations

Pr( st | xt ) = ...

MEMM
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…

S t-1

St

S t+1

…

is “Wisniewski”
part of
noun phrase

…

ends in
“-ski”

O t -1

Ot

O t +1

Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state history

Pr( st | xt , st −1, st − 2, ...) = ...

HMM vs. MEMM
St-1

St

St+1
...

Pr( s, o) = ∏ Pr( si | si −1 ) Pr(oi | si −1 )
i

Ot-1

Ot

St-1

Pr( s | o) = ∏ Pr( si | si −1 , oi −1 )
i

Ot-1

Ot+1
St

Ot

St+1
...
Ot+1

Solve the Label Bias Problem
• Change the state-transition structure of the
model

– Not always practical to change the set of states
• Start with a fully-connected model and let the
training procedure figure out a good structure
– Prelude the use of prior, which is very valuable

Random Field

Courtesy of Rongkun Shen

Conditional Random Field

Courtesy of Rongkun Shen

Conditional Distribution
If the graph G = (V, E) of Y is a tree, the conditional distribution over
the label sequence Y = y, given X = x, by fundamental theorem of
random fields is:



pθ (y | x) µ exp  ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷
v∈V ,k
 e∈E,k


x is a data sequence
y is a label sequence
v is a vertex from vertex set V = set of label random variables
e is an edge from edge set E over V
fk and gk are given and fixed. gk is a Boolean vertex feature; fk
is a Boolean edge feature
k is the number of features
θ = (λ1 , λ2 ,L , λn ; µ1 , µ 2 ,L , µ n ); λk and µ k are parameters to be
estimated
y|e is the set of components of y defined by edge e
y|v is the set of components of y defined by vertex v

Conditional Distribution
• CRFs use the observation-dependent normalization Z(x) for the
conditional distributions:



1
pθ (y | x) =
exp  ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷
Z (x)
v∈V ,k
 e∈E,k

Z(x) is a normalization over the data sequence x

HMM like CRF
Single feature for each state-state pair (y’,y) and stateobservation pair in the data to train CRF

Yt-1

Xt-1

Yt

Xt

Yt+1
...

= 1 if yu = y’ and yv = y

0 otherwise

Xt+1

= 1 if yv = y and xv = x

0 otherwise

λy’,y and µy,x are equivalent to the logarithm of the HMM transition
probability Pr(y’|y) and observation probability Pr(x|y)

HMM like CRF
For a chain structure, the conditional probability of a label
sequence can be expressed in matrix form.
For each position i in the observed sequence x, define matrix

Where ei is the edge with label (yi-1, yi) and vi is the vertex with
label yi

HMM like CRF
The normalization function is the (start, stop) entry of the product
of these matrices

The conditional probability of label sequence y is:

where, y0 = start and yn+1 = stop

Parameter Estimation
The problem: determine the parameters
From training data
with empirical distribution

The goal: maximize the log-likelihood objective function

Parameter Estimation –
Iterative Scaling Algorithms
Update the weights as
Appropriately chosen

and

for

for edge feature fk is the solution of

T(x, y) is a global property of (x,y) and efficiently computing the
Right-hand sides of the above equation is a problem

Algorithm S
Define slack feature:

p(y y)
For each index i = 0, …, n+1 we' , define forward vectors

And backward vectors

Algorithm S
=
=

∑
x

n

p (x )∑

n =1

∑

y 'y

p ( y ' , y , x ) f k ( e i , y | e i = ( y ', y ) , x )

α i−1 ( y '| x ) M i ( y ' , y | x ) β i ( y | x )
p ( y ', y , x ) =
Z (x)

=

Algorithm S

The rate of convergence is governed by step size
which is Inversely proportional to constant S, but S is generally
quite large, resulting in slow convergence.

Algorithm T
Keeps track of partial T total. It accumulates feature expectations
into counters indexed by T(x)

Use forward-back ward recurrences to compute the expectation
ak,t of feature fk and bk,t of feature gk given that T(x) = t

Experiments
• Modeling label bias problem
– 2000 training and 500 test samples generated by
HMM
– CRF error is 4.6%
– MEMM error is 42%

CRF solves label bias problem

Experiments
• Modeling mixed order sources
– CRF converge in 500 iterations
– MEMM converge in 100 iterations

MEMM vs. HMM
The HMM outperforms the MEMM

CRF vs. MEMM
CRF usually outperforms the MEMM

CRF vs. HMM
Each open square represents a data set with α < ½, and a sold
square indicates a data set with a α ≥ ½. When the data is mostly
second order α ≥ ½, the discriminatively trained CRF usually
outperforms the MEMM

POS Tagging Experiments
• First-order HMM, MEMM and CRF model
• Data set: Penn Tree bank
• 50-50% test-train split

• Uses MEMM parameter vector as a starting point for
training the corresponding CRF to accelerate
convergence speed.

Interactive IE using CRF

Interactive parser updates IE results according to user’s
changes. Color coding used to alert the ambiguity of IE.

Some IE tools Available
• MALLET (UMass)
–
–
–
–

statistical natural language processing,
document classification,
clustering,
information extraction

– other machine learning applications to text.
• Sample Application:
GeneTaggerCRF: a gene-entity tagger based on MALLET

(MAchine Learning for LanguagE Toolkit). It uses
conditional random fields to find genes in a text file.

MinorThird
• http://minorthird.sourceforge.net/
• “a collection of Java classes for storing
text, annotating text, and learning to
extract entities and categorize text”
• Stored documents can be annotated in
independent files using TextLabels
(denoting, say, part-of-speech and
semantic information)

GATE
• http://gate.ac.uk/ie/annie.html
• leading toolkit for Text Mining
• distributed with an Information Extraction
component set called ANNIE (demo)
• Used in many research projects
– Long list can be found on its website
– Under integration of IBM UIMA

Sunita Sarawagi's CRF package
• http://crf.sourceforge.net/
• A Java implementation of conditional
random fields for sequential labeling.

UIMA (IBM)
• Unstructured Information
Management Architecture.
– A platform for unstructured
information management solutions
from combinations of semantic
analysis (IE) and search
components.

Some Interesting Website based
on IE
• ZoomInfo
• CiteSeer.org

(some of us using it everyday!)

• Google Local, Google Scholar
• and many more…

Information Extraction --- An one hour summary

Recommandé

Recommandé

Contenu connexe

Similaire à Information Extraction --- An one hour summary

Similaire à Information Extraction --- An one hour summary (20)

Plus de Yunyao Li

Plus de Yunyao Li (20)

Dernier

Dernier (20)

Information Extraction --- An one hour summary

Notes de l'éditeur