SlideShare une entreprise Scribd logo
1  sur  52
Information Extraction
Yunyao Li
EECS /SI 767
03/29/2006
The Problem
Date
Time: Start - End
Location
Speaker

Person
What is “Information Extraction”
As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

NAME

TITLE

ORGANIZATION

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…

Courtesy of William W. Cohen
What is “Information Extraction”
As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

IE

NAME
Bill Gates
Bill Veghte
Richard Stallman

TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…

Courtesy of William W. Cohen
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…

Microsoft Corporation
CEO
Bill Gates
Microsoft
aka “named entity
Gates
extraction”
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…

Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…

Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering

Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…

* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation

NAME
Bill Gates
Bill Veghte
Richard Stallman

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..

October 14, 2002, 4:00 a.m. PT

Courtesy of William W. Cohen
Live Example: Seminar
Landscape of IE Techniques
Classify Pre-segmented
Candidates

Classifier
which class?

Try alternate
window sizes:

Context Free Grammars

Finite State Machines

Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky.
BEGIN

which class?

Abraham Lincoln was born in Kentucky.

Our Focus
NNP
today! NNP

Most likely state sequence?

V

V

P

Classifier
VP

NP
BEGIN

END

BEGIN

NP
PP

which class?

END

?

Boundary Models

Classifier

ars
e

Alabama
Alaska
…
Wisconsin
Wyoming

Abraham Lincoln was born in Kentucky.

ely
p

member?

Abraham Lincoln was born in Kentucky.

l ik

Abraham Lincoln was born in Kentucky.

Sliding Window

Mo
st

Lexicons

VP
S

Courtesy of William W. Cohen
Markov Property
S1: rain
S2: cloud
S3: sun

1/2

The state of a system at time t+1,
qt+1, is conditionally independent of
{qt-1, qt-2, …, q1, q0} given qt

S2
1/3

1/2

S1

2/3
1

S2

In another word, current state
determines the probability
distribution for the next state.
Markov Property
S1: rain
S2: cloud
S3: sun

1/2

State-transition probabilities,

S2
1/3

1/2

S1

A=

2/3

0
1
 0
 0 .5 0 .5 0 


0.67 0.33 0



S3

1

Q: given today is sunny (i.e., q1=3),
what is the probability of “sun-cloud”
with the model?
Hidden Markov Model
S1: rain
S2: cloud
S3: sun

1/2
1/10

S2
1/2

S1
4/5

9/10
1/3

2/3
1

1/5

state sequences

O1

S3
3/10

7/10

O2

O3

O4

observations

O5
IE with Hidden Markov Model
Given a sequence of observations:
SI/EECS 767 is held weekly at SIN2 .

and a trained HMM:

Find the most likely state sequence: (Viterbi)

course name
location name
background

 
arg max s P ( s , o )

SI/EECS 767 is held weekly at SIN2
Any words said to be generated by the designated “course name”
state extract as a course name:
Course name: SI/EECS 767
Name Entity Extraction
[Bikel, et al 1998]

Person
end-ofsentence

start-ofsentence

Org
(Five other name classes)

Other

Hidden
states
Name Entity Extraction
Transition
probabilities

Observation
probabilities

P(st | st-1, ot-1 )

P(ot | st , st-1 )
or P(ot | st , ot-1 )

(1) Generating first word of a name-class

(2) Generating the rest of words in the name-class

(3) Generating “+end+” in a name-class
Training: Estimating Probabilities
Back-Off
“unknown words” and insufficient training data
Transition
probabilities

Observation
probabilities

P(st | st-1 )

P(ot | st )

P(st )

P(ot )
HMM-Experimental Results
Train on ~500k words of news wire text.
Results:
Learning HMM for IE
[Seymore, 1999]

Consider labeled, unlabeled, and distantly-labeled data
Some Issues with HMM
• Need to enumerate all possible observation
sequences
• Not practical to represent multiple interacting
features or long-range dependencies of the
observations
• Very strict independence assumptions on the
observations
Maximum Entropy Markov Models
[Lafferty, 2001]
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…

S t-1

St

S t+1

…
is “Wisniewski”
part of
noun phrase

…

ends in
“-ski”

O t -1

Ot

O t +1

Idea: replace generative model in HMM with a maxent
model, where state depends on observations

Pr( st | xt ) = ...
Courtesy of William W. Cohen
MEMM
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…

S t-1

St

S t+1

…

is “Wisniewski”
part of
noun phrase

…

ends in
“-ski”

O t -1

Ot

O t +1

Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state history

Pr( st | xt , st −1, st − 2, ...) = ...
Courtesy of William W. Cohen
HMM vs. MEMM
St-1

St

St+1
...

Pr( s, o) = ∏ Pr( si | si −1 ) Pr(oi | si −1 )
i

Ot-1

Ot

St-1

Pr( s | o) = ∏ Pr( si | si −1 , oi −1 )
i

Ot-1

Ot+1
St

Ot

St+1
...
Ot+1
Label Bias Problem with MEMM
Consider this MEMM

Pr(12|ro) = Pr(2|1,ro)Pr(1,ro) = Pr(2| 1,o)Pr(1,r)
Pr(12|ri) = Pr(2|1,ri)Pr(1,ri) = Pr(2| 1,i)Pr(1,r)
Pr(2|1,o) = Pr(2|1,r) = 1

Pr(12|ro) = Pr(12|ri)

But it should be Pr(12|ro) < Pr(12|ri)!
Solve the Label Bias Problem
• Change the state-transition structure of the
model

– Not always practical to change the set of states
• Start with a fully-connected model and let the
training procedure figure out a good structure
– Prelude the use of prior, which is very valuable
Random Field

Courtesy of Rongkun Shen
Conditional Random Field

Courtesy of Rongkun Shen
Conditional Distribution
If the graph G = (V, E) of Y is a tree, the conditional distribution over
the label sequence Y = y, given X = x, by fundamental theorem of
random fields is:



pθ (y | x) µ exp  ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷
v∈V ,k
 e∈E,k


x is a data sequence
y is a label sequence
v is a vertex from vertex set V = set of label random variables
e is an edge from edge set E over V
fk and gk are given and fixed. gk is a Boolean vertex feature; fk
is a Boolean edge feature
k is the number of features
θ = (λ1 , λ2 ,L , λn ; µ1 , µ 2 ,L , µ n ); λk and µ k are parameters to be
estimated
y|e is the set of components of y defined by edge e
y|v is the set of components of y defined by vertex v
Conditional Distribution
• CRFs use the observation-dependent normalization Z(x) for the
conditional distributions:



1
pθ (y | x) =
exp  ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷
Z (x)
v∈V ,k
 e∈E,k

Z(x) is a normalization over the data sequence x
HMM like CRF
Single feature for each state-state pair (y’,y) and stateobservation pair in the data to train CRF

Yt-1

Xt-1

Yt

Xt

Yt+1
...

= 1 if yu = y’ and yv = y

0 otherwise

Xt+1

= 1 if yv = y and xv = x

0 otherwise

λy’,y and µy,x are equivalent to the logarithm of the HMM transition
probability Pr(y’|y) and observation probability Pr(x|y)
HMM like CRF
For a chain structure, the conditional probability of a label
sequence can be expressed in matrix form.
For each position i in the observed sequence x, define matrix

Where ei is the edge with label (yi-1, yi) and vi is the vertex with
label yi
HMM like CRF
The normalization function is the (start, stop) entry of the product
of these matrices

The conditional probability of label sequence y is:

where, y0 = start and yn+1 = stop
Parameter Estimation
The problem: determine the parameters
From training data
with empirical distribution

The goal: maximize the log-likelihood objective function
Parameter Estimation –
Iterative Scaling Algorithms
Update the weights as
Appropriately chosen

and

for

for edge feature fk is the solution of

T(x, y) is a global property of (x,y) and efficiently computing the
Right-hand sides of the above equation is a problem
Algorithm S
Define slack feature:

p(y y)
For each index i = 0, …, n+1 we' , define forward vectors

And backward vectors
Algorithm S
=
=

∑
x

n

p (x )∑

n =1

∑

y 'y

p ( y ' , y , x ) f k ( e i , y | e i = ( y ', y ) , x )

α i−1 ( y '| x ) M i ( y ' , y | x ) β i ( y | x )
p ( y ', y , x ) =
Z (x)

=
Algorithm S

The rate of convergence is governed by step size
which is Inversely proportional to constant S, but S is generally
quite large, resulting in slow convergence.
Algorithm T
Keeps track of partial T total. It accumulates feature expectations
into counters indexed by T(x)

Use forward-back ward recurrences to compute the expectation
ak,t of feature fk and bk,t of feature gk given that T(x) = t
Experiments
• Modeling label bias problem
– 2000 training and 500 test samples generated by
HMM
– CRF error is 4.6%
– MEMM error is 42%

CRF solves label bias problem
Experiments
• Modeling mixed order sources
– CRF converge in 500 iterations
– MEMM converge in 100 iterations
MEMM vs. HMM
The HMM outperforms the MEMM
CRF vs. MEMM
CRF usually outperforms the MEMM
CRF vs. HMM
Each open square represents a data set with α < ½, and a sold
square indicates a data set with a α ≥ ½. When the data is mostly
second order α ≥ ½, the discriminatively trained CRF usually
outperforms the MEMM
POS Tagging Experiments
• First-order HMM, MEMM and CRF model
• Data set: Penn Tree bank
• 50-50% test-train split

• Uses MEMM parameter vector as a starting point for
training the corresponding CRF to accelerate
convergence speed.
Interactive IE using CRF

Interactive parser updates IE results according to user’s
changes. Color coding used to alert the ambiguity of IE.
Some IE tools Available
• MALLET (UMass)
–
–
–
–

statistical natural language processing,
document classification,
clustering,
information extraction

– other machine learning applications to text.
• Sample Application:
GeneTaggerCRF: a gene-entity tagger based on MALLET

(MAchine Learning for LanguagE Toolkit). It uses
conditional random fields to find genes in a text file.
MinorThird
• http://minorthird.sourceforge.net/
• “a collection of Java classes for storing
text, annotating text, and learning to
extract entities and categorize text”
• Stored documents can be annotated in
independent files using TextLabels
(denoting, say, part-of-speech and
semantic information)
GATE
• http://gate.ac.uk/ie/annie.html
• leading toolkit for Text Mining
• distributed with an Information Extraction
component set called ANNIE (demo)
• Used in many research projects
– Long list can be found on its website
– Under integration of IBM UIMA
Sunita Sarawagi's CRF package
• http://crf.sourceforge.net/
• A Java implementation of conditional
random fields for sequential labeling.
UIMA (IBM)
• Unstructured Information
Management Architecture.
– A platform for unstructured
information management solutions
from combinations of semantic
analysis (IE) and search
components.
Some Interesting Website based
on IE
• ZoomInfo
• CiteSeer.org

(some of us using it everyday!)

• Google Local, Google Scholar
• and many more…

Contenu connexe

Similaire à Information Extraction --- An one hour summary

Industry-wide research on open source Internet of Things platforms - San Fran...
Industry-wide research on open source Internet of Things platforms - San Fran...Industry-wide research on open source Internet of Things platforms - San Fran...
Industry-wide research on open source Internet of Things platforms - San Fran...changeableradiu23
 
Red Hat Summit, World IP Day, and the new OWASP Top 10
Red Hat Summit, World IP Day,  and the new OWASP Top 10Red Hat Summit, World IP Day,  and the new OWASP Top 10
Red Hat Summit, World IP Day, and the new OWASP Top 10Black Duck by Synopsys
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppttestbest6
 
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...Black Duck by Synopsys
 
OIT Technology, Communications, Japan
OIT Technology, Communications, JapanOIT Technology, Communications, Japan
OIT Technology, Communications, JapanChristos Makiyama
 
How to become an awesome Open Source contributor
How to become an awesome Open Source contributorHow to become an awesome Open Source contributor
How to become an awesome Open Source contributorChristos Matskas
 
Career in Software Development
Career in Software Development  Career in Software Development
Career in Software Development neosphere
 
Microsoft adds blockchain tools and no code ai on power apps
Microsoft adds blockchain tools and no code ai on power appsMicrosoft adds blockchain tools and no code ai on power apps
Microsoft adds blockchain tools and no code ai on power appsBlockchain Council
 
Windham Danny
Windham DannyWindham Danny
Windham DannyCarl Ford
 
2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pub2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pubStephen Buxton
 
Open Source Software Development by TLV Partners
Open Source Software Development by TLV PartnersOpen Source Software Development by TLV Partners
Open Source Software Development by TLV PartnersRoy Leiser
 
Open source presentation
Open source presentationOpen source presentation
Open source presentationRona Segev Gal
 
Open Source Insight: Happy Birthday Open Source and Application Security for ...
Open Source Insight: Happy Birthday Open Source and Application Security for ...Open Source Insight: Happy Birthday Open Source and Application Security for ...
Open Source Insight: Happy Birthday Open Source and Application Security for ...Black Duck by Synopsys
 
Juarez Barbosa Junior - Microsoft - OSL19
Juarez Barbosa Junior - Microsoft - OSL19Juarez Barbosa Junior - Microsoft - OSL19
Juarez Barbosa Junior - Microsoft - OSL19marketingsyone
 
The Trinity in Exponential Technologies: Open Source, Blockchain and Microsof...
The Trinity in Exponential Technologies: Open Source, Blockchain and Microsof...The Trinity in Exponential Technologies: Open Source, Blockchain and Microsof...
The Trinity in Exponential Technologies: Open Source, Blockchain and Microsof...Juarez Junior
 
Open Source Insight: GitHub Finds 4M Flaws, IAST Magic Quadrant, 2018 Open So...
Open Source Insight:GitHub Finds 4M Flaws, IAST Magic Quadrant, 2018 Open So...Open Source Insight:GitHub Finds 4M Flaws, IAST Magic Quadrant, 2018 Open So...
Open Source Insight: GitHub Finds 4M Flaws, IAST Magic Quadrant, 2018 Open So...Black Duck by Synopsys
 
Keynote: Hijacking Boring Sounding Things Like Foundations and Maturity Model...
Keynote: Hijacking Boring Sounding Things Like Foundations and Maturity Model...Keynote: Hijacking Boring Sounding Things Like Foundations and Maturity Model...
Keynote: Hijacking Boring Sounding Things Like Foundations and Maturity Model...Jon Galloway
 
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...Black Duck by Synopsys
 

Similaire à Information Extraction --- An one hour summary (20)

Microsoft History
Microsoft HistoryMicrosoft History
Microsoft History
 
Bill gates
Bill gatesBill gates
Bill gates
 
Industry-wide research on open source Internet of Things platforms - San Fran...
Industry-wide research on open source Internet of Things platforms - San Fran...Industry-wide research on open source Internet of Things platforms - San Fran...
Industry-wide research on open source Internet of Things platforms - San Fran...
 
Red Hat Summit, World IP Day, and the new OWASP Top 10
Red Hat Summit, World IP Day,  and the new OWASP Top 10Red Hat Summit, World IP Day,  and the new OWASP Top 10
Red Hat Summit, World IP Day, and the new OWASP Top 10
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppt
 
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
 
OIT Technology, Communications, Japan
OIT Technology, Communications, JapanOIT Technology, Communications, Japan
OIT Technology, Communications, Japan
 
How to become an awesome Open Source contributor
How to become an awesome Open Source contributorHow to become an awesome Open Source contributor
How to become an awesome Open Source contributor
 
Career in Software Development
Career in Software Development  Career in Software Development
Career in Software Development
 
Microsoft adds blockchain tools and no code ai on power apps
Microsoft adds blockchain tools and no code ai on power appsMicrosoft adds blockchain tools and no code ai on power apps
Microsoft adds blockchain tools and no code ai on power apps
 
Windham Danny
Windham DannyWindham Danny
Windham Danny
 
2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pub2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pub
 
Open Source Software Development by TLV Partners
Open Source Software Development by TLV PartnersOpen Source Software Development by TLV Partners
Open Source Software Development by TLV Partners
 
Open source presentation
Open source presentationOpen source presentation
Open source presentation
 
Open Source Insight: Happy Birthday Open Source and Application Security for ...
Open Source Insight: Happy Birthday Open Source and Application Security for ...Open Source Insight: Happy Birthday Open Source and Application Security for ...
Open Source Insight: Happy Birthday Open Source and Application Security for ...
 
Juarez Barbosa Junior - Microsoft - OSL19
Juarez Barbosa Junior - Microsoft - OSL19Juarez Barbosa Junior - Microsoft - OSL19
Juarez Barbosa Junior - Microsoft - OSL19
 
The Trinity in Exponential Technologies: Open Source, Blockchain and Microsof...
The Trinity in Exponential Technologies: Open Source, Blockchain and Microsof...The Trinity in Exponential Technologies: Open Source, Blockchain and Microsof...
The Trinity in Exponential Technologies: Open Source, Blockchain and Microsof...
 
Open Source Insight: GitHub Finds 4M Flaws, IAST Magic Quadrant, 2018 Open So...
Open Source Insight:GitHub Finds 4M Flaws, IAST Magic Quadrant, 2018 Open So...Open Source Insight:GitHub Finds 4M Flaws, IAST Magic Quadrant, 2018 Open So...
Open Source Insight: GitHub Finds 4M Flaws, IAST Magic Quadrant, 2018 Open So...
 
Keynote: Hijacking Boring Sounding Things Like Foundations and Maturity Model...
Keynote: Hijacking Boring Sounding Things Like Foundations and Maturity Model...Keynote: Hijacking Boring Sounding Things Like Foundations and Maturity Model...
Keynote: Hijacking Boring Sounding Things Like Foundations and Maturity Model...
 
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
 

Plus de Yunyao Li

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopYunyao Li
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLPYunyao Li
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Yunyao Li
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Yunyao Li
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaYunyao Li
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingYunyao Li
 
Coling poster
Coling posterColing poster
Coling posterYunyao Li
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Yunyao Li
 

Plus de Yunyao Li (20)

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLP
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Information Extraction --- An one hour summary

  • 2. The Problem Date Time: Start - End Location Speaker Person
  • 3. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. NAME TITLE ORGANIZATION "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Courtesy of William W. Cohen
  • 4. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Courtesy of William W. Cohen
  • 5. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft aka “named entity Gates extraction” Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen
  • 6. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen
  • 7. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen
  • 8. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation NAME Bill Gates Bill Veghte Richard Stallman For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. October 14, 2002, 4:00 a.m. PT Courtesy of William W. Cohen
  • 10. Landscape of IE Techniques Classify Pre-segmented Candidates Classifier which class? Try alternate window sizes: Context Free Grammars Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN which class? Abraham Lincoln was born in Kentucky. Our Focus NNP today! NNP Most likely state sequence? V V P Classifier VP NP BEGIN END BEGIN NP PP which class? END ? Boundary Models Classifier ars e Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. ely p member? Abraham Lincoln was born in Kentucky. l ik Abraham Lincoln was born in Kentucky. Sliding Window Mo st Lexicons VP S Courtesy of William W. Cohen
  • 11. Markov Property S1: rain S2: cloud S3: sun 1/2 The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt S2 1/3 1/2 S1 2/3 1 S2 In another word, current state determines the probability distribution for the next state.
  • 12. Markov Property S1: rain S2: cloud S3: sun 1/2 State-transition probabilities, S2 1/3 1/2 S1 A= 2/3 0 1  0  0 .5 0 .5 0    0.67 0.33 0   S3 1 Q: given today is sunny (i.e., q1=3), what is the probability of “sun-cloud” with the model?
  • 13. Hidden Markov Model S1: rain S2: cloud S3: sun 1/2 1/10 S2 1/2 S1 4/5 9/10 1/3 2/3 1 1/5 state sequences O1 S3 3/10 7/10 O2 O3 O4 observations O5
  • 14. IE with Hidden Markov Model Given a sequence of observations: SI/EECS 767 is held weekly at SIN2 . and a trained HMM: Find the most likely state sequence: (Viterbi) course name location name background   arg max s P ( s , o ) SI/EECS 767 is held weekly at SIN2 Any words said to be generated by the designated “course name” state extract as a course name: Course name: SI/EECS 767
  • 15. Name Entity Extraction [Bikel, et al 1998] Person end-ofsentence start-ofsentence Org (Five other name classes) Other Hidden states
  • 16. Name Entity Extraction Transition probabilities Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) or P(ot | st , ot-1 ) (1) Generating first word of a name-class (2) Generating the rest of words in the name-class (3) Generating “+end+” in a name-class
  • 18. Back-Off “unknown words” and insufficient training data Transition probabilities Observation probabilities P(st | st-1 ) P(ot | st ) P(st ) P(ot )
  • 19. HMM-Experimental Results Train on ~500k words of news wire text. Results:
  • 20. Learning HMM for IE [Seymore, 1999] Consider labeled, unlabeled, and distantly-labeled data
  • 21. Some Issues with HMM • Need to enumerate all possible observation sequences • Not practical to represent multiple interacting features or long-range dependencies of the observations • Very strict independence assumptions on the observations
  • 22. Maximum Entropy Markov Models [Lafferty, 2001] identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t-1 St S t+1 … is “Wisniewski” part of noun phrase … ends in “-ski” O t -1 Ot O t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations Pr( st | xt ) = ... Courtesy of William W. Cohen
  • 23. MEMM identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t-1 St S t+1 … is “Wisniewski” part of noun phrase … ends in “-ski” O t -1 Ot O t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history Pr( st | xt , st −1, st − 2, ...) = ... Courtesy of William W. Cohen
  • 24. HMM vs. MEMM St-1 St St+1 ... Pr( s, o) = ∏ Pr( si | si −1 ) Pr(oi | si −1 ) i Ot-1 Ot St-1 Pr( s | o) = ∏ Pr( si | si −1 , oi −1 ) i Ot-1 Ot+1 St Ot St+1 ... Ot+1
  • 25. Label Bias Problem with MEMM Consider this MEMM Pr(12|ro) = Pr(2|1,ro)Pr(1,ro) = Pr(2| 1,o)Pr(1,r) Pr(12|ri) = Pr(2|1,ri)Pr(1,ri) = Pr(2| 1,i)Pr(1,r) Pr(2|1,o) = Pr(2|1,r) = 1 Pr(12|ro) = Pr(12|ri) But it should be Pr(12|ro) < Pr(12|ri)!
  • 26. Solve the Label Bias Problem • Change the state-transition structure of the model – Not always practical to change the set of states • Start with a fully-connected model and let the training procedure figure out a good structure – Prelude the use of prior, which is very valuable
  • 27. Random Field Courtesy of Rongkun Shen
  • 29. Conditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:   pθ (y | x) µ exp  ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷ v∈V ,k  e∈E,k  x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features θ = (λ1 , λ2 ,L , λn ; µ1 , µ 2 ,L , µ n ); λk and µ k are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v
  • 30. Conditional Distribution • CRFs use the observation-dependent normalization Z(x) for the conditional distributions:   1 pθ (y | x) = exp  ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷ Z (x) v∈V ,k  e∈E,k  Z(x) is a normalization over the data sequence x
  • 31. HMM like CRF Single feature for each state-state pair (y’,y) and stateobservation pair in the data to train CRF Yt-1 Xt-1 Yt Xt Yt+1 ... = 1 if yu = y’ and yv = y  0 otherwise Xt+1 = 1 if yv = y and xv = x  0 otherwise λy’,y and µy,x are equivalent to the logarithm of the HMM transition probability Pr(y’|y) and observation probability Pr(x|y)
  • 32. HMM like CRF For a chain structure, the conditional probability of a label sequence can be expressed in matrix form. For each position i in the observed sequence x, define matrix Where ei is the edge with label (yi-1, yi) and vi is the vertex with label yi
  • 33. HMM like CRF The normalization function is the (start, stop) entry of the product of these matrices The conditional probability of label sequence y is: where, y0 = start and yn+1 = stop
  • 34. Parameter Estimation The problem: determine the parameters From training data with empirical distribution The goal: maximize the log-likelihood objective function
  • 35. Parameter Estimation – Iterative Scaling Algorithms Update the weights as Appropriately chosen and for for edge feature fk is the solution of T(x, y) is a global property of (x,y) and efficiently computing the Right-hand sides of the above equation is a problem
  • 36. Algorithm S Define slack feature: p(y y) For each index i = 0, …, n+1 we' , define forward vectors And backward vectors
  • 37. Algorithm S = = ∑ x n p (x )∑ n =1 ∑ y 'y p ( y ' , y , x ) f k ( e i , y | e i = ( y ', y ) , x ) α i−1 ( y '| x ) M i ( y ' , y | x ) β i ( y | x ) p ( y ', y , x ) = Z (x) =
  • 38. Algorithm S The rate of convergence is governed by step size which is Inversely proportional to constant S, but S is generally quite large, resulting in slow convergence.
  • 39. Algorithm T Keeps track of partial T total. It accumulates feature expectations into counters indexed by T(x) Use forward-back ward recurrences to compute the expectation ak,t of feature fk and bk,t of feature gk given that T(x) = t
  • 40. Experiments • Modeling label bias problem – 2000 training and 500 test samples generated by HMM – CRF error is 4.6% – MEMM error is 42% CRF solves label bias problem
  • 41. Experiments • Modeling mixed order sources – CRF converge in 500 iterations – MEMM converge in 100 iterations
  • 42. MEMM vs. HMM The HMM outperforms the MEMM
  • 43. CRF vs. MEMM CRF usually outperforms the MEMM
  • 44. CRF vs. HMM Each open square represents a data set with α < ½, and a sold square indicates a data set with a α ≥ ½. When the data is mostly second order α ≥ ½, the discriminatively trained CRF usually outperforms the MEMM
  • 45. POS Tagging Experiments • First-order HMM, MEMM and CRF model • Data set: Penn Tree bank • 50-50% test-train split • Uses MEMM parameter vector as a starting point for training the corresponding CRF to accelerate convergence speed.
  • 46. Interactive IE using CRF Interactive parser updates IE results according to user’s changes. Color coding used to alert the ambiguity of IE.
  • 47. Some IE tools Available • MALLET (UMass) – – – – statistical natural language processing, document classification, clustering, information extraction – other machine learning applications to text. • Sample Application: GeneTaggerCRF: a gene-entity tagger based on MALLET (MAchine Learning for LanguagE Toolkit). It uses conditional random fields to find genes in a text file.
  • 48. MinorThird • http://minorthird.sourceforge.net/ • “a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text” • Stored documents can be annotated in independent files using TextLabels (denoting, say, part-of-speech and semantic information)
  • 49. GATE • http://gate.ac.uk/ie/annie.html • leading toolkit for Text Mining • distributed with an Information Extraction component set called ANNIE (demo) • Used in many research projects – Long list can be found on its website – Under integration of IBM UIMA
  • 50. Sunita Sarawagi's CRF package • http://crf.sourceforge.net/ • A Java implementation of conditional random fields for sequential labeling.
  • 51. UIMA (IBM) • Unstructured Information Management Architecture. – A platform for unstructured information management solutions from combinations of semantic analysis (IE) and search components.
  • 52. Some Interesting Website based on IE • ZoomInfo • CiteSeer.org (some of us using it everyday!) • Google Local, Google Scholar • and many more…

Notes de l'éditeur

  1. Pr(2|1,r): Independent of observation.