SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Seman&c	
  Analysis	
  in	
  Language	
  Technology	
  
http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 



Information Extraction (I)

Named Entity Recognition (NER)
Marina	
  San(ni	
  
san$nim@stp.lingfil.uu.se	
  
	
  
Department	
  of	
  Linguis(cs	
  and	
  Philology	
  
Uppsala	
  University,	
  Uppsala,	
  Sweden	
  
	
  
Spring	
  2016	
  
	
  
	
  
1	
  
Previous	
  Lecture:	
  Distribu$onal	
  Seman$cs	
  
•  Star(ng	
  from	
  Shakespeare	
  and	
  IR	
  (term-­‐document	
  matrix)	
  …	
  
•  Moving	
  to	
  context	
  ”windows”	
  taken	
  from	
  the	
  Brown	
  corpus…	
  
•  Ending	
  up	
  to	
  PPMI	
  to	
  weigh	
  word	
  distribu(on…	
  
•  Men(oning	
  cosine	
  metric	
  to	
  compare	
  vectors….	
  
2	
  
As#You#Like#It Twelfth#Night Julius#Caesar Henry#V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0
IR:	
  Term-­‐document	
  matrix	
  
•  Each	
  cell:	
  count	
  of	
  term	
  t	
  in	
  a	
  document	
  d:	
  	
  Nt,d:	
  	
  
•  Each	
  document	
  is	
  a	
  count	
  vector	
  in	
  ℕv:	
  a	
  column	
  below	
  	
  
3	
  
Term	
  frequency	
  of	
  
t	
  in	
  d	
  
Document	
  similarity:	
  Term-­‐document	
  matrix	
  
•  Two	
  documents	
  are	
  similar	
  if	
  their	
  vectors	
  are	
  similar	
  
4	
  
As#You#Like#It Twelfth#Night Julius#Caesar Henry#V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0
The	
  words	
  in	
  a	
  term-­‐document	
  matrix	
  
•  Two	
  words	
  are	
  similar	
  if	
  their	
  vectors	
  are	
  similar	
  
5	
  
As#You#Like#It Twelfth#Night Julius#Caesar Henry#V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0
Term-­‐context	
  matrix	
  for	
  word	
  similarity	
  
•  Two	
  words	
  are	
  similar	
  in	
  meaning	
  if	
  their	
  context	
  
vectors	
  are	
  similar	
  
6	
  
aardvark computer data pinch result sugar …
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
we say, two words are similarin meaning if their context vectors
are similar.
	
  
Compu$ng	
  PPMI	
  on	
  a	
  term-­‐context	
  matrix	
  
•  Matrix	
  F	
  with	
  W	
  rows	
  (words)	
  and	
  C	
  columns	
  (contexts)	
  
•  fij	
  is	
  #	
  of	
  $mes	
  wi	
  occurs	
  in	
  context	
  cj
7	
  
pij =
fij
fij
j=1
C
∑
i=1
W
∑
pi* =
fij
j=1
C
∑
fij
j=1
C
∑
i=1
W
∑ p* j =
fij
i=1
W
∑
fij
j=1
C
∑
i=1
W
∑
pmiij = log2
pij
pi* p* j
ppmiij =
pmiij if pmiij > 0
0 otherwise
!
"
#
$#
The	
  count	
  of	
  all	
  
the	
  words	
  that	
  
occur	
  in	
  that	
  
context	
  
The	
  count	
  of	
  all	
  the	
  
contexts	
  where	
  the	
  
word	
  appear	
  
The	
  sum	
  of	
  all	
  words	
  in	
  
all	
  contexts	
  =	
  all	
  the	
  
numbers	
  in	
  the	
  matrix	
  
Summa$on:	
  Sigma	
  Nota$on	
  (i)	
  
8	
  
It means: sum whatever appears after the Sigma: so we sum n.
What is the value of n ? The values are shown below and above the Sigma.
Below --> index variable (eg. start from 1);
Above --> the range of the sum (eg. from 1 up to 4).
In this case, it says that n goes from 1 to 4, which is 1, 2, 3 and 4
(http://www.mathsisfun.com/algebra/sigma-notation.html )
	
  
pij =
fij
fij
j=1
C
∑
i=1
W
∑we can’t delete
f(i,j) !!!	
  
Sum	
  from	
  i=1	
  to	
  4	
  
Summa$on:	
  Sigma	
  Nota$on	
  (ii)	
  	
  
•  Addi(onal	
  examples	
  
•  Sums	
  can	
  be	
  nested	
  
9	
  
Alterna$ve	
  nota$ons…	
  (Levy,	
  2012)	
  
•  When,	
  the	
  range	
  of	
  the	
  sum	
  can	
  be	
  understood	
  from	
  context,	
  it	
  
ca	
  be	
  le	
  out;	
  	
  
•  or	
  we	
  want	
  to	
  be	
  vague	
  about	
  the	
  precise	
  range	
  of	
  the	
  sum.	
  For	
  
example,	
  suppose	
  that	
  there	
  are	
  n	
  variables,	
  x1	
  through	
  xn.	
  	
  
•  In	
  order	
  to	
  say	
  that	
  the	
  sum	
  of	
  all	
  n	
  variables	
  is	
  equal	
  to	
  1,	
  we	
  
might	
  simply	
  write:	
  	
  
10	
  
Formulas:	
  Sigma	
  Nota$on	
  
11	
  
pij =
fij
fij
j=1
C
∑
i=1
W
∑
pi* =
fij
j=1
C
∑
fij
j=1
C
∑
i=1
W
∑
p* j =
fij
i=1
W
∑
fij
j=1
C
∑
i=1
W
∑
•  Numerator:	
  f	
  ij	
  =	
  a	
  single	
  cell	
  	
  
•  Denominators:	
  sum	
  the	
  cells	
  of	
  all	
  the	
  
words	
  and	
  the	
  cells	
  of	
  all	
  the	
  contexts	
  
•  Numerator:	
  sum	
  the	
  cells	
  of	
  all	
  contexts	
  
(all	
  the	
  columns)	
  
•  Numerator:	
  sum	
  the	
  cells	
  of	
  all	
  the	
  words	
  
(all	
  the	
  rows)	
  	
  
Living	
  lexicon:	
  built	
  upon	
  an	
  underlying	
  
con$nously	
  updated	
  corpus	
  	
  
12	
  
Drawbacks:	
  Updated	
  but	
  unstable	
  &	
  incomplete:	
  missing words, missing	
  
linguis(c	
  informa(on,	
  etc.	
  	
  
Mul(lingualiy,	
  func(on	
  words,	
  etc.	
  	
  
Similarity:	
  	
  
•  Given	
  the	
  underlying	
  sta(s(cal	
  model,	
  these	
  words	
  are	
  similar	
  
13	
  
Fredrik	
  Olsson	
  
Gavagai	
  blog	
  
•  Further	
  reading	
  (Magnus	
  Sahlgren)	
  :	
  
heps://www.gavagai.se/blog/
2015/09/30/a-­‐brief-­‐history-­‐of-­‐
word-­‐embeddings/	
  	
  
14	
  
End	
  of	
  previous	
  lecture	
  
15	
  
Acknowledgements
Most	
  slides	
  borrowed	
  or	
  adapted	
  from:	
  
Dan	
  Jurafsky	
  and	
  Christopher	
  Manning,	
  Coursera	
  
Dan	
  Jurafsky	
  and	
  James	
  H.	
  Mar(n	
  
	
  	
  
	
  
J&M(2015,	
  dra):	
  heps://web.stanford.edu/~jurafsky/slp3/	
  	
  	
  
	
  
	
  	
  	
  
Preliminary:	
  What’s	
  Informa$on	
  Extrac$on	
  (IE)?	
  	
  
•  IE	
  =	
  text	
  analy(cs	
  =	
  text	
  mining	
  =	
  e-­‐discovery,	
  etc.	
  
•  The	
  ul(mate	
  goal	
  is	
  to	
  convert	
  unstructured	
  text	
  into	
  structured	
  
informa(on	
  (so	
  informa(on	
  of	
  interest	
  can	
  easily	
  be	
  picked	
  up).	
  
•  unstructured	
  data/text:	
  email,	
  PDF	
  files,	
  social	
  media	
  posts,	
  tweets,	
  text	
  
messages,	
  blogs,	
  basically	
  any	
  running	
  text...	
  
•  structured	
  data/text:	
  databases	
  (xlm,	
  sql,	
  etc.),	
  ontologies,	
  dic(onaries,	
  etc.	
  	
  
17	
  
Informa$on	
  
Extrac$on	
  and	
  Named	
  
En$ty	
  Recogni$on	
  
Introducing	
  the	
  tasks:	
  
Gelng	
  simple	
  structured	
  
informa(on	
  out	
  of	
  text	
  
Informa$on	
  Extrac$on	
  
•  Informa(on	
  extrac(on	
  (IE)	
  systems	
  
•  Find	
  and	
  understand	
  limited	
  relevant	
  parts	
  of	
  texts	
  
•  Gather	
  informa(on	
  from	
  many	
  pieces	
  of	
  text	
  
•  Produce	
  a	
  structured	
  representa(on	
  of	
  relevant	
  informa(on:	
  	
  
•  rela3ons	
  (in	
  the	
  database	
  sense),	
  a.k.a.,	
  
•  a	
  knowledge	
  base	
  
•  Goals:	
  
1.  Organize	
  informa(on	
  so	
  that	
  it	
  is	
  useful	
  to	
  people	
  
2.  Put	
  informa(on	
  in	
  a	
  seman(cally	
  precise	
  form	
  that	
  allows	
  further	
  
inferences	
  to	
  be	
  made	
  by	
  computer	
  algorithms	
  
Informa$on	
  Extrac$on:	
  factual	
  info	
  
•  IE	
  systems	
  extract	
  clear,	
  factual	
  informa(on	
  
•  Roughly:	
  Who	
  did	
  what	
  to	
  whom	
  when?	
  
•  E.g.,	
  
•  Gathering	
  earnings,	
  profits,	
  board	
  members,	
  headquarters,	
  etc.	
  from	
  
company	
  reports	
  	
  
•  The	
  headquarters	
  of	
  BHP	
  Billiton	
  Limited,	
  and	
  the	
  global	
  headquarters	
  
of	
  the	
  combined	
  BHP	
  Billiton	
  Group,	
  are	
  located	
  in	
  Melbourne,	
  
Australia.	
  	
  
•  headquarters(“BHP	
  Biliton	
  Limited”,	
  “Melbourne,	
  Australia”)	
  
•  Learn	
  drug-­‐gene	
  product	
  interac(ons	
  from	
  medical	
  research	
  literature	
  
Low-­‐level	
  informa$on	
  extrac$on	
  
•  Is	
  now	
  available	
  –	
  and	
  I	
  think	
  popular	
  –	
  in	
  applica(ons	
  like	
  Apple	
  
or	
  Google	
  mail,	
  and	
  web	
  indexing	
  
•  Oen	
  seems	
  to	
  be	
  based	
  on	
  regular	
  expressions	
  and	
  name	
  lists	
  
Low-­‐level	
  informa$on	
  extrac$on	
  
•  A	
  very	
  important	
  sub-­‐task:	
  find	
  and	
  classify	
  names	
  
in	
  text.	
  
•  An	
  en(ty	
  is	
  a	
  discrete	
  thing	
  like	
  “IBM	
  Corpora(on”	
  
•  Named” means called “IBM” or “Big Blue” not “it” or
“the company”
•  often extended in practice to things like dates,
instances of products and chemical/biological
substances that aren’t really entities…
•  But also used for times, dates, proteins, etc., which aren’t
entities – easy to recognize semantic classes
Named	
  En$ty	
  Recogni$on	
  (NER)	
  
Named	
  En$ty	
  Recogni$on	
  (NER)	
  
•  A	
  very	
  important	
  sub-­‐task:	
  find	
  and	
  
classify	
  names	
  in	
  text,	
  for	
  example:	
  
•  The	
  decision	
  by	
  the	
  independent	
  MP	
  
Andrew	
  Wilkie	
  to	
  withdraw	
  his	
  support	
  
for	
  the	
  minority	
  Labor	
  government	
  
sounded	
  drama(c	
  but	
  it	
  should	
  not	
  
further	
  threaten	
  its	
  stability.	
  When,	
  aer	
  
the	
  2010	
  elec(on,	
  Wilkie,	
  Rob	
  
Oakeshoe,	
  Tony	
  Windsor	
  and	
  the	
  
Greens	
  agreed	
  to	
  support	
  Labor,	
  they	
  
gave	
  just	
  two	
  guarantees:	
  confidence	
  
and	
  supply.	
  
you have a text, and
you want to:
1.  find things that are
names: European
Commission, John
Lloyd Jones, etc.
2. give them labels:
ORG, PERS, etc.
	
  
•  A	
  very	
  important	
  sub-­‐task:	
  find	
  and	
  classify	
  names	
  in	
  
text,	
  for	
  example:	
  
•  The	
  decision	
  by	
  the	
  independent	
  MP	
  Andrew	
  Wilkie	
  to	
  
withdraw	
  his	
  support	
  for	
  the	
  minority	
  Labor	
  government	
  
sounded	
  drama(c	
  but	
  it	
  should	
  not	
  further	
  threaten	
  its	
  
stability.	
  When,	
  aer	
  the	
  2010	
  elec(on,	
  Wilkie,	
  Rob	
  
Oakeshoe,	
  Tony	
  Windsor	
  and	
  the	
  Greens	
  agreed	
  to	
  support	
  
Labor,	
  they	
  gave	
  just	
  two	
  guarantees:	
  confidence	
  and	
  
supply.	
  
Named	
  En$ty	
  Recogni$on	
  (NER)	
  
Person	
  
Date	
  
Loca(on	
  
Organi-­‐	
  
	
  	
  	
  	
  za(on	
  
	
  
	
  
Named	
  En$ty	
  Recogni$on	
  (NER)	
  
•  The	
  uses:	
  
•  Named	
  en((es	
  can	
  be	
  indexed,	
  linked	
  off,	
  etc.	
  
•  Sen(ment	
  can	
  be	
  aeributed	
  to	
  companies	
  or	
  products	
  
•  A	
  lot	
  of	
  IE	
  rela(ons	
  are	
  associa(ons	
  between	
  named	
  en((es	
  
•  For	
  ques(on	
  answering,	
  answers	
  are	
  oen	
  named	
  en((es.	
  
•  Concretely:	
  
•  Many	
  web	
  pages	
  tag	
  various	
  en((es,	
  with	
  links	
  to	
  bio	
  or	
  topic	
  pages,	
  etc.	
  
•  Reuters’	
  OpenCalais,	
  Evri,	
  AlchemyAPI,	
  Yahoo’s	
  Term	
  Extrac(on,	
  …	
  
•  Apple/Google/Microso/…	
  smart	
  recognizers	
  for	
  document	
  content	
  
Summary:	
  
Gelng	
  simple	
  structured	
  informa(on	
  out	
  of	
  text	
  
Evalua$on	
  of	
  Named	
  
En$ty	
  Recogni$on	
  
The	
  extension	
  of	
  Precision,	
  
Recall,	
  and	
  the	
  F	
  measure	
  to	
  
sequences	
  
The	
  Named	
  En$ty	
  Recogni$on	
  Task	
  
Task:	
  Predict	
  en((es	
  in	
  a	
  text	
  
	
  
	
  Foreign	
   	
  ORG	
  
	
  Ministry	
   	
  ORG	
  
	
  spokesman	
   	
  O	
  
	
  Shen	
  	
   	
  PER	
  
	
  Guofang	
   	
  PER	
  
	
  told	
   	
   	
  O	
  
	
  Reuters	
   	
  ORG	
  
	
  :	
   	
   	
  :	
  
}	
  
Standard	
  	
  
evalua(on	
  
is	
  per	
  en(ty,	
  
not	
  per	
  token	
  
P/R	
  
30	
  
P=TP/TP+FP;	
  R=TP/TP+FN	
  
FP=false	
  alarm	
  (it	
  is	
  not	
  a	
  
NE,	
  but	
  it	
  has	
  been	
  
classified	
  as	
  NE)	
  
FN	
  =it	
  is	
  true	
  that	
  it	
  is	
  a	
  
NE,	
  but	
  d	
  system	
  failed	
  
to	
  recognised	
  it	
  
Precision/Recall/F1	
  for	
  IE/NER	
  
•  Recall	
  and	
  precision	
  are	
  straighNorward	
  for	
  tasks	
  like	
  IR	
  and	
  text	
  
categoriza(on,	
  where	
  there	
  is	
  only	
  one	
  grain	
  size	
  (documents)	
  
•  The	
  measure	
  behaves	
  a	
  bit	
  funnily	
  for	
  IE/NER	
  when	
  there	
  are	
  
boundary	
  errors	
  (which	
  are	
  common):	
  
•  First	
  Bank	
  of	
  Chicago	
  announced	
  earnings	
  …	
  
•  This	
  counts	
  as	
  both	
  a	
  fp	
  and	
  a	
  fn	
  
•  Selec(ng	
  nothing	
  would	
  have	
  been	
  beeer	
  
•  Some	
  other	
  metrics	
  (e.g.,	
  MUC	
  scorer)	
  give	
  par(al	
  credit	
  
(according	
  to	
  complex	
  rules)	
  
Summary:	
  	
  
Be	
  careful	
  when	
  interpre(ng	
  the	
  P/R/F1	
  measures	
  
Sequence	
  Models	
  for	
  
Named	
  En$ty	
  
Recogni$on	
  
The	
  ML	
  sequence	
  model	
  approach	
  to	
  NER	
  
Training	
  
1.  Collect	
  a	
  set	
  of	
  representa(ve	
  training	
  documents	
  
2.  Label	
  each	
  token	
  for	
  its	
  en(ty	
  class	
  or	
  other	
  (O)	
  
3.  Design	
  feature	
  extractors	
  appropriate	
  to	
  the	
  text	
  and	
  classes	
  
4.  Train	
  a	
  sequence	
  classifier	
  to	
  predict	
  the	
  labels	
  from	
  the	
  data	
  
	
  
Tes(ng	
  
1.  Receive	
  a	
  set	
  of	
  tes(ng	
  documents	
  
2.  Run	
  sequence	
  model	
  inference	
  to	
  label	
  each	
  token	
  
3.  Appropriately	
  output	
  the	
  recognized	
  en((es	
  
NER	
  pipeline	
  
35	
  
Representa(ve	
  
documents	
  
Human	
  
annota(on	
  
Annotated	
  
documents	
  
Feature	
  
extrac(on	
  
Training	
  data	
  Sequence	
  
classifiers	
  
NER	
  system	
  
Encoding	
  classes	
  for	
  sequence	
  labeling	
  
	
   	
   	
  IO	
  encoding 	
  IOB	
  encoding	
  
	
  
	
  Fred 	
  	
   	
  PER 	
   	
  B-­‐PER	
  
	
  showed	
   	
  O 	
   	
  O	
  
	
  Sue 	
  	
   	
  PER 	
   	
  B-­‐PER	
  
	
  Mengqiu	
   	
  PER 	
   	
  B-­‐PER	
  
	
  Huang	
   	
  PER 	
   	
  I-­‐PER	
  
	
  ‘s	
   	
   	
  O 	
   	
  O	
  
	
  new	
  	
   	
  O 	
   	
  O	
  
	
  pain(ng 	
  O 	
   	
  O	
  
Features	
  for	
  sequence	
  labeling	
  
•  Words	
  
•  Current	
  word	
  (essen(ally	
  like	
  a	
  learned	
  dic(onary)	
  
•  Previous/next	
  word	
  (context)	
  
•  Other	
  kinds	
  of	
  inferred	
  linguis(c	
  classifica(on	
  
•  Part-­‐of-­‐speech	
  tags	
  
•  Label	
  context	
  
•  Previous	
  (and	
  perhaps	
  next)	
  label	
  
37	
  
Features:	
  Word	
  substrings	
  
drug
company
movie
place
person
Cotrimoxazole	
   Wethersfield	
  
Alien	
  Fury:	
  Countdown	
  to	
  Invasion	
  
0
0
0
18
0
oxa
708
0
0
06
:
0 8
6
68
14
field
Features: Word shapes
•  Word Shapes
•  Map words to simplified representation that encodes attributes
such as length, capitalization, numerals, Greek letters, internal
punctuation, etc.
Varicella-zoster Xx-xxx
mRNA xXXX
CPA1 XXXd
Sequence	
  models	
  
•  Once	
  you	
  have	
  designed	
  the	
  features,	
  apply	
  a	
  sequence	
  
classifier	
  (cf	
  PoS	
  tagging),	
  such	
  as:	
  
•  Maximum	
  Entropy	
  Markov	
  Models	
  
•  Condi(onal	
  Random	
  Fields	
  
•  etc.	
  
40	
  
The end

Contenu connexe

Tendances

Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 

Tendances (20)

Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Bert
BertBert
Bert
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
NLP
NLPNLP
NLP
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Recent trends in natural language processing
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processing
 
Language models
Language modelsLanguage models
Language models
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 

En vedette

Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER)
Stephen Shellman
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
 
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
Guy De Pauw
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2
Arabic_NLP_ImamU2013
 
Towards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can HelpTowards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can Help
Marina Santini
 
Learning to rank fulltext results from clicks
Learning to rank fulltext results from clicksLearning to rank fulltext results from clicks
Learning to rank fulltext results from clicks
tkramar
 

En vedette (20)

Named Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationNamed Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 Presentation
 
Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER)
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Dictionary-based named entity recognition
Dictionary-based named entity recognitionDictionary-based named entity recognition
Dictionary-based named entity recognition
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2
 
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyA Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
Named Entity Recognition - VLSP 2016
Named Entity Recognition - VLSP 2016Named Entity Recognition - VLSP 2016
Named Entity Recognition - VLSP 2016
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Named Entities
Named EntitiesNamed Entities
Named Entities
 
Entity identification and extraction
Entity identification and extractionEntity identification and extraction
Entity identification and extraction
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Towards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can HelpTowards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can Help
 
Learning to rank fulltext results from clicks
Learning to rank fulltext results from clicksLearning to rank fulltext results from clicks
Learning to rank fulltext results from clicks
 
Word Sense Disambiguation and Induction
Word Sense Disambiguation and InductionWord Sense Disambiguation and Induction
Word Sense Disambiguation and Induction
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
 

Similaire à IE: Named Entity Recognition (NER)

Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
Jae Hong Kil
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 

Similaire à IE: Named Entity Recognition (NER) (20)

Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Lecture20 xing
Lecture20 xingLecture20 xing
Lecture20 xing
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Some thoughts about the gaps across languages and domains through the experi...
Some thoughts about the gaps across languages and domains through the experi...Some thoughts about the gaps across languages and domains through the experi...
Some thoughts about the gaps across languages and domains through the experi...
 
Document similarity
Document similarityDocument similarity
Document similarity
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
NELL: The Never-Ending Language Learning System
NELL: The Never-Ending Language Learning SystemNELL: The Never-Ending Language Learning System
NELL: The Never-Ending Language Learning System
 
SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Data Science Using Python.pptx
Data Science Using Python.pptxData Science Using Python.pptx
Data Science Using Python.pptx
 

Plus de Marina Santini

Plus de Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)
 
Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities
 
Mathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability TheoryMathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability Theory
 

Dernier

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Dernier (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

IE: Named Entity Recognition (NER)

  • 1. Seman&c  Analysis  in  Language  Technology   http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 
 
 Information Extraction (I)
 Named Entity Recognition (NER) Marina  San(ni   san$nim@stp.lingfil.uu.se     Department  of  Linguis(cs  and  Philology   Uppsala  University,  Uppsala,  Sweden     Spring  2016       1  
  • 2. Previous  Lecture:  Distribu$onal  Seman$cs   •  Star(ng  from  Shakespeare  and  IR  (term-­‐document  matrix)  …   •  Moving  to  context  ”windows”  taken  from  the  Brown  corpus…   •  Ending  up  to  PPMI  to  weigh  word  distribu(on…   •  Men(oning  cosine  metric  to  compare  vectors….   2  
  • 3. As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 IR:  Term-­‐document  matrix   •  Each  cell:  count  of  term  t  in  a  document  d:    Nt,d:     •  Each  document  is  a  count  vector  in  ℕv:  a  column  below     3   Term  frequency  of   t  in  d  
  • 4. Document  similarity:  Term-­‐document  matrix   •  Two  documents  are  similar  if  their  vectors  are  similar   4   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  • 5. The  words  in  a  term-­‐document  matrix   •  Two  words  are  similar  if  their  vectors  are  similar   5   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  • 6. Term-­‐context  matrix  for  word  similarity   •  Two  words  are  similar  in  meaning  if  their  context   vectors  are  similar   6   aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0 we say, two words are similarin meaning if their context vectors are similar.  
  • 7. Compu$ng  PPMI  on  a  term-­‐context  matrix   •  Matrix  F  with  W  rows  (words)  and  C  columns  (contexts)   •  fij  is  #  of  $mes  wi  occurs  in  context  cj 7   pij = fij fij j=1 C ∑ i=1 W ∑ pi* = fij j=1 C ∑ fij j=1 C ∑ i=1 W ∑ p* j = fij i=1 W ∑ fij j=1 C ∑ i=1 W ∑ pmiij = log2 pij pi* p* j ppmiij = pmiij if pmiij > 0 0 otherwise ! " # $# The  count  of  all   the  words  that   occur  in  that   context   The  count  of  all  the   contexts  where  the   word  appear   The  sum  of  all  words  in   all  contexts  =  all  the   numbers  in  the  matrix  
  • 8. Summa$on:  Sigma  Nota$on  (i)   8   It means: sum whatever appears after the Sigma: so we sum n. What is the value of n ? The values are shown below and above the Sigma. Below --> index variable (eg. start from 1); Above --> the range of the sum (eg. from 1 up to 4). In this case, it says that n goes from 1 to 4, which is 1, 2, 3 and 4 (http://www.mathsisfun.com/algebra/sigma-notation.html )   pij = fij fij j=1 C ∑ i=1 W ∑we can’t delete f(i,j) !!!   Sum  from  i=1  to  4  
  • 9. Summa$on:  Sigma  Nota$on  (ii)     •  Addi(onal  examples   •  Sums  can  be  nested   9  
  • 10. Alterna$ve  nota$ons…  (Levy,  2012)   •  When,  the  range  of  the  sum  can  be  understood  from  context,  it   ca  be  le  out;     •  or  we  want  to  be  vague  about  the  precise  range  of  the  sum.  For   example,  suppose  that  there  are  n  variables,  x1  through  xn.     •  In  order  to  say  that  the  sum  of  all  n  variables  is  equal  to  1,  we   might  simply  write:     10  
  • 11. Formulas:  Sigma  Nota$on   11   pij = fij fij j=1 C ∑ i=1 W ∑ pi* = fij j=1 C ∑ fij j=1 C ∑ i=1 W ∑ p* j = fij i=1 W ∑ fij j=1 C ∑ i=1 W ∑ •  Numerator:  f  ij  =  a  single  cell     •  Denominators:  sum  the  cells  of  all  the   words  and  the  cells  of  all  the  contexts   •  Numerator:  sum  the  cells  of  all  contexts   (all  the  columns)   •  Numerator:  sum  the  cells  of  all  the  words   (all  the  rows)    
  • 12. Living  lexicon:  built  upon  an  underlying   con$nously  updated  corpus     12   Drawbacks:  Updated  but  unstable  &  incomplete:  missing words, missing   linguis(c  informa(on,  etc.     Mul(lingualiy,  func(on  words,  etc.    
  • 13. Similarity:     •  Given  the  underlying  sta(s(cal  model,  these  words  are  similar   13   Fredrik  Olsson  
  • 14. Gavagai  blog   •  Further  reading  (Magnus  Sahlgren)  :   heps://www.gavagai.se/blog/ 2015/09/30/a-­‐brief-­‐history-­‐of-­‐ word-­‐embeddings/     14  
  • 15. End  of  previous  lecture   15  
  • 16. Acknowledgements Most  slides  borrowed  or  adapted  from:   Dan  Jurafsky  and  Christopher  Manning,  Coursera   Dan  Jurafsky  and  James  H.  Mar(n         J&M(2015,  dra):  heps://web.stanford.edu/~jurafsky/slp3/              
  • 17. Preliminary:  What’s  Informa$on  Extrac$on  (IE)?     •  IE  =  text  analy(cs  =  text  mining  =  e-­‐discovery,  etc.   •  The  ul(mate  goal  is  to  convert  unstructured  text  into  structured   informa(on  (so  informa(on  of  interest  can  easily  be  picked  up).   •  unstructured  data/text:  email,  PDF  files,  social  media  posts,  tweets,  text   messages,  blogs,  basically  any  running  text...   •  structured  data/text:  databases  (xlm,  sql,  etc.),  ontologies,  dic(onaries,  etc.     17  
  • 18. Informa$on   Extrac$on  and  Named   En$ty  Recogni$on   Introducing  the  tasks:   Gelng  simple  structured   informa(on  out  of  text  
  • 19. Informa$on  Extrac$on   •  Informa(on  extrac(on  (IE)  systems   •  Find  and  understand  limited  relevant  parts  of  texts   •  Gather  informa(on  from  many  pieces  of  text   •  Produce  a  structured  representa(on  of  relevant  informa(on:     •  rela3ons  (in  the  database  sense),  a.k.a.,   •  a  knowledge  base   •  Goals:   1.  Organize  informa(on  so  that  it  is  useful  to  people   2.  Put  informa(on  in  a  seman(cally  precise  form  that  allows  further   inferences  to  be  made  by  computer  algorithms  
  • 20. Informa$on  Extrac$on:  factual  info   •  IE  systems  extract  clear,  factual  informa(on   •  Roughly:  Who  did  what  to  whom  when?   •  E.g.,   •  Gathering  earnings,  profits,  board  members,  headquarters,  etc.  from   company  reports     •  The  headquarters  of  BHP  Billiton  Limited,  and  the  global  headquarters   of  the  combined  BHP  Billiton  Group,  are  located  in  Melbourne,   Australia.     •  headquarters(“BHP  Biliton  Limited”,  “Melbourne,  Australia”)   •  Learn  drug-­‐gene  product  interac(ons  from  medical  research  literature  
  • 21. Low-­‐level  informa$on  extrac$on   •  Is  now  available  –  and  I  think  popular  –  in  applica(ons  like  Apple   or  Google  mail,  and  web  indexing   •  Oen  seems  to  be  based  on  regular  expressions  and  name  lists  
  • 23. •  A  very  important  sub-­‐task:  find  and  classify  names   in  text.   •  An  en(ty  is  a  discrete  thing  like  “IBM  Corpora(on”   •  Named” means called “IBM” or “Big Blue” not “it” or “the company” •  often extended in practice to things like dates, instances of products and chemical/biological substances that aren’t really entities… •  But also used for times, dates, proteins, etc., which aren’t entities – easy to recognize semantic classes Named  En$ty  Recogni$on  (NER)  
  • 24. Named  En$ty  Recogni$on  (NER)   •  A  very  important  sub-­‐task:  find  and   classify  names  in  text,  for  example:   •  The  decision  by  the  independent  MP   Andrew  Wilkie  to  withdraw  his  support   for  the  minority  Labor  government   sounded  drama(c  but  it  should  not   further  threaten  its  stability.  When,  aer   the  2010  elec(on,  Wilkie,  Rob   Oakeshoe,  Tony  Windsor  and  the   Greens  agreed  to  support  Labor,  they   gave  just  two  guarantees:  confidence   and  supply.   you have a text, and you want to: 1.  find things that are names: European Commission, John Lloyd Jones, etc. 2. give them labels: ORG, PERS, etc.  
  • 25. •  A  very  important  sub-­‐task:  find  and  classify  names  in   text,  for  example:   •  The  decision  by  the  independent  MP  Andrew  Wilkie  to   withdraw  his  support  for  the  minority  Labor  government   sounded  drama(c  but  it  should  not  further  threaten  its   stability.  When,  aer  the  2010  elec(on,  Wilkie,  Rob   Oakeshoe,  Tony  Windsor  and  the  Greens  agreed  to  support   Labor,  they  gave  just  two  guarantees:  confidence  and   supply.   Named  En$ty  Recogni$on  (NER)   Person   Date   Loca(on   Organi-­‐          za(on      
  • 26. Named  En$ty  Recogni$on  (NER)   •  The  uses:   •  Named  en((es  can  be  indexed,  linked  off,  etc.   •  Sen(ment  can  be  aeributed  to  companies  or  products   •  A  lot  of  IE  rela(ons  are  associa(ons  between  named  en((es   •  For  ques(on  answering,  answers  are  oen  named  en((es.   •  Concretely:   •  Many  web  pages  tag  various  en((es,  with  links  to  bio  or  topic  pages,  etc.   •  Reuters’  OpenCalais,  Evri,  AlchemyAPI,  Yahoo’s  Term  Extrac(on,  …   •  Apple/Google/Microso/…  smart  recognizers  for  document  content  
  • 27. Summary:   Gelng  simple  structured  informa(on  out  of  text  
  • 28. Evalua$on  of  Named   En$ty  Recogni$on   The  extension  of  Precision,   Recall,  and  the  F  measure  to   sequences  
  • 29. The  Named  En$ty  Recogni$on  Task   Task:  Predict  en((es  in  a  text      Foreign    ORG    Ministry    ORG    spokesman    O    Shen      PER    Guofang    PER    told      O    Reuters    ORG    :      :   }   Standard     evalua(on   is  per  en(ty,   not  per  token  
  • 30. P/R   30   P=TP/TP+FP;  R=TP/TP+FN   FP=false  alarm  (it  is  not  a   NE,  but  it  has  been   classified  as  NE)   FN  =it  is  true  that  it  is  a   NE,  but  d  system  failed   to  recognised  it  
  • 31. Precision/Recall/F1  for  IE/NER   •  Recall  and  precision  are  straighNorward  for  tasks  like  IR  and  text   categoriza(on,  where  there  is  only  one  grain  size  (documents)   •  The  measure  behaves  a  bit  funnily  for  IE/NER  when  there  are   boundary  errors  (which  are  common):   •  First  Bank  of  Chicago  announced  earnings  …   •  This  counts  as  both  a  fp  and  a  fn   •  Selec(ng  nothing  would  have  been  beeer   •  Some  other  metrics  (e.g.,  MUC  scorer)  give  par(al  credit   (according  to  complex  rules)  
  • 32. Summary:     Be  careful  when  interpre(ng  the  P/R/F1  measures  
  • 33. Sequence  Models  for   Named  En$ty   Recogni$on  
  • 34. The  ML  sequence  model  approach  to  NER   Training   1.  Collect  a  set  of  representa(ve  training  documents   2.  Label  each  token  for  its  en(ty  class  or  other  (O)   3.  Design  feature  extractors  appropriate  to  the  text  and  classes   4.  Train  a  sequence  classifier  to  predict  the  labels  from  the  data     Tes(ng   1.  Receive  a  set  of  tes(ng  documents   2.  Run  sequence  model  inference  to  label  each  token   3.  Appropriately  output  the  recognized  en((es  
  • 35. NER  pipeline   35   Representa(ve   documents   Human   annota(on   Annotated   documents   Feature   extrac(on   Training  data  Sequence   classifiers   NER  system  
  • 36. Encoding  classes  for  sequence  labeling        IO  encoding  IOB  encoding      Fred      PER    B-­‐PER    showed    O    O    Sue      PER    B-­‐PER    Mengqiu    PER    B-­‐PER    Huang    PER    I-­‐PER    ‘s      O    O    new      O    O    pain(ng  O    O  
  • 37. Features  for  sequence  labeling   •  Words   •  Current  word  (essen(ally  like  a  learned  dic(onary)   •  Previous/next  word  (context)   •  Other  kinds  of  inferred  linguis(c  classifica(on   •  Part-­‐of-­‐speech  tags   •  Label  context   •  Previous  (and  perhaps  next)  label   37  
  • 38. Features:  Word  substrings   drug company movie place person Cotrimoxazole   Wethersfield   Alien  Fury:  Countdown  to  Invasion   0 0 0 18 0 oxa 708 0 0 06 : 0 8 6 68 14 field
  • 39. Features: Word shapes •  Word Shapes •  Map words to simplified representation that encodes attributes such as length, capitalization, numerals, Greek letters, internal punctuation, etc. Varicella-zoster Xx-xxx mRNA xXXX CPA1 XXXd
  • 40. Sequence  models   •  Once  you  have  designed  the  features,  apply  a  sequence   classifier  (cf  PoS  tagging),  such  as:   •  Maximum  Entropy  Markov  Models   •  Condi(onal  Random  Fields   •  etc.   40