SlideShare une entreprise Scribd logo
1  sur  90
CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 14: Information Extraction and Semantic Relation learning Lots of slides from many  people, including Rion Snow, Jim Martin, Chris Manning, and William Cohen,
2 Background: Information Extraction Extract information from text Sometimes called text analyticscommercially Extract entities (the people, organizations, locations, times, dates, genes, diseases, medicines, etc. in a text) Extract the relations between entities Figure out the larger events that are taking place
3 Information Extraction Creating knowledge bases and ontologies Implications for cognitive modeling Digital Libaries Google scholar, Citeseer need to extract the title, author and references Bioinformatics Patent analysis Specific market segments for stock analysis SEC filings Intelligence analysis
Outline Reminder: Named Entity Tagging Relation Extraction Hand-built patterns Seed (bootstrap) methods Supervised classification Distant supervision
What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME              TITLE   ORGANIZATION Slide from William Cohen
What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME              TITLE   ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. Slide from William Cohen
What is “Information Extraction” As a familyof techniques: Information Extraction =   segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation  “named entity extraction” Slide from William Cohen
What is “Information Extraction” As a familyof techniques: Information Extraction =   segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slide from William Cohen
What is “Information Extraction” As a familyof techniques: Information Extraction =   segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slide from William Cohen
What is “Information Extraction” TITLE   ORGANIZATION NAME       Bill Gates CEO Microsoft Bill  Veghte VP Microsoft Free Soft.. Stallman founder Richard  As a familyof techniques: Information Extraction =   segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * * Slide from William Cohen
Extracting Structured Knowledge Each article can contain hundreds or thousands of items of knowledge... “The Lawrence Livermore National Laboratory (LLNL) in Livermore, California is a scientific research laboratory founded by the University of California in 1952.” LLNL EQ Lawrence Livermore National Laboratory  LLNL LOC-IN California Livermore LOC-IN California LLNL IS-A scientific research laboratory LLNL FOUNDED-BY University of California LLNL FOUNDED-IN 1952
12 Goal:  Machine-readable summaries Structured knowledge extraction:  Summary for machine Textual abstract:  Summary for human
From Unstructured Text to Structured Knowledge Unstructured Text News articles... slide from Rion Snow
From Unstructured Text to Structured Knowledge Unstructured Text Blog posts.... slide from Rion Snow
From Unstructured Text to Structured Knowledge Unstructured Text Scientific journal articles... slide from Rion Snow
From Unstructured Text to Structured Knowledge Unstructured Text Tweets, instant messages, chat logs... slide from Rion Snow
From Unstructured Text to Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
Reminder: Task 1: Named Entity Tagging ,[object Object],	<PER> John Hennessy</PER> is a professor at <ORG> Stanford University </ORG>, in <LOC> Palo Alto </LOC>. <RNA> TAR </RNA> independent transactivation by <PROTEIN> Tat </PROTEIN>in cells derived from the <CELL> CNS </CELL>- a novel mechanism of <DNA> HIV-1 gene </DNA>regulation. Slide from Chris Manning
Reminder:Maximum Entropy Markov Model DNA O DNA O regulation HIV−1 gene of Slide from Chris Manning
Task II: Relation Extraction
Relations between words Language Understanding Applications needs word meaning! Question answering Conversational agents Summarization One key meaning component: word relations Hierarchical (ontological) relations “San Francisco” ISA “city” Other relations between words  “alternator” is a part of a “car”
Relation Prediction “...works by such authors as Herrick, Goldsmith, and Shakespeare.” “If you consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” ShakespeareIS-A author(0.87) How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?
Hyponymy One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other car is a hyponym of vehicle dog is a hyponym of animal mango is a hyponym of fruit Conversely vehicle is a hypernym/superordinate  of car animal is a hypernym of dog fruit is a hypernym of mango
30 WordNet relations X is-a-kind-of Y (hyponym / hypernym) X is-a-part-of Y (meronym / holonym) slide from Rion Snow
WordNet is incomplete; ontological relations are missing for many words This is especially true for specific domains (restaurants, auto parts, finance)
Other kinds of Relations: Disease Outbreaks Extract structured information from text Slide from Eugene Agichtein May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…  Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus)
More relations: Protein Interactions interact complex CBF-A            CBF-C CBF-B  	          CBF-A-CBF-C complex associates „We show that CBF-A and CBF-C interact  with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or  CBF-C individually but that it associates with the  CBF-A-CBF-C complex.“ Slide from Rosario and Hearst
Yet More Relations CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York Slide from Jim Martin
Relation Types For generic news texts... Slide from Jim Martin
More relations: UMLS Unified Medical Language System integrates linguistic, terminological and semantic information Semantic Network consists of 134 semantic types and 54 relations between types Pharmacologic Substance       affects 	Pathologic Function Pharmacologic Substance       causes 	Pathologic Function Pharmacologic Substance       complicates	Pathologic Function Pharmacologic Substance       diagnoses 	Pathologic Function Pharmacologic Substance       prevents 	Pathologic Function Pharmacologic Substance       treats 	Pathologic Function Slide from Paul Buitelaar
Relations in Ontologies: GO (Gene Ontology) GO (Gene Ontology) Aligns descriptions of gene products in different databases, including plant, animal and microbial genomes Organizing principles are molecular function, biological process and cellular component Accession:	GO:0009292 Ontology:	biological process Synonyms:	broad: genetic exchange Definition:	In the absence of a sexual life cycle, the processes 			involved 	in the introduction of genetic information to create 			a genetically different individual. Term Lineage	all : all (164142) 			GO:0008150 : biological process (115947) 				GO:0007275 : development (11892) 					GO:0009292 : genetic transfer (69) Slide from Paul Buitelaar
Relations in Ontologies: geographical Ontology F-Logic similar Geographical Entity (GE) is-a flow_through Inhabited GE Natural GE capital_of city river mountain country instance_of located_in Neckar Zugspitze Germany capital_of length (km) height (m) flow_through located_in Berlin Stuttgart 367 2962 flow_through Design: Philipp Cimiano Slide from Paul Buitelaar
39 MeSH (Medical Subject Headings) Thesaurus Definition MeSH Descriptor Synonym set Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and  Il-Yeol Song
MeSH Tree MeSH Ontology Hierarchically arranged from most general to most specific. Actually a graph rather than a tree normally appear in more than one place in the tree MeSH Tree Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and  Il-Yeol Song
Slide from Doug Appelt Types of ACE Relations, 2003 ROLE - relates a person to an organization or a geopolitical entity Subtypes: member, owner, affiliate, client, citizen PART - generalized containment Subtypes: subsidiary, physical part-of, set membership AT - permanent and transient locations Subtypes: located, based-in, residence SOCIAL- social relations among persons Subtypes: parent, sibling, spouse, grandparent, associate
Frequent Freebase Relations a
Predicting the “is-a” relation “...works by such authors as Herrick, Goldsmith, and Shakespeare.” “If you consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” ShakespeareIS-A author(0.87) How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?
Treatment Disease Why this is hard: Ambiguity!Which relations hold between 2 entities? Cure? Prevent? Side Effect?
Different relations between Disease (Hepatitis) and Treatment Cure These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135. Prevent A two-dose combined hepatitis A and Bvaccine would facilitate immunization programs Vague Effect of interferon on hepatitis B Slide from Barbara Rosario and Marti Hearst
5 easy methods for relation extraction Hand-built patterns Supervised methods Bootstrapping (seed) methods Unsupervised methods Distant supervision
5 easy methods for relation extraction Hand-built patterns Bootstrapping (seed) methods Supervised methods Unsupervised methods Distant supervision
A complex hand-built extraction rule [NYU Proteus]
Goal:   Add hyponyms to WordNet directly from text. Intuition from Hearst (1992)  “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use” What does Gelidium mean?  How do you know?`
Goal:   Add hyponyms to WordNet directly from text. Intuition from Hearst (1992)  “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use” What does Gelidium mean?  How do you know?`
Hearst’s Hand-Designed Lexico-Syntactic Patterns (Hearst, 1992):   Automatic Acquisition of Hyponyms “Y such as X ((, X)* (, and/or) X)” “such Y as X…” “X… or other Y” “X… and other Y” “Y including X…” “Y, especially X…”
Hearst’s hand-built patterns for Relation Extraction
Problem with hand-built patterns Requires that we hand-build patterns for each relation! don’t want to have to do this for all possible relations! we’d like better accuracy
5 easy methods for relation extraction Hand-built patterns Supervised methods Bootstrapping (seed) methods Unsupervised methods Distant supervision
2. Supervised Relation Extraction Sometimes done in 3 steps Find all pairs of named entities Decide if 2 entities are related If yes, classifying the relation Why the extra step? Cuts down on training time for classification by eliminating most pairs Producing separate feature-sets that are appropriate for each task. 55
Relation Analysis Usually just run on named entities within the same sentence Slide from Jim Martin
Slide from Jing Jiang Relation Extraction Task definition: to label the semantic relation between a pair of entities in a sentence (fragment) …[leaderarg-1] of a minority [governmentarg-2]… PHYS PER-SOC EMP-ORG NIL PHYS: Physical PER-SOC: Personal / Social EMP-ORG: Employment / Membership / Subsidiary
Supervised Learning Supervised machine learning (e.g. [Zhou et al. 2005], [Bunescu & Mooney 2005], [Zhang et al. 2006], [Surdeanu & Ciaramita 2007]) Training data is needed for each relation type …[leaderarg-1] of a minority [governmentarg-2]… arg-1 word: leader arg-2 type: ORG dependency: arg-1  of  arg-2 EMP-ORG PHYS PER-SOC NIL Slide from Jing Jiang
ACE 2008 Six Relations
Features: Words Headwords of M1 and M2, and combination George Washington Bridge Bag of words and bigrams in M1 and M2 Words or bigrams in particular positions to the left and right of the M1 and M2 +/- 1, 2, 3 Bag of words or bigrams between the two entities
Features: Named Entity Type and Mention Level Named-entity types (ORG, LOC, etc) Concatenation of the types Entity Level of M1 and M2  (NAME, NOMINAL, PRONOUN)
Features: Parse Tree and Base Phrases Syntactic environment Constituent path through the tree from one to the other Base syntactic chunk sequence from one to the other Dependency path Slide from Jim Martin
Features: Gazeteers and trigger words Personal relative trigger list from wordnet: parent, wife, husabnd, grandparent, etc Country name list
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagnersaid.
Classifiers for supervised methods Now you can use any classifier you like SVM Logistic regression Naïve Bayes etc
Summary Can get high accuracies with enough hand-labeled training data  If test data looks exactly like the training data But labeling 5000 relations (and named entities) is expensive the approach doesn’t generalize to different genres
5 easy methods for relation extraction Hand-built patterns Supervised methods Bootstrapping (seed) methods Unsupervised methods Distant supervision
Slide from Jim Martin Bootstrapping Approaches What if you don’t have enough annotated text to train on. But you might have some seed tuples  Or you might have some patterns that work pretty well Can you use those seeds to do something useful? Co-training and active learning use the seeds to train classifiers to tag more data to train better classifiers... Bootstrapping tries to learn directly (populate a relation) through direct use of the seeds
Slide from Jim Martin Bootstrapping Example: Seed Tuple <Mark Twain, Elmira>  Seed tuple Grep (google) “Mark Twain is buried in Elmira, NY.” X is buried in Y “The grave of Mark Twain is in Elmira” The grave of X is in Y “Elmira is Mark Twain’s final resting place” Y is X’s final resting place. Use those patterns to grep for new tuples that you don’t already know
Hearst (1992) proposal for bootstrapping Choose lexical relation R. Gather a set of pairs that have this relation Find places in the corpus where these expressions occur near each other and record the environment. Find the commonalities among these environments and hypothesize that common ones yield patterns that indicate the relation of interest.
Slide from Jim Martin Bootstrapping Relations
Dipre (Brin 1998) Extract <author, book> pairs. Start with these 5 seeds Learn these patterns: Now iterate, using these patterns to get more instances and patterns…
Snowball (Agichtein and Gravano 2000) New idea: require that X and Y be named entities of particular types {<’s 0.7> <in 0.7> <headquarters 0.7>} LOCATION  ORGANIZATION {<- 0.75> <based 0.75>} LOCATION  ORGANIZATION
5 easy methods for relation extraction Hand-built patterns Supervised methods Bootstrapping (seed) methods Unsupervised methods Distant supervision
Distant supervision paradigm Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17 Mintz, Bills, Snow, Jurafsky (2009)  Distant supervision for relation extraction without labeled data.  ACL-2009. Instead of hand-creating 5 seed examples Use a large database to get our seed examples lots of examples supervision from a database, not a corpus! Not genre-dependent! Create lots and lots of noisy features from all these examples Combine in a classifier
Distant supervision paradigm Has advantages of supervised classification: use of rich of hand-created knowledge Has advantages of unsupervised classification: infinite amounts of data allows for very large number of weak features not sensitive to training corpus
77 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. Slide from Rion Snow
78 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. Slide from Rion Snow
79 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. Slide from Rion Snow
80 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. Slide from Rion Snow
81 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. This leads to high-signal examples like: “...consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” Slide from Rion Snow
82 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. This leads to high-signal examples like: “...consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” But noisy examples like: “The author of Shakespeare in Love...” “...authors at the ShakespeareFestival...” Training set (TREC and Wikipedia): 14,000 hypernym pairs, ~600,000 total pairs Slide from Rion Snow
How to learn patterns Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17 … doubly heavy hydrogen atomcalleddeuterium… Take corpus sentences Collect noun pairs 752,311 pairs from 6M words of newswire Is pair an IS-A in WordNet?  14,387 yes, 737,924 no Parse the sentences Extract patterns 69,592 dependency paths >5 pairs) Train classifier on these patterns Logistic regression with 70K features(actually converted to 974,288 bucketed binary features) 1 (Atom, deuterium) 2 YES 3 4 5 6
One of 70,000 patterns “<superordinate> ‘called’ <subordinate>” Learned from cases such as: “sarcoma / cancer”:  …an uncommon bone cancercalled osteogenicsarcoma and to… “deuterium / atom”  ….heavy water rich in the doubly heavy hydrogen atomcalleddeuterium. New pairs discovered:   “efflorescence / condition”:  …and a conditioncalledefflorescence are other reasons for…  “’neal_inc / company”   …The company, now called O'Neal Inc., was sole distributor of E-Ferol… “hat_creek_outfit / ranch” …run a small ranch called the Hat Creek Outfit. “hiv-1 / aids_virus” …infected by the AIDS virus, called HIV-1. “bateau_mouche / attraction” …local sightseeing attraction called the Bateau Mouche... “kibbutz_malkiyya / collective_farm”  …an Israeli collective farm called Kibbutz Malkiyya…
Hypernym Precision / Recall for all Features Slide from Rion Snow
Hypernym Precision / Recall for all Features Slide from Rion Snow
Hypernym Precision / Recall for all Features Slide from Rion Snow
Hypernym Precision / Recall for all Features Slide from Rion Snow
Idea: use each pattern as a feature!!!!Precision/Recall for Hypernym Classification: logistic regression 10-fold Cross Validation on 14,000 WordNet-Labeled Pairs slide from Rion Snow
Outline Reminder: Named Entity Tagging Relation Extraction Hand-built patterns Seed (bootstrap) methods Supervised classification Distant supervision

Contenu connexe

En vedette

Near-optimal Character Animation with Continuous Control
Near-optimal Character Animation with Continuous ControlNear-optimal Character Animation with Continuous Control
Near-optimal Character Animation with Continuous Controlbutest
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...butest
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...butest
 
2009 HEP Science Network Requirements Workshop Final Report
2009 HEP Science Network Requirements Workshop Final Report2009 HEP Science Network Requirements Workshop Final Report
2009 HEP Science Network Requirements Workshop Final Reportbutest
 
mingdraft2.doc
mingdraft2.docmingdraft2.doc
mingdraft2.docbutest
 
cs-project.doc
cs-project.doccs-project.doc
cs-project.docbutest
 
JENIS – JENIS SISTEM OPERASI PADA KOMPUTER DAN HANDPHONE NAMA ...
JENIS – JENIS SISTEM OPERASI PADA KOMPUTER DAN HANDPHONE NAMA ...JENIS – JENIS SISTEM OPERASI PADA KOMPUTER DAN HANDPHONE NAMA ...
JENIS – JENIS SISTEM OPERASI PADA KOMPUTER DAN HANDPHONE NAMA ...butest
 
Noshir Contractor - WebSci'09 - Society On-Line
Noshir Contractor - WebSci'09 - Society On-LineNoshir Contractor - WebSci'09 - Society On-Line
Noshir Contractor - WebSci'09 - Society On-Linebutest
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos butest
 
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...butest
 

En vedette (10)

Near-optimal Character Animation with Continuous Control
Near-optimal Character Animation with Continuous ControlNear-optimal Character Animation with Continuous Control
Near-optimal Character Animation with Continuous Control
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...
 
2009 HEP Science Network Requirements Workshop Final Report
2009 HEP Science Network Requirements Workshop Final Report2009 HEP Science Network Requirements Workshop Final Report
2009 HEP Science Network Requirements Workshop Final Report
 
mingdraft2.doc
mingdraft2.docmingdraft2.doc
mingdraft2.doc
 
cs-project.doc
cs-project.doccs-project.doc
cs-project.doc
 
JENIS – JENIS SISTEM OPERASI PADA KOMPUTER DAN HANDPHONE NAMA ...
JENIS – JENIS SISTEM OPERASI PADA KOMPUTER DAN HANDPHONE NAMA ...JENIS – JENIS SISTEM OPERASI PADA KOMPUTER DAN HANDPHONE NAMA ...
JENIS – JENIS SISTEM OPERASI PADA KOMPUTER DAN HANDPHONE NAMA ...
 
Noshir Contractor - WebSci'09 - Society On-Line
Noshir Contractor - WebSci'09 - Society On-LineNoshir Contractor - WebSci'09 - Society On-Line
Noshir Contractor - WebSci'09 - Society On-Line
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
 

Similaire à Automatic Hypernym Classification: Towards the Induction of ...

Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...Black Duck by Synopsys
 
Open Source Insight: Happy Birthday Open Source and Application Security for ...
Open Source Insight: Happy Birthday Open Source and Application Security for ...Open Source Insight: Happy Birthday Open Source and Application Security for ...
Open Source Insight: Happy Birthday Open Source and Application Security for ...Black Duck by Synopsys
 
Open Source Insight: Big Data Breaches, Costly Cyberattacks, Vuln Detection f...
Open Source Insight: Big Data Breaches, Costly Cyberattacks, Vuln Detection f...Open Source Insight: Big Data Breaches, Costly Cyberattacks, Vuln Detection f...
Open Source Insight: Big Data Breaches, Costly Cyberattacks, Vuln Detection f...Black Duck by Synopsys
 
Industry-wide research on open source Internet of Things platforms - San Fran...
Industry-wide research on open source Internet of Things platforms - San Fran...Industry-wide research on open source Internet of Things platforms - San Fran...
Industry-wide research on open source Internet of Things platforms - San Fran...changeableradiu23
 
Ala 2008 Emerging Technologies
Ala 2008 Emerging TechnologiesAla 2008 Emerging Technologies
Ala 2008 Emerging TechnologiesDarrell W. Gunter
 
Open Source Insight: Who Owns Linux? TRITON Attack, App Security Testing, Fut...
Open Source Insight: Who Owns Linux? TRITON Attack, App Security Testing, Fut...Open Source Insight: Who Owns Linux? TRITON Attack, App Security Testing, Fut...
Open Source Insight: Who Owns Linux? TRITON Attack, App Security Testing, Fut...Black Duck by Synopsys
 
Open Source Insight: Open Source 360 Survey, DockerCon 2017, & More on the Cl...
Open Source Insight: Open Source 360 Survey, DockerCon 2017, & More on the Cl...Open Source Insight: Open Source 360 Survey, DockerCon 2017, & More on the Cl...
Open Source Insight: Open Source 360 Survey, DockerCon 2017, & More on the Cl...Black Duck by Synopsys
 
Open Source Software Development by TLV Partners
Open Source Software Development by TLV PartnersOpen Source Software Development by TLV Partners
Open Source Software Development by TLV PartnersRoy Leiser
 
Open source presentation
Open source presentationOpen source presentation
Open source presentationRona Segev Gal
 
Enabling Knowledge Creation through Outsiders: Towards a Push Model of Open I...
Enabling Knowledge Creation through Outsiders: Towards a Push Model of Open I...Enabling Knowledge Creation through Outsiders: Towards a Push Model of Open I...
Enabling Knowledge Creation through Outsiders: Towards a Push Model of Open I...Matthias Stürmer
 
Open Source Insight: Security Breaches and Cryptocurrency Dominating News
Open Source Insight: Security Breaches and Cryptocurrency Dominating NewsOpen Source Insight: Security Breaches and Cryptocurrency Dominating News
Open Source Insight: Security Breaches and Cryptocurrency Dominating NewsBlack Duck by Synopsys
 
How Data Never Looked So Good!
 How Data Never Looked So Good! How Data Never Looked So Good!
How Data Never Looked So Good!Darrell W. Gunter
 
Open Source Insight: NotPetya Strikes, Patching Is Vital for Risk Management
Open Source Insight:  NotPetya Strikes,  Patching Is Vital for Risk ManagementOpen Source Insight:  NotPetya Strikes,  Patching Is Vital for Risk Management
Open Source Insight: NotPetya Strikes, Patching Is Vital for Risk ManagementBlack Duck by Synopsys
 
Red Hat Summit, World IP Day, and the new OWASP Top 10
Red Hat Summit, World IP Day,  and the new OWASP Top 10Red Hat Summit, World IP Day,  and the new OWASP Top 10
Red Hat Summit, World IP Day, and the new OWASP Top 10Black Duck by Synopsys
 
Tech M&A Forecast 2011
Tech M&A Forecast 2011Tech M&A Forecast 2011
Tech M&A Forecast 2011Alina Soltys
 
Iot privacy vs convenience
Iot privacy vs  convenienceIot privacy vs  convenience
Iot privacy vs convenienceDon Lovett
 
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...Black Duck by Synopsys
 
London Online 2008
London Online 2008London Online 2008
London Online 2008Joe Buzzanga
 
Open Source Insight: GDPR Best Practices, Struts RCE Vulns, SAST, DAST & Equ...
Open Source Insight:  GDPR Best Practices, Struts RCE Vulns, SAST, DAST & Equ...Open Source Insight:  GDPR Best Practices, Struts RCE Vulns, SAST, DAST & Equ...
Open Source Insight: GDPR Best Practices, Struts RCE Vulns, SAST, DAST & Equ...Black Duck by Synopsys
 

Similaire à Automatic Hypernym Classification: Towards the Induction of ... (20)

Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
Open Source Insight: AI for Open Source Management, IoT Time Bombs, Ready for...
 
Microsoft History
Microsoft HistoryMicrosoft History
Microsoft History
 
Open Source Insight: Happy Birthday Open Source and Application Security for ...
Open Source Insight: Happy Birthday Open Source and Application Security for ...Open Source Insight: Happy Birthday Open Source and Application Security for ...
Open Source Insight: Happy Birthday Open Source and Application Security for ...
 
Open Source Insight: Big Data Breaches, Costly Cyberattacks, Vuln Detection f...
Open Source Insight: Big Data Breaches, Costly Cyberattacks, Vuln Detection f...Open Source Insight: Big Data Breaches, Costly Cyberattacks, Vuln Detection f...
Open Source Insight: Big Data Breaches, Costly Cyberattacks, Vuln Detection f...
 
Industry-wide research on open source Internet of Things platforms - San Fran...
Industry-wide research on open source Internet of Things platforms - San Fran...Industry-wide research on open source Internet of Things platforms - San Fran...
Industry-wide research on open source Internet of Things platforms - San Fran...
 
Ala 2008 Emerging Technologies
Ala 2008 Emerging TechnologiesAla 2008 Emerging Technologies
Ala 2008 Emerging Technologies
 
Open Source Insight: Who Owns Linux? TRITON Attack, App Security Testing, Fut...
Open Source Insight: Who Owns Linux? TRITON Attack, App Security Testing, Fut...Open Source Insight: Who Owns Linux? TRITON Attack, App Security Testing, Fut...
Open Source Insight: Who Owns Linux? TRITON Attack, App Security Testing, Fut...
 
Open Source Insight: Open Source 360 Survey, DockerCon 2017, & More on the Cl...
Open Source Insight: Open Source 360 Survey, DockerCon 2017, & More on the Cl...Open Source Insight: Open Source 360 Survey, DockerCon 2017, & More on the Cl...
Open Source Insight: Open Source 360 Survey, DockerCon 2017, & More on the Cl...
 
Open Source Software Development by TLV Partners
Open Source Software Development by TLV PartnersOpen Source Software Development by TLV Partners
Open Source Software Development by TLV Partners
 
Open source presentation
Open source presentationOpen source presentation
Open source presentation
 
Enabling Knowledge Creation through Outsiders: Towards a Push Model of Open I...
Enabling Knowledge Creation through Outsiders: Towards a Push Model of Open I...Enabling Knowledge Creation through Outsiders: Towards a Push Model of Open I...
Enabling Knowledge Creation through Outsiders: Towards a Push Model of Open I...
 
Open Source Insight: Security Breaches and Cryptocurrency Dominating News
Open Source Insight: Security Breaches and Cryptocurrency Dominating NewsOpen Source Insight: Security Breaches and Cryptocurrency Dominating News
Open Source Insight: Security Breaches and Cryptocurrency Dominating News
 
How Data Never Looked So Good!
 How Data Never Looked So Good! How Data Never Looked So Good!
How Data Never Looked So Good!
 
Open Source Insight: NotPetya Strikes, Patching Is Vital for Risk Management
Open Source Insight:  NotPetya Strikes,  Patching Is Vital for Risk ManagementOpen Source Insight:  NotPetya Strikes,  Patching Is Vital for Risk Management
Open Source Insight: NotPetya Strikes, Patching Is Vital for Risk Management
 
Red Hat Summit, World IP Day, and the new OWASP Top 10
Red Hat Summit, World IP Day,  and the new OWASP Top 10Red Hat Summit, World IP Day,  and the new OWASP Top 10
Red Hat Summit, World IP Day, and the new OWASP Top 10
 
Tech M&A Forecast 2011
Tech M&A Forecast 2011Tech M&A Forecast 2011
Tech M&A Forecast 2011
 
Iot privacy vs convenience
Iot privacy vs  convenienceIot privacy vs  convenience
Iot privacy vs convenience
 
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
Open Source Insight: Securing IoT, Atlanta Ransomware Attack, Congress on Cyb...
 
London Online 2008
London Online 2008London Online 2008
London Online 2008
 
Open Source Insight: GDPR Best Practices, Struts RCE Vulns, SAST, DAST & Equ...
Open Source Insight:  GDPR Best Practices, Struts RCE Vulns, SAST, DAST & Equ...Open Source Insight:  GDPR Best Practices, Struts RCE Vulns, SAST, DAST & Equ...
Open Source Insight: GDPR Best Practices, Struts RCE Vulns, SAST, DAST & Equ...
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Automatic Hypernym Classification: Towards the Induction of ...

  • 1. CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 14: Information Extraction and Semantic Relation learning Lots of slides from many people, including Rion Snow, Jim Martin, Chris Manning, and William Cohen,
  • 2. 2 Background: Information Extraction Extract information from text Sometimes called text analyticscommercially Extract entities (the people, organizations, locations, times, dates, genes, diseases, medicines, etc. in a text) Extract the relations between entities Figure out the larger events that are taking place
  • 3. 3 Information Extraction Creating knowledge bases and ontologies Implications for cognitive modeling Digital Libaries Google scholar, Citeseer need to extract the title, author and references Bioinformatics Patent analysis Specific market segments for stock analysis SEC filings Intelligence analysis
  • 4. Outline Reminder: Named Entity Tagging Relation Extraction Hand-built patterns Seed (bootstrap) methods Supervised classification Distant supervision
  • 5. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Slide from William Cohen
  • 6. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. Slide from William Cohen
  • 7. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation “named entity extraction” Slide from William Cohen
  • 8. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slide from William Cohen
  • 9. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slide from William Cohen
  • 10. What is “Information Extraction” TITLE ORGANIZATION NAME Bill Gates CEO Microsoft Bill Veghte VP Microsoft Free Soft.. Stallman founder Richard As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * * Slide from William Cohen
  • 11. Extracting Structured Knowledge Each article can contain hundreds or thousands of items of knowledge... “The Lawrence Livermore National Laboratory (LLNL) in Livermore, California is a scientific research laboratory founded by the University of California in 1952.” LLNL EQ Lawrence Livermore National Laboratory LLNL LOC-IN California Livermore LOC-IN California LLNL IS-A scientific research laboratory LLNL FOUNDED-BY University of California LLNL FOUNDED-IN 1952
  • 12. 12 Goal: Machine-readable summaries Structured knowledge extraction: Summary for machine Textual abstract: Summary for human
  • 13. From Unstructured Text to Structured Knowledge Unstructured Text News articles... slide from Rion Snow
  • 14. From Unstructured Text to Structured Knowledge Unstructured Text Blog posts.... slide from Rion Snow
  • 15. From Unstructured Text to Structured Knowledge Unstructured Text Scientific journal articles... slide from Rion Snow
  • 16. From Unstructured Text to Structured Knowledge Unstructured Text Tweets, instant messages, chat logs... slide from Rion Snow
  • 17. From Unstructured Text to Structured Knowledge Unstructured Text slide from Rion Snow
  • 18. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
  • 19. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
  • 20. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
  • 21. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
  • 22. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
  • 23. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
  • 24.
  • 25. Reminder:Maximum Entropy Markov Model DNA O DNA O regulation HIV−1 gene of Slide from Chris Manning
  • 26. Task II: Relation Extraction
  • 27. Relations between words Language Understanding Applications needs word meaning! Question answering Conversational agents Summarization One key meaning component: word relations Hierarchical (ontological) relations “San Francisco” ISA “city” Other relations between words “alternator” is a part of a “car”
  • 28. Relation Prediction “...works by such authors as Herrick, Goldsmith, and Shakespeare.” “If you consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” ShakespeareIS-A author(0.87) How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?
  • 29. Hyponymy One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other car is a hyponym of vehicle dog is a hyponym of animal mango is a hyponym of fruit Conversely vehicle is a hypernym/superordinate of car animal is a hypernym of dog fruit is a hypernym of mango
  • 30. 30 WordNet relations X is-a-kind-of Y (hyponym / hypernym) X is-a-part-of Y (meronym / holonym) slide from Rion Snow
  • 31. WordNet is incomplete; ontological relations are missing for many words This is especially true for specific domains (restaurants, auto parts, finance)
  • 32. Other kinds of Relations: Disease Outbreaks Extract structured information from text Slide from Eugene Agichtein May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus)
  • 33. More relations: Protein Interactions interact complex CBF-A CBF-C CBF-B CBF-A-CBF-C complex associates „We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“ Slide from Rosario and Hearst
  • 34. Yet More Relations CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York Slide from Jim Martin
  • 35. Relation Types For generic news texts... Slide from Jim Martin
  • 36. More relations: UMLS Unified Medical Language System integrates linguistic, terminological and semantic information Semantic Network consists of 134 semantic types and 54 relations between types Pharmacologic Substance affects Pathologic Function Pharmacologic Substance causes Pathologic Function Pharmacologic Substance complicates Pathologic Function Pharmacologic Substance diagnoses Pathologic Function Pharmacologic Substance prevents Pathologic Function Pharmacologic Substance treats Pathologic Function Slide from Paul Buitelaar
  • 37. Relations in Ontologies: GO (Gene Ontology) GO (Gene Ontology) Aligns descriptions of gene products in different databases, including plant, animal and microbial genomes Organizing principles are molecular function, biological process and cellular component Accession: GO:0009292 Ontology: biological process Synonyms: broad: genetic exchange Definition: In the absence of a sexual life cycle, the processes involved in the introduction of genetic information to create a genetically different individual. Term Lineage all : all (164142) GO:0008150 : biological process (115947) GO:0007275 : development (11892) GO:0009292 : genetic transfer (69) Slide from Paul Buitelaar
  • 38. Relations in Ontologies: geographical Ontology F-Logic similar Geographical Entity (GE) is-a flow_through Inhabited GE Natural GE capital_of city river mountain country instance_of located_in Neckar Zugspitze Germany capital_of length (km) height (m) flow_through located_in Berlin Stuttgart 367 2962 flow_through Design: Philipp Cimiano Slide from Paul Buitelaar
  • 39. 39 MeSH (Medical Subject Headings) Thesaurus Definition MeSH Descriptor Synonym set Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and Il-Yeol Song
  • 40. MeSH Tree MeSH Ontology Hierarchically arranged from most general to most specific. Actually a graph rather than a tree normally appear in more than one place in the tree MeSH Tree Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and Il-Yeol Song
  • 41. Slide from Doug Appelt Types of ACE Relations, 2003 ROLE - relates a person to an organization or a geopolitical entity Subtypes: member, owner, affiliate, client, citizen PART - generalized containment Subtypes: subsidiary, physical part-of, set membership AT - permanent and transient locations Subtypes: located, based-in, residence SOCIAL- social relations among persons Subtypes: parent, sibling, spouse, grandparent, associate
  • 43. Predicting the “is-a” relation “...works by such authors as Herrick, Goldsmith, and Shakespeare.” “If you consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” ShakespeareIS-A author(0.87) How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?
  • 44. Treatment Disease Why this is hard: Ambiguity!Which relations hold between 2 entities? Cure? Prevent? Side Effect?
  • 45. Different relations between Disease (Hepatitis) and Treatment Cure These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135. Prevent A two-dose combined hepatitis A and Bvaccine would facilitate immunization programs Vague Effect of interferon on hepatitis B Slide from Barbara Rosario and Marti Hearst
  • 46. 5 easy methods for relation extraction Hand-built patterns Supervised methods Bootstrapping (seed) methods Unsupervised methods Distant supervision
  • 47. 5 easy methods for relation extraction Hand-built patterns Bootstrapping (seed) methods Supervised methods Unsupervised methods Distant supervision
  • 48. A complex hand-built extraction rule [NYU Proteus]
  • 49. Goal: Add hyponyms to WordNet directly from text. Intuition from Hearst (1992) “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use” What does Gelidium mean? How do you know?`
  • 50. Goal: Add hyponyms to WordNet directly from text. Intuition from Hearst (1992) “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use” What does Gelidium mean? How do you know?`
  • 51. Hearst’s Hand-Designed Lexico-Syntactic Patterns (Hearst, 1992): Automatic Acquisition of Hyponyms “Y such as X ((, X)* (, and/or) X)” “such Y as X…” “X… or other Y” “X… and other Y” “Y including X…” “Y, especially X…”
  • 52. Hearst’s hand-built patterns for Relation Extraction
  • 53. Problem with hand-built patterns Requires that we hand-build patterns for each relation! don’t want to have to do this for all possible relations! we’d like better accuracy
  • 54. 5 easy methods for relation extraction Hand-built patterns Supervised methods Bootstrapping (seed) methods Unsupervised methods Distant supervision
  • 55. 2. Supervised Relation Extraction Sometimes done in 3 steps Find all pairs of named entities Decide if 2 entities are related If yes, classifying the relation Why the extra step? Cuts down on training time for classification by eliminating most pairs Producing separate feature-sets that are appropriate for each task. 55
  • 56. Relation Analysis Usually just run on named entities within the same sentence Slide from Jim Martin
  • 57. Slide from Jing Jiang Relation Extraction Task definition: to label the semantic relation between a pair of entities in a sentence (fragment) …[leaderarg-1] of a minority [governmentarg-2]… PHYS PER-SOC EMP-ORG NIL PHYS: Physical PER-SOC: Personal / Social EMP-ORG: Employment / Membership / Subsidiary
  • 58. Supervised Learning Supervised machine learning (e.g. [Zhou et al. 2005], [Bunescu & Mooney 2005], [Zhang et al. 2006], [Surdeanu & Ciaramita 2007]) Training data is needed for each relation type …[leaderarg-1] of a minority [governmentarg-2]… arg-1 word: leader arg-2 type: ORG dependency: arg-1  of  arg-2 EMP-ORG PHYS PER-SOC NIL Slide from Jing Jiang
  • 59. ACE 2008 Six Relations
  • 60. Features: Words Headwords of M1 and M2, and combination George Washington Bridge Bag of words and bigrams in M1 and M2 Words or bigrams in particular positions to the left and right of the M1 and M2 +/- 1, 2, 3 Bag of words or bigrams between the two entities
  • 61. Features: Named Entity Type and Mention Level Named-entity types (ORG, LOC, etc) Concatenation of the types Entity Level of M1 and M2 (NAME, NOMINAL, PRONOUN)
  • 62. Features: Parse Tree and Base Phrases Syntactic environment Constituent path through the tree from one to the other Base syntactic chunk sequence from one to the other Dependency path Slide from Jim Martin
  • 63. Features: Gazeteers and trigger words Personal relative trigger list from wordnet: parent, wife, husabnd, grandparent, etc Country name list
  • 64. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagnersaid.
  • 65. Classifiers for supervised methods Now you can use any classifier you like SVM Logistic regression Naïve Bayes etc
  • 66. Summary Can get high accuracies with enough hand-labeled training data If test data looks exactly like the training data But labeling 5000 relations (and named entities) is expensive the approach doesn’t generalize to different genres
  • 67. 5 easy methods for relation extraction Hand-built patterns Supervised methods Bootstrapping (seed) methods Unsupervised methods Distant supervision
  • 68. Slide from Jim Martin Bootstrapping Approaches What if you don’t have enough annotated text to train on. But you might have some seed tuples Or you might have some patterns that work pretty well Can you use those seeds to do something useful? Co-training and active learning use the seeds to train classifiers to tag more data to train better classifiers... Bootstrapping tries to learn directly (populate a relation) through direct use of the seeds
  • 69. Slide from Jim Martin Bootstrapping Example: Seed Tuple <Mark Twain, Elmira> Seed tuple Grep (google) “Mark Twain is buried in Elmira, NY.” X is buried in Y “The grave of Mark Twain is in Elmira” The grave of X is in Y “Elmira is Mark Twain’s final resting place” Y is X’s final resting place. Use those patterns to grep for new tuples that you don’t already know
  • 70. Hearst (1992) proposal for bootstrapping Choose lexical relation R. Gather a set of pairs that have this relation Find places in the corpus where these expressions occur near each other and record the environment. Find the commonalities among these environments and hypothesize that common ones yield patterns that indicate the relation of interest.
  • 71. Slide from Jim Martin Bootstrapping Relations
  • 72. Dipre (Brin 1998) Extract <author, book> pairs. Start with these 5 seeds Learn these patterns: Now iterate, using these patterns to get more instances and patterns…
  • 73. Snowball (Agichtein and Gravano 2000) New idea: require that X and Y be named entities of particular types {<’s 0.7> <in 0.7> <headquarters 0.7>} LOCATION ORGANIZATION {<- 0.75> <based 0.75>} LOCATION ORGANIZATION
  • 74. 5 easy methods for relation extraction Hand-built patterns Supervised methods Bootstrapping (seed) methods Unsupervised methods Distant supervision
  • 75. Distant supervision paradigm Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17 Mintz, Bills, Snow, Jurafsky (2009) Distant supervision for relation extraction without labeled data. ACL-2009. Instead of hand-creating 5 seed examples Use a large database to get our seed examples lots of examples supervision from a database, not a corpus! Not genre-dependent! Create lots and lots of noisy features from all these examples Combine in a classifier
  • 76. Distant supervision paradigm Has advantages of supervised classification: use of rich of hand-created knowledge Has advantages of unsupervised classification: infinite amounts of data allows for very large number of weak features not sensitive to training corpus
  • 77. 77 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. Slide from Rion Snow
  • 78. 78 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. Slide from Rion Snow
  • 79. 79 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. Slide from Rion Snow
  • 80. 80 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. Slide from Rion Snow
  • 81. 81 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. This leads to high-signal examples like: “...consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” Slide from Rion Snow
  • 82. 82 Relation Classification with “Distant Supervision” We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet. This leads to high-signal examples like: “...consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” But noisy examples like: “The author of Shakespeare in Love...” “...authors at the ShakespeareFestival...” Training set (TREC and Wikipedia): 14,000 hypernym pairs, ~600,000 total pairs Slide from Rion Snow
  • 83. How to learn patterns Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17 … doubly heavy hydrogen atomcalleddeuterium… Take corpus sentences Collect noun pairs 752,311 pairs from 6M words of newswire Is pair an IS-A in WordNet? 14,387 yes, 737,924 no Parse the sentences Extract patterns 69,592 dependency paths >5 pairs) Train classifier on these patterns Logistic regression with 70K features(actually converted to 974,288 bucketed binary features) 1 (Atom, deuterium) 2 YES 3 4 5 6
  • 84. One of 70,000 patterns “<superordinate> ‘called’ <subordinate>” Learned from cases such as: “sarcoma / cancer”: …an uncommon bone cancercalled osteogenicsarcoma and to… “deuterium / atom” ….heavy water rich in the doubly heavy hydrogen atomcalleddeuterium. New pairs discovered: “efflorescence / condition”: …and a conditioncalledefflorescence are other reasons for… “’neal_inc / company” …The company, now called O'Neal Inc., was sole distributor of E-Ferol… “hat_creek_outfit / ranch” …run a small ranch called the Hat Creek Outfit. “hiv-1 / aids_virus” …infected by the AIDS virus, called HIV-1. “bateau_mouche / attraction” …local sightseeing attraction called the Bateau Mouche... “kibbutz_malkiyya / collective_farm” …an Israeli collective farm called Kibbutz Malkiyya…
  • 85. Hypernym Precision / Recall for all Features Slide from Rion Snow
  • 86. Hypernym Precision / Recall for all Features Slide from Rion Snow
  • 87. Hypernym Precision / Recall for all Features Slide from Rion Snow
  • 88. Hypernym Precision / Recall for all Features Slide from Rion Snow
  • 89. Idea: use each pattern as a feature!!!!Precision/Recall for Hypernym Classification: logistic regression 10-fold Cross Validation on 14,000 WordNet-Labeled Pairs slide from Rion Snow
  • 90. Outline Reminder: Named Entity Tagging Relation Extraction Hand-built patterns Seed (bootstrap) methods Supervised classification Distant supervision