SOFIE is a unified approach for ontology-based information extraction that combines pattern-based extraction from text with facts from an ontology like YAGO. It expresses extraction patterns and rules as logical facts and uses weighted MAX SAT solving to test hypotheses against the rules and facts to perform word sense disambiguation, extract new patterns, and infer new knowledge. Evaluation experiments on Wikipedia and news articles show it can effectively extract information at a large scale and disambiguate entities, though it has limitations for reasoning over ontologies with open world assumptions.
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig
1. SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig Tobias Wunner Unit for Natural Language Processing (UNLP) firstname.lastname@deri.org Wednesday,22nd June, 2011 DERI, Reading Group 1
2. Based On: “SOFIE: A Self-Organizing Framework for Information Extraction” Authors: Fabian Suchanek, Mauro Sozio, Gerhard Weikum Published: World Wide Web Conference (WWW) Madrid, 2009 2
4. Motivation Classical IE on text pattern-based 80pc Semistructural approach Wikipedia infoboxes 95% Idea of Paper: combine use text (hypotheses) + ontology (trusted facts) 4
5. Example 5 Document1 YAGO ontology familyName(AlbertEinstein, Einstein) bornIn(AlbertEinstein, Germany) attendedSchoolIn( AlbertEinstein, Germany) Einstein attended secondary school in Germany. New Knowledge
6. General Idea Express extraction patterns as fact Rules to understand usage of terms Add restrictions 6 patternOcc(“X went to school in Y”,Einstein, Switzerland) patternOcc(Pattern,X,Y) and R(X,Y) ⇒ express(Pattern,R)
7. Contribution Unified approach to Pattern matching Word Sense Disambiguation Reasoning Large Scale On Unstructured Data 7
8. Pattern extraction with WICs Extract patterns based on ‘interesting’ entities 8 Documents Einstein was born at Ulm in Württemberg, Germany, on March 18, 1879. When Albert was around four, his father gave him a magnetic compass. When Albert became older, he went to a school in Switzerland. After he graduated, he got a job in the patent office there… Knowledge Base patternOcc(“Einstein was born in Ulm”,Einstein@D1, Ulm@D1) [1] patternOcc(“Ulm is in Württemberg, Germany”,Ulm@D1, Germany@D1) [1] patternOcc(“Albert .. Switzerland”,Albert@D1, Switzerland@D1) [1] WICs (Word in Context)
9. Grounding Test Rules How? find an instance which satisfies the formulae 9 bornIn(Einstein,Ulm) ⇒ ¬bornIn(Einstein,Timbuktu) studiedIn(Einstein,Ulm) bornIn(X,Ulm) ⇒ ¬bornIn(X,Timbuktu) studiedIn(X,Ulm)
10. Rules (Hypotheses) Disambiguation disambiguatesAs(Albert@D,AlberEinstein)[?] Expresses a new fact expresses(P, livedIn(Einstein,Switzerland) )[?] New facts CityIn(Ulm,Germany)[?] 10
11. New fact rule ...with disambiguation 11 “Pattern P expresses Relation R when analysis of WICs are disambiguated” patternOcc( P, WX, WY ) and disambiguatesAs(WX, X) and disambiguatesAs(WY, Y) and R(X,Y) ⇒ express( P, R )
12. Restrictions Disambiguation disambiguation prior should influence choice of disambiguation 12 N - any disamb. function disambPrior( W, X, N ) ⇒ disambiguatedAs( W, X ) | words(D1) ∩ rel(AlbertEinstein)| | words(D1) |
14. SOFIE Rules Framework to test the hypotheses Question “How to satisfy all them?” rules + trusted facts 14 dismbPrior(Albert@D1, AlbertEinstein, 10) ⇒ disambiguatesAs(Albert@D1, AlbertEinstein) patternOcc( P, X, Y ) and R(X,Y) ⇒ express( P, R ) dismbPrior(Albert@D1, HermannEinstein, 3) ⇒ disambiguatesAs(Albert@D1, HermannEinstein) Country(Germany) livedIn(AlbertEinstein,Ulm) …
15. SAT / MAX SAT SAT (Satisfiability) proove formula can be TRUE Complexity Classes P Good example: Nk NP Bad cN e.g. naive algorithm for 100 variables 2100 x 10-10 ms per row = 4 x 1012 y Not always.. 3SAT in (4/3)N SAT Solver 15 F = (X or Y or Z) and (¬X or Y or Z) and (¬X or ¬Y or ¬Z) G = (X or Y) and (¬X or ¬Y) and (X) truth table has 23 rows Details Schöning 2010
16. SAT / MAX SAT SAT (Satisfiability) proove formula can be TRUE Complexity Classes P Good example: Nk NP Bad cN e.g. naive algorithm for 100 variables 2100 x 10-10 ms per row = 4 x 1012 y Not always.. 3SAT in (4/3)N SAT Solver MAX SAT 16 F = (X or Y or Z) and (¬X or Y or Z) and (¬X or ¬Y or ¬Z) G = (X or Y) and (¬X or ¬Y) and (X) truth table has 23 rows Details Schöning 2010
17. Weighted MAX SAT in SOFIE ...back to SOFIE this is MAX SAT but with weights 17 rules + trusted facts Country(Germany) livedIn(AlbertEinstein,Ulm) … dismbPrior(Albert@D1, AlbertEinstein, 10) ⇒ disambiguatesAs(Albert@D1, AlbertEinstein) patternOcc( P, X, Y ) and R(X,Y) ⇒ express( P, R ) dismbPrior(Albert@D1, HermannEinstein, 3) ⇒ disambiguatesAs(Albert@D1, HermannEinstein)
18. Weighted MAX SAT in SOFIE Weighted MAX SAT is NP hard only approximation algorithms impractical to find optimal solution SAT Solver Johnson’s algorithm: 2/3 (apprx guarantee)
19. Weighted MAX SAT in SOFIE Functional MAX SAT Specialized reasoning (support for functional properties) Approximation guarantee 1/2 Propagates dominating unit clauses Considers only unit clauses A v B [w1] A v B [w2] B v C [w3] C [w4] A v B [10] A [10] A [30] A = true 30 > 10+10
21. Controlled experiment Large-scale: Corpus from Wikipedia articles 2000 articles 13 frequent relations from YAGO Parsing = 87min Reaoning = 77min 21
22. Unstructured text sources 150 news paper articles relation under test headquarterOf YAGO (modified with relation seeds) Parsing 87min WeightedMaxSat 77min disambiguated entries (provenance) could be manually assessed 22 functional relation
23. Unstructured text sources Large-scale: 10 biographies for each of 400 US senators 5 relationships Disambiguation was not ideal for YAGO (13 James Watson) Parsing 7h W-MAX-SAT 9h Results 4 good 1 bad (misleading patterns) 23
24. MAX SAT can’t do OWL per se (Open World Assumption) Reformulate OWL in propositional logic OWL FOL Skolem Normal Form Propositional Logic Might find OWL-inconsistent ontologies due to OW Assumption 24 define a student as a subclass “attends some course” ⇒ ∀ x, ∃ y: attends(x,y), Course(y) -> Student(y) ⇒ ∀ x: attends(x,k), Course(y) -> Student(y); ∃ k ⇒ ¬attends(xi, ki) or ¬Course(xi) or Student(xi); k=x1 .. xn Inferred Ontology { Student(alex), Student(bob), Student subClassOf attends some Course, attends(alex, SemanticWeb) } Details JMC 2010
25. Conclusions Ontology-based IE (OBIE) reformulated as weighted MAX SAT problem Approximation algorithm with 1/2 Works and scales (large corpus + YAGO) 25
26. Limitations Specialized approximation algorithm Accounts for SOFIE rules NOT OWL MAX SAT Restrictions ∈ Prepositional Logic ∉ First-Order Logic Ontology population approach (can’t infer new relations) 26
27. References 27 F Suchanek et al, SOFIE: a self-organizing framework for information extraction, Proceeding WWW '09 Proceedings of the 18th international conference on World wide web, link John McCrae, Automatic Extraction Of Logically Consistent Ontologies From Text, PhD thesis at National Institute of Informatics, Japan, 2009 link Uwe Schöning: Das SAT-Problem. In Informatik Spektrum 33(5): 479-483, 2010, link F Suchanek, Automated Construction and Growth of a Large Ontology, PhD thesis at Technology of Saarland University. Saarbrücken, Germany, 2009, link