Paper presentation at ICWE2013.
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents
http://icwe2013.webengineering.org/accepted-full-papers
How to Troubleshoot Apps for the Modern Connected Worker
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents
1. Motivation
Data on the Web
09/07/13 ICWE 2013, Aalborg, Denmark
Some eyecatching opener illustrating growth and or diversity of web data
Summaries on the fly:
Query-based Extraction of Structured Knowledge
from Web Documents
ICWE 2013: International Conference on Web Engineering
8-12 July 2013, Aalborg , Denmark
Besnik Fetahu, Bernardo Pereira Nunes, Stefan Dietze
(L3S Research Center, DE)
3. Introduction
• Motivation
– Large amounts of textual Web Documents
– Efficient techniques querying for relevant information
– Extraction of chunks of text: relations, named entities etc.
– Summaries as means on highlighting most important chunks of text
• Issues:
– Summaries as non-structured text
– Weak relationship of user interests and importance of specific chunks of
text in a corpus
09/07/13 ICWE 2013, Aalborg, Denmark
4. Prominent Text Summarisation Approaches
• Heuristics for relation extraction
• Extraction of information based on predefined templates
• Sentence inclusion based on inclusion of specific terms
• Latent Semantic Analysis (LSA) for measuring importance of specific terms
• Tree Kernels encoding relevant information for event detection
• Latent Dirichlet Allocation (LDA) for topic modelling
• Populating ontologies based on extracted information from text
09/07/13 ICWE 2013, Aalborg, Denmark
IE
IR
ML
SW
5. Focused Knowledge Extraction
Overview
• Structured Summary Generation Components:
– Query Expansion and Reformulation
– Named Entity Definition and Co-Reference Resolution
– Pattern Generation
– Contextual Structure of Summaries
09/07/13 ICWE 2013, Aalborg, Denmark
6. Focused Knowledge Extraction
Pipeline
09/07/13 ICWE 2013, Aalborg, Denmark
Stem Cell
user query
Anatomical structure
Biotechnology
Cloning
Cell biology
Developmental Biology
Stem Cell
query typing and expansion
Corpus
OR/AND of
expanded query terms
NER
POS
Annotate
filtered
documents patterns
Democrats → applauded → Mr. Spitzer Eliot (Gov) calls
→ insure → 500 000 children → lack→ health insurance
→ enroll → 900 000 adults → are → eligible Medicaid
→ enrolled → issue debt → pay → stem cell research.
structured summary
Entities Actions
8. Focused Knowledge Extraction -
Named Entity Definitions & Co-Reference Resolution
• Entities recognised using NER&NED tools (Stanford’s NLP toolkit)
• Construct a co-occurrence matrix of proper nouns appearing consecutively
• Sample entities: “Chicago Bears”, “playoff games”
• Co-reference resolution crucial for accurate knowledge extraction
09/07/13 ICWE 2013, Aalborg, Denmark
k
i
ii termtermoccurrcoiMiscentity
1
1),(][
=
+−=
9. Focused Knowledge Extraction
Pattern Generation
• Determine topic terms (LDA) from the
underlying filtered corpus
• Annotate using POS taggers topic terms
• Pattern items:
– POS tags from topic terms
– Query terms (incl. terms after expansion)
09/07/13 ICWE 2013, Aalborg, Denmark
police found women men dr death people drug
medical officers man problems study killed
heart hospital test sex patients evidence dead
drugs officer….
police_NN found_VBD women_NNS men_NNS
dr_VBP death_NN people_NNS drug_NN
medical_JJ officers_NNS man_NN
problems_NNS study_NN killed_VBD heart_NN
hospital_NN test_NN sex_NN patients_NNS
evidence_NN dead_NN drugs_NNS officer_NN
NN → VBD → NNS → VBP → NN….
Stem Cell → Anatomical structure →
Biotechnology Cloning → Cell Biology →
Developmental Biology
10. Focused Knowledge Extraction
Pattern Generation (I)
• Construct co-occurrence matrix of pattern items (POS tags, Query terms)
• Generate automatically emerging patterns reflecting syntactical relevance
of chunks of text
• Patterns as a sequence of co-occurring items, modelled as directed tree
graphs
• For each pattern item generate a directed tree graph, considering it as a
root node
• Patterns score conveys importance for a given corpus and query
09/07/13 ICWE 2013, Aalborg, Denmark
11. Generated Patterns Pattern Score ψscore
NN → JJ → VB → RB 0.28571429
NN → VB → JJ → RB 0.19949495
Stem Cell → NN → VB → RB → JJ 0.17361111
JJ → RB → VB → NN → Stem Cell 0.17347462
RB → JJ → NN → Stem Cell 0.16466599
NN → Stem Cell → RB → VB → JJ 0.16155811
RB → VB → Stem Cell → NN → JJ 0.16129665
09/07/13 ICWE 2013, Aalborg, Denmark
Focused Knowledge Extraction
Pattern Generation (II)
Automatically generated patterns showing sequence of important syntactical items to appear in a sentence
Scoring mechanism of patterns as the marginal
probability of co-occurring pattern items based on the
filtered corpus
Prior probability of a
pattern item, as the
head node of the
directed tree graph.
Conditional probability
of two consecutive
pattern items
12. Focused Knowledge Extraction
Contextual Structure of Summaries
• Summaries generated as structured knowledge
• Decomposition of summaries into two structures:
– global (Entities, Actions) for entire corpus
– local (entity-context, action-context) for particular document
• Multiple summary perspectives based on generated context
• Enrichment with additional information from reference datasets (DBpedia)
09/07/13 ICWE 2013, Aalborg, Denmark
13. Focused Knowledge Extraction
Contextual Structure of Summaries
09/07/13 ICWE 2013, Aalborg, Denmark
Contextual Structure of Summaries with global and local structures enabling multiple summary perspectives:
“The kinds of stem cell therapies being researched for the most part do not involve the politically sensitive use of
embryonic stem cells.”
Stem cell
Therapies
researched
involve
Stem Cell:
Embryonic, sensitive
researched:
Stem cell therapies ↔ most part
14. Evaluation Setup
• Dataset: New York Times, year 2007
• 40,000 articles with manually generated summaries
• Summary relevance w.r.t the generated context (query)
• Coverage of the manually NYT generated summaries
• ROGUE-n metric to measure coverage of structured vs. manually generated
summaries
09/07/13 ICWE 2013, Aalborg, Denmark
Total n-grams
Matching n-grams from
structured and manually
generated summaries.
15. Results
• 10 queries used for evaluation (2007’s prominent events from Time’s
Magazine1
)
• Human evaluation for summary relevance: 76% correctly generated
• 17 evaluators with an average of 20 summaries evaluated
1
http://www.time.com/time/specials/2007/0,28757,1686204,00.html
09/07/13 ICWE 2013, Aalborg, Denmark
Query European
Union
Super
Bowl
US
Congress
Virgina
Tech
Stem
Cell
Protest Harry
Potter
Global
Warming
National
Security
Terrorist
Attacks
#Q. Terms 7 13 17 28 5 2 22 5 0 0
#Doc. 157 370 13 12 105 129 10 198 250 57
#Summ. 129 325 19 11 86 103 7 170 207 52
Generated structured summaries for the different queries.
16. Results
• ROGUE-1 evaluation results for the 10 queries
• 25% precision and 32% recall as best performing results for ROGUE-1
09/07/13 ICWE 2013, Aalborg, Denmark
P/R/F1 measures based on ROGUE-1 metric for the 10 queries used for evaluation
17. Results
Sample Generated Summaries
09/07/13 ICWE 2013, Aalborg, Denmark
Query: “Stem Cell”
Democrats → applauded → Mr. Spitzer Eliot (Gov) calls → insure → 500, 000 children → lack → health
insurance → enrol → 900, 000 adults → are → eligible Medicaid → enrolled → issue debt → pay → stem cell
research.
Congress’s Shift in Power → revives → Medicare Debate House Democrats → try to rush → legislation →
requiring → government → negotiate → lower drug prices for Medicare beneficiaries → overturning →
President Bush’s restrictions on embryonic stem cell research.
The nation → welcome → ambitious agenda → being offered → today by the new Congress Democratic
majority → raising → minimum wage → advancing → stem cell research → restoring → oversight of the
executive branch.
New study → suggesting → useful stem cells → be derived → amniotic fluid without → destroying →
embryos.
Swarns, Rachel L → announced → 9 Aug. federal government → pays → studies on stem cell colonies , lines
→ created before→ that date, government → does not encourage → destruction of additional embryos .
Stem cell research → has not produced → a single medical treatment → is morally wrong→ to create human
life → to destroy → for research.
The measure → allow → scientists → receiving → federal funds → use → embryonic stem cells from surplus
embryos → generated → fertility clinics , after cell lines → had been derived → by others → using → nonfederal
funds.
18. Conclusions
• Query-based generated summaries
• Contextualised Structured Summaries
– Typing and expanding of queries using reference datasets
– Automated pattern generation
• Incorporated user interests and syntactical relevance of chunks of text
• Multiple summary perspectives
• Overall good accuracy of generated summaries
• Infer new knowledge by interlinking summaries of different/same contexts
09/07/13 ICWE 2013, Aalborg, Denmark