2. What is Information Extraction?
Goal:
Extract structured information from unstructured (or loosely formatted) text.
Typical description of task:
Identify named entities
Identify relations between entities
Populate a database
May also include:
Event extraction
Resolution of temporal expressions
Wrapper induction (automatic construction of templates)
Applications:
Natural language understanding,
Question-answering, summarization, etc.
3. Information Extraction
IE extracts pieces of information that are salient to the user's needs
Find namedentities such as persons and organizations
Find find attributes of those entities or events they participate in
ContrastIR, which indicates which documents need to be read by
a user
Links between the extracted information and the original documents
are maintained to allow the user to reference context.
6. Relevant IE Definitions
Entities:
Entities are the basic building blocks that can be found in text
documents.(An object of interest)
Examples: people, companies, locations, genes, and drugs.
Attributes:
Attributes are features of the extracted entities. (A property of an entity
such as its name, alias, descriptor or type)
Examples: the title of a person, the age of a person, and the type of an
organization.
7. Relevant IE Definitions
Facts:
Facts are the relations that exist between entities. (a relationship held
between two or more entities such as the position of a person in a
company)
Example: Employment relationship between a person and a company
or phosphorylation between two proteins.
Events:
An event is an activity or occurrence of interest in which entities
participate
An activity involving several entities such as a terrorist act, airline crash,
management change, new product introduction a merger between two
companies, a birthday and so on.
8.
9. IE - Method
Extract raw text(html, pdf, ps, gif.)
Tokenize
Detect term boundaries
We extracted alpha 1 type XIII collagen from …
Their house council recommended…
Detect sentence boundaries
Tag parts of speech (POS)
John/noun saw/verb Mary/noun.
Tag named entities
Person, place, organization, gene, chemical.
Parse
Determine co-reference
Extract knowledge
10. Architecture: Components of
IE Systems
Core linguistic components, adapted to or be useful for NLP tasks in general
IE-specific components, address the core IE tasks.
Domain-Independent
Domain-specific components
The following steps are performed in Domain-Independent part:
Meta-data analysis:
Extraction of the title, body, structure of the body (identification of
paragraphs), and the date of the document.
Tokenization:
Segmentation of the text into word-like units, called tokens and
classification of their type, e.g., identification of capitalized words,
words written in lowercase letters, hyphenated words, punctuation
signs, numbers, etc.
11. Architecture: Components of
IE Systems
Morphological analysis:
Extraction of morphological information from tokens which constitute
potential word forms-the base form(or lemma), part of speech, other
morphological tags depending on the part of speech.
e.g., verbs have features such as tense, mood, aspect, person, etc.
Words which are ambiguous with respect to certain morphological categories
may undergo disambiguation. Typically part-of-speech disambiguation is
performed.
Sentence/Utterance boundary detection:
Segmentation of text into a sequence of sentences or utterances, each of which
is represented as a sequence of lexical items together with their features.
Common Named-entity extraction:
Detection of domain-independent named entities, such as temporal
expressions, numbers and currency, geographical references, etc.
12. Architecture: Components of
IE Systems
Phrase recognition:
Recognition of small-scale, local structures such as noun phrases, verb groups,
prepositional phrases, acronyms, and abbreviations.
Syntactic analysis:
Computation of a dependency structure (parse tree) of the sentence based on the
sequence of lexical items and small-scale structures.
Syntactic analysis may be deep or shallow.
In the former case, compute all possible interpretations (parse trees) and
grammatical relations within the sentence.
In the latter case, the analysis is restricted to identification of non-recursive
structures or structures with limited amount of structural recursion, which
can be identified with a high degree of certainty, and linguistic phenomena
which cause problems (ambiguities) are not handled and represented with
underspecified structures.
13. Architecture: Components of
IE Systems
The core IE tasks:
NER,
Co-reference resolution, and
Detection of relations and events
Typically domain-specific, and are supported by domain-specific system
components and resources.
Domain-specific processing is also supported on a lower level by detection of
specialized terms in text.
Architecture: IE System
In the domain specific core of the processing chain, a NER component is
applied to identify the entities relevant in a given domain.
Patterns may then be applied to:
Identify text fragments, which describe the target relations and events, and
Extract the key attributes to fill the slots in the template representing the
relation/event.
16. Architecture: Components of
IE Systems
A co-reference component identifies mentions that refer to the same entity.
Partially-filled templates are fused and validated using domain-specific
inference rules in order to create full-fledged relation/event descriptions.
Several software packages to provide various tools that can be used in the
process of developing an IE system, ranging from core linguistic processing
modules (e.g., language detectors, sentence splitters), to general IE-oriented
NLP frameworks.
18. Named Entity Recognition
Named Entity Recognition (NER) addresses the problem of the identification
(detection) and classification of predefined types of named entities,
Such as organizations (e.g., ‘World Health Organisation’), persons (e.g., ‘Mohamad
Gouse’), place names (e.g., ‘the Baltic Sea’), temporal expressions (e.g., ‘15 January
1984’), numerical and currency expressions (e.g., ‘20 MillionEuros’), etc.
NER task include extracting descriptive information from the text about the detected
entities through filling of a small-scale template.
Example, in the case of persons, it may include extracting the title, position,
nationality, gender, and other attributes of the person.
NER also involves lemmatization (normalization) of the named entities, which is
particularly crucial in highly inflective languages.
Example in Polish there are six inflected forms of the name ‘Mohamad Gouse’
depending on grammatical case: ‘Mohamad Gouse’ (nominative), ‘Mohamad
Gouseego’ (genitive), Mohamad Gouseemu (dative), ‘Mohamad Gouseiego’
(accusative), Mohamad Gousem (instrumental), Mohamad Gousem (locative),
Mohamad Gouse(vocative).
19. Co-Reference
Co-reference Resolution (CO) requires the identification of multiple (coreferring) mentions of
the same entity in the text.
Entity mentions can be:
(a) Named, in case an entity is referred to by name
e.g., ‘General Electric’ and ‘GE’ may refer to the same real-world entity.
(b) Pronominal, in case an entity is referred to with a pronoun
e.g., in ‘John bought food. But he forgot to buy drinks.’, the pronoun he refers to John.
(c) Nominal, in case an entity is referred to with a nominal phrase
e.g., in ‘Microsoft revealed its earnings. The company also unveiled future plans.’ the
definite noun phrase The company refers to Microsoft.
(d) Implicit, as in case of using zero-anaphora1
e.g., in the Italian text fragment ‘OEBerlusconii ha visitato il luogo del disastro. i Ha
sorvolato, con l’elicottero.’
(Berlusconi has visited the place of disaster. [He] flew over with a helicopter.) the
second sentence does not have an explicit realization of the reference to
Berlusconi.
20. Relation Extraction
Relation Extraction (RE) is the task of detecting and classifying predefined
relationships between entities identified in text.
For example:
EmployeeOf(Steve Jobs,Apple): a relation between a person and an
organisation, extracted from ‘Steve Jobs works for Apple’
LocatedIn(Smith,New York): a relation between a person and location,
extracted from ‘Mr. Smith gave a talk at the conference in New York’,
SubsidiaryOf(TVN,ITI Holding): a relation between two companies,
extracted from ‘Listed broadcaster TVN said its parent company, ITI
Holdings, is considering various options for the potential sale.
The set of relations that may be of interest is unlimited, the set of relations
within a given task is predefined and fixed, as part of the specification of the
task.
21. Event Extraction
Event Extraction (EE) refers to the task of identifying events in free text and
deriving detailed and structured information about them, ideally identifying who
did what to whom, when, where, through what methods (instruments), and why.
Usually, event extraction involves extraction of several entities and relationships
between them.
For instance, extraction of information on terrorist attacks from the text
fragment ‘Masked gunmen armed with assault rifles and grenades attacked a
wedding party in mainly Kurdish southeast Turkey, killing at least 44 people.’
Involves identification of perpetrators (masked gunmen), victims (people),
number of killed/injured (at least 44), weapons and means used (rifles and
grenades), and location (southeast Turkey).
Another example is the extraction of information on new joint ventures, where
the aim is to identify the partners, products, profits and capitalization of the joint
venture.
EE is considered to be the hardest of the four IE tasks.
22. IE Subtask: Named Entity Recognition
Detect and classify all proper names mentioned in text
What is a proper name? Depends on application.
People, places, organizations, times, amounts, etc.
Names of genes and proteins
Names of college courses
23. NER Example
Find extent of each mention
Classify each mention
Sources of ambiguity
Different strings that map to the same entity
Equivalent strings that map to different entities (e.g., U.S. Grant)
24. Approaches to NER
Early systems: hand-written rules
Statistical systems
Supervised learning (HMMs, Decision Trees, MaxEnt, SVMs, CRFs)
Semi-supervised learning (bootstrapping)
Unsupervised learning (rely on lexical resources, lexical patterns, and
corpus statistics)
25. A Sequence-Labeling Approach using
CRFs
Input: Sequence of observations (tokens/words/text)
Output: Sequence of states (labels/classes)
B: Begin
I: Inside
O: Outside
Some evidence that including L (Last) and U (Unit length) is
advantageous (Ratinov and Roth 09)
CRFs defines a conditional probability p(Y|X) over label sequences Y
given an observation sequence X
No effort wasted modeling the observations (in contrast to joint
models like HMMs)
Arbitrary features of the observations may be captured by the model
26. Linear Chain CRFs
Simplest and most common graph structure, used for
sequence modeling
Inference can be done efficiently using dynamic
programming O(|X||Y|2)
28. NER Features
Several feature families used, all time-shifted by -2, -1, 0, 1, 2:
The word itself
Capitalization and digit patterns (shape patterns)
8 lexicons entered by hand (e.g., honorifics, days, months)
15 lexicons obtained from web sites (e.g., countries, publicly-traded
companies, surnames, stopwords, universities)
25 lexicons automatically induced from the web (people names,
organizations, NGOs, nationalities)
29. Limitations of Conventional
NER(and IE)
Supervised learning
Expensive
Inconsistent
Worse for relations and events!
Fixed, narrow, pre-specified sets of entity types
Small, homogeneous corpora (newswire, seminar announcements)
30. Evaluating Named Entity Recognition
Recall that recall is the ratio of the number of correctly labeled responses to the
total that should have been labeled.
Precision is the ratio of the number of correctly labeled responses to the total
labeled.
The F-measure provides a way to combine these two measures into a single
metric.
key
correct
N
N
recall
incorrectcorrect
correct
NN
N
precision
recallprecision
recallprecision
F
2
2
)1(
31. What is Relation Extraction?
Typically defined as identifying relations between two entities
Relations Subtypes Examples
Affiliations
Personal
Organizational
Artifactual
married to, mother
of spokesman for,
president of owns,
invented, produces
Geospatial
Proximity
Directional
near, on outskirts
southeast of
Part-of
Organizational
Political
a unit of, parent of
annexed, acquired
32. Typical (Supervised) Approach
FindEntities( ): Named entity recognizer
Related?( ): Binary classier that says whether two entities are involved in
a relation
ClassifyRelation( ): Classier that labels relations discovered by
Related?( )
34. NELL: Never-Ending Language
Learner
NELL: Can computers learn to read?
Goal: create a system that learns to read the web
Reading task: Extract facts from text found on the web
Learning task: Iteratively improve reading competence.
http://rtw.ml.cmu.edu/rtw/
35. Approach
Inputs
Ontology with target categories and relations (i.e., predicates)
Small number of seed examples for each
Set of constraints that couple the predicates
Large corpus of unlabeled documents
Output: new predicate instances
Semi-supervised bootstrap learning methods
Couple the learning of functions to constrain the problem
Exploit redundancy of information on the web.
37. Types of Coupling
1. Mutual Exclusion (output constraint)
Mutually exclusive predicates can't both be satisfied by the same input x
E.g., x cannot be a Person and a Sport
2. Relation Argument Type-Checking (compositional constraint)
Arguments of relations declared to be of certain categories
E.g., CompanyIsInEconomicSector(Company, EconomicSector)
3. Unstructured and Semi-Structured Text Features
(multi-view-agreement constraint)
Look at different views (like co-training)
Require classifiers agree
E.g., freeform textual contexts and semi-structured contexts
39. Coupled Pattern Learner (CPL)
Free-text extractor that learns contextual patterns to extract predicate
instances
Use mutual exclusion and type-checking constraints to filter candidates
instances
Rank instances and patterns by leveraging redundancy: if an instance or
pattern occurs more frequently, it's ranked higher
40. Coupled SEAL (CSEAL)
SEAL (Set Expander for Any Language) is a wrapper induction algorithm
Operates over semi-structured text such as web pages
Constructs page-specific extraction rules (wrappers) that are human- and
markup-language independent
CSEAL adds mutual-exclusion and type-checking constraints
41. CSEAL Wrappers
Seeds: Ford, Nissan, Toyota
arg1 is a placeholder for extracting instances
42. Open IE and TextRunner
Motivations:
Web corpora are massive, introducing scalability concerns
Relations of interest are unanticipated, diverse, and abundant
Use of “heavy” linguistic technology (NERs and parsers) don't work
well
Input: a large, heterogeneous Web corpus
9M web pages, 133M sentences
No pre-specified set of relations
Output: huge set of extracted relations
60.5M tuples, 11.3M high-probability tuples
Tuples are indexed for searching
43. TextRunner Architecture
Learner outputs a classier that labels trustworthy extractions
Extractor finds and outputs trustworthy extractions
Assessor normalizes and scores the extractions
44. Architecture: Self-Supervised Learner
1. Automatically labels training data
Uses a parser to induce dependency structures
Parses a small corpus of several thousand sentences
Identifies and labels a set of positive and negative extractions using
relation-independent heuristics
An extraction is a tuple t = (ej , ri,j , ej)
Entities are base noun phrases
Uses parse to identify potential relations
2. Trains a classifier
Domain-independent, simple non-parse features
E.g., POS tags, phrase chunks, regexes, stopwords, etc.
46. Architecture: Redundancy-Based
Assessor
Take the tuples and perform
Normalization, deduplication, synonym resolution
Assessment
Number of distinct sentences from which each extraction was found serves
as a measure of confidence
Entities and relations indexed using Lucene
47. Template Filling
The task of template-filling is to find Template filling documents that
evoke such situations and then fill the slots in templates with appropriate
material.
These slot fillers may consist of
Text segments extracted directly from the text, or
Concepts that have been inferred from text elements via some
additional processing (times, amounts, entities from an ontology, etc.).
48. Applications of IE
Infrastructure for IR and for Categorization
Information Routing
Event Based Summarization
Automatic Creation of Databases
Company acquisitions
Sports scores
Terrorist activities
Job listings
Corporate titles and addresses
49. Inductive Algorithms for IE
Rule Induction algorithms produce symbolic IE rules based on a corpus of
annotated documents.
WHISK
BWI
The (LP)2 Algorithm
The inductive algorithms are suitable for semi-structured domains, where
the rules are fairly simple, whereas when dealing with free text documents
(such as news articles) the probabilistic algorithms perform much better.
50. WHISK
WHISK is a supervised learning algorithm that uses hand-tagged examples
for learning information extraction rules.
Works for structured, semi-structured and free text.
Extract both single-slot and multi-slot information.
Doesn’t require syntactic preprocessing for structured and semi-structured
text, and recommend syntactic analyzer and semantic tagger for free text.
The extraction pattern learned by WHISK is in the form of limited regular
expression, considering tradeoff between expressiveness and efficiency.
Example: IE task of extracting neighborhood, number of bedrooms and
price from the text
51. WHISK
An Example from the Rental Ads domain
An example extraction pattern which can be learned by WHISK is,
*(Neighborhood) *(Bedroom) * ‘$’(Number)
Neighborhood, Bedroom, and Number – Semantic classes specified by
domain experts.
WHISK learns the extraction rules using a top-down covering algorithm.
The algorithm begins learning a single rule by starting with an empty rule;
Then add one term at a time until either no negative examples are covered
by the rule or the pre-pruning criterion has been satisfied.
52. We add terms to specialize it in order to reduce the Laplacian error of the rule.
The Laplacian expected error is defined as,
Where, e is the number of negative extraction and
n is the number of positive extractions on the training instances (terms)
Example:
For instance, from the text “3 BR, upper flr of turn of ctry. Incl gar, grt N. Hill
loc 995$. (206)-999-9999,” the rule would extract the frame Bedrooms – 3,
Price – 995.
The “*” char in the pattern will match any number of characters (unlimited
jump).
Patterns enclosed in parentheses become numbered elements in the output
pattern, and hence (Digit) is $1 and (number) is $2.
1
1
n
e
Laplacian
53. Boosted Wrapper Induction(BWI)
The BWI is a system that utilizes wrapper induction techniques for
traditional Information Extraction.
IE is treated as a classification problem that entails trying to approximate
two boundary functions Xbegin(i ) and Xend(i ).
Xbegin(i ) is equal to 1 if the ith token starts a field that is part of the frame to
be extracted and 0 otherwise.
Xend(i ) is defined in a similar way for tokens that end a field.
The learning algorithm approximates each X function by taking a set of
pairs of the form {i, X}(i) as training data.
54. Each field is extracted by a wrapper W=<F, A, H> where F is a set of begin boundary
detectors
A is a set of end boundary detectors
H(k) is the probability that the field has length k
A boundary detector is just a sequence of tokens with wild cards (some kind of a
regular expression).
W(i, j ) is a nave Bayesian approximation of the probability
The BWI algorithm learns two detectors by using a greedy algorithm that extends the
prefix and suffix patterns while there is an improvement in the accuracy.
The sets F(i) and A(i) are generated from the detectors by using the AdaBoost
algorithm.
The detector pattern can include specific words and regular expressions that work on a
set of wildcards such as <num>, <Cap>, <LowerCase>, <Punctuation> and <Alpha>.
otherwise
ijHjAiFif
jiW
0
)1()()(1
),(
),()( iFCiF k
k
FK )()( iACiA
k
kAk
55. (LP)2 Algorithm
The (LP)2 algorithm learns from an annotated corpus and induces two sets
of rules:
Tagging rules generated by a bottom-up generalization process
correction rules that correct mistakes and omissions done by the
tagging rules.
A tagging rule is a pattern that contains conditions on words preceding the
place where a tag is to be inserted and conditions on the words that follow
the tag.
Conditions can be either words, lemmas, lexical categories (such as digit,
noun, verb, etc), case (lower or upper), and semantic categories (such as
time-id, cities, etc).
The (LP)2 algorithm is a covering algorithm that tries to cover all training
examples.
The initial tagging rules are generalized by dropping conditions.
56. IE and Text Summarization
User’s perspective,
IE can be glossed as "I know what specific pieces of information I want–just
find them for me!",
Summarization can be glossed as "What’s in the text that is interesting?".
Technically, from the system builder’s perspective, the two applications blend into
each other.
The most pertinent technical aspects are:
Are the criteria of interestingness specified at run-time or by the system builder?
Is the input a single document or multiple documents?
Is the extracted information manipulated, either by simple content delineation
routines or by complex inferences, or just delivered verbatim?
What is the grain size of the extracted units of information–individual entities
and events, or blocks of text?
Is the output formulated in language, or in a computer-internal knowledge
representation?
57. Text Summarization
An information access technology that given a
document or sets of related documents, extracts the
most important content from the source(s) taking into
account the user or task at hand, and presents this
content in a well formed and concise text
58. Text Summarization Techniques
Topic Representation
Influence of Context
Indicator Representations
Pattern Extraction
59. Text Summarization
Input: one or more text documents
Output: paragraph length summary
Sentence extraction is the standard method
Using features such as key words, sentence position in document,
cue phrases
Identify sentences within documents that are salient
Extract and string sentences together
Machine learning for extraction
Corpus of document/summary pairs
Learn the features that best determine important sentences
Summarization of scientific articles
60. A Summarization Machine
EXTRACTS
ABSTRACTS
?
MULTIDOCS
Extract Abstract
Indicative
Generic
Background
Query-oriented
Just the news
10%
50%
100%
Very Brief
Brief
Long
Headline
Informative
DOC
QUERY
CASE FRAMES
TEMPLATES
CORE CONCEPTS
CORE EVENTS
RELATIONSHIPS
CLAUSE FRAGMENTS
INDEX TERMS
61. The Modules of the Summarization
Machine
E
X
T
R
A
C
T
I
O
N
I
N
T
E
R
P
R
E
T
A
T
I
O
N
EXTRACTS
ABSTRACTS
?
CASE FRAMES
TEMPLATES
CORE CONCEPTS
CORE EVENTS
RELATIONSHIPS
CLAUSE FRAGMENTS
INDEX TERMS
MULTIDOC EXTRACTS
G
E
N
E
R
A
T
I
O
N
F
I
L
T
E
R
I
N
G
DOC
EXTRACTS
62. What is Summarization?
Data as input (database, software trace, expert system), text summary as output
Text as input (one or more articles), paragraph summary as output
Multimedia in input or output
Summaries must convey maximal information in minimal space
Involves: Three stages (typically)
Content identification
Find/Extract the most important material
Conceptual organization
Realization
63. Types of summaries
Purpose
Indicative, informative, and critical summaries
Form
Extracts (representative paragraphs/sentences/phrases)
Abstracts: “a concise summary of the central subject matter of a
document”.
Dimensions
Single-document vs. multi-document
Context
Query-specific vs. query-independent
Generic vs. query-oriented
provides author’s view vs. reflects user’s interest.
65. Aspects that Describe Summaries
Input
subject type: domain
genre: newspaper articles, editorials, letters, reports...
form: regular text structure, free-form
source size: single doc, multiple docs (few,many)
Purpose
situation: embedded in larger system (MT, IR) or not?
audience: focused or general
usage: IR, sorting, skimming...
Output
completeness: include all aspects, or focus on some?
format: paragraph, table, etc.
style: informative, indicative, aggregative, critical...
66. Single Document Summarization
System Architecture
Extraction
Sentence reduction
Generation
Sentence combination
Input: single document
Extracted sentences
Output: summary
Corpus
Decomposition
Lexicon
Parser
Co-reference
67. Multi-Document Summarization
Monitor variety of online information sources
News, multilingual
Email
Gather information on events across source and time
Same day, multiple sources
Across time
Summarize
Highlighting similarities, new information, different perspectives,
user specified interests in real-time
69. From extract to abstract:
topic interpretation or concept fusion.
Experiment (Marcu, 98):
Got 10 newspaper texts, with
human abstracts.
Asked 14 judges to extract
corresponding clauses from texts, to
cover the same content.
Compared word lengths of extracts
to abstracts: extract_length 2.76
abstract_length !!
xx xxx xxxx x xx xxxx
xxx xx xxx xx xxxxx x
xxx xx xxx xx x xxx xx
xx xxx x xxx xx xxx x
xx x xxxx xxxx xxxx xx
xx xxxx xxx
xxx xx xx xxxx x xxx
xx x xx xx xxxxx x x xx
xxx xxxxxx xxxxxx x x
xxxxxxx xx x xxxxxx
xxxx
xx xx xxxxx xxx xx x xx
xx xxxx xxx xxxx xx
Topic Interpretation
xxx xx xxx xxxx xx
xxx x xxxx x xx xxxx
xx xxx xxxx xx x xxx
xxx xxxx x xxx x xxx
xx xx xxxxx x x xx
xxxxxxx xx x xxxxxx
xxxx
xx xx xxxxx xxx xx
xxx xx xxxx x xxxxx
xx xxxxx x
70. Some Types of Interpretation
Concept generalization:
Sue ate apples, pears, and bananas Sue ate fruit
Meronymy replacement:
Both wheels, the pedals, saddle, chain… the bike
Script identification:
He sat down, read the menu, ordered, ate, paid, and left He ate at the
restaurant
Metonymy:
A spokesperson for the US Government announced that… Washington
announced that...
71. General Aspects of Interpretation
Interpretation occurs at the conceptual level...
…words alone are polysemous (bat animal and sports
instrument) and combine for meaning (alleged murderer
murderer).
For interpretation, you need world knowledge...
…the fusion inferences are not in the text!
72. Extract a pattern for each event in training data
part of speech & mention tags
Example: Japanese political leaders GPE JJ PER
Japanese political Leaders
GPE PER
NN JJ NN
GPE JJ PER
Text
Ents
POS
Pattern
Pattern Extraction
73. Summarization - Scope
Data preparation:
Collect large sets of texts with abstracts, all genres.
Build large corpora of <Text, Abstract, Extract> tuples.
Investigate relationships between extracts and abstracts (using <Extract,
Abstract> tuples).
Types of summary:
Determine characteristics of each type.
Topic Identification:
Develop new identification methods (discourse, etc.).
Develop heuristics for method combination (train heuristics on <Text,
Extract> tuples).
74. Summarization - Scope
Concept Interpretation (Fusion):
Investigate types of fusion (semantic, evaluative…).
Create large collections of fusion knowledge/rules (e.g., signature
libraries, generalization and partonymic hierarchies, metonymy
rules…).
Study incorporation of User’s knowledge in interpretation.
Generation:
Develop Sentence Planner rules for dense packing of content into
sentences (using <Extract, Abstract> pairs).
Evaluation:
Develop better evaluation metrics, for types of summaries.
75. Apriori Algorithm
In computer science and data mining, Apriori is a classic algorithm for
learning association rules.
Apriori is designed to operate on databases containing transactions.
Example, collections of items bought by customers, or details of a website
frequentation.
The algorithm attempts to find subsets which are common to at least a
minimum number C (the cutoff, or confidence threshold) of the itemsets.
Apriori uses a "bottom up" approach, where frequent subsets are extended
one item at a time (a step known as candidate generation, and groups of
candidates are tested against the data.
The algorithm terminates when no further successful extensions are found.
Apriori uses breadth-first search and a hash tree structure to count candidate
item sets efficiently.
76. Find rules in two stages
Agarwal et.al., divided the problem of finding good rules into two phases:
1. Find all itemsets with a specified minimal support (coverage). An itemset
is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm
can efficiently find all itemsets whose coverage is above a given
minimum.
2. Use these itemsets to help generate interersting rules. Having done stage
1, we have considerably narrowed down the possibilities, and can do
reasonably fast processing of the large itemsets to generate candidate
rules.
77. Terminology
k-itemset : a set of k items. E.g.
{beer, cheese, eggs} is a 3-itemset
{cheese} is a 1-itemset
{honey, ice-cream} is a 2-itemset
support: an itemset has support s% if s% of the records in the DB contain that
itemset.
minimum support: the Apriori algorithm starts with the specification of a
minimum level of support, and will focus on itemsets with this level or
above.
78. Terminology
large itemset: doesn’t mean an itemset with many items. It means one
whose support is at least minimum support.
Lk : the set of all large k-itemsets in the DB.
Ck : a set of candidate large k-itemsets. In the algorithm we will look at, it
generates this set, which contains all the k-itemsets that might be large,
and then eventually generates the set above.
79. Terminology
sets: Let A be a set (A = {cat, dog}) and
let B be a set (B = {dog, eel, rat}) and
let C = {eel, rat}
I use ‘A + B’ to mean A union B.
So A + B = {cat, dog, eel, rat}
When X is a subset of Y, I use Y – X to mean the set of things in Y which
are not in X.
E.g. B – C = {dog}
80. Apriori Algorithm
Find all large 1-itemsets
For (k = 2 ; while Lk-1 is non-empty; k++)
{Ck = apriori-gen(Lk-1)
For each c in Ck, initialise c.count to zero
For all records r in the DB
{Cr = subset(Ck, r); For each c in Cr , c.count++ }
Set Lk := all c in Ck whose count >= minsup
} /* end -- return all of the Lk sets.
The algorithm returns all of the (non-empty) Lk sets, which gives us an
excellent start in finding interesting rules (although the large itemsets
themselves will usually be very interesting and useful.
81. Example: Generation of candidate itemsets and frequent
itemsets, where the minimum support count is 2.
82. Apriori Merits/Demerits
Merits
Uses large itemset property
Easily parallelized
Easy to implement
Demerits
Assumes transaction database is memory resident.
Requires many database scans.
83. Summary
Association Rules form an very applied data mining approach.
Association Rules are derived from frequent itemsets.
The Apriori algorithm is an efficient algorithm for finding all frequent
itemsets.
The Apriori algorithm implements level-wise search using frequent item
property.
The Apriori algorithm can be additionally optimized.
There are many measures for association rules.
85. Frequent Pattern Mining:
An Example
Given a transaction database DB and a minimum support threshold ξ,
Find all frequent patterns (item sets) with support no less than ξ.
TID Items bought
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
DB:
Minimum support: ξ =3
Input:
Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…
Problem Statement: How to efficiently find all frequent patterns?
86. Compress a large database into a compact, Frequent-Pattern tree (FP-tree)
structure
highly compacted, but complete for frequent pattern mining
avoid costly repeated database scans
Develop an efficient, FP-tree-based frequent pattern mining method (FP-
growth)
A divide-and-conquer methodology: decompose mining tasks into
smaller ones
Avoid candidate generation: sub-database test only.
Overview of FP-Growth: Ideas
88. Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent items (single
item patterns) and order them into a list L in frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according to the order in L;
Scan DB the second time, construct FP-tree by putting each frequency
ordered transaction onto it.
89. 89
FP-tree Example: step 1
Item frequency
f 4
c 4
a 3
b 3
m 3
p 3
TID Items bought
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
L
Step 1: Scan DB for the first time to generate L
By-Product of First
Scan of Database
90. 90
FP-tree Example: step 2
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Step 2: scan the DB for the second time, order frequent items
in each transaction
91. 91
FP-tree Example: step 2
Step 2: construct FP-tree
{}
f:1
c:1
a:1
m:1
p:1
{f, c, a, m, p}
{}
{}
f:2
c:2
a:2
b:1m:1
p:1 m:1
{f, c, a, b, m}
NOTE: Each
transaction
corresponds to one
path in the FP-tree
94. FP-Tree Definition
FP-tree is a frequent pattern tree .
Formally, FP-tree is a tree structure defined below:
1. One root labeled as “null", a set of item prefix sub-trees as the children of
the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields:
item-name : register which item this node represents,
count, the number of transactions represented by the portion of the path
reaching this node,
node-link that links to the next node in the FP-tree carrying the same
item-name, or null if there is none.
3. Each entry in the frequent-item header table has two fields,
item-name, and
head of node-link that points to the first node in the FP-tree carrying the
item-name.
95. Advantages of the FP-tree Structure
The most significant advantage of the FP-tree
Scan the DB only twice and twice only.
Completeness:
The FP-tree contains all the information related to mining frequent patterns
(given the min-support threshold).
Compactness:
The size of the tree is bounded by the occurrences of frequent items
The height of the tree is bounded by the maximum number of items in a
transaction
97. Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer)
Recursively grow frequent patterns using the FP-tree: looking for shorter
ones recursively and then concatenating the suffix:
For each frequent item, construct its conditional pattern base, and then its
conditional FP-tree;
Repeat the process on each newly created conditional FP-tree until the
resulting FP-tree is empty, or it contains only one path (single path will
generate all the combinations of its sub-paths, each of which is a frequent
pattern)
98. 3 Major Steps
Starting the processing from the end of list L:
Step 1:
Construct conditional pattern base for each item in the header table
Step 2:
Construct conditional FP-tree from each conditional pattern base
Step 3:
Recursively mine conditional FP-trees and grow frequent patterns
obtained so far. If the conditional FP-tree contains a single path, simply
enumerate all the patterns
99. Step 1: Construct Conditional Pattern Base
Starting at the bottom of frequent-item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item
Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Conditional pattern bases
item cond. pattern base
p fcam:2, cb:1
m fca:2, fcab:1
b fca:1, f:1, c:1
a fc:3
c f:3
f { }
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item head
f
c
a
b
m
p
100. Properties of FP-Tree
Node-link property
For any frequent item ai, all the possible frequent patterns that contain ai
can be obtained by following ai's node-links, starting from ai's head in
the FP-tree header.
Prefix path property
To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.
101. Step 2: Construct Conditional FP-tree
For each pattern base
Accumulate the count for each item in the base
Construct the conditional FP-tree for the frequent items of the pattern
base
m- cond. pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-tree
{}
f:4
c:3
a:3
b:1m:2
m:1
Header Table
Item head
f 4
c 4
a 3
b 3
m 3
p 3
102. Step 3: Recursively mine the conditional
FP-tree
{}
f:3
c:3
a:3
conditional FP-tree of
“am”: (fc:3)
{}
f:3
c:3
conditional FP-tree of
“cm”: (f:3)
{}
f:3
conditional FP-tree of
“cam”: (f:3)
{}
f:3
conditional FP-tree of “fm”: 3
conditional FP-tree of
of “fam”: 3
conditional FP-tree of
“m”: (fca:3)
add
“a”
add
“c”
add
“f”
add
“c”
add
“f”
Frequent Pattern
fcam
add
“f”
conditional FP-tree of
“fcm”: 3
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
add
“f”
103. Principles of FP-Growth
Pattern growth property
Let be a frequent itemset in DB, B be 's conditional pattern base,
and be an itemset in B. Then is a frequent itemset in DB iff
is frequent in B.
Is “fcabm ” a frequent pattern?
“fcab” is a branch of m's conditional pattern base
“b” is NOT frequent in transactions containing “fcab ”
“bm” is NOT a frequent itemset.
104. Conditional Pattern Bases and
Conditional FP-Tree
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern baseItem
order of
L
105. Single FP-tree Path Generation
Suppose an FP-tree T has a single path P.
The complete set of frequent pattern of T can be generated by enumeration of
all the combinations of the sub-paths of P
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns concerning m:
combination of {f, c, a} and m
m,
fm, cm, am,
fcm, fam, cam,
fcam
106. Summary of FP-Growth Algorithm
Mining frequent patterns can be viewed as first mining 1-itemset and
progressively growing each 1-itemset by mining on its conditional pattern base
recursively
Transform a frequent k-itemset mining problem into a sequence of k frequent 1-
itemset mining problems via a set of conditional pattern bases
107. Efficiency Analysis
Facts: usually
1. FP-tree is much smaller than the size of the DB
2. Pattern base is smaller than original FP-tree
3. Conditional FP-tree is smaller than pattern base
Mining process works on a set of usually much smaller pattern
bases and conditional FP-trees
Divide-and-conquer and dramatic scale of shrinking
108. Performance Improvement
Projected DBs
Disk-resident
FP-tree
FP-tree
Materialization
FP-tree
Incremental update
partition the
DB into a set
of projected
DBs and then
construct an
FP-tree and
mine it in each
projected DB.
Store the FP-
tree in the
hark disks by
using B+ tree
structure to
reduce I/O
cost.
a low ξ may
usually satisfy
most of the
mining queries
in the FP-tree
construction.
How to update
an FP-tree when
there are new
data?
• Reconstruct
the FP-tree
• Or do not
update the
FP-tree