SlideShare a Scribd company logo
1 of 108
INFORMATION EXTRACTION
What is Information Extraction?
Goal:
Extract structured information from unstructured (or loosely formatted) text.
Typical description of task:
 Identify named entities
 Identify relations between entities
 Populate a database
May also include:
 Event extraction
 Resolution of temporal expressions
 Wrapper induction (automatic construction of templates)
Applications:
 Natural language understanding,
 Question-answering, summarization, etc.
Information Extraction
 IE extracts pieces of information that are salient to the user's needs
 Find namedentities such as persons and organizations
 Find find attributes of those entities or events they participate in
 ContrastIR, which indicates which documents need to be read by
a user
 Links between the extracted information and the original documents
are maintained to allow the user to reference context.
Schematic view of the Information
Extraction Process
Information Extraction
Relevant IE Definitions
Entities:
 Entities are the basic building blocks that can be found in text
documents.(An object of interest)
 Examples: people, companies, locations, genes, and drugs.
Attributes:
 Attributes are features of the extracted entities. (A property of an entity
such as its name, alias, descriptor or type)
 Examples: the title of a person, the age of a person, and the type of an
organization.
Relevant IE Definitions
Facts:
 Facts are the relations that exist between entities. (a relationship held
between two or more entities such as the position of a person in a
company)
 Example: Employment relationship between a person and a company
or phosphorylation between two proteins.
Events:
 An event is an activity or occurrence of interest in which entities
participate
 An activity involving several entities such as a terrorist act, airline crash,
management change, new product introduction a merger between two
companies, a birthday and so on.
IE - Method
 Extract raw text(html, pdf, ps, gif.)
 Tokenize
 Detect term boundaries
 We extracted alpha 1 type XIII collagen from …
 Their house council recommended…
 Detect sentence boundaries
 Tag parts of speech (POS)
 John/noun saw/verb Mary/noun.
 Tag named entities
 Person, place, organization, gene, chemical.
 Parse
 Determine co-reference
 Extract knowledge
Architecture: Components of
IE Systems
 Core linguistic components, adapted to or be useful for NLP tasks in general
 IE-specific components, address the core IE tasks.
 Domain-Independent
 Domain-specific components
 The following steps are performed in Domain-Independent part:
 Meta-data analysis:
 Extraction of the title, body, structure of the body (identification of
paragraphs), and the date of the document.
 Tokenization:
 Segmentation of the text into word-like units, called tokens and
classification of their type, e.g., identification of capitalized words,
words written in lowercase letters, hyphenated words, punctuation
signs, numbers, etc.
Architecture: Components of
IE Systems
 Morphological analysis:
 Extraction of morphological information from tokens which constitute
potential word forms-the base form(or lemma), part of speech, other
morphological tags depending on the part of speech.
 e.g., verbs have features such as tense, mood, aspect, person, etc.
 Words which are ambiguous with respect to certain morphological categories
may undergo disambiguation. Typically part-of-speech disambiguation is
performed.
 Sentence/Utterance boundary detection:
 Segmentation of text into a sequence of sentences or utterances, each of which
is represented as a sequence of lexical items together with their features.
 Common Named-entity extraction:
 Detection of domain-independent named entities, such as temporal
expressions, numbers and currency, geographical references, etc.
Architecture: Components of
IE Systems
 Phrase recognition:
 Recognition of small-scale, local structures such as noun phrases, verb groups,
prepositional phrases, acronyms, and abbreviations.
 Syntactic analysis:
 Computation of a dependency structure (parse tree) of the sentence based on the
sequence of lexical items and small-scale structures.
 Syntactic analysis may be deep or shallow.
 In the former case, compute all possible interpretations (parse trees) and
grammatical relations within the sentence.
 In the latter case, the analysis is restricted to identification of non-recursive
structures or structures with limited amount of structural recursion, which
can be identified with a high degree of certainty, and linguistic phenomena
which cause problems (ambiguities) are not handled and represented with
underspecified structures.
Architecture: Components of
IE Systems
 The core IE tasks:
 NER,
 Co-reference resolution, and
 Detection of relations and events
Typically domain-specific, and are supported by domain-specific system
components and resources.
 Domain-specific processing is also supported on a lower level by detection of
specialized terms in text.
 Architecture: IE System
 In the domain specific core of the processing chain, a NER component is
applied to identify the entities relevant in a given domain.
 Patterns may then be applied to:
 Identify text fragments, which describe the target relations and events, and
 Extract the key attributes to fill the slots in the template representing the
relation/event.
IE System - Architecture
Typical Architecture of an Information
Extraction System
Architecture: Components of
IE Systems
 A co-reference component identifies mentions that refer to the same entity.
 Partially-filled templates are fused and validated using domain-specific
inference rules in order to create full-fledged relation/event descriptions.
 Several software packages to provide various tools that can be used in the
process of developing an IE system, ranging from core linguistic processing
modules (e.g., language detectors, sentence splitters), to general IE-oriented
NLP frameworks.
IE Task Types
 Named Entity Recognition (NER)
 Co-reference Resolution (CO)
 Relation Extraction (RE)
 Event Extraction (EE)
Named Entity Recognition
 Named Entity Recognition (NER) addresses the problem of the identification
(detection) and classification of predefined types of named entities,
 Such as organizations (e.g., ‘World Health Organisation’), persons (e.g., ‘Mohamad
Gouse’), place names (e.g., ‘the Baltic Sea’), temporal expressions (e.g., ‘15 January
1984’), numerical and currency expressions (e.g., ‘20 MillionEuros’), etc.
 NER task include extracting descriptive information from the text about the detected
entities through filling of a small-scale template.
 Example, in the case of persons, it may include extracting the title, position,
nationality, gender, and other attributes of the person.
 NER also involves lemmatization (normalization) of the named entities, which is
particularly crucial in highly inflective languages.
 Example in Polish there are six inflected forms of the name ‘Mohamad Gouse’
depending on grammatical case: ‘Mohamad Gouse’ (nominative), ‘Mohamad
Gouseego’ (genitive), Mohamad Gouseemu (dative), ‘Mohamad Gouseiego’
(accusative), Mohamad Gousem (instrumental), Mohamad Gousem (locative),
Mohamad Gouse(vocative).
Co-Reference
 Co-reference Resolution (CO) requires the identification of multiple (coreferring) mentions of
the same entity in the text.
 Entity mentions can be:
 (a) Named, in case an entity is referred to by name
 e.g., ‘General Electric’ and ‘GE’ may refer to the same real-world entity.
 (b) Pronominal, in case an entity is referred to with a pronoun
 e.g., in ‘John bought food. But he forgot to buy drinks.’, the pronoun he refers to John.
 (c) Nominal, in case an entity is referred to with a nominal phrase
 e.g., in ‘Microsoft revealed its earnings. The company also unveiled future plans.’ the
definite noun phrase The company refers to Microsoft.
 (d) Implicit, as in case of using zero-anaphora1
 e.g., in the Italian text fragment ‘OEBerlusconii ha visitato il luogo del disastro. i Ha
sorvolato, con l’elicottero.’
 (Berlusconi has visited the place of disaster. [He] flew over with a helicopter.) the
second sentence does not have an explicit realization of the reference to
Berlusconi.
Relation Extraction
 Relation Extraction (RE) is the task of detecting and classifying predefined
relationships between entities identified in text.
 For example:
 EmployeeOf(Steve Jobs,Apple): a relation between a person and an
organisation, extracted from ‘Steve Jobs works for Apple’
 LocatedIn(Smith,New York): a relation between a person and location,
extracted from ‘Mr. Smith gave a talk at the conference in New York’,
 SubsidiaryOf(TVN,ITI Holding): a relation between two companies,
extracted from ‘Listed broadcaster TVN said its parent company, ITI
Holdings, is considering various options for the potential sale.
 The set of relations that may be of interest is unlimited, the set of relations
within a given task is predefined and fixed, as part of the specification of the
task.
Event Extraction
 Event Extraction (EE) refers to the task of identifying events in free text and
deriving detailed and structured information about them, ideally identifying who
did what to whom, when, where, through what methods (instruments), and why.
 Usually, event extraction involves extraction of several entities and relationships
between them.
 For instance, extraction of information on terrorist attacks from the text
fragment ‘Masked gunmen armed with assault rifles and grenades attacked a
wedding party in mainly Kurdish southeast Turkey, killing at least 44 people.’
 Involves identification of perpetrators (masked gunmen), victims (people),
number of killed/injured (at least 44), weapons and means used (rifles and
grenades), and location (southeast Turkey).
 Another example is the extraction of information on new joint ventures, where
the aim is to identify the partners, products, profits and capitalization of the joint
venture.
 EE is considered to be the hardest of the four IE tasks.
IE Subtask: Named Entity Recognition
 Detect and classify all proper names mentioned in text
 What is a proper name? Depends on application.
 People, places, organizations, times, amounts, etc.
 Names of genes and proteins
 Names of college courses
NER Example
 Find extent of each mention
 Classify each mention
 Sources of ambiguity
 Different strings that map to the same entity
 Equivalent strings that map to different entities (e.g., U.S. Grant)
Approaches to NER
 Early systems: hand-written rules
 Statistical systems
 Supervised learning (HMMs, Decision Trees, MaxEnt, SVMs, CRFs)
 Semi-supervised learning (bootstrapping)
 Unsupervised learning (rely on lexical resources, lexical patterns, and
corpus statistics)
A Sequence-Labeling Approach using
CRFs
 Input: Sequence of observations (tokens/words/text)
 Output: Sequence of states (labels/classes)
 B: Begin
 I: Inside
 O: Outside
 Some evidence that including L (Last) and U (Unit length) is
advantageous (Ratinov and Roth 09)
 CRFs defines a conditional probability p(Y|X) over label sequences Y
given an observation sequence X
 No effort wasted modeling the observations (in contrast to joint
models like HMMs)
 Arbitrary features of the observations may be captured by the model
Linear Chain CRFs
 Simplest and most common graph structure, used for
sequence modeling
 Inference can be done efficiently using dynamic
programming O(|X||Y|2)
Linear Chain CRFs
NER Features
 Several feature families used, all time-shifted by -2, -1, 0, 1, 2:
 The word itself
 Capitalization and digit patterns (shape patterns)
 8 lexicons entered by hand (e.g., honorifics, days, months)
 15 lexicons obtained from web sites (e.g., countries, publicly-traded
companies, surnames, stopwords, universities)
 25 lexicons automatically induced from the web (people names,
organizations, NGOs, nationalities)
Limitations of Conventional
NER(and IE)
 Supervised learning
 Expensive
 Inconsistent
 Worse for relations and events!
 Fixed, narrow, pre-specified sets of entity types
 Small, homogeneous corpora (newswire, seminar announcements)
Evaluating Named Entity Recognition
 Recall that recall is the ratio of the number of correctly labeled responses to the
total that should have been labeled.
 Precision is the ratio of the number of correctly labeled responses to the total
labeled.
 The F-measure provides a way to combine these two measures into a single
metric.
key
correct
N
N
recall 
incorrectcorrect
correct
NN
N
precision


recallprecision
recallprecision
F


 2
2
)1(


What is Relation Extraction?
 Typically defined as identifying relations between two entities
Relations Subtypes Examples
Affiliations
Personal
Organizational
Artifactual
married to, mother
of spokesman for,
president of owns,
invented, produces
Geospatial
Proximity
Directional
near, on outskirts
southeast of
Part-of
Organizational
Political
a unit of, parent of
annexed, acquired
Typical (Supervised) Approach
 FindEntities( ): Named entity recognizer
 Related?( ): Binary classier that says whether two entities are involved in
a relation
 ClassifyRelation( ): Classier that labels relations discovered by
Related?( )
Typical (Semi-Supervised) Approach
NELL: Never-Ending Language
Learner
NELL: Can computers learn to read?
 Goal: create a system that learns to read the web
 Reading task: Extract facts from text found on the web
 Learning task: Iteratively improve reading competence.
 http://rtw.ml.cmu.edu/rtw/
Approach
 Inputs
 Ontology with target categories and relations (i.e., predicates)
 Small number of seed examples for each
 Set of constraints that couple the predicates
 Large corpus of unlabeled documents
 Output: new predicate instances
 Semi-supervised bootstrap learning methods
 Couple the learning of functions to constrain the problem
 Exploit redundancy of information on the web.
Coupled Semi-Supervised Learning
Types of Coupling
1. Mutual Exclusion (output constraint)
 Mutually exclusive predicates can't both be satisfied by the same input x
 E.g., x cannot be a Person and a Sport
2. Relation Argument Type-Checking (compositional constraint)
 Arguments of relations declared to be of certain categories
 E.g., CompanyIsInEconomicSector(Company, EconomicSector)
3. Unstructured and Semi-Structured Text Features
(multi-view-agreement constraint)
 Look at different views (like co-training)
 Require classifiers agree
 E.g., freeform textual contexts and semi-structured contexts
System Architecture
Coupled Pattern Learner (CPL)
 Free-text extractor that learns contextual patterns to extract predicate
instances
 Use mutual exclusion and type-checking constraints to filter candidates
instances
 Rank instances and patterns by leveraging redundancy: if an instance or
pattern occurs more frequently, it's ranked higher
Coupled SEAL (CSEAL)
 SEAL (Set Expander for Any Language) is a wrapper induction algorithm
 Operates over semi-structured text such as web pages
 Constructs page-specific extraction rules (wrappers) that are human- and
markup-language independent
 CSEAL adds mutual-exclusion and type-checking constraints
CSEAL Wrappers
 Seeds: Ford, Nissan, Toyota
 arg1 is a placeholder for extracting instances
Open IE and TextRunner
 Motivations:
 Web corpora are massive, introducing scalability concerns
 Relations of interest are unanticipated, diverse, and abundant
 Use of “heavy” linguistic technology (NERs and parsers) don't work
well
 Input: a large, heterogeneous Web corpus
 9M web pages, 133M sentences
 No pre-specified set of relations
 Output: huge set of extracted relations
 60.5M tuples, 11.3M high-probability tuples
 Tuples are indexed for searching
TextRunner Architecture
 Learner outputs a classier that labels trustworthy extractions
 Extractor finds and outputs trustworthy extractions
 Assessor normalizes and scores the extractions
Architecture: Self-Supervised Learner
1. Automatically labels training data
 Uses a parser to induce dependency structures
 Parses a small corpus of several thousand sentences
 Identifies and labels a set of positive and negative extractions using
relation-independent heuristics
 An extraction is a tuple t = (ej , ri,j , ej)
 Entities are base noun phrases
 Uses parse to identify potential relations
2. Trains a classifier
 Domain-independent, simple non-parse features
 E.g., POS tags, phrase chunks, regexes, stopwords, etc.
Architecture: Single-Pass Extractor
1. POS tag each word
2. Identify entities using lightweight NP chunker
3. Identify relations
4. Classify them
Architecture: Redundancy-Based
Assessor
 Take the tuples and perform
 Normalization, deduplication, synonym resolution
 Assessment
 Number of distinct sentences from which each extraction was found serves
as a measure of confidence
 Entities and relations indexed using Lucene
Template Filling
 The task of template-filling is to find Template filling documents that
evoke such situations and then fill the slots in templates with appropriate
material.
 These slot fillers may consist of
 Text segments extracted directly from the text, or
 Concepts that have been inferred from text elements via some
additional processing (times, amounts, entities from an ontology, etc.).
Applications of IE
 Infrastructure for IR and for Categorization
 Information Routing
 Event Based Summarization
 Automatic Creation of Databases
 Company acquisitions
 Sports scores
 Terrorist activities
 Job listings
 Corporate titles and addresses
Inductive Algorithms for IE
 Rule Induction algorithms produce symbolic IE rules based on a corpus of
annotated documents.
 WHISK
 BWI
 The (LP)2 Algorithm
 The inductive algorithms are suitable for semi-structured domains, where
the rules are fairly simple, whereas when dealing with free text documents
(such as news articles) the probabilistic algorithms perform much better.
WHISK
 WHISK is a supervised learning algorithm that uses hand-tagged examples
for learning information extraction rules.
 Works for structured, semi-structured and free text.
 Extract both single-slot and multi-slot information.
 Doesn’t require syntactic preprocessing for structured and semi-structured
text, and recommend syntactic analyzer and semantic tagger for free text.
 The extraction pattern learned by WHISK is in the form of limited regular
expression, considering tradeoff between expressiveness and efficiency.
 Example: IE task of extracting neighborhood, number of bedrooms and
price from the text
WHISK
 An Example from the Rental Ads domain
 An example extraction pattern which can be learned by WHISK is,
*(Neighborhood) *(Bedroom) * ‘$’(Number)
Neighborhood, Bedroom, and Number – Semantic classes specified by
domain experts.
 WHISK learns the extraction rules using a top-down covering algorithm.
 The algorithm begins learning a single rule by starting with an empty rule;
 Then add one term at a time until either no negative examples are covered
by the rule or the pre-pruning criterion has been satisfied.
 We add terms to specialize it in order to reduce the Laplacian error of the rule.
The Laplacian expected error is defined as,
Where, e is the number of negative extraction and
n is the number of positive extractions on the training instances (terms)
 Example:
 For instance, from the text “3 BR, upper flr of turn of ctry. Incl gar, grt N. Hill
loc 995$. (206)-999-9999,” the rule would extract the frame Bedrooms – 3,
Price – 995.
 The “*” char in the pattern will match any number of characters (unlimited
jump).
 Patterns enclosed in parentheses become numbered elements in the output
pattern, and hence (Digit) is $1 and (number) is $2.
1
1



n
e
Laplacian
Boosted Wrapper Induction(BWI)
 The BWI is a system that utilizes wrapper induction techniques for
traditional Information Extraction.
 IE is treated as a classification problem that entails trying to approximate
two boundary functions Xbegin(i ) and Xend(i ).
 Xbegin(i ) is equal to 1 if the ith token starts a field that is part of the frame to
be extracted and 0 otherwise.
 Xend(i ) is defined in a similar way for tokens that end a field.
 The learning algorithm approximates each X function by taking a set of
pairs of the form {i, X}(i) as training data.
 Each field is extracted by a wrapper W=<F, A, H> where F is a set of begin boundary
detectors
 A is a set of end boundary detectors
 H(k) is the probability that the field has length k
 A boundary detector is just a sequence of tokens with wild cards (some kind of a
regular expression).
W(i, j ) is a nave Bayesian approximation of the probability
 The BWI algorithm learns two detectors by using a greedy algorithm that extends the
prefix and suffix patterns while there is an improvement in the accuracy.
 The sets F(i) and A(i) are generated from the detectors by using the AdaBoost
algorithm.
 The detector pattern can include specific words and regular expressions that work on a
set of wildcards such as <num>, <Cap>, <LowerCase>, <Punctuation> and <Alpha>.


 

otherwise
ijHjAiFif
jiW
0
)1()()(1
),(

),()( iFCiF k
k
FK )()( iACiA
k
kAk
(LP)2 Algorithm
 The (LP)2 algorithm learns from an annotated corpus and induces two sets
of rules:
 Tagging rules generated by a bottom-up generalization process
 correction rules that correct mistakes and omissions done by the
tagging rules.
 A tagging rule is a pattern that contains conditions on words preceding the
place where a tag is to be inserted and conditions on the words that follow
the tag.
 Conditions can be either words, lemmas, lexical categories (such as digit,
noun, verb, etc), case (lower or upper), and semantic categories (such as
time-id, cities, etc).
 The (LP)2 algorithm is a covering algorithm that tries to cover all training
examples.
 The initial tagging rules are generalized by dropping conditions.
IE and Text Summarization
 User’s perspective,
 IE can be glossed as "I know what specific pieces of information I want–just
find them for me!",
 Summarization can be glossed as "What’s in the text that is interesting?".
 Technically, from the system builder’s perspective, the two applications blend into
each other.
 The most pertinent technical aspects are:
 Are the criteria of interestingness specified at run-time or by the system builder?
 Is the input a single document or multiple documents?
 Is the extracted information manipulated, either by simple content delineation
routines or by complex inferences, or just delivered verbatim?
 What is the grain size of the extracted units of information–individual entities
and events, or blocks of text?
 Is the output formulated in language, or in a computer-internal knowledge
representation?
Text Summarization
 An information access technology that given a
document or sets of related documents, extracts the
most important content from the source(s) taking into
account the user or task at hand, and presents this
content in a well formed and concise text
Text Summarization Techniques
 Topic Representation
 Influence of Context
 Indicator Representations
 Pattern Extraction
Text Summarization
Input: one or more text documents
Output: paragraph length summary
 Sentence extraction is the standard method
 Using features such as key words, sentence position in document,
cue phrases
 Identify sentences within documents that are salient
 Extract and string sentences together
 Machine learning for extraction
 Corpus of document/summary pairs
 Learn the features that best determine important sentences
 Summarization of scientific articles
A Summarization Machine
EXTRACTS
ABSTRACTS
?
MULTIDOCS
Extract Abstract
Indicative
Generic
Background
Query-oriented
Just the news
10%
50%
100%
Very Brief
Brief
Long
Headline
Informative
DOC
QUERY
CASE FRAMES
TEMPLATES
CORE CONCEPTS
CORE EVENTS
RELATIONSHIPS
CLAUSE FRAGMENTS
INDEX TERMS
The Modules of the Summarization
Machine
E
X
T
R
A
C
T
I
O
N
I
N
T
E
R
P
R
E
T
A
T
I
O
N
EXTRACTS
ABSTRACTS
?
CASE FRAMES
TEMPLATES
CORE CONCEPTS
CORE EVENTS
RELATIONSHIPS
CLAUSE FRAGMENTS
INDEX TERMS
MULTIDOC EXTRACTS
G
E
N
E
R
A
T
I
O
N
F
I
L
T
E
R
I
N
G
DOC
EXTRACTS
What is Summarization?
 Data as input (database, software trace, expert system), text summary as output
 Text as input (one or more articles), paragraph summary as output
 Multimedia in input or output
 Summaries must convey maximal information in minimal space
 Involves: Three stages (typically)
 Content identification
 Find/Extract the most important material
 Conceptual organization
 Realization
Types of summaries
 Purpose
 Indicative, informative, and critical summaries
 Form
 Extracts (representative paragraphs/sentences/phrases)
 Abstracts: “a concise summary of the central subject matter of a
document”.
 Dimensions
 Single-document vs. multi-document
 Context
 Query-specific vs. query-independent
 Generic vs. query-oriented
 provides author’s view vs. reflects user’s interest.
Genres
 Headlines
 Outlines
 Minutes
 Biographies
 Abridgments
 Sound bites
 Movie summaries
 Chronologies, etc.
Aspects that Describe Summaries
 Input
 subject type: domain
 genre: newspaper articles, editorials, letters, reports...
 form: regular text structure, free-form
 source size: single doc, multiple docs (few,many)
 Purpose
 situation: embedded in larger system (MT, IR) or not?
 audience: focused or general
 usage: IR, sorting, skimming...
 Output
 completeness: include all aspects, or focus on some?
 format: paragraph, table, etc.
 style: informative, indicative, aggregative, critical...
Single Document Summarization
System Architecture
Extraction
Sentence reduction
Generation
Sentence combination
Input: single document
Extracted sentences
Output: summary
Corpus
Decomposition
Lexicon
Parser
Co-reference
Multi-Document Summarization
 Monitor variety of online information sources
 News, multilingual
 Email
 Gather information on events across source and time
 Same day, multiple sources
 Across time
 Summarize
 Highlighting similarities, new information, different perspectives,
user specified interests in real-time
Example System: SUMMARIST
Three stages:
1. Topic Identification Modules: Positional Importance, Cue Phrases (under
construction), Word Counts, Discourse Structure (under construction), ...
2. Topic Interpretation Modules: Concept Counting /Wavefront, Concept
Signatures (being extended)
3. Summary Generation Modules (not yet built): Keywords, Template Gen, Sent.
Planner & Realizer
SUMMARY = TOPIC ID + INTERPRETATION + GENERATION
 From extract to abstract:
topic interpretation or concept fusion.
 Experiment (Marcu, 98):
 Got 10 newspaper texts, with
human abstracts.
 Asked 14 judges to extract
corresponding clauses from texts, to
cover the same content.
 Compared word lengths of extracts
to abstracts: extract_length  2.76 
abstract_length !!
xx xxx xxxx x xx xxxx
xxx xx xxx xx xxxxx x
xxx xx xxx xx x xxx xx
xx xxx x xxx xx xxx x
xx x xxxx xxxx xxxx xx
xx xxxx xxx
xxx xx xx xxxx x xxx
xx x xx xx xxxxx x x xx
xxx xxxxxx xxxxxx x x
xxxxxxx xx x xxxxxx
xxxx
xx xx xxxxx xxx xx x xx
xx xxxx xxx xxxx xx
Topic Interpretation
xxx xx xxx xxxx xx
xxx x xxxx x xx xxxx
xx xxx xxxx xx x xxx
xxx xxxx x xxx x xxx
xx xx xxxxx x x xx
xxxxxxx xx x xxxxxx
xxxx
xx xx xxxxx xxx xx
xxx xx xxxx x xxxxx
xx xxxxx x
Some Types of Interpretation
 Concept generalization:
Sue ate apples, pears, and bananas  Sue ate fruit
 Meronymy replacement:
Both wheels, the pedals, saddle, chain…  the bike
 Script identification:
He sat down, read the menu, ordered, ate, paid, and left  He ate at the
restaurant
 Metonymy:
A spokesperson for the US Government announced that…  Washington
announced that...
General Aspects of Interpretation
 Interpretation occurs at the conceptual level...
…words alone are polysemous (bat  animal and sports
instrument) and combine for meaning (alleged murderer 
murderer).
 For interpretation, you need world knowledge...
…the fusion inferences are not in the text!
 Extract a pattern for each event in training data
 part of speech & mention tags
 Example: Japanese political leaders  GPE JJ PER
Japanese political Leaders
GPE PER
NN JJ NN
GPE JJ PER
Text
Ents
POS
Pattern
Pattern Extraction
Summarization - Scope
 Data preparation:
 Collect large sets of texts with abstracts, all genres.
 Build large corpora of <Text, Abstract, Extract> tuples.
 Investigate relationships between extracts and abstracts (using <Extract,
Abstract> tuples).
 Types of summary:
 Determine characteristics of each type.
 Topic Identification:
 Develop new identification methods (discourse, etc.).
 Develop heuristics for method combination (train heuristics on <Text,
Extract> tuples).
Summarization - Scope
 Concept Interpretation (Fusion):
 Investigate types of fusion (semantic, evaluative…).
 Create large collections of fusion knowledge/rules (e.g., signature
libraries, generalization and partonymic hierarchies, metonymy
rules…).
 Study incorporation of User’s knowledge in interpretation.
 Generation:
 Develop Sentence Planner rules for dense packing of content into
sentences (using <Extract, Abstract> pairs).
 Evaluation:
 Develop better evaluation metrics, for types of summaries.
Apriori Algorithm
 In computer science and data mining, Apriori is a classic algorithm for
learning association rules.
 Apriori is designed to operate on databases containing transactions.
 Example, collections of items bought by customers, or details of a website
frequentation.
 The algorithm attempts to find subsets which are common to at least a
minimum number C (the cutoff, or confidence threshold) of the itemsets.
 Apriori uses a "bottom up" approach, where frequent subsets are extended
one item at a time (a step known as candidate generation, and groups of
candidates are tested against the data.
 The algorithm terminates when no further successful extensions are found.
 Apriori uses breadth-first search and a hash tree structure to count candidate
item sets efficiently.
Find rules in two stages
Agarwal et.al., divided the problem of finding good rules into two phases:
1. Find all itemsets with a specified minimal support (coverage). An itemset
is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm
can efficiently find all itemsets whose coverage is above a given
minimum.
2. Use these itemsets to help generate interersting rules. Having done stage
1, we have considerably narrowed down the possibilities, and can do
reasonably fast processing of the large itemsets to generate candidate
rules.
Terminology
k-itemset : a set of k items. E.g.
{beer, cheese, eggs} is a 3-itemset
{cheese} is a 1-itemset
{honey, ice-cream} is a 2-itemset
support: an itemset has support s% if s% of the records in the DB contain that
itemset.
minimum support: the Apriori algorithm starts with the specification of a
minimum level of support, and will focus on itemsets with this level or
above.
Terminology
large itemset: doesn’t mean an itemset with many items. It means one
whose support is at least minimum support.
Lk : the set of all large k-itemsets in the DB.
Ck : a set of candidate large k-itemsets. In the algorithm we will look at, it
generates this set, which contains all the k-itemsets that might be large,
and then eventually generates the set above.
Terminology
sets: Let A be a set (A = {cat, dog}) and
let B be a set (B = {dog, eel, rat}) and
let C = {eel, rat}
I use ‘A + B’ to mean A union B.
So A + B = {cat, dog, eel, rat}
When X is a subset of Y, I use Y – X to mean the set of things in Y which
are not in X.
E.g. B – C = {dog}
Apriori Algorithm
Find all large 1-itemsets
For (k = 2 ; while Lk-1 is non-empty; k++)
{Ck = apriori-gen(Lk-1)
For each c in Ck, initialise c.count to zero
For all records r in the DB
{Cr = subset(Ck, r); For each c in Cr , c.count++ }
Set Lk := all c in Ck whose count >= minsup
} /* end -- return all of the Lk sets.
The algorithm returns all of the (non-empty) Lk sets, which gives us an
excellent start in finding interesting rules (although the large itemsets
themselves will usually be very interesting and useful.
Example: Generation of candidate itemsets and frequent
itemsets, where the minimum support count is 2.
Apriori Merits/Demerits
 Merits
 Uses large itemset property
 Easily parallelized
 Easy to implement
 Demerits
 Assumes transaction database is memory resident.
 Requires many database scans.
Summary
 Association Rules form an very applied data mining approach.
 Association Rules are derived from frequent itemsets.
 The Apriori algorithm is an efficient algorithm for finding all frequent
itemsets.
 The Apriori algorithm implements level-wise search using frequent item
property.
 The Apriori algorithm can be additionally optimized.
 There are many measures for association rules.
FP-Growth Algorithm
Frequent Pattern Mining:
An Example
Given a transaction database DB and a minimum support threshold ξ,
Find all frequent patterns (item sets) with support no less than ξ.
TID Items bought
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
DB:
Minimum support: ξ =3
Input:
Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…
Problem Statement: How to efficiently find all frequent patterns?
 Compress a large database into a compact, Frequent-Pattern tree (FP-tree)
structure
 highly compacted, but complete for frequent pattern mining
 avoid costly repeated database scans
 Develop an efficient, FP-tree-based frequent pattern mining method (FP-
growth)
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only.
Overview of FP-Growth: Ideas
FP-tree:
Construction and Design
Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent items (single
item patterns) and order them into a list L in frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according to the order in L;
Scan DB the second time, construct FP-tree by putting each frequency
ordered transaction onto it.
89
FP-tree Example: step 1
Item frequency
f 4
c 4
a 3
b 3
m 3
p 3
TID Items bought
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
L
Step 1: Scan DB for the first time to generate L
By-Product of First
Scan of Database
90
FP-tree Example: step 2
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Step 2: scan the DB for the second time, order frequent items
in each transaction
91
FP-tree Example: step 2
Step 2: construct FP-tree
{}
f:1
c:1
a:1
m:1
p:1
{f, c, a, m, p}
{}
{}
f:2
c:2
a:2
b:1m:1
p:1 m:1
{f, c, a, b, m}
NOTE: Each
transaction
corresponds to one
path in the FP-tree
92
FP-tree Example: step 2
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Step 2: construct FP-tree
{}
f:3
c:2
a:2
b:1m:1
p:1 m:1
{f, b}
b:1
{c, b, p}
c:1
b:1
p:1
{}
f:3
c:2
a:2
b:1m:1
p:1 m:1
b:1
{f, c, a, m, p}
Node-Link
93
Construction Example
Final FP-tree
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item head
f
c
a
b
m
p
FP-Tree Definition
 FP-tree is a frequent pattern tree .
 Formally, FP-tree is a tree structure defined below:
1. One root labeled as “null", a set of item prefix sub-trees as the children of
the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields:
 item-name : register which item this node represents,
 count, the number of transactions represented by the portion of the path
reaching this node,
 node-link that links to the next node in the FP-tree carrying the same
item-name, or null if there is none.
3. Each entry in the frequent-item header table has two fields,
 item-name, and
 head of node-link that points to the first node in the FP-tree carrying the
item-name.
Advantages of the FP-tree Structure
 The most significant advantage of the FP-tree
 Scan the DB only twice and twice only.
 Completeness:
 The FP-tree contains all the information related to mining frequent patterns
(given the min-support threshold).
 Compactness:
 The size of the tree is bounded by the occurrences of frequent items
 The height of the tree is bounded by the maximum number of items in a
transaction
FP-growth:
Mining Frequent Patterns
Using FP-tree
Mining Frequent Patterns Using FP-tree
 General idea (divide-and-conquer)
Recursively grow frequent patterns using the FP-tree: looking for shorter
ones recursively and then concatenating the suffix:
 For each frequent item, construct its conditional pattern base, and then its
conditional FP-tree;
 Repeat the process on each newly created conditional FP-tree until the
resulting FP-tree is empty, or it contains only one path (single path will
generate all the combinations of its sub-paths, each of which is a frequent
pattern)
3 Major Steps
Starting the processing from the end of list L:
Step 1:
Construct conditional pattern base for each item in the header table
Step 2:
Construct conditional FP-tree from each conditional pattern base
Step 3:
Recursively mine conditional FP-trees and grow frequent patterns
obtained so far. If the conditional FP-tree contains a single path, simply
enumerate all the patterns
Step 1: Construct Conditional Pattern Base
 Starting at the bottom of frequent-item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item
 Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Conditional pattern bases
item cond. pattern base
p fcam:2, cb:1
m fca:2, fcab:1
b fca:1, f:1, c:1
a fc:3
c f:3
f { }
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item head
f
c
a
b
m
p
Properties of FP-Tree
 Node-link property
 For any frequent item ai, all the possible frequent patterns that contain ai
can be obtained by following ai's node-links, starting from ai's head in
the FP-tree header.
 Prefix path property
 To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.
Step 2: Construct Conditional FP-tree
 For each pattern base
 Accumulate the count for each item in the base
 Construct the conditional FP-tree for the frequent items of the pattern
base
m- cond. pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-tree

{}
f:4
c:3
a:3
b:1m:2
m:1
Header Table
Item head
f 4
c 4
a 3
b 3
m 3
p 3

Step 3: Recursively mine the conditional
FP-tree
{}
f:3
c:3
a:3
conditional FP-tree of
“am”: (fc:3)
{}
f:3
c:3
conditional FP-tree of
“cm”: (f:3)
{}
f:3
conditional FP-tree of
“cam”: (f:3)
{}
f:3
conditional FP-tree of “fm”: 3
conditional FP-tree of
of “fam”: 3
conditional FP-tree of
“m”: (fca:3)
add
“a”
add
“c”
add
“f”
add
“c”
add
“f”
Frequent Pattern
fcam
add
“f”
conditional FP-tree of
“fcm”: 3
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
add
“f”
Principles of FP-Growth
 Pattern growth property
 Let  be a frequent itemset in DB, B be 's conditional pattern base,
and  be an itemset in B. Then    is a frequent itemset in DB iff 
is frequent in B.
 Is “fcabm ” a frequent pattern?
 “fcab” is a branch of m's conditional pattern base
 “b” is NOT frequent in transactions containing “fcab ”
 “bm” is NOT a frequent itemset.
Conditional Pattern Bases and
Conditional FP-Tree
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern baseItem
order of
L
Single FP-tree Path Generation
 Suppose an FP-tree T has a single path P.
 The complete set of frequent pattern of T can be generated by enumeration of
all the combinations of the sub-paths of P
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns concerning m:
combination of {f, c, a} and m
m,
fm, cm, am,
fcm, fam, cam,
fcam

Summary of FP-Growth Algorithm
 Mining frequent patterns can be viewed as first mining 1-itemset and
progressively growing each 1-itemset by mining on its conditional pattern base
recursively
 Transform a frequent k-itemset mining problem into a sequence of k frequent 1-
itemset mining problems via a set of conditional pattern bases
Efficiency Analysis
Facts: usually
1. FP-tree is much smaller than the size of the DB
2. Pattern base is smaller than original FP-tree
3. Conditional FP-tree is smaller than pattern base
 Mining process works on a set of usually much smaller pattern
bases and conditional FP-trees
 Divide-and-conquer and dramatic scale of shrinking
Performance Improvement
Projected DBs
Disk-resident
FP-tree
FP-tree
Materialization
FP-tree
Incremental update
partition the
DB into a set
of projected
DBs and then
construct an
FP-tree and
mine it in each
projected DB.
Store the FP-
tree in the
hark disks by
using B+ tree
structure to
reduce I/O
cost.
a low ξ may
usually satisfy
most of the
mining queries
in the FP-tree
construction.
How to update
an FP-tree when
there are new
data?
• Reconstruct
the FP-tree
• Or do not
update the
FP-tree

More Related Content

What's hot

Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 

What's hot (20)

Text Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion Mining
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Text clustering
Text clusteringText clustering
Text clustering
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Text categorization
Text categorizationText categorization
Text categorization
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
word level analysis
word level analysis word level analysis
word level analysis
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Bert
BertBert
Bert
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 

Similar to Text Data Mining

Download
DownloadDownload
Download
butest
 
Download
DownloadDownload
Download
butest
 
information extraction by selamu shirtawi
information extraction by selamu shirtawiinformation extraction by selamu shirtawi
information extraction by selamu shirtawi
selamu shirtawi
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt pppppppppppppppppppppppppp
ratnapatil14
 
5-Information Extraction (IE) and Machine Translation (MT).ppt
5-Information Extraction (IE) and Machine Translation (MT).ppt5-Information Extraction (IE) and Machine Translation (MT).ppt
5-Information Extraction (IE) and Machine Translation (MT).ppt
milkesa13
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
sstose
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
Jae Hong Kil
 

Similar to Text Data Mining (20)

leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
 
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...
 
information extraction by selamu shirtawi
information extraction by selamu shirtawiinformation extraction by selamu shirtawi
information extraction by selamu shirtawi
 
Rule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak ReportsRule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak Reports
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt pppppppppppppppppppppppppp
 
5-Information Extraction (IE) and Machine Translation (MT).ppt
5-Information Extraction (IE) and Machine Translation (MT).ppt5-Information Extraction (IE) and Machine Translation (MT).ppt
5-Information Extraction (IE) and Machine Translation (MT).ppt
 
Named entity recognition using web document corpus
Named entity recognition using web document corpusNamed entity recognition using web document corpus
Named entity recognition using web document corpus
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
Entity Linking
Entity LinkingEntity Linking
Entity Linking
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011
 
Named Entity Recognition Using Web Document Corpus
Named Entity Recognition Using Web Document CorpusNamed Entity Recognition Using Web Document Corpus
Named Entity Recognition Using Web Document Corpus
 
A study on the approaches of developing a named entity recognition tool
A study on the approaches of developing a named entity recognition toolA study on the approaches of developing a named entity recognition tool
A study on the approaches of developing a named entity recognition tool
 
Annotating Content Zones In News Articles
Annotating Content Zones In News ArticlesAnnotating Content Zones In News Articles
Annotating Content Zones In News Articles
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
 
Knowledge acquisition using automated techniques
Knowledge acquisition using automated techniquesKnowledge acquisition using automated techniques
Knowledge acquisition using automated techniques
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
 

Text Data Mining

  • 2. What is Information Extraction? Goal: Extract structured information from unstructured (or loosely formatted) text. Typical description of task:  Identify named entities  Identify relations between entities  Populate a database May also include:  Event extraction  Resolution of temporal expressions  Wrapper induction (automatic construction of templates) Applications:  Natural language understanding,  Question-answering, summarization, etc.
  • 3. Information Extraction  IE extracts pieces of information that are salient to the user's needs  Find namedentities such as persons and organizations  Find find attributes of those entities or events they participate in  ContrastIR, which indicates which documents need to be read by a user  Links between the extracted information and the original documents are maintained to allow the user to reference context.
  • 4. Schematic view of the Information Extraction Process
  • 6. Relevant IE Definitions Entities:  Entities are the basic building blocks that can be found in text documents.(An object of interest)  Examples: people, companies, locations, genes, and drugs. Attributes:  Attributes are features of the extracted entities. (A property of an entity such as its name, alias, descriptor or type)  Examples: the title of a person, the age of a person, and the type of an organization.
  • 7. Relevant IE Definitions Facts:  Facts are the relations that exist between entities. (a relationship held between two or more entities such as the position of a person in a company)  Example: Employment relationship between a person and a company or phosphorylation between two proteins. Events:  An event is an activity or occurrence of interest in which entities participate  An activity involving several entities such as a terrorist act, airline crash, management change, new product introduction a merger between two companies, a birthday and so on.
  • 8.
  • 9. IE - Method  Extract raw text(html, pdf, ps, gif.)  Tokenize  Detect term boundaries  We extracted alpha 1 type XIII collagen from …  Their house council recommended…  Detect sentence boundaries  Tag parts of speech (POS)  John/noun saw/verb Mary/noun.  Tag named entities  Person, place, organization, gene, chemical.  Parse  Determine co-reference  Extract knowledge
  • 10. Architecture: Components of IE Systems  Core linguistic components, adapted to or be useful for NLP tasks in general  IE-specific components, address the core IE tasks.  Domain-Independent  Domain-specific components  The following steps are performed in Domain-Independent part:  Meta-data analysis:  Extraction of the title, body, structure of the body (identification of paragraphs), and the date of the document.  Tokenization:  Segmentation of the text into word-like units, called tokens and classification of their type, e.g., identification of capitalized words, words written in lowercase letters, hyphenated words, punctuation signs, numbers, etc.
  • 11. Architecture: Components of IE Systems  Morphological analysis:  Extraction of morphological information from tokens which constitute potential word forms-the base form(or lemma), part of speech, other morphological tags depending on the part of speech.  e.g., verbs have features such as tense, mood, aspect, person, etc.  Words which are ambiguous with respect to certain morphological categories may undergo disambiguation. Typically part-of-speech disambiguation is performed.  Sentence/Utterance boundary detection:  Segmentation of text into a sequence of sentences or utterances, each of which is represented as a sequence of lexical items together with their features.  Common Named-entity extraction:  Detection of domain-independent named entities, such as temporal expressions, numbers and currency, geographical references, etc.
  • 12. Architecture: Components of IE Systems  Phrase recognition:  Recognition of small-scale, local structures such as noun phrases, verb groups, prepositional phrases, acronyms, and abbreviations.  Syntactic analysis:  Computation of a dependency structure (parse tree) of the sentence based on the sequence of lexical items and small-scale structures.  Syntactic analysis may be deep or shallow.  In the former case, compute all possible interpretations (parse trees) and grammatical relations within the sentence.  In the latter case, the analysis is restricted to identification of non-recursive structures or structures with limited amount of structural recursion, which can be identified with a high degree of certainty, and linguistic phenomena which cause problems (ambiguities) are not handled and represented with underspecified structures.
  • 13. Architecture: Components of IE Systems  The core IE tasks:  NER,  Co-reference resolution, and  Detection of relations and events Typically domain-specific, and are supported by domain-specific system components and resources.  Domain-specific processing is also supported on a lower level by detection of specialized terms in text.  Architecture: IE System  In the domain specific core of the processing chain, a NER component is applied to identify the entities relevant in a given domain.  Patterns may then be applied to:  Identify text fragments, which describe the target relations and events, and  Extract the key attributes to fill the slots in the template representing the relation/event.
  • 14. IE System - Architecture
  • 15. Typical Architecture of an Information Extraction System
  • 16. Architecture: Components of IE Systems  A co-reference component identifies mentions that refer to the same entity.  Partially-filled templates are fused and validated using domain-specific inference rules in order to create full-fledged relation/event descriptions.  Several software packages to provide various tools that can be used in the process of developing an IE system, ranging from core linguistic processing modules (e.g., language detectors, sentence splitters), to general IE-oriented NLP frameworks.
  • 17. IE Task Types  Named Entity Recognition (NER)  Co-reference Resolution (CO)  Relation Extraction (RE)  Event Extraction (EE)
  • 18. Named Entity Recognition  Named Entity Recognition (NER) addresses the problem of the identification (detection) and classification of predefined types of named entities,  Such as organizations (e.g., ‘World Health Organisation’), persons (e.g., ‘Mohamad Gouse’), place names (e.g., ‘the Baltic Sea’), temporal expressions (e.g., ‘15 January 1984’), numerical and currency expressions (e.g., ‘20 MillionEuros’), etc.  NER task include extracting descriptive information from the text about the detected entities through filling of a small-scale template.  Example, in the case of persons, it may include extracting the title, position, nationality, gender, and other attributes of the person.  NER also involves lemmatization (normalization) of the named entities, which is particularly crucial in highly inflective languages.  Example in Polish there are six inflected forms of the name ‘Mohamad Gouse’ depending on grammatical case: ‘Mohamad Gouse’ (nominative), ‘Mohamad Gouseego’ (genitive), Mohamad Gouseemu (dative), ‘Mohamad Gouseiego’ (accusative), Mohamad Gousem (instrumental), Mohamad Gousem (locative), Mohamad Gouse(vocative).
  • 19. Co-Reference  Co-reference Resolution (CO) requires the identification of multiple (coreferring) mentions of the same entity in the text.  Entity mentions can be:  (a) Named, in case an entity is referred to by name  e.g., ‘General Electric’ and ‘GE’ may refer to the same real-world entity.  (b) Pronominal, in case an entity is referred to with a pronoun  e.g., in ‘John bought food. But he forgot to buy drinks.’, the pronoun he refers to John.  (c) Nominal, in case an entity is referred to with a nominal phrase  e.g., in ‘Microsoft revealed its earnings. The company also unveiled future plans.’ the definite noun phrase The company refers to Microsoft.  (d) Implicit, as in case of using zero-anaphora1  e.g., in the Italian text fragment ‘OEBerlusconii ha visitato il luogo del disastro. i Ha sorvolato, con l’elicottero.’  (Berlusconi has visited the place of disaster. [He] flew over with a helicopter.) the second sentence does not have an explicit realization of the reference to Berlusconi.
  • 20. Relation Extraction  Relation Extraction (RE) is the task of detecting and classifying predefined relationships between entities identified in text.  For example:  EmployeeOf(Steve Jobs,Apple): a relation between a person and an organisation, extracted from ‘Steve Jobs works for Apple’  LocatedIn(Smith,New York): a relation between a person and location, extracted from ‘Mr. Smith gave a talk at the conference in New York’,  SubsidiaryOf(TVN,ITI Holding): a relation between two companies, extracted from ‘Listed broadcaster TVN said its parent company, ITI Holdings, is considering various options for the potential sale.  The set of relations that may be of interest is unlimited, the set of relations within a given task is predefined and fixed, as part of the specification of the task.
  • 21. Event Extraction  Event Extraction (EE) refers to the task of identifying events in free text and deriving detailed and structured information about them, ideally identifying who did what to whom, when, where, through what methods (instruments), and why.  Usually, event extraction involves extraction of several entities and relationships between them.  For instance, extraction of information on terrorist attacks from the text fragment ‘Masked gunmen armed with assault rifles and grenades attacked a wedding party in mainly Kurdish southeast Turkey, killing at least 44 people.’  Involves identification of perpetrators (masked gunmen), victims (people), number of killed/injured (at least 44), weapons and means used (rifles and grenades), and location (southeast Turkey).  Another example is the extraction of information on new joint ventures, where the aim is to identify the partners, products, profits and capitalization of the joint venture.  EE is considered to be the hardest of the four IE tasks.
  • 22. IE Subtask: Named Entity Recognition  Detect and classify all proper names mentioned in text  What is a proper name? Depends on application.  People, places, organizations, times, amounts, etc.  Names of genes and proteins  Names of college courses
  • 23. NER Example  Find extent of each mention  Classify each mention  Sources of ambiguity  Different strings that map to the same entity  Equivalent strings that map to different entities (e.g., U.S. Grant)
  • 24. Approaches to NER  Early systems: hand-written rules  Statistical systems  Supervised learning (HMMs, Decision Trees, MaxEnt, SVMs, CRFs)  Semi-supervised learning (bootstrapping)  Unsupervised learning (rely on lexical resources, lexical patterns, and corpus statistics)
  • 25. A Sequence-Labeling Approach using CRFs  Input: Sequence of observations (tokens/words/text)  Output: Sequence of states (labels/classes)  B: Begin  I: Inside  O: Outside  Some evidence that including L (Last) and U (Unit length) is advantageous (Ratinov and Roth 09)  CRFs defines a conditional probability p(Y|X) over label sequences Y given an observation sequence X  No effort wasted modeling the observations (in contrast to joint models like HMMs)  Arbitrary features of the observations may be captured by the model
  • 26. Linear Chain CRFs  Simplest and most common graph structure, used for sequence modeling  Inference can be done efficiently using dynamic programming O(|X||Y|2)
  • 28. NER Features  Several feature families used, all time-shifted by -2, -1, 0, 1, 2:  The word itself  Capitalization and digit patterns (shape patterns)  8 lexicons entered by hand (e.g., honorifics, days, months)  15 lexicons obtained from web sites (e.g., countries, publicly-traded companies, surnames, stopwords, universities)  25 lexicons automatically induced from the web (people names, organizations, NGOs, nationalities)
  • 29. Limitations of Conventional NER(and IE)  Supervised learning  Expensive  Inconsistent  Worse for relations and events!  Fixed, narrow, pre-specified sets of entity types  Small, homogeneous corpora (newswire, seminar announcements)
  • 30. Evaluating Named Entity Recognition  Recall that recall is the ratio of the number of correctly labeled responses to the total that should have been labeled.  Precision is the ratio of the number of correctly labeled responses to the total labeled.  The F-measure provides a way to combine these two measures into a single metric. key correct N N recall  incorrectcorrect correct NN N precision   recallprecision recallprecision F    2 2 )1(  
  • 31. What is Relation Extraction?  Typically defined as identifying relations between two entities Relations Subtypes Examples Affiliations Personal Organizational Artifactual married to, mother of spokesman for, president of owns, invented, produces Geospatial Proximity Directional near, on outskirts southeast of Part-of Organizational Political a unit of, parent of annexed, acquired
  • 32. Typical (Supervised) Approach  FindEntities( ): Named entity recognizer  Related?( ): Binary classier that says whether two entities are involved in a relation  ClassifyRelation( ): Classier that labels relations discovered by Related?( )
  • 34. NELL: Never-Ending Language Learner NELL: Can computers learn to read?  Goal: create a system that learns to read the web  Reading task: Extract facts from text found on the web  Learning task: Iteratively improve reading competence.  http://rtw.ml.cmu.edu/rtw/
  • 35. Approach  Inputs  Ontology with target categories and relations (i.e., predicates)  Small number of seed examples for each  Set of constraints that couple the predicates  Large corpus of unlabeled documents  Output: new predicate instances  Semi-supervised bootstrap learning methods  Couple the learning of functions to constrain the problem  Exploit redundancy of information on the web.
  • 37. Types of Coupling 1. Mutual Exclusion (output constraint)  Mutually exclusive predicates can't both be satisfied by the same input x  E.g., x cannot be a Person and a Sport 2. Relation Argument Type-Checking (compositional constraint)  Arguments of relations declared to be of certain categories  E.g., CompanyIsInEconomicSector(Company, EconomicSector) 3. Unstructured and Semi-Structured Text Features (multi-view-agreement constraint)  Look at different views (like co-training)  Require classifiers agree  E.g., freeform textual contexts and semi-structured contexts
  • 39. Coupled Pattern Learner (CPL)  Free-text extractor that learns contextual patterns to extract predicate instances  Use mutual exclusion and type-checking constraints to filter candidates instances  Rank instances and patterns by leveraging redundancy: if an instance or pattern occurs more frequently, it's ranked higher
  • 40. Coupled SEAL (CSEAL)  SEAL (Set Expander for Any Language) is a wrapper induction algorithm  Operates over semi-structured text such as web pages  Constructs page-specific extraction rules (wrappers) that are human- and markup-language independent  CSEAL adds mutual-exclusion and type-checking constraints
  • 41. CSEAL Wrappers  Seeds: Ford, Nissan, Toyota  arg1 is a placeholder for extracting instances
  • 42. Open IE and TextRunner  Motivations:  Web corpora are massive, introducing scalability concerns  Relations of interest are unanticipated, diverse, and abundant  Use of “heavy” linguistic technology (NERs and parsers) don't work well  Input: a large, heterogeneous Web corpus  9M web pages, 133M sentences  No pre-specified set of relations  Output: huge set of extracted relations  60.5M tuples, 11.3M high-probability tuples  Tuples are indexed for searching
  • 43. TextRunner Architecture  Learner outputs a classier that labels trustworthy extractions  Extractor finds and outputs trustworthy extractions  Assessor normalizes and scores the extractions
  • 44. Architecture: Self-Supervised Learner 1. Automatically labels training data  Uses a parser to induce dependency structures  Parses a small corpus of several thousand sentences  Identifies and labels a set of positive and negative extractions using relation-independent heuristics  An extraction is a tuple t = (ej , ri,j , ej)  Entities are base noun phrases  Uses parse to identify potential relations 2. Trains a classifier  Domain-independent, simple non-parse features  E.g., POS tags, phrase chunks, regexes, stopwords, etc.
  • 45. Architecture: Single-Pass Extractor 1. POS tag each word 2. Identify entities using lightweight NP chunker 3. Identify relations 4. Classify them
  • 46. Architecture: Redundancy-Based Assessor  Take the tuples and perform  Normalization, deduplication, synonym resolution  Assessment  Number of distinct sentences from which each extraction was found serves as a measure of confidence  Entities and relations indexed using Lucene
  • 47. Template Filling  The task of template-filling is to find Template filling documents that evoke such situations and then fill the slots in templates with appropriate material.  These slot fillers may consist of  Text segments extracted directly from the text, or  Concepts that have been inferred from text elements via some additional processing (times, amounts, entities from an ontology, etc.).
  • 48. Applications of IE  Infrastructure for IR and for Categorization  Information Routing  Event Based Summarization  Automatic Creation of Databases  Company acquisitions  Sports scores  Terrorist activities  Job listings  Corporate titles and addresses
  • 49. Inductive Algorithms for IE  Rule Induction algorithms produce symbolic IE rules based on a corpus of annotated documents.  WHISK  BWI  The (LP)2 Algorithm  The inductive algorithms are suitable for semi-structured domains, where the rules are fairly simple, whereas when dealing with free text documents (such as news articles) the probabilistic algorithms perform much better.
  • 50. WHISK  WHISK is a supervised learning algorithm that uses hand-tagged examples for learning information extraction rules.  Works for structured, semi-structured and free text.  Extract both single-slot and multi-slot information.  Doesn’t require syntactic preprocessing for structured and semi-structured text, and recommend syntactic analyzer and semantic tagger for free text.  The extraction pattern learned by WHISK is in the form of limited regular expression, considering tradeoff between expressiveness and efficiency.  Example: IE task of extracting neighborhood, number of bedrooms and price from the text
  • 51. WHISK  An Example from the Rental Ads domain  An example extraction pattern which can be learned by WHISK is, *(Neighborhood) *(Bedroom) * ‘$’(Number) Neighborhood, Bedroom, and Number – Semantic classes specified by domain experts.  WHISK learns the extraction rules using a top-down covering algorithm.  The algorithm begins learning a single rule by starting with an empty rule;  Then add one term at a time until either no negative examples are covered by the rule or the pre-pruning criterion has been satisfied.
  • 52.  We add terms to specialize it in order to reduce the Laplacian error of the rule. The Laplacian expected error is defined as, Where, e is the number of negative extraction and n is the number of positive extractions on the training instances (terms)  Example:  For instance, from the text “3 BR, upper flr of turn of ctry. Incl gar, grt N. Hill loc 995$. (206)-999-9999,” the rule would extract the frame Bedrooms – 3, Price – 995.  The “*” char in the pattern will match any number of characters (unlimited jump).  Patterns enclosed in parentheses become numbered elements in the output pattern, and hence (Digit) is $1 and (number) is $2. 1 1    n e Laplacian
  • 53. Boosted Wrapper Induction(BWI)  The BWI is a system that utilizes wrapper induction techniques for traditional Information Extraction.  IE is treated as a classification problem that entails trying to approximate two boundary functions Xbegin(i ) and Xend(i ).  Xbegin(i ) is equal to 1 if the ith token starts a field that is part of the frame to be extracted and 0 otherwise.  Xend(i ) is defined in a similar way for tokens that end a field.  The learning algorithm approximates each X function by taking a set of pairs of the form {i, X}(i) as training data.
  • 54.  Each field is extracted by a wrapper W=<F, A, H> where F is a set of begin boundary detectors  A is a set of end boundary detectors  H(k) is the probability that the field has length k  A boundary detector is just a sequence of tokens with wild cards (some kind of a regular expression). W(i, j ) is a nave Bayesian approximation of the probability  The BWI algorithm learns two detectors by using a greedy algorithm that extends the prefix and suffix patterns while there is an improvement in the accuracy.  The sets F(i) and A(i) are generated from the detectors by using the AdaBoost algorithm.  The detector pattern can include specific words and regular expressions that work on a set of wildcards such as <num>, <Cap>, <LowerCase>, <Punctuation> and <Alpha>.      otherwise ijHjAiFif jiW 0 )1()()(1 ),(  ),()( iFCiF k k FK )()( iACiA k kAk
  • 55. (LP)2 Algorithm  The (LP)2 algorithm learns from an annotated corpus and induces two sets of rules:  Tagging rules generated by a bottom-up generalization process  correction rules that correct mistakes and omissions done by the tagging rules.  A tagging rule is a pattern that contains conditions on words preceding the place where a tag is to be inserted and conditions on the words that follow the tag.  Conditions can be either words, lemmas, lexical categories (such as digit, noun, verb, etc), case (lower or upper), and semantic categories (such as time-id, cities, etc).  The (LP)2 algorithm is a covering algorithm that tries to cover all training examples.  The initial tagging rules are generalized by dropping conditions.
  • 56. IE and Text Summarization  User’s perspective,  IE can be glossed as "I know what specific pieces of information I want–just find them for me!",  Summarization can be glossed as "What’s in the text that is interesting?".  Technically, from the system builder’s perspective, the two applications blend into each other.  The most pertinent technical aspects are:  Are the criteria of interestingness specified at run-time or by the system builder?  Is the input a single document or multiple documents?  Is the extracted information manipulated, either by simple content delineation routines or by complex inferences, or just delivered verbatim?  What is the grain size of the extracted units of information–individual entities and events, or blocks of text?  Is the output formulated in language, or in a computer-internal knowledge representation?
  • 57. Text Summarization  An information access technology that given a document or sets of related documents, extracts the most important content from the source(s) taking into account the user or task at hand, and presents this content in a well formed and concise text
  • 58. Text Summarization Techniques  Topic Representation  Influence of Context  Indicator Representations  Pattern Extraction
  • 59. Text Summarization Input: one or more text documents Output: paragraph length summary  Sentence extraction is the standard method  Using features such as key words, sentence position in document, cue phrases  Identify sentences within documents that are salient  Extract and string sentences together  Machine learning for extraction  Corpus of document/summary pairs  Learn the features that best determine important sentences  Summarization of scientific articles
  • 60. A Summarization Machine EXTRACTS ABSTRACTS ? MULTIDOCS Extract Abstract Indicative Generic Background Query-oriented Just the news 10% 50% 100% Very Brief Brief Long Headline Informative DOC QUERY CASE FRAMES TEMPLATES CORE CONCEPTS CORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS
  • 61. The Modules of the Summarization Machine E X T R A C T I O N I N T E R P R E T A T I O N EXTRACTS ABSTRACTS ? CASE FRAMES TEMPLATES CORE CONCEPTS CORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS MULTIDOC EXTRACTS G E N E R A T I O N F I L T E R I N G DOC EXTRACTS
  • 62. What is Summarization?  Data as input (database, software trace, expert system), text summary as output  Text as input (one or more articles), paragraph summary as output  Multimedia in input or output  Summaries must convey maximal information in minimal space  Involves: Three stages (typically)  Content identification  Find/Extract the most important material  Conceptual organization  Realization
  • 63. Types of summaries  Purpose  Indicative, informative, and critical summaries  Form  Extracts (representative paragraphs/sentences/phrases)  Abstracts: “a concise summary of the central subject matter of a document”.  Dimensions  Single-document vs. multi-document  Context  Query-specific vs. query-independent  Generic vs. query-oriented  provides author’s view vs. reflects user’s interest.
  • 64. Genres  Headlines  Outlines  Minutes  Biographies  Abridgments  Sound bites  Movie summaries  Chronologies, etc.
  • 65. Aspects that Describe Summaries  Input  subject type: domain  genre: newspaper articles, editorials, letters, reports...  form: regular text structure, free-form  source size: single doc, multiple docs (few,many)  Purpose  situation: embedded in larger system (MT, IR) or not?  audience: focused or general  usage: IR, sorting, skimming...  Output  completeness: include all aspects, or focus on some?  format: paragraph, table, etc.  style: informative, indicative, aggregative, critical...
  • 66. Single Document Summarization System Architecture Extraction Sentence reduction Generation Sentence combination Input: single document Extracted sentences Output: summary Corpus Decomposition Lexicon Parser Co-reference
  • 67. Multi-Document Summarization  Monitor variety of online information sources  News, multilingual  Email  Gather information on events across source and time  Same day, multiple sources  Across time  Summarize  Highlighting similarities, new information, different perspectives, user specified interests in real-time
  • 68. Example System: SUMMARIST Three stages: 1. Topic Identification Modules: Positional Importance, Cue Phrases (under construction), Word Counts, Discourse Structure (under construction), ... 2. Topic Interpretation Modules: Concept Counting /Wavefront, Concept Signatures (being extended) 3. Summary Generation Modules (not yet built): Keywords, Template Gen, Sent. Planner & Realizer SUMMARY = TOPIC ID + INTERPRETATION + GENERATION
  • 69.  From extract to abstract: topic interpretation or concept fusion.  Experiment (Marcu, 98):  Got 10 newspaper texts, with human abstracts.  Asked 14 judges to extract corresponding clauses from texts, to cover the same content.  Compared word lengths of extracts to abstracts: extract_length  2.76  abstract_length !! xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx x xxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xx xx xxxx xxx xxx xx xx xxxx x xxx xx x xx xx xxxxx x x xx xxx xxxxxx xxxxxx x x xxxxxxx xx x xxxxxx xxxx xx xx xxxxx xxx xx x xx xx xxxx xxx xxxx xx Topic Interpretation xxx xx xxx xxxx xx xxx x xxxx x xx xxxx xx xxx xxxx xx x xxx xxx xxxx x xxx x xxx xx xx xxxxx x x xx xxxxxxx xx x xxxxxx xxxx xx xx xxxxx xxx xx xxx xx xxxx x xxxxx xx xxxxx x
  • 70. Some Types of Interpretation  Concept generalization: Sue ate apples, pears, and bananas  Sue ate fruit  Meronymy replacement: Both wheels, the pedals, saddle, chain…  the bike  Script identification: He sat down, read the menu, ordered, ate, paid, and left  He ate at the restaurant  Metonymy: A spokesperson for the US Government announced that…  Washington announced that...
  • 71. General Aspects of Interpretation  Interpretation occurs at the conceptual level... …words alone are polysemous (bat  animal and sports instrument) and combine for meaning (alleged murderer  murderer).  For interpretation, you need world knowledge... …the fusion inferences are not in the text!
  • 72.  Extract a pattern for each event in training data  part of speech & mention tags  Example: Japanese political leaders  GPE JJ PER Japanese political Leaders GPE PER NN JJ NN GPE JJ PER Text Ents POS Pattern Pattern Extraction
  • 73. Summarization - Scope  Data preparation:  Collect large sets of texts with abstracts, all genres.  Build large corpora of <Text, Abstract, Extract> tuples.  Investigate relationships between extracts and abstracts (using <Extract, Abstract> tuples).  Types of summary:  Determine characteristics of each type.  Topic Identification:  Develop new identification methods (discourse, etc.).  Develop heuristics for method combination (train heuristics on <Text, Extract> tuples).
  • 74. Summarization - Scope  Concept Interpretation (Fusion):  Investigate types of fusion (semantic, evaluative…).  Create large collections of fusion knowledge/rules (e.g., signature libraries, generalization and partonymic hierarchies, metonymy rules…).  Study incorporation of User’s knowledge in interpretation.  Generation:  Develop Sentence Planner rules for dense packing of content into sentences (using <Extract, Abstract> pairs).  Evaluation:  Develop better evaluation metrics, for types of summaries.
  • 75. Apriori Algorithm  In computer science and data mining, Apriori is a classic algorithm for learning association rules.  Apriori is designed to operate on databases containing transactions.  Example, collections of items bought by customers, or details of a website frequentation.  The algorithm attempts to find subsets which are common to at least a minimum number C (the cutoff, or confidence threshold) of the itemsets.  Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data.  The algorithm terminates when no further successful extensions are found.  Apriori uses breadth-first search and a hash tree structure to count candidate item sets efficiently.
  • 76. Find rules in two stages Agarwal et.al., divided the problem of finding good rules into two phases: 1. Find all itemsets with a specified minimal support (coverage). An itemset is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm can efficiently find all itemsets whose coverage is above a given minimum. 2. Use these itemsets to help generate interersting rules. Having done stage 1, we have considerably narrowed down the possibilities, and can do reasonably fast processing of the large itemsets to generate candidate rules.
  • 77. Terminology k-itemset : a set of k items. E.g. {beer, cheese, eggs} is a 3-itemset {cheese} is a 1-itemset {honey, ice-cream} is a 2-itemset support: an itemset has support s% if s% of the records in the DB contain that itemset. minimum support: the Apriori algorithm starts with the specification of a minimum level of support, and will focus on itemsets with this level or above.
  • 78. Terminology large itemset: doesn’t mean an itemset with many items. It means one whose support is at least minimum support. Lk : the set of all large k-itemsets in the DB. Ck : a set of candidate large k-itemsets. In the algorithm we will look at, it generates this set, which contains all the k-itemsets that might be large, and then eventually generates the set above.
  • 79. Terminology sets: Let A be a set (A = {cat, dog}) and let B be a set (B = {dog, eel, rat}) and let C = {eel, rat} I use ‘A + B’ to mean A union B. So A + B = {cat, dog, eel, rat} When X is a subset of Y, I use Y – X to mean the set of things in Y which are not in X. E.g. B – C = {dog}
  • 80. Apriori Algorithm Find all large 1-itemsets For (k = 2 ; while Lk-1 is non-empty; k++) {Ck = apriori-gen(Lk-1) For each c in Ck, initialise c.count to zero For all records r in the DB {Cr = subset(Ck, r); For each c in Cr , c.count++ } Set Lk := all c in Ck whose count >= minsup } /* end -- return all of the Lk sets. The algorithm returns all of the (non-empty) Lk sets, which gives us an excellent start in finding interesting rules (although the large itemsets themselves will usually be very interesting and useful.
  • 81. Example: Generation of candidate itemsets and frequent itemsets, where the minimum support count is 2.
  • 82. Apriori Merits/Demerits  Merits  Uses large itemset property  Easily parallelized  Easy to implement  Demerits  Assumes transaction database is memory resident.  Requires many database scans.
  • 83. Summary  Association Rules form an very applied data mining approach.  Association Rules are derived from frequent itemsets.  The Apriori algorithm is an efficient algorithm for finding all frequent itemsets.  The Apriori algorithm implements level-wise search using frequent item property.  The Apriori algorithm can be additionally optimized.  There are many measures for association rules.
  • 85. Frequent Pattern Mining: An Example Given a transaction database DB and a minimum support threshold ξ, Find all frequent patterns (item sets) with support no less than ξ. TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} DB: Minimum support: ξ =3 Input: Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am… Problem Statement: How to efficiently find all frequent patterns?
  • 86.  Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure  highly compacted, but complete for frequent pattern mining  avoid costly repeated database scans  Develop an efficient, FP-tree-based frequent pattern mining method (FP- growth)  A divide-and-conquer methodology: decompose mining tasks into smaller ones  Avoid candidate generation: sub-database test only. Overview of FP-Growth: Ideas
  • 88. Construct FP-tree Two Steps: 1. Scan the transaction DB for the first time, find frequent items (single item patterns) and order them into a list L in frequency descending order. e.g., L={f:4, c:4, a:3, b:3, m:3, p:3} In the format of (item-name, support) 2. For each transaction, order its frequent items according to the order in L; Scan DB the second time, construct FP-tree by putting each frequency ordered transaction onto it.
  • 89. 89 FP-tree Example: step 1 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} L Step 1: Scan DB for the first time to generate L By-Product of First Scan of Database
  • 90. 90 FP-tree Example: step 2 TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} Step 2: scan the DB for the second time, order frequent items in each transaction
  • 91. 91 FP-tree Example: step 2 Step 2: construct FP-tree {} f:1 c:1 a:1 m:1 p:1 {f, c, a, m, p} {} {} f:2 c:2 a:2 b:1m:1 p:1 m:1 {f, c, a, b, m} NOTE: Each transaction corresponds to one path in the FP-tree
  • 92. 92 FP-tree Example: step 2 {} f:4 c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2 m:1 Step 2: construct FP-tree {} f:3 c:2 a:2 b:1m:1 p:1 m:1 {f, b} b:1 {c, b, p} c:1 b:1 p:1 {} f:3 c:2 a:2 b:1m:1 p:1 m:1 b:1 {f, c, a, m, p} Node-Link
  • 93. 93 Construction Example Final FP-tree {} f:4 c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2 m:1 Header Table Item head f c a b m p
  • 94. FP-Tree Definition  FP-tree is a frequent pattern tree .  Formally, FP-tree is a tree structure defined below: 1. One root labeled as “null", a set of item prefix sub-trees as the children of the root, and a frequent-item header table. 2. Each node in the item prefix sub-trees has three fields:  item-name : register which item this node represents,  count, the number of transactions represented by the portion of the path reaching this node,  node-link that links to the next node in the FP-tree carrying the same item-name, or null if there is none. 3. Each entry in the frequent-item header table has two fields,  item-name, and  head of node-link that points to the first node in the FP-tree carrying the item-name.
  • 95. Advantages of the FP-tree Structure  The most significant advantage of the FP-tree  Scan the DB only twice and twice only.  Completeness:  The FP-tree contains all the information related to mining frequent patterns (given the min-support threshold).  Compactness:  The size of the tree is bounded by the occurrences of frequent items  The height of the tree is bounded by the maximum number of items in a transaction
  • 97. Mining Frequent Patterns Using FP-tree  General idea (divide-and-conquer) Recursively grow frequent patterns using the FP-tree: looking for shorter ones recursively and then concatenating the suffix:  For each frequent item, construct its conditional pattern base, and then its conditional FP-tree;  Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)
  • 98. 3 Major Steps Starting the processing from the end of list L: Step 1: Construct conditional pattern base for each item in the header table Step 2: Construct conditional FP-tree from each conditional pattern base Step 3: Recursively mine conditional FP-trees and grow frequent patterns obtained so far. If the conditional FP-tree contains a single path, simply enumerate all the patterns
  • 99. Step 1: Construct Conditional Pattern Base  Starting at the bottom of frequent-item header table in the FP-tree  Traverse the FP-tree by following the link of each frequent item  Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases item cond. pattern base p fcam:2, cb:1 m fca:2, fcab:1 b fca:1, f:1, c:1 a fc:3 c f:3 f { } {} f:4 c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2 m:1 Header Table Item head f c a b m p
  • 100. Properties of FP-Tree  Node-link property  For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header.  Prefix path property  To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai.
  • 101. Step 2: Construct Conditional FP-tree  For each pattern base  Accumulate the count for each item in the base  Construct the conditional FP-tree for the frequent items of the pattern base m- cond. pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree  {} f:4 c:3 a:3 b:1m:2 m:1 Header Table Item head f 4 c 4 a 3 b 3 m 3 p 3 
  • 102. Step 3: Recursively mine the conditional FP-tree {} f:3 c:3 a:3 conditional FP-tree of “am”: (fc:3) {} f:3 c:3 conditional FP-tree of “cm”: (f:3) {} f:3 conditional FP-tree of “cam”: (f:3) {} f:3 conditional FP-tree of “fm”: 3 conditional FP-tree of of “fam”: 3 conditional FP-tree of “m”: (fca:3) add “a” add “c” add “f” add “c” add “f” Frequent Pattern fcam add “f” conditional FP-tree of “fcm”: 3 Frequent Pattern Frequent Pattern Frequent Pattern Frequent Pattern Frequent Pattern Frequent Pattern Frequent Pattern add “f”
  • 103. Principles of FP-Growth  Pattern growth property  Let  be a frequent itemset in DB, B be 's conditional pattern base, and  be an itemset in B. Then    is a frequent itemset in DB iff  is frequent in B.  Is “fcabm ” a frequent pattern?  “fcab” is a branch of m's conditional pattern base  “b” is NOT frequent in transactions containing “fcab ”  “bm” is NOT a frequent itemset.
  • 104. Conditional Pattern Bases and Conditional FP-Tree EmptyEmptyf {(f:3)}|c{(f:3)}c {(f:3, c:3)}|a{(fc:3)}a Empty{(fca:1), (f:1), (c:1)}b {(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m {(c:3)}|p{(fcam:2), (cb:1)}p Conditional FP-treeConditional pattern baseItem order of L
  • 105. Single FP-tree Path Generation  Suppose an FP-tree T has a single path P.  The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns concerning m: combination of {f, c, a} and m m, fm, cm, am, fcm, fam, cam, fcam 
  • 106. Summary of FP-Growth Algorithm  Mining frequent patterns can be viewed as first mining 1-itemset and progressively growing each 1-itemset by mining on its conditional pattern base recursively  Transform a frequent k-itemset mining problem into a sequence of k frequent 1- itemset mining problems via a set of conditional pattern bases
  • 107. Efficiency Analysis Facts: usually 1. FP-tree is much smaller than the size of the DB 2. Pattern base is smaller than original FP-tree 3. Conditional FP-tree is smaller than pattern base  Mining process works on a set of usually much smaller pattern bases and conditional FP-trees  Divide-and-conquer and dramatic scale of shrinking
  • 108. Performance Improvement Projected DBs Disk-resident FP-tree FP-tree Materialization FP-tree Incremental update partition the DB into a set of projected DBs and then construct an FP-tree and mine it in each projected DB. Store the FP- tree in the hark disks by using B+ tree structure to reduce I/O cost. a low ξ may usually satisfy most of the mining queries in the FP-tree construction. How to update an FP-tree when there are new data? • Reconstruct the FP-tree • Or do not update the FP-tree