Text Data Mining

What is Information Extraction?
Goal:
Extract structured information from unstructured (or loosely formatted) text.
Typical description of task:
 Identify named entities
 Identify relations between entities
 Populate a database
May also include:
 Event extraction
 Resolution of temporal expressions
 Wrapper induction (automatic construction of templates)
Applications:
 Natural language understanding,
 Question-answering, summarization, etc.

Information Extraction
 IE extracts pieces of information that are salient to the user's needs
 Find namedentities such as persons and organizations
 Find find attributes of those entities or events they participate in
 ContrastIR, which indicates which documents need to be read by
a user
 Links between the extracted information and the original documents
are maintained to allow the user to reference context.

Schematic view of the Information
Extraction Process

Relevant IE Definitions
Entities:
 Entities are the basic building blocks that can be found in text
documents.(An object of interest)
 Examples: people, companies, locations, genes, and drugs.
Attributes:
 Attributes are features of the extracted entities. (A property of an entity
such as its name, alias, descriptor or type)
 Examples: the title of a person, the age of a person, and the type of an
organization.

Relevant IE Definitions
Facts:
 Facts are the relations that exist between entities. (a relationship held
between two or more entities such as the position of a person in a
company)
 Example: Employment relationship between a person and a company
or phosphorylation between two proteins.
Events:
 An event is an activity or occurrence of interest in which entities
participate
 An activity involving several entities such as a terrorist act, airline crash,
management change, new product introduction a merger between two
companies, a birthday and so on.

IE - Method
 Extract raw text(html, pdf, ps, gif.)
 Tokenize
 Detect term boundaries
 We extracted alpha 1 type XIII collagen from …
 Their house council recommended…
 Detect sentence boundaries
 Tag parts of speech (POS)
 John/noun saw/verb Mary/noun.
 Tag named entities
 Person, place, organization, gene, chemical.
 Parse
 Determine co-reference
 Extract knowledge

Architecture: Components of
IE Systems
 Core linguistic components, adapted to or be useful for NLP tasks in general
 IE-specific components, address the core IE tasks.
 Domain-Independent
 Domain-specific components
 The following steps are performed in Domain-Independent part:
 Meta-data analysis:
 Extraction of the title, body, structure of the body (identification of
paragraphs), and the date of the document.
 Tokenization:
 Segmentation of the text into word-like units, called tokens and
classification of their type, e.g., identification of capitalized words,
words written in lowercase letters, hyphenated words, punctuation
signs, numbers, etc.

IE Systems
 Morphological analysis:
 Extraction of morphological information from tokens which constitute
potential word forms-the base form(or lemma), part of speech, other
morphological tags depending on the part of speech.
 e.g., verbs have features such as tense, mood, aspect, person, etc.
 Words which are ambiguous with respect to certain morphological categories
may undergo disambiguation. Typically part-of-speech disambiguation is
performed.
 Sentence/Utterance boundary detection:
 Segmentation of text into a sequence of sentences or utterances, each of which
is represented as a sequence of lexical items together with their features.
 Common Named-entity extraction:
 Detection of domain-independent named entities, such as temporal
expressions, numbers and currency, geographical references, etc.

IE Systems
 Phrase recognition:
 Recognition of small-scale, local structures such as noun phrases, verb groups,
prepositional phrases, acronyms, and abbreviations.
 Syntactic analysis:
 Computation of a dependency structure (parse tree) of the sentence based on the
sequence of lexical items and small-scale structures.
 Syntactic analysis may be deep or shallow.
 In the former case, compute all possible interpretations (parse trees) and
grammatical relations within the sentence.
 In the latter case, the analysis is restricted to identification of non-recursive
structures or structures with limited amount of structural recursion, which
can be identified with a high degree of certainty, and linguistic phenomena
which cause problems (ambiguities) are not handled and represented with
underspecified structures.

IE Systems
 The core IE tasks:
 NER,
 Co-reference resolution, and
 Detection of relations and events
Typically domain-specific, and are supported by domain-specific system
components and resources.
 Domain-specific processing is also supported on a lower level by detection of
specialized terms in text.
 Architecture: IE System
 In the domain specific core of the processing chain, a NER component is
applied to identify the entities relevant in a given domain.
 Patterns may then be applied to:
 Identify text fragments, which describe the target relations and events, and
 Extract the key attributes to fill the slots in the template representing the
relation/event.

Typical Architecture of an Information
Extraction System

IE Systems
 A co-reference component identifies mentions that refer to the same entity.
 Partially-filled templates are fused and validated using domain-specific
inference rules in order to create full-fledged relation/event descriptions.
 Several software packages to provide various tools that can be used in the
process of developing an IE system, ranging from core linguistic processing
modules (e.g., language detectors, sentence splitters), to general IE-oriented
NLP frameworks.

IE Task Types
 Named Entity Recognition (NER)
 Co-reference Resolution (CO)
 Relation Extraction (RE)
 Event Extraction (EE)

Named Entity Recognition
 Named Entity Recognition (NER) addresses the problem of the identification
(detection) and classification of predefined types of named entities,
 Such as organizations (e.g., ‘World Health Organisation’), persons (e.g., ‘Mohamad
Gouse’), place names (e.g., ‘the Baltic Sea’), temporal expressions (e.g., ‘15 January
1984’), numerical and currency expressions (e.g., ‘20 MillionEuros’), etc.
 NER task include extracting descriptive information from the text about the detected
entities through filling of a small-scale template.
 Example, in the case of persons, it may include extracting the title, position,
nationality, gender, and other attributes of the person.
 NER also involves lemmatization (normalization) of the named entities, which is
particularly crucial in highly inflective languages.
 Example in Polish there are six inflected forms of the name ‘Mohamad Gouse’
depending on grammatical case: ‘Mohamad Gouse’ (nominative), ‘Mohamad
Gouseego’ (genitive), Mohamad Gouseemu (dative), ‘Mohamad Gouseiego’
(accusative), Mohamad Gousem (instrumental), Mohamad Gousem (locative),
Mohamad Gouse(vocative).

Co-Reference
 Co-reference Resolution (CO) requires the identification of multiple (coreferring) mentions of
the same entity in the text.
 Entity mentions can be:
 (a) Named, in case an entity is referred to by name
 e.g., ‘General Electric’ and ‘GE’ may refer to the same real-world entity.
 (b) Pronominal, in case an entity is referred to with a pronoun
 e.g., in ‘John bought food. But he forgot to buy drinks.’, the pronoun he refers to John.
 (c) Nominal, in case an entity is referred to with a nominal phrase
 e.g., in ‘Microsoft revealed its earnings. The company also unveiled future plans.’ the
definite noun phrase The company refers to Microsoft.
 (d) Implicit, as in case of using zero-anaphora1
 e.g., in the Italian text fragment ‘OEBerlusconii ha visitato il luogo del disastro. i Ha
sorvolato, con l’elicottero.’
 (Berlusconi has visited the place of disaster. [He] flew over with a helicopter.) the
second sentence does not have an explicit realization of the reference to
Berlusconi.

Relation Extraction
 Relation Extraction (RE) is the task of detecting and classifying predefined
relationships between entities identified in text.
 For example:
 EmployeeOf(Steve Jobs,Apple): a relation between a person and an
organisation, extracted from ‘Steve Jobs works for Apple’
 LocatedIn(Smith,New York): a relation between a person and location,
extracted from ‘Mr. Smith gave a talk at the conference in New York’,
 SubsidiaryOf(TVN,ITI Holding): a relation between two companies,
extracted from ‘Listed broadcaster TVN said its parent company, ITI
Holdings, is considering various options for the potential sale.
 The set of relations that may be of interest is unlimited, the set of relations
within a given task is predefined and fixed, as part of the specification of the
task.

Event Extraction
 Event Extraction (EE) refers to the task of identifying events in free text and
deriving detailed and structured information about them, ideally identifying who
did what to whom, when, where, through what methods (instruments), and why.
 Usually, event extraction involves extraction of several entities and relationships
between them.
 For instance, extraction of information on terrorist attacks from the text
fragment ‘Masked gunmen armed with assault rifles and grenades attacked a
wedding party in mainly Kurdish southeast Turkey, killing at least 44 people.’
 Involves identification of perpetrators (masked gunmen), victims (people),
number of killed/injured (at least 44), weapons and means used (rifles and
grenades), and location (southeast Turkey).
 Another example is the extraction of information on new joint ventures, where
the aim is to identify the partners, products, profits and capitalization of the joint
venture.
 EE is considered to be the hardest of the four IE tasks.

IE Subtask: Named Entity Recognition
 Detect and classify all proper names mentioned in text
 What is a proper name? Depends on application.
 People, places, organizations, times, amounts, etc.
 Names of genes and proteins
 Names of college courses

NER Example
 Find extent of each mention
 Classify each mention
 Sources of ambiguity
 Different strings that map to the same entity
 Equivalent strings that map to different entities (e.g., U.S. Grant)

Approaches to NER
 Early systems: hand-written rules
 Statistical systems
 Supervised learning (HMMs, Decision Trees, MaxEnt, SVMs, CRFs)
 Semi-supervised learning (bootstrapping)
 Unsupervised learning (rely on lexical resources, lexical patterns, and
corpus statistics)

A Sequence-Labeling Approach using
CRFs
 Input: Sequence of observations (tokens/words/text)
 Output: Sequence of states (labels/classes)
 B: Begin
 I: Inside
 O: Outside
 Some evidence that including L (Last) and U (Unit length) is
advantageous (Ratinov and Roth 09)
 CRFs defines a conditional probability p(Y|X) over label sequences Y
given an observation sequence X
 No effort wasted modeling the observations (in contrast to joint
models like HMMs)
 Arbitrary features of the observations may be captured by the model

Linear Chain CRFs
 Simplest and most common graph structure, used for
sequence modeling
 Inference can be done efficiently using dynamic
programming O(|X||Y|2)

NER Features
 Several feature families used, all time-shifted by -2, -1, 0, 1, 2:
 The word itself
 Capitalization and digit patterns (shape patterns)
 8 lexicons entered by hand (e.g., honorifics, days, months)
 15 lexicons obtained from web sites (e.g., countries, publicly-traded
companies, surnames, stopwords, universities)
 25 lexicons automatically induced from the web (people names,
organizations, NGOs, nationalities)

Limitations of Conventional
NER(and IE)
 Supervised learning
 Expensive
 Inconsistent
 Worse for relations and events!
 Fixed, narrow, pre-specified sets of entity types
 Small, homogeneous corpora (newswire, seminar announcements)

Evaluating Named Entity Recognition
 Recall that recall is the ratio of the number of correctly labeled responses to the
total that should have been labeled.
 Precision is the ratio of the number of correctly labeled responses to the total
labeled.
 The F-measure provides a way to combine these two measures into a single
metric.
key
correct
N
N
recall 
incorrectcorrect
correct
NN
N
precision


recallprecision
recallprecision
F


 2
2
)1(



What is Relation Extraction?
 Typically defined as identifying relations between two entities
Relations Subtypes Examples
Affiliations
Personal
Organizational
Artifactual
married to, mother
of spokesman for,
president of owns,
invented, produces
Geospatial
Proximity
Directional
near, on outskirts
southeast of
Part-of
Organizational
Political
a unit of, parent of
annexed, acquired

Typical (Supervised) Approach
 FindEntities( ): Named entity recognizer
 Related?( ): Binary classier that says whether two entities are involved in
a relation
 ClassifyRelation( ): Classier that labels relations discovered by
Related?( )

Typical (Semi-Supervised) Approach

NELL: Never-Ending Language
Learner
NELL: Can computers learn to read?
 Goal: create a system that learns to read the web
 Reading task: Extract facts from text found on the web
 Learning task: Iteratively improve reading competence.
 http://rtw.ml.cmu.edu/rtw/

Approach
 Inputs
 Ontology with target categories and relations (i.e., predicates)
 Small number of seed examples for each
 Set of constraints that couple the predicates
 Large corpus of unlabeled documents
 Output: new predicate instances
 Semi-supervised bootstrap learning methods
 Couple the learning of functions to constrain the problem
 Exploit redundancy of information on the web.

Coupled Semi-Supervised Learning

Types of Coupling
1. Mutual Exclusion (output constraint)
 Mutually exclusive predicates can't both be satisfied by the same input x
 E.g., x cannot be a Person and a Sport
2. Relation Argument Type-Checking (compositional constraint)
 Arguments of relations declared to be of certain categories
 E.g., CompanyIsInEconomicSector(Company, EconomicSector)
3. Unstructured and Semi-Structured Text Features
(multi-view-agreement constraint)
 Look at different views (like co-training)
 Require classifiers agree
 E.g., freeform textual contexts and semi-structured contexts

Coupled Pattern Learner (CPL)
 Free-text extractor that learns contextual patterns to extract predicate
instances
 Use mutual exclusion and type-checking constraints to filter candidates
instances
 Rank instances and patterns by leveraging redundancy: if an instance or
pattern occurs more frequently, it's ranked higher

Coupled SEAL (CSEAL)
 SEAL (Set Expander for Any Language) is a wrapper induction algorithm
 Operates over semi-structured text such as web pages
 Constructs page-specific extraction rules (wrappers) that are human- and
markup-language independent
 CSEAL adds mutual-exclusion and type-checking constraints

CSEAL Wrappers
 Seeds: Ford, Nissan, Toyota
 arg1 is a placeholder for extracting instances

Open IE and TextRunner
 Motivations:
 Web corpora are massive, introducing scalability concerns
 Relations of interest are unanticipated, diverse, and abundant
 Use of “heavy” linguistic technology (NERs and parsers) don't work
well
 Input: a large, heterogeneous Web corpus
 9M web pages, 133M sentences
 No pre-specified set of relations
 Output: huge set of extracted relations
 60.5M tuples, 11.3M high-probability tuples
 Tuples are indexed for searching

TextRunner Architecture
 Learner outputs a classier that labels trustworthy extractions
 Extractor finds and outputs trustworthy extractions
 Assessor normalizes and scores the extractions

Architecture: Self-Supervised Learner
1. Automatically labels training data
 Uses a parser to induce dependency structures
 Parses a small corpus of several thousand sentences
 Identifies and labels a set of positive and negative extractions using
relation-independent heuristics
 An extraction is a tuple t = (ej , ri,j , ej)
 Entities are base noun phrases
 Uses parse to identify potential relations
2. Trains a classifier
 Domain-independent, simple non-parse features
 E.g., POS tags, phrase chunks, regexes, stopwords, etc.

Architecture: Single-Pass Extractor
1. POS tag each word
2. Identify entities using lightweight NP chunker
3. Identify relations
4. Classify them

Architecture: Redundancy-Based
Assessor
 Take the tuples and perform
 Normalization, deduplication, synonym resolution
 Assessment
 Number of distinct sentences from which each extraction was found serves
as a measure of confidence
 Entities and relations indexed using Lucene

Template Filling
 The task of template-filling is to find Template filling documents that
evoke such situations and then fill the slots in templates with appropriate
material.
 These slot fillers may consist of
 Text segments extracted directly from the text, or
 Concepts that have been inferred from text elements via some
additional processing (times, amounts, entities from an ontology, etc.).

Applications of IE
 Infrastructure for IR and for Categorization
 Information Routing
 Event Based Summarization
 Automatic Creation of Databases
 Company acquisitions
 Sports scores
 Terrorist activities
 Job listings
 Corporate titles and addresses

Inductive Algorithms for IE
 Rule Induction algorithms produce symbolic IE rules based on a corpus of
annotated documents.
 WHISK
 BWI
 The (LP)2 Algorithm
 The inductive algorithms are suitable for semi-structured domains, where
the rules are fairly simple, whereas when dealing with free text documents
(such as news articles) the probabilistic algorithms perform much better.

WHISK
 WHISK is a supervised learning algorithm that uses hand-tagged examples
for learning information extraction rules.
 Works for structured, semi-structured and free text.
 Extract both single-slot and multi-slot information.
 Doesn’t require syntactic preprocessing for structured and semi-structured
text, and recommend syntactic analyzer and semantic tagger for free text.
 The extraction pattern learned by WHISK is in the form of limited regular
expression, considering tradeoff between expressiveness and efficiency.
 Example: IE task of extracting neighborhood, number of bedrooms and
price from the text

WHISK
 An Example from the Rental Ads domain
 An example extraction pattern which can be learned by WHISK is,
*(Neighborhood) *(Bedroom) * ‘$’(Number)
Neighborhood, Bedroom, and Number – Semantic classes specified by
domain experts.
 WHISK learns the extraction rules using a top-down covering algorithm.
 The algorithm begins learning a single rule by starting with an empty rule;
 Then add one term at a time until either no negative examples are covered
by the rule or the pre-pruning criterion has been satisfied.

 We add terms to specialize it in order to reduce the Laplacian error of the rule.
The Laplacian expected error is defined as,
Where, e is the number of negative extraction and
n is the number of positive extractions on the training instances (terms)
 Example:
 For instance, from the text “3 BR, upper flr of turn of ctry. Incl gar, grt N. Hill
loc 995$. (206)-999-9999,” the rule would extract the frame Bedrooms – 3,
Price – 995.
 The “*” char in the pattern will match any number of characters (unlimited
jump).
 Patterns enclosed in parentheses become numbered elements in the output
pattern, and hence (Digit) is $1 and (number) is $2.
1
1



n
e
Laplacian

Boosted Wrapper Induction(BWI)
 The BWI is a system that utilizes wrapper induction techniques for
traditional Information Extraction.
 IE is treated as a classification problem that entails trying to approximate
two boundary functions Xbegin(i ) and Xend(i ).
 Xbegin(i ) is equal to 1 if the ith token starts a field that is part of the frame to
be extracted and 0 otherwise.
 Xend(i ) is defined in a similar way for tokens that end a field.
 The learning algorithm approximates each X function by taking a set of
pairs of the form {i, X}(i) as training data.

 Each field is extracted by a wrapper W=<F, A, H> where F is a set of begin boundary
detectors
 A is a set of end boundary detectors
 H(k) is the probability that the field has length k
 A boundary detector is just a sequence of tokens with wild cards (some kind of a
regular expression).
W(i, j ) is a nave Bayesian approximation of the probability
 The BWI algorithm learns two detectors by using a greedy algorithm that extends the
prefix and suffix patterns while there is an improvement in the accuracy.
 The sets F(i) and A(i) are generated from the detectors by using the AdaBoost
algorithm.
 The detector pattern can include specific words and regular expressions that work on a
set of wildcards such as <num>, <Cap>, <LowerCase>, <Punctuation> and <Alpha>.


 

otherwise
ijHjAiFif
jiW
0
)1()()(1
),(

),()( iFCiF k
k
FK )()( iACiA
k
kAk

(LP)2 Algorithm
 The (LP)2 algorithm learns from an annotated corpus and induces two sets
of rules:
 Tagging rules generated by a bottom-up generalization process
 correction rules that correct mistakes and omissions done by the
tagging rules.
 A tagging rule is a pattern that contains conditions on words preceding the
place where a tag is to be inserted and conditions on the words that follow
the tag.
 Conditions can be either words, lemmas, lexical categories (such as digit,
noun, verb, etc), case (lower or upper), and semantic categories (such as
time-id, cities, etc).
 The (LP)2 algorithm is a covering algorithm that tries to cover all training
examples.
 The initial tagging rules are generalized by dropping conditions.

IE and Text Summarization
 User’s perspective,
 IE can be glossed as "I know what specific pieces of information I want–just
find them for me!",
 Summarization can be glossed as "What’s in the text that is interesting?".
 Technically, from the system builder’s perspective, the two applications blend into
each other.
 The most pertinent technical aspects are:
 Are the criteria of interestingness specified at run-time or by the system builder?
 Is the input a single document or multiple documents?
 Is the extracted information manipulated, either by simple content delineation
routines or by complex inferences, or just delivered verbatim?
 What is the grain size of the extracted units of information–individual entities
and events, or blocks of text?
 Is the output formulated in language, or in a computer-internal knowledge
representation?

Text Summarization
 An information access technology that given a
document or sets of related documents, extracts the
most important content from the source(s) taking into
account the user or task at hand, and presents this
content in a well formed and concise text

Text Summarization Techniques
 Topic Representation
 Influence of Context
 Indicator Representations
 Pattern Extraction

Text Summarization
Input: one or more text documents
Output: paragraph length summary
 Sentence extraction is the standard method
 Using features such as key words, sentence position in document,
cue phrases
 Identify sentences within documents that are salient
 Extract and string sentences together
 Machine learning for extraction
 Corpus of document/summary pairs
 Learn the features that best determine important sentences
 Summarization of scientific articles

A Summarization Machine
EXTRACTS
ABSTRACTS
?
MULTIDOCS
Extract Abstract
Indicative
Generic
Background
Query-oriented
Just the news
10%
50%
100%
Very Brief
Brief
Long
Headline
Informative
DOC
QUERY
CASE FRAMES
TEMPLATES
CORE CONCEPTS
CORE EVENTS
RELATIONSHIPS
CLAUSE FRAGMENTS
INDEX TERMS

The Modules of the Summarization
Machine
E
X
T
R
A
C
T
I
O
N
I
N
T
E
R
P
R
E
T
A
T
I
O
N
EXTRACTS
ABSTRACTS
?
CASE FRAMES
TEMPLATES
CORE CONCEPTS
CORE EVENTS
RELATIONSHIPS
CLAUSE FRAGMENTS
INDEX TERMS
MULTIDOC EXTRACTS
G
E
N
E
R
A
T
I
O
N
F
I
L
T
E
R
I
N
G
DOC
EXTRACTS

What is Summarization?
 Data as input (database, software trace, expert system), text summary as output
 Text as input (one or more articles), paragraph summary as output
 Multimedia in input or output
 Summaries must convey maximal information in minimal space
 Involves: Three stages (typically)
 Content identification
 Find/Extract the most important material
 Conceptual organization
 Realization

Types of summaries
 Purpose
 Indicative, informative, and critical summaries
 Form
 Extracts (representative paragraphs/sentences/phrases)
 Abstracts: “a concise summary of the central subject matter of a
document”.
 Dimensions
 Single-document vs. multi-document
 Context
 Query-specific vs. query-independent
 Generic vs. query-oriented
 provides author’s view vs. reflects user’s interest.

Genres
 Headlines
 Outlines
 Minutes
 Biographies
 Abridgments
 Sound bites
 Movie summaries
 Chronologies, etc.

Aspects that Describe Summaries
 Input
 subject type: domain
 genre: newspaper articles, editorials, letters, reports...
 form: regular text structure, free-form
 source size: single doc, multiple docs (few,many)
 Purpose
 situation: embedded in larger system (MT, IR) or not?
 audience: focused or general
 usage: IR, sorting, skimming...
 Output
 completeness: include all aspects, or focus on some?
 format: paragraph, table, etc.
 style: informative, indicative, aggregative, critical...

Single Document Summarization
System Architecture
Extraction
Sentence reduction
Generation
Sentence combination
Input: single document
Extracted sentences
Output: summary
Corpus
Decomposition
Lexicon
Parser
Co-reference

Multi-Document Summarization
 Monitor variety of online information sources
 News, multilingual
 Email
 Gather information on events across source and time
 Same day, multiple sources
 Across time
 Summarize
 Highlighting similarities, new information, different perspectives,
user specified interests in real-time

Example System: SUMMARIST
Three stages:
1. Topic Identification Modules: Positional Importance, Cue Phrases (under
construction), Word Counts, Discourse Structure (under construction), ...
2. Topic Interpretation Modules: Concept Counting /Wavefront, Concept
Signatures (being extended)
3. Summary Generation Modules (not yet built): Keywords, Template Gen, Sent.
Planner & Realizer
SUMMARY = TOPIC ID + INTERPRETATION + GENERATION

 From extract to abstract:
topic interpretation or concept fusion.
 Experiment (Marcu, 98):
 Got 10 newspaper texts, with
human abstracts.
 Asked 14 judges to extract
corresponding clauses from texts, to
cover the same content.
 Compared word lengths of extracts
to abstracts: extract_length  2.76 
abstract_length !!
xx xxx xxxx x xx xxxx
xxx xx xxx xx xxxxx x
xxx xx xxx xx x xxx xx
xx xxx x xxx xx xxx x
xx x xxxx xxxx xxxx xx
xx xxxx xxx
xxx xx xx xxxx x xxx
xx x xx xx xxxxx x x xx
xxx xxxxxx xxxxxx x x
xxxxxxx xx x xxxxxx
xxxx
xx xx xxxxx xxx xx x xx
xx xxxx xxx xxxx xx
Topic Interpretation
xxx xx xxx xxxx xx
xxx x xxxx x xx xxxx
xx xxx xxxx xx x xxx
xxx xxxx x xxx x xxx
xx xx xxxxx x x xx
xxxxxxx xx x xxxxxx
xxxx
xx xx xxxxx xxx xx
xxx xx xxxx x xxxxx
xx xxxxx x

Some Types of Interpretation
 Concept generalization:
Sue ate apples, pears, and bananas  Sue ate fruit
 Meronymy replacement:
Both wheels, the pedals, saddle, chain…  the bike
 Script identification:
He sat down, read the menu, ordered, ate, paid, and left  He ate at the
restaurant
 Metonymy:
A spokesperson for the US Government announced that…  Washington
announced that...

General Aspects of Interpretation
 Interpretation occurs at the conceptual level...
…words alone are polysemous (bat  animal and sports
instrument) and combine for meaning (alleged murderer 
murderer).
 For interpretation, you need world knowledge...
…the fusion inferences are not in the text!

 Extract a pattern for each event in training data
 part of speech & mention tags
 Example: Japanese political leaders  GPE JJ PER
Japanese political Leaders
GPE PER
NN JJ NN
GPE JJ PER
Text
Ents
POS
Pattern
Pattern Extraction

Summarization - Scope
 Data preparation:
 Collect large sets of texts with abstracts, all genres.
 Build large corpora of <Text, Abstract, Extract> tuples.
 Investigate relationships between extracts and abstracts (using <Extract,
Abstract> tuples).
 Types of summary:
 Determine characteristics of each type.
 Topic Identification:
 Develop new identification methods (discourse, etc.).
 Develop heuristics for method combination (train heuristics on <Text,
Extract> tuples).

Summarization - Scope
 Concept Interpretation (Fusion):
 Investigate types of fusion (semantic, evaluative…).
 Create large collections of fusion knowledge/rules (e.g., signature
libraries, generalization and partonymic hierarchies, metonymy
rules…).
 Study incorporation of User’s knowledge in interpretation.
 Generation:
 Develop Sentence Planner rules for dense packing of content into
sentences (using <Extract, Abstract> pairs).
 Evaluation:
 Develop better evaluation metrics, for types of summaries.

Apriori Algorithm
 In computer science and data mining, Apriori is a classic algorithm for
learning association rules.
 Apriori is designed to operate on databases containing transactions.
 Example, collections of items bought by customers, or details of a website
frequentation.
 The algorithm attempts to find subsets which are common to at least a
minimum number C (the cutoff, or confidence threshold) of the itemsets.
 Apriori uses a "bottom up" approach, where frequent subsets are extended
one item at a time (a step known as candidate generation, and groups of
candidates are tested against the data.
 The algorithm terminates when no further successful extensions are found.
 Apriori uses breadth-first search and a hash tree structure to count candidate
item sets efficiently.

Find rules in two stages
Agarwal et.al., divided the problem of finding good rules into two phases:
1. Find all itemsets with a specified minimal support (coverage). An itemset
is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm
can efficiently find all itemsets whose coverage is above a given
minimum.
2. Use these itemsets to help generate interersting rules. Having done stage
1, we have considerably narrowed down the possibilities, and can do
reasonably fast processing of the large itemsets to generate candidate
rules.

Terminology
k-itemset : a set of k items. E.g.
{beer, cheese, eggs} is a 3-itemset
{cheese} is a 1-itemset
{honey, ice-cream} is a 2-itemset
support: an itemset has support s% if s% of the records in the DB contain that
itemset.
minimum support: the Apriori algorithm starts with the specification of a
minimum level of support, and will focus on itemsets with this level or
above.

Terminology
large itemset: doesn’t mean an itemset with many items. It means one
whose support is at least minimum support.
Lk : the set of all large k-itemsets in the DB.
Ck : a set of candidate large k-itemsets. In the algorithm we will look at, it
generates this set, which contains all the k-itemsets that might be large,
and then eventually generates the set above.

Terminology
sets: Let A be a set (A = {cat, dog}) and
let B be a set (B = {dog, eel, rat}) and
let C = {eel, rat}
I use ‘A + B’ to mean A union B.
So A + B = {cat, dog, eel, rat}
When X is a subset of Y, I use Y – X to mean the set of things in Y which
are not in X.
E.g. B – C = {dog}

Apriori Algorithm
Find all large 1-itemsets
For (k = 2 ; while Lk-1 is non-empty; k++)
{Ck = apriori-gen(Lk-1)
For each c in Ck, initialise c.count to zero
For all records r in the DB
{Cr = subset(Ck, r); For each c in Cr , c.count++ }
Set Lk := all c in Ck whose count >= minsup
} /* end -- return all of the Lk sets.
The algorithm returns all of the (non-empty) Lk sets, which gives us an
excellent start in finding interesting rules (although the large itemsets
themselves will usually be very interesting and useful.

Example: Generation of candidate itemsets and frequent
itemsets, where the minimum support count is 2.

Apriori Merits/Demerits
 Merits
 Uses large itemset property
 Easily parallelized
 Easy to implement
 Demerits
 Assumes transaction database is memory resident.
 Requires many database scans.

Summary
 Association Rules form an very applied data mining approach.
 Association Rules are derived from frequent itemsets.
 The Apriori algorithm is an efficient algorithm for finding all frequent
itemsets.
 The Apriori algorithm implements level-wise search using frequent item
property.
 The Apriori algorithm can be additionally optimized.
 There are many measures for association rules.

Frequent Pattern Mining:
An Example
Given a transaction database DB and a minimum support threshold ξ,
Find all frequent patterns (item sets) with support no less than ξ.
TID Items bought
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
DB:
Minimum support: ξ =3
Input:
Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…
Problem Statement: How to efficiently find all frequent patterns?

 Compress a large database into a compact, Frequent-Pattern tree (FP-tree)
structure
 highly compacted, but complete for frequent pattern mining
 avoid costly repeated database scans
 Develop an efficient, FP-tree-based frequent pattern mining method (FP-
growth)
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only.
Overview of FP-Growth: Ideas

FP-tree:
Construction and Design

Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent items (single
item patterns) and order them into a list L in frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according to the order in L;
Scan DB the second time, construct FP-tree by putting each frequency
ordered transaction onto it.

89
FP-tree Example: step 1
Item frequency
f 4
c 4
a 3
b 3
m 3
p 3
TID Items bought
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
L
Step 1: Scan DB for the first time to generate L
By-Product of First
Scan of Database

90
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Step 2: scan the DB for the second time, order frequent items
in each transaction

91
Step 2: construct FP-tree
{}
f:1
c:1
a:1
m:1
p:1
{f, c, a, m, p}
{}
{}
f:2
c:2
a:2
b:1m:1
p:1 m:1
{f, c, a, b, m}
NOTE: Each
transaction
corresponds to one
path in the FP-tree

92
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Step 2: construct FP-tree
{}
f:3
c:2
a:2
b:1m:1
p:1 m:1
{f, b}
b:1
{c, b, p}
c:1
b:1
p:1
{}
f:3
c:2
a:2
b:1m:1
p:1 m:1
b:1
{f, c, a, m, p}
Node-Link

93
Construction Example
Final FP-tree
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item head
f
c
a
b
m
p

FP-Tree Definition
 FP-tree is a frequent pattern tree .
 Formally, FP-tree is a tree structure defined below:
1. One root labeled as “null", a set of item prefix sub-trees as the children of
the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields:
 item-name : register which item this node represents,
 count, the number of transactions represented by the portion of the path
reaching this node,
 node-link that links to the next node in the FP-tree carrying the same
item-name, or null if there is none.
3. Each entry in the frequent-item header table has two fields,
 item-name, and
 head of node-link that points to the first node in the FP-tree carrying the
item-name.

Advantages of the FP-tree Structure
 The most significant advantage of the FP-tree
 Scan the DB only twice and twice only.
 Completeness:
 The FP-tree contains all the information related to mining frequent patterns
(given the min-support threshold).
 Compactness:
 The size of the tree is bounded by the occurrences of frequent items
 The height of the tree is bounded by the maximum number of items in a
transaction

FP-growth:
Mining Frequent Patterns
Using FP-tree

Mining Frequent Patterns Using FP-tree
 General idea (divide-and-conquer)
Recursively grow frequent patterns using the FP-tree: looking for shorter
ones recursively and then concatenating the suffix:
 For each frequent item, construct its conditional pattern base, and then its
conditional FP-tree;
 Repeat the process on each newly created conditional FP-tree until the
resulting FP-tree is empty, or it contains only one path (single path will
generate all the combinations of its sub-paths, each of which is a frequent
pattern)

3 Major Steps
Starting the processing from the end of list L:
Step 1:
Construct conditional pattern base for each item in the header table
Step 2:
Construct conditional FP-tree from each conditional pattern base
Step 3:
Recursively mine conditional FP-trees and grow frequent patterns
obtained so far. If the conditional FP-tree contains a single path, simply
enumerate all the patterns

Step 1: Construct Conditional Pattern Base
 Starting at the bottom of frequent-item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item
 Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Conditional pattern bases
item cond. pattern base
p fcam:2, cb:1
m fca:2, fcab:1
b fca:1, f:1, c:1
a fc:3
c f:3
f { }
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item head
f
c
a
b
m
p

Properties of FP-Tree
 Node-link property
 For any frequent item ai, all the possible frequent patterns that contain ai
can be obtained by following ai's node-links, starting from ai's head in
the FP-tree header.
 Prefix path property
 To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.

Step 2: Construct Conditional FP-tree
 For each pattern base
 Accumulate the count for each item in the base
 Construct the conditional FP-tree for the frequent items of the pattern
base
m- cond. pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-tree

{}
f:4
c:3
a:3
b:1m:2
m:1
Header Table
Item head
f 4
c 4
a 3
b 3
m 3
p 3


Step 3: Recursively mine the conditional
FP-tree
{}
f:3
c:3
a:3
conditional FP-tree of
“am”: (fc:3)
{}
f:3
c:3
“cm”: (f:3)
{}
f:3
“cam”: (f:3)
{}
f:3
conditional FP-tree of “fm”: 3
of “fam”: 3
“m”: (fca:3)
add
“a”
add
“c”
add
“f”
add
“c”
add
“f”
Frequent Pattern
fcam
add
“f”
“fcm”: 3
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
Frequent Pattern
add
“f”

Principles of FP-Growth
 Pattern growth property
 Let  be a frequent itemset in DB, B be 's conditional pattern base,
and  be an itemset in B. Then    is a frequent itemset in DB iff 
is frequent in B.
 Is “fcabm ” a frequent pattern?
 “fcab” is a branch of m's conditional pattern base
 “b” is NOT frequent in transactions containing “fcab ”
 “bm” is NOT a frequent itemset.

Conditional Pattern Bases and
Conditional FP-Tree
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern baseItem
order of
L

Single FP-tree Path Generation
 Suppose an FP-tree T has a single path P.
 The complete set of frequent pattern of T can be generated by enumeration of
all the combinations of the sub-paths of P
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns concerning m:
combination of {f, c, a} and m
m,
fm, cm, am,
fcm, fam, cam,
fcam


Summary of FP-Growth Algorithm
 Mining frequent patterns can be viewed as first mining 1-itemset and
progressively growing each 1-itemset by mining on its conditional pattern base
recursively
 Transform a frequent k-itemset mining problem into a sequence of k frequent 1-
itemset mining problems via a set of conditional pattern bases

Efficiency Analysis
Facts: usually
1. FP-tree is much smaller than the size of the DB
2. Pattern base is smaller than original FP-tree
3. Conditional FP-tree is smaller than pattern base
 Mining process works on a set of usually much smaller pattern
bases and conditional FP-trees
 Divide-and-conquer and dramatic scale of shrinking

Performance Improvement
Projected DBs
Disk-resident
FP-tree
FP-tree
Materialization
FP-tree
Incremental update
partition the
DB into a set
of projected
DBs and then
construct an
FP-tree and
mine it in each
projected DB.
Store the FP-
tree in the
hark disks by
using B+ tree
structure to
reduce I/O
cost.
a low ξ may
usually satisfy
most of the
mining queries
in the FP-tree
construction.
How to update
an FP-tree when
there are new
data?
• Reconstruct
the FP-tree
• Or do not
update the
FP-tree

Text Data Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Text Data Mining

Similar to Text Data Mining (20)

Text Data Mining