SlideShare a Scribd company logo
1 of 48
Natural Language Processing
for the Semantic Web
University of Babylon
College of Information Technology
Department of Software
Supervisor By
Prof. Dr. Assad Sabah
By PhD. Postgraduate Students
Alyaa Talib Raheem
• INTROUCTION
• INFORMATION EXTRACTION
• AMBIGUITY
• APPROACHES TO LINGUISTIC PROCESSING
• NLP PIPELINES
• TOKENIZATION
• SENTENCE SPLITTING
• POS TAGGING
• MORPHOLOGICAL ANALYSIS AND STEMMING
• SYNTACTIC PARSING
• CHUNKING
• Named Entity Recognition and Classification
• CHALLENGES IN NERC
• APPROACHES of NERC and PIPELINES
• Relation Extraction
• RELATION EXTRACTION PIPELINE
• THE ROLE OF KNOWLEDGE BASES IN RELATION EXTRACTION
INTROUCTION
 Natural Language Processing (NLP) is the automatic processing of text written in
natural (human) languages (English, French, Chinese, etc.) as opposed to artificial
languages such as programming languages, to try machine to “understand” it..
 It is also known as Computational Linguistics (CL) or Natural Language Engineering
(NLE).
 NLP include a wide range of task:
low-level tasks, such as segmenting text into sentences and words.
high-level complex applications such as semantic annotation and opinion mining.
 NLP techniques provide a way to enhance web data with semantics or relations
describing how entities relate to one another such as “Barack Obama,” concepts such as
“Politician”.
 one of the problems of NLP is that tools almost always need adapting to specific
domains and tasks, and for real-world applications this is often easier with rule-based
systems.
INFORMATION EXTRACTION
 Information extraction is the process of extracting information and turning it
into structured data.
 The information contained in the structured knowledge base can then be used
as a resource for other tasks, such as answering natural language queries or
improving on standard search engines with deeper or more implicit forms of
knowledge than that expressed in text.
 When considering information contained in text several types :
1- Named entities (NEs), such as persons, locations, and organizations. Along with
proper names,
2- Temporal expressions, such as dates and times, are also often considered as
named entities.
INFORMATION EXTRACTION
 Named entities are connected together by means of relations .A more complex
type of information is the event, which can be seen as a group of relations grounded
in time.
 Information extraction is difficult process because there are many ways of
expressing the same facts:
INFORMATION EXTRACTION
Linguistic
pre-processing
Named entity
recognition
(NER)
IE Relation
Event
extraction
Information extraction typically consists of a sequence of tasks:
AMBIGUITY
 Ambiguous language means that more than one interpretation is possible,
either syntactically or semantically.
 Computers cannot easily rely on world knowledge and common sense, so
they have to use statistical or other techniques to resolve ambiguity.
 Some kinds of text, such as newspaper headlines and messages on social
media, are often designed to be deliberately ambiguous for entertainment
value or to make them more memorable.
 Some classic examples of ambiguous are shown below:
 There are two main kinds of approach to linguistic processing tasks: a knowledge-based
approach and a Machine learning approach.
APPROACHES TO LINGUISTIC PROCESSING
Table 1: knowledge-based vs. machine learning approaches to NLP
 An NLP pre-processing pipeline, as shown in Figure 2.1, typically consists of the
following components:
 There are tools used for parsing or division, defining things like noun and verb
sentences in the case of division, or doing a more detailed analysis of grammatical
structure in the case of inflection, such as GATE, Stanford CoreNLP, OpenNLP, NLTK.
NLP PIPELINES
 Concerning toolkits:
1-GATE provides a number of open-source linguistic preprocessing, It contains a
ready-made pipeline for Information Extraction, called ANNIE under LGPL license.
2- Stanford CoreNLP is another open-source annotation pipeline framework, available
under the GPL license, which can perform all the core linguistic processing, via a
simple Java API.
3-OpenNLP is an open-source machine learning-based toolkit for language processing,
which uses maximum entropy and Perceptron-based classifiers. It is freely available
under the Apache license.
4-NLTK is an open-source Python-based toolkit, available under the Apache license, It
provides a number of different variants for some components, both rule-based and
learning-based.
NLP PIPELINES
 Tokenization is the task of splitting the input text into very simple units, called
tokens, which generally correspond to words, numbers, and symbols, and are
typically separated by white space in English.
 It is important to use a high-quality tokenizer, as errors are likely to affect the
results of all subsequent NLP components in the pipeline.
 Commonly distinguished types of tokens are numbers, symbols (e.g., $, %),
punctuation, and words of different kinds, e.g., uppercase, lowercase, mixed case.
TOKENIZATION
 Several tools to
1- GATE’s ANNIE Tokenizer³ relies on a set of regular expression rules which are then
compiled into a finite-state machine.
2- The PTBTokenizer is an efficient, fast, and deterministic tokenizer, which forms part of the
suite of Stanford CoreNLP tools http://gate.ac.uk
3- NLTK also has several similar tokenizers to ANNIE, one based on regular expressions,
written in Python. http://www.nltk.org/
TOKENIZATION
There were 250 cyber-crime offences
 Sentence detection (or sentence splitting) is the task of separating text into its
constituent sentences.
 This is typically involves determining whether punctuation, such as full stops,
commas, exclamation marks, and question marks, denote the end of a sentence or
something else quoted speech, abbreviations, etc.
Mr. Barack Obama and his wife, Mrs. Michel, reside in the United States of
America.
 More complex cases arise when the text contains tables, titles, formulae, or other
formatting markup, these are usually the biggest source of error.
 Sentence splitters generally make use of already tokenized GATE’s ANNIE
sentence Splitter, the OpenNLP, NLTK uses the Punkt sentence segmenter,
Stanford CoreNLP.
SENTENCE SPLITTING
 Part-of-Speech (POS) tagging is concerned with tagging words with their part of
speech, e.g., noun, verb, adjective..
 These basic linguistic categories are typically divided into quite fine-grained tags,
distinguishing for instance between singular and plural nouns, and different tenses
of verbs.
 The POS tag is determined by taking into account not just the word itself, but also
the context in which it appears For example, the word love could be a noun or
verb depending on the context (I love fish vs. Love is all you need)..
 Figure shows an example of some POS-tagged text, using the PTB tag set.
POS TAGGING
POS TAGGING
POS TAGGING
 Morphological analysis essentially concerns the identification and classification of the
linguistic units of a word, typically breaking the word down into its root form and an
affix.
 For example, the verb walked comprises a root form walk and an affix -ed.
 In English, morphological analysis is typically applied to verbs and nouns for
example
 Morphological analyzers for English are often rule-based,for example, plural nouns
are typically created by adding -s or -es to the end of the singular noun).
MORPHOLOGICAL ANALYSIS AND STEMMING
1- By adding a suffix to the root form (e.g., walk, walked, box, boxes)
2- another internal modification such as a vowel change (e.g., run, ran, goose, geese).
1- MORPHOLOGICAL ANALYSIS
 Stemmers produce the stem form of each word, e.g., driving and drivers have
the stem drive, whereas morphological analysis tends to produce the
root/lemma forms of the words and their affixes, e.g., drive and driver for the
above examples, with affixes -ing and -s respectively.
MORPHOLOGICAL ANALYSIS AND STEMMING
2- STEMMING
SYNTACTIC PARSING
 Syntactic parsing is concerned with analyzing sentences to derive their
syntactic structure according to a grammar.
 Essentially, parsing explains how different elements in a sentence are related
to each other, such as how the subject and object of a verb are connected.
 There are many different syntactic theories in computational linguistics,
which posit different kinds of syntactic structures.
CHUNKING
 Chunkers, also sometimes called shallow parsers, Chunking is a process of
extracting phrases from unstructured text. Chunking is very important when
you want to extract information from text such as Locations, Person Names
etc.
 Tools for chunking can be subdivided into Noun Phrase (NP) Chunkers and
Verb Phrase (VP) Chunkers.
CHUNKING
 They may even include negative elements such as might not have
bought or didn’t buy.An example of chunker output combining both
noun and verb phrase chunking is shown in Figure 2.11.
Named Entity Recognition and
Classification
Named Entity Recognition and Classification
 named entity recognition and classification (NERC), also called entity
identification or entity extraction ‒ is a natural language processing (NLP)
technique that automatically identifies named entities in a text and classifies them
into predefined categories.
 Entities can be names of people, organizations, locations, times, quantities,
monetary values, percentages, and more).
 Extracting the main entities in a text helps sort unstructured data and detect
important information, which is crucial if you have to deal with large datasets.
 It can be subdivided into three tasks:
1- named entity recognition task (NCR), involving identifying the boundaries of an
NE
2- named entity classification (NEC), involving detecting the class or type of the NE.
Named Entity Recognition and Classification
 named entity linking (NEL) recognize if a named entity mention in a text
corresponds to any NEs in a reference knowledge base.
(NER)
(NEL)
NEC
Named Entity Recognition and Classification
 NERC is the starting point for many more complex applications and
tasks such as ontology building, relation extraction, question
answering, information extraction, information retrieval, machine
translation, and semantic annotation.
 The standard kind of 5- or 7-class entity recognition problem is now
often less useful, which in turn means that new paradigms are
required In some cases, such as the recognition of Twitter user
names, the Organization and Location, has become blurred even for
a human, and is no longer always useful
CHALLENGES IN NERC
 One of the main challenges of NERC is to distinguish between named
entities and entities. The difference between these two things is that named
entities are instances of types (such as Person, Politician) such as Obama.
 Another challenge is to recognize NE boundaries correctly.
Example, Sir Robert Walpole was a British statesman
 Ambiguities are one of the biggest challenges for NERC systems. These can
affect both the recognition and the classification component, and
sometimes even both simultaneously.
Example, the word May can be a proper noun (named entity) or a common
noun (not an entity, as in the verbal use you may go), but even when a proper
noun, it can fall into various categories(month of the year, part of a person’s
name (and furthermore a first name or surname), or part of an organization
name).
APPROACHES of NERC and PIPELINES
 Approaches to NERC can be roughly divided into
1- Rule- or pattern-based,
2- Machine learning or statistical extraction methods.
 An example of a typical processing pipeling for NERC is shown in Figure 3.1.
 A typical NERC processing pipeline consists of linguistic pre-processing
(tokenization, sentence splitting, POS tagging), followed by entity finding
using gazetteers and grammars, and then co-reference resolution.
APPROACHES TO NERC
1 RULE-BASED APPROACHES TO NERC
 RULE-BASED use contextual information to help determine whether
candidate entities from the gazetteers are valid, or to extend the set of
candidates.
 Linguistic rule-based methods for NERC, such as those used in ANNIE,
GATE’s information extraction system typically comprise a combination of
gazetteer lists and hand-coded pattern matching rules.
 Gazetteer lists are designed for annotating simple, regular features such as
known names of companies, locations, days of the week, famous people, etc.
Note: Using gazetteers alone is insufficient for recognizing and classifying
entities, because on the one hand many names are too ambiguous (e.g., “London”
could be part of an Organization name, a Person name, or just the Location)
APPROACHES TO NERC
1 RULE-BASED APPROACHES TO NERC
 pattern matching rules:
 Using pattern matching for NERC requires the development of patterns over
multi-faceted structures that consider many different properties of words, such
as orthography (capitalization), morphology, part-of-speech information and so
on.
 A typical simple pattern-matching rule might try to match all university names,
E.g., University of Sheffield, University of Bristol
 A more complex rule might try to identify the name of any organization
looking for a keyword from a gazetteer list such as Company, Organization,
Business, School, etc.
APPROACHES TO NERC
2-SUPERVISED LEARNING METHODS FOR NERC
 Rule learning methods were historically followed by supervised learning approaches.
 supervised learning approaches which learn weights for features, based on their
probability of appearing with negative vs. positive training examples for specific NE
types.
The general supervised learning approach consists of five stages:
• linguistic pre-processing.
• feature extraction.
• training models on training data.
• applying models to test data.
• post-processing the results to tag the documents.
APPROACHES TO NERC
SUPERVISED LEARNING METHODS FOR NERC
 Popular features include:
• Morphological features: capitalization, occurrence of special characters (e.g., $, %);
• Part-of-speech features: tags of the occurrence;
• Context features: words and POS of words in a window around the occurrence,
usually of 1–3 words;
• Gazetteer features: appearance in NE gazetteers;
• Syntactic features: features based on parse of sentence;
• Word representation features: features based on unsupervised training on unlabeled
text using e.g., Brown clustering or word embedding.
 Statistical NERC approaches use a variety of models, such as Hidden Markov
Models (HMMs) , Maximum Entropy models , Support Vector Machines (SVMs)
Conditional Random Fields (CRFs), or neural networks
Relation Extraction
Relation Extraction
 Relation extraction (RE) is the task of extracting links between entities, which
builds on the task of NERC.
 Typical relation types include birthdate(PER, DATE) and founder-of(PER, ORG),
birthdate (John Smith, 1985-01-01)
founder-of (Bill Gates, Microsoft).
 It can be subdivided into three tasks:
1- Relation argument identification (finding the boundaries of the arguments),
2- Relation argument classification (assigning types to the arguments)
3- Relation classification (assigning a type to the relation).
Relation Extraction
 The first two tasks are generally approached using NERC. For semantic
annotation a further step is to link relation arguments to entries in a knowledge
base using named entity linking (NEL) methods
 One of the problems RE approaches face is the vast difference between relation
schemas; unlike for NER, there is no standard small set of types that is shared
across systems.
 Another problem is that relation types can overlap or even entail one another,
e.g., the ceo-of(PER, ORG) relation is completely subsumed by the employee-of
(PER, ORG) relation, whereas there is only a strong overlap, but no entailment
relation between country-of-birth(PER, LOC) and country-of-residence(PER,
LOC).
RELATION EXTRACTION PIPELINE
 Figure 4.1. Note that there are several variations of this approach, as described in the
following sections.
THE ROLE OF KNOWLEDGE BASES IN RELATION EXTRACTION
 Relation extraction is generally defined as extracting relation mentions with
their arguments from text. For traditional relation extraction, relation types and
their arguments are defined in a schema. whereas for open information
extraction, relation types are not pre-defined.
 An example of a template for a binary relation would be Person, born on, Date,
birthdate (John Smith, 1985-01-01) though relations can have more than two
arguments, such as Government positions held.
 Relation extraction has many challenges in addition to those of the NERC task:
1- The main challenge is that relations can be expressed in different ways. For
instance born on can be expressed as was born on, birthdate is on, or saw the light
of day for the first time on.
THE ROLE OF KNOWLEDGE BASES IN RELATION EXTRACTION
3- Relation expressions are also not always unique to one relation, such as works at
can imply employee of or CEO of.
4- Some relation expressions are also very vague, for instance:
Alfred Hitchcock’s “The Birds” was very popular.
Note: Although relation extraction is often solved in the form of a pipeline
architecture, as shown in previous figure, this can lead to errors being propagated. If
an error is made at an early stage of the pipeline, it cannot be corrected again later.
RELATION SCHEMAS
 There are two kinds of information that need to be described for relation
extraction.
First: we need information about classes (e.g., artist, track) and their relationships
(e.g., released-track).This is kind of information is published as a schema.
 Second, we need information about instances of these classes (e.g., David
Bowie, Changes), which can be published in a dataset.
 Some websites contain semantic markup, often using http://schema.org/, but do
not publish this in a separate dataset.
 The purpose of relation extraction, schemas serve a similar purpose to locally
defined templates, they have a clear advantage in the way data is described,
using unique identifiers for entities, called uniform resource identifiers (URIs).
RELATION EXTRACTION METHODS
RELATION
EXTRACTI
ON
METHODS
1- Rule-based methods
2- Supervised methods
3- Semi-supervised bootstrapping
methods.
5- Unsupervised/Open IE methods.
6- Distantly supervised approaches
7- Universal schema
1- RULE-BASED APPROACHES
 Rule-based relation extraction approaches make use of domain knowledge,
which are encoded as relation extraction rules.
 There are two different types of rule-based approaches:
1- Stand-alone: relies on a rule grammar to encode complex rules and
dependencies between them. An example of a rule from is BandMember
followed within 30 characters by Instrument. For both BandMember and
Instrument, pre-compiled gazetteers plus regular expressions are used to
recognize them.
 A downside of such rule-based approaches is that they are not able to
generalize to unseen textual patterns, and generally have poor recall.
1- RULE-BASED APPROACHES
2- learn rules for inference to complement other relation extraction approaches.
This is starts with a pair of entities that are known to be in a relation according
to a (seed) knowledge base, then performs random walk over the knowledge
graph to find other paths that connect these entities . thus it can learn that if two
people share a child, they are very likely to be married, or that people often
study at the same university as their siblings.
 A downside of using rules learned for inference purposes is that rules learned
on a small knowledge base might not generalize to new relation.
 Only rules learned based on a large enough amount of evidence should
ideally be used.
2- UNSUPERVISED APPROACHES
 Unsupervised relation extraction approaches became popular soon after supervised
systems, with open information extraction systems such as TextRunner , ReVerb,
and OLLIE.
 Open IE can thus be seen as a subgroup of unsupervised approaches., this means
they have to infer the classes of entities, and relations between them.
 TextRunner: was the first fully implemented and evaluated Open IE system. It
learns a Conditional Random Field (CRF) model for relations, classes, and entities
from a corpus using a relation-independent extraction model.
 ReVerb addresses two shortcomings previous Open IE systems have: incoherent
extractions and uninformative extractions. Incoherent extractions occur when the
extracted relation phrase has no meaningful interpretation
2- UNSUPERVISED APPROACHES
3- DISTANT SUPERVISION APPROACHES
 Distant supervision is a method for automatically annotating training data using
existing relations in knowledge bases.
 Mintz et al. define the distant supervision assumption as:
 The approach has all the advantages of supervised learning (high precision
extraction, output with respect to extraction schema), and additional advantages,
since no manual effort is required to label training data. Extraction performance
is slightly lower than for supervised approaches, due to incorrectly labeled
training examples.
If two entities participate in a relation, any sentence that contains those two entities might
express that relation.
DISTANT SUPERVISION APPROACHES
Thank You

More Related Content

Similar to Natural Language Processing_in semantic web.pptx

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingBhavya Chawla
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxrohithprabhas1
 
Natural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overviewNatural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overviewBenjaminlapid1
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxSHIBDASDUTTA
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficultiesijtsrd
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptpavankalyanadroittec
 
Mining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsMining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Natural language processing using python
Natural language processing using pythonNatural language processing using python
Natural language processing using pythonPrakash Anand
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSISTEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSISaciijournal
 
Text mining open source tokenization
Text mining open source tokenizationText mining open source tokenization
Text mining open source tokenizationaciijournal
 
Big data
Big dataBig data
Big dataIshucs
 
Text Mining: open Source Tokenization Tools � An Analysis
Text Mining: open Source Tokenization Tools � An AnalysisText Mining: open Source Tokenization Tools � An Analysis
Text Mining: open Source Tokenization Tools � An Analysisaciijournal
 
Natural Language Processing (NLP).pdf
Natural Language Processing (NLP).pdfNatural Language Processing (NLP).pdf
Natural Language Processing (NLP).pdfMoar Digital 360
 

Similar to Natural Language Processing_in semantic web.pptx (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Natural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overviewNatural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overview
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
 
Nlp (1)
Nlp (1)Nlp (1)
Nlp (1)
 
Mining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsMining Opinion Features in Customer Reviews
Mining Opinion Features in Customer Reviews
 
nlp (1).pptx
nlp (1).pptxnlp (1).pptx
nlp (1).pptx
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing using python
Natural language processing using pythonNatural language processing using python
Natural language processing using python
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Nlp
NlpNlp
Nlp
 
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSISTEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
 
Text mining open source tokenization
Text mining open source tokenizationText mining open source tokenization
Text mining open source tokenization
 
Big data
Big dataBig data
Big data
 
Text Mining: open Source Tokenization Tools � An Analysis
Text Mining: open Source Tokenization Tools � An AnalysisText Mining: open Source Tokenization Tools � An Analysis
Text Mining: open Source Tokenization Tools � An Analysis
 
Natural Language Processing (NLP).pdf
Natural Language Processing (NLP).pdfNatural Language Processing (NLP).pdf
Natural Language Processing (NLP).pdf
 

Recently uploaded

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 

Recently uploaded (20)

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 

Natural Language Processing_in semantic web.pptx

  • 1. Natural Language Processing for the Semantic Web University of Babylon College of Information Technology Department of Software Supervisor By Prof. Dr. Assad Sabah By PhD. Postgraduate Students Alyaa Talib Raheem
  • 2. • INTROUCTION • INFORMATION EXTRACTION • AMBIGUITY • APPROACHES TO LINGUISTIC PROCESSING • NLP PIPELINES • TOKENIZATION • SENTENCE SPLITTING • POS TAGGING • MORPHOLOGICAL ANALYSIS AND STEMMING • SYNTACTIC PARSING • CHUNKING • Named Entity Recognition and Classification • CHALLENGES IN NERC • APPROACHES of NERC and PIPELINES • Relation Extraction • RELATION EXTRACTION PIPELINE • THE ROLE OF KNOWLEDGE BASES IN RELATION EXTRACTION
  • 3. INTROUCTION  Natural Language Processing (NLP) is the automatic processing of text written in natural (human) languages (English, French, Chinese, etc.) as opposed to artificial languages such as programming languages, to try machine to “understand” it..  It is also known as Computational Linguistics (CL) or Natural Language Engineering (NLE).  NLP include a wide range of task: low-level tasks, such as segmenting text into sentences and words. high-level complex applications such as semantic annotation and opinion mining.  NLP techniques provide a way to enhance web data with semantics or relations describing how entities relate to one another such as “Barack Obama,” concepts such as “Politician”.  one of the problems of NLP is that tools almost always need adapting to specific domains and tasks, and for real-world applications this is often easier with rule-based systems.
  • 4. INFORMATION EXTRACTION  Information extraction is the process of extracting information and turning it into structured data.  The information contained in the structured knowledge base can then be used as a resource for other tasks, such as answering natural language queries or improving on standard search engines with deeper or more implicit forms of knowledge than that expressed in text.  When considering information contained in text several types : 1- Named entities (NEs), such as persons, locations, and organizations. Along with proper names, 2- Temporal expressions, such as dates and times, are also often considered as named entities.
  • 5. INFORMATION EXTRACTION  Named entities are connected together by means of relations .A more complex type of information is the event, which can be seen as a group of relations grounded in time.  Information extraction is difficult process because there are many ways of expressing the same facts:
  • 6. INFORMATION EXTRACTION Linguistic pre-processing Named entity recognition (NER) IE Relation Event extraction Information extraction typically consists of a sequence of tasks:
  • 7. AMBIGUITY  Ambiguous language means that more than one interpretation is possible, either syntactically or semantically.  Computers cannot easily rely on world knowledge and common sense, so they have to use statistical or other techniques to resolve ambiguity.  Some kinds of text, such as newspaper headlines and messages on social media, are often designed to be deliberately ambiguous for entertainment value or to make them more memorable.  Some classic examples of ambiguous are shown below:
  • 8.  There are two main kinds of approach to linguistic processing tasks: a knowledge-based approach and a Machine learning approach. APPROACHES TO LINGUISTIC PROCESSING Table 1: knowledge-based vs. machine learning approaches to NLP
  • 9.  An NLP pre-processing pipeline, as shown in Figure 2.1, typically consists of the following components:  There are tools used for parsing or division, defining things like noun and verb sentences in the case of division, or doing a more detailed analysis of grammatical structure in the case of inflection, such as GATE, Stanford CoreNLP, OpenNLP, NLTK. NLP PIPELINES
  • 10.  Concerning toolkits: 1-GATE provides a number of open-source linguistic preprocessing, It contains a ready-made pipeline for Information Extraction, called ANNIE under LGPL license. 2- Stanford CoreNLP is another open-source annotation pipeline framework, available under the GPL license, which can perform all the core linguistic processing, via a simple Java API. 3-OpenNLP is an open-source machine learning-based toolkit for language processing, which uses maximum entropy and Perceptron-based classifiers. It is freely available under the Apache license. 4-NLTK is an open-source Python-based toolkit, available under the Apache license, It provides a number of different variants for some components, both rule-based and learning-based. NLP PIPELINES
  • 11.
  • 12.  Tokenization is the task of splitting the input text into very simple units, called tokens, which generally correspond to words, numbers, and symbols, and are typically separated by white space in English.  It is important to use a high-quality tokenizer, as errors are likely to affect the results of all subsequent NLP components in the pipeline.  Commonly distinguished types of tokens are numbers, symbols (e.g., $, %), punctuation, and words of different kinds, e.g., uppercase, lowercase, mixed case. TOKENIZATION  Several tools to 1- GATE’s ANNIE Tokenizer³ relies on a set of regular expression rules which are then compiled into a finite-state machine. 2- The PTBTokenizer is an efficient, fast, and deterministic tokenizer, which forms part of the suite of Stanford CoreNLP tools http://gate.ac.uk 3- NLTK also has several similar tokenizers to ANNIE, one based on regular expressions, written in Python. http://www.nltk.org/
  • 13. TOKENIZATION There were 250 cyber-crime offences
  • 14.  Sentence detection (or sentence splitting) is the task of separating text into its constituent sentences.  This is typically involves determining whether punctuation, such as full stops, commas, exclamation marks, and question marks, denote the end of a sentence or something else quoted speech, abbreviations, etc. Mr. Barack Obama and his wife, Mrs. Michel, reside in the United States of America.  More complex cases arise when the text contains tables, titles, formulae, or other formatting markup, these are usually the biggest source of error.  Sentence splitters generally make use of already tokenized GATE’s ANNIE sentence Splitter, the OpenNLP, NLTK uses the Punkt sentence segmenter, Stanford CoreNLP. SENTENCE SPLITTING
  • 15.  Part-of-Speech (POS) tagging is concerned with tagging words with their part of speech, e.g., noun, verb, adjective..  These basic linguistic categories are typically divided into quite fine-grained tags, distinguishing for instance between singular and plural nouns, and different tenses of verbs.  The POS tag is determined by taking into account not just the word itself, but also the context in which it appears For example, the word love could be a noun or verb depending on the context (I love fish vs. Love is all you need)..  Figure shows an example of some POS-tagged text, using the PTB tag set. POS TAGGING
  • 18.  Morphological analysis essentially concerns the identification and classification of the linguistic units of a word, typically breaking the word down into its root form and an affix.  For example, the verb walked comprises a root form walk and an affix -ed.  In English, morphological analysis is typically applied to verbs and nouns for example  Morphological analyzers for English are often rule-based,for example, plural nouns are typically created by adding -s or -es to the end of the singular noun). MORPHOLOGICAL ANALYSIS AND STEMMING 1- By adding a suffix to the root form (e.g., walk, walked, box, boxes) 2- another internal modification such as a vowel change (e.g., run, ran, goose, geese). 1- MORPHOLOGICAL ANALYSIS
  • 19.  Stemmers produce the stem form of each word, e.g., driving and drivers have the stem drive, whereas morphological analysis tends to produce the root/lemma forms of the words and their affixes, e.g., drive and driver for the above examples, with affixes -ing and -s respectively. MORPHOLOGICAL ANALYSIS AND STEMMING 2- STEMMING
  • 20. SYNTACTIC PARSING  Syntactic parsing is concerned with analyzing sentences to derive their syntactic structure according to a grammar.  Essentially, parsing explains how different elements in a sentence are related to each other, such as how the subject and object of a verb are connected.  There are many different syntactic theories in computational linguistics, which posit different kinds of syntactic structures.
  • 21. CHUNKING  Chunkers, also sometimes called shallow parsers, Chunking is a process of extracting phrases from unstructured text. Chunking is very important when you want to extract information from text such as Locations, Person Names etc.  Tools for chunking can be subdivided into Noun Phrase (NP) Chunkers and Verb Phrase (VP) Chunkers.
  • 22. CHUNKING  They may even include negative elements such as might not have bought or didn’t buy.An example of chunker output combining both noun and verb phrase chunking is shown in Figure 2.11.
  • 23. Named Entity Recognition and Classification
  • 24. Named Entity Recognition and Classification  named entity recognition and classification (NERC), also called entity identification or entity extraction ‒ is a natural language processing (NLP) technique that automatically identifies named entities in a text and classifies them into predefined categories.  Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more).  Extracting the main entities in a text helps sort unstructured data and detect important information, which is crucial if you have to deal with large datasets.  It can be subdivided into three tasks: 1- named entity recognition task (NCR), involving identifying the boundaries of an NE 2- named entity classification (NEC), involving detecting the class or type of the NE.
  • 25. Named Entity Recognition and Classification  named entity linking (NEL) recognize if a named entity mention in a text corresponds to any NEs in a reference knowledge base. (NER) (NEL) NEC
  • 26. Named Entity Recognition and Classification  NERC is the starting point for many more complex applications and tasks such as ontology building, relation extraction, question answering, information extraction, information retrieval, machine translation, and semantic annotation.  The standard kind of 5- or 7-class entity recognition problem is now often less useful, which in turn means that new paradigms are required In some cases, such as the recognition of Twitter user names, the Organization and Location, has become blurred even for a human, and is no longer always useful
  • 27. CHALLENGES IN NERC  One of the main challenges of NERC is to distinguish between named entities and entities. The difference between these two things is that named entities are instances of types (such as Person, Politician) such as Obama.  Another challenge is to recognize NE boundaries correctly. Example, Sir Robert Walpole was a British statesman  Ambiguities are one of the biggest challenges for NERC systems. These can affect both the recognition and the classification component, and sometimes even both simultaneously. Example, the word May can be a proper noun (named entity) or a common noun (not an entity, as in the verbal use you may go), but even when a proper noun, it can fall into various categories(month of the year, part of a person’s name (and furthermore a first name or surname), or part of an organization name).
  • 28. APPROACHES of NERC and PIPELINES  Approaches to NERC can be roughly divided into 1- Rule- or pattern-based, 2- Machine learning or statistical extraction methods.  An example of a typical processing pipeling for NERC is shown in Figure 3.1.  A typical NERC processing pipeline consists of linguistic pre-processing (tokenization, sentence splitting, POS tagging), followed by entity finding using gazetteers and grammars, and then co-reference resolution.
  • 29. APPROACHES TO NERC 1 RULE-BASED APPROACHES TO NERC  RULE-BASED use contextual information to help determine whether candidate entities from the gazetteers are valid, or to extend the set of candidates.  Linguistic rule-based methods for NERC, such as those used in ANNIE, GATE’s information extraction system typically comprise a combination of gazetteer lists and hand-coded pattern matching rules.  Gazetteer lists are designed for annotating simple, regular features such as known names of companies, locations, days of the week, famous people, etc. Note: Using gazetteers alone is insufficient for recognizing and classifying entities, because on the one hand many names are too ambiguous (e.g., “London” could be part of an Organization name, a Person name, or just the Location)
  • 30. APPROACHES TO NERC 1 RULE-BASED APPROACHES TO NERC  pattern matching rules:  Using pattern matching for NERC requires the development of patterns over multi-faceted structures that consider many different properties of words, such as orthography (capitalization), morphology, part-of-speech information and so on.  A typical simple pattern-matching rule might try to match all university names, E.g., University of Sheffield, University of Bristol  A more complex rule might try to identify the name of any organization looking for a keyword from a gazetteer list such as Company, Organization, Business, School, etc.
  • 31. APPROACHES TO NERC 2-SUPERVISED LEARNING METHODS FOR NERC  Rule learning methods were historically followed by supervised learning approaches.  supervised learning approaches which learn weights for features, based on their probability of appearing with negative vs. positive training examples for specific NE types. The general supervised learning approach consists of five stages: • linguistic pre-processing. • feature extraction. • training models on training data. • applying models to test data. • post-processing the results to tag the documents.
  • 32. APPROACHES TO NERC SUPERVISED LEARNING METHODS FOR NERC  Popular features include: • Morphological features: capitalization, occurrence of special characters (e.g., $, %); • Part-of-speech features: tags of the occurrence; • Context features: words and POS of words in a window around the occurrence, usually of 1–3 words; • Gazetteer features: appearance in NE gazetteers; • Syntactic features: features based on parse of sentence; • Word representation features: features based on unsupervised training on unlabeled text using e.g., Brown clustering or word embedding.  Statistical NERC approaches use a variety of models, such as Hidden Markov Models (HMMs) , Maximum Entropy models , Support Vector Machines (SVMs) Conditional Random Fields (CRFs), or neural networks
  • 34. Relation Extraction  Relation extraction (RE) is the task of extracting links between entities, which builds on the task of NERC.  Typical relation types include birthdate(PER, DATE) and founder-of(PER, ORG), birthdate (John Smith, 1985-01-01) founder-of (Bill Gates, Microsoft).  It can be subdivided into three tasks: 1- Relation argument identification (finding the boundaries of the arguments), 2- Relation argument classification (assigning types to the arguments) 3- Relation classification (assigning a type to the relation).
  • 35. Relation Extraction  The first two tasks are generally approached using NERC. For semantic annotation a further step is to link relation arguments to entries in a knowledge base using named entity linking (NEL) methods  One of the problems RE approaches face is the vast difference between relation schemas; unlike for NER, there is no standard small set of types that is shared across systems.  Another problem is that relation types can overlap or even entail one another, e.g., the ceo-of(PER, ORG) relation is completely subsumed by the employee-of (PER, ORG) relation, whereas there is only a strong overlap, but no entailment relation between country-of-birth(PER, LOC) and country-of-residence(PER, LOC).
  • 36. RELATION EXTRACTION PIPELINE  Figure 4.1. Note that there are several variations of this approach, as described in the following sections.
  • 37. THE ROLE OF KNOWLEDGE BASES IN RELATION EXTRACTION  Relation extraction is generally defined as extracting relation mentions with their arguments from text. For traditional relation extraction, relation types and their arguments are defined in a schema. whereas for open information extraction, relation types are not pre-defined.  An example of a template for a binary relation would be Person, born on, Date, birthdate (John Smith, 1985-01-01) though relations can have more than two arguments, such as Government positions held.  Relation extraction has many challenges in addition to those of the NERC task: 1- The main challenge is that relations can be expressed in different ways. For instance born on can be expressed as was born on, birthdate is on, or saw the light of day for the first time on.
  • 38. THE ROLE OF KNOWLEDGE BASES IN RELATION EXTRACTION 3- Relation expressions are also not always unique to one relation, such as works at can imply employee of or CEO of. 4- Some relation expressions are also very vague, for instance: Alfred Hitchcock’s “The Birds” was very popular. Note: Although relation extraction is often solved in the form of a pipeline architecture, as shown in previous figure, this can lead to errors being propagated. If an error is made at an early stage of the pipeline, it cannot be corrected again later.
  • 39. RELATION SCHEMAS  There are two kinds of information that need to be described for relation extraction. First: we need information about classes (e.g., artist, track) and their relationships (e.g., released-track).This is kind of information is published as a schema.  Second, we need information about instances of these classes (e.g., David Bowie, Changes), which can be published in a dataset.  Some websites contain semantic markup, often using http://schema.org/, but do not publish this in a separate dataset.  The purpose of relation extraction, schemas serve a similar purpose to locally defined templates, they have a clear advantage in the way data is described, using unique identifiers for entities, called uniform resource identifiers (URIs).
  • 40.
  • 41. RELATION EXTRACTION METHODS RELATION EXTRACTI ON METHODS 1- Rule-based methods 2- Supervised methods 3- Semi-supervised bootstrapping methods. 5- Unsupervised/Open IE methods. 6- Distantly supervised approaches 7- Universal schema
  • 42. 1- RULE-BASED APPROACHES  Rule-based relation extraction approaches make use of domain knowledge, which are encoded as relation extraction rules.  There are two different types of rule-based approaches: 1- Stand-alone: relies on a rule grammar to encode complex rules and dependencies between them. An example of a rule from is BandMember followed within 30 characters by Instrument. For both BandMember and Instrument, pre-compiled gazetteers plus regular expressions are used to recognize them.  A downside of such rule-based approaches is that they are not able to generalize to unseen textual patterns, and generally have poor recall.
  • 43. 1- RULE-BASED APPROACHES 2- learn rules for inference to complement other relation extraction approaches. This is starts with a pair of entities that are known to be in a relation according to a (seed) knowledge base, then performs random walk over the knowledge graph to find other paths that connect these entities . thus it can learn that if two people share a child, they are very likely to be married, or that people often study at the same university as their siblings.  A downside of using rules learned for inference purposes is that rules learned on a small knowledge base might not generalize to new relation.  Only rules learned based on a large enough amount of evidence should ideally be used.
  • 44. 2- UNSUPERVISED APPROACHES  Unsupervised relation extraction approaches became popular soon after supervised systems, with open information extraction systems such as TextRunner , ReVerb, and OLLIE.  Open IE can thus be seen as a subgroup of unsupervised approaches., this means they have to infer the classes of entities, and relations between them.  TextRunner: was the first fully implemented and evaluated Open IE system. It learns a Conditional Random Field (CRF) model for relations, classes, and entities from a corpus using a relation-independent extraction model.  ReVerb addresses two shortcomings previous Open IE systems have: incoherent extractions and uninformative extractions. Incoherent extractions occur when the extracted relation phrase has no meaningful interpretation
  • 46. 3- DISTANT SUPERVISION APPROACHES  Distant supervision is a method for automatically annotating training data using existing relations in knowledge bases.  Mintz et al. define the distant supervision assumption as:  The approach has all the advantages of supervised learning (high precision extraction, output with respect to extraction schema), and additional advantages, since no manual effort is required to label training data. Extraction performance is slightly lower than for supervised approaches, due to incorrectly labeled training examples. If two entities participate in a relation, any sentence that contains those two entities might express that relation.