Summary of Multilingual Natural Language Processing Applications: From Theory to Practice

MULTILINGUAL NATURAL LANGUAGE PROCESSING
APPLICATION: FROM THEORY TO PRACTICE
OCTOBER 2017
Mashael Alduwais

OVERVIEW
Multilingual Natural Language
processing application: From theory to
practice
Edited by Daniel M. Bikel Imed Zitouni
IBM Press @ 2012
Two Parts:
I. Theory: 7 chapters
II. Practice: 9 chapters
10/30/2017 MASHAEL ALDUWAIS 2

ABOUT THE AUTHORS
Daniel M. Bikel
 Current Position: Research Scientist @ Google
 Previous: LinkedIn, Google, IBM
 Education: Harvard University, University of
Pennsylvania
 Interest: Syntax/parsing, information extraction,
multilingual systems, NLP systems design,
machine learning toolkits, language modeling.
Imed Zitouni
 Current Position: Principle Researcher@ Microsoft
 Previous: IBM, Bell-Labs, DIALOCA
 Education: Université Henri Poincaré, Nancy
 Interest: natural language processing, language
modeling, spoken dialog systems, speech
recognition, and machine learning.

BOOK CONTENT
Part I: Theory
 Chapter 1 Finding the Structure of Words
 Chapter 2 Finding the Structure of Documents
 Chapter 3 Syntax
 Chapter 4 Semantic Parsing
 Chapter 5 Language Modeling
 Chapter 6 Recognizing Textual Entailment
 Chapter 7 Multilingual Sentiment and
Subjectivity Analysis
Part II: Practice
 Chapter 8 Entity Detection and Tracking
 Chapter 9 Relations and Events
 Chapter 10 Machine Translation
 Chapter 11 Multilingual Information Retrieval
 Chapter 12 Multilingual Automatic
Summarization
 Chapter 13 Question Answering
 Chapter 14 Distillation
 Chapter 15 Spoken Dialog Systems
 Chapter 16 Combining Natural Language
Processing Engines

CHAPTER 1. FINDING THE STRUCTURE OF WORDS
‫الكلمات‬ ‫كيب‬‫تر‬
Morphological parsing: discovery of word structure
 Tokens: words
 In Arabic, certain tokens are concatenated in writing with the preceding or the following ones, possibly changing
their forms as well. (called clitics).
 Lexemes: the concept behind a linguistic form and the set of alternative that can express it.
 Lexical categories of verbs, nouns, adjectives, conjunctions, particles, or other parts of speech.
 Turning singular into plural
 Morphemes: structural components of word form (segments or morphs). Ex: dis-agree-ment-s
 Typology: divides languages into groups by characterizing the prevalent morphological
phenomena in those languages. Ex: Isolating, Synthetic, Agglutinative, Fusional.

CHAPTER 1. FINDING THE STRUCTURE OF WORDS
‫الكلمات‬ ‫كيب‬‫تر‬
Issues and Challenges:
 Irregularity: word forms that are not described by a prototypical linguistic model.
 Ambiguity: word forms be understood in multiple ways out of the context of their discourse.
 Productivity: Is the inventory of words in a language finite, or is it unlimited?
Morphological Models:
 Dictionary Lookup
 Finite-State Morphology
 Unification-Based Morphology
 Functional Morphology

CHAPTER 2. FINDING THE STRUCTURE OF DOCUMENTS
‫النص‬ ‫كيب‬‫تر‬
Some (NLP) tasks use sentences as the basic processing unit:
 Parsing, machine translation, automatic speech recognition (ASR) systems, and semantic role
labeling
Sentence boundary detection (sentence segmentation): Automatically segmenting a
sequence of word tokens into sentence units.
Topic segmentation (discourse or text segmentation): Automatically dividing a stream
of text or speech into topically homogeneous blocks.
A boundary classification problem:
 Depending on the type of input (i.e., text versus speech), different features may be used.
 Performance have improved by exploiting very high-dimensional feature sets.

CHAPTER 3. SYNTAX
‫النحو‬
Syntax Parsing: (syntax analysis): discover the various predicate-argument
dependencies that may exist in a sentence.
 Parse natural language text to provide syntactic trees.
 Recursively partition the words in the sentence into individual phrases such as verb or noun.
 Used for text-to-speech, machine translation, summarization, and paraphrasing application.

CHAPTER 3. SYNTAX
‫النحو‬
Treebanks:
 A collection of sentences where each sentence is provided a complete syntax analysis.
(Annotated text corpus)
 The syntactic analysis for each sentence has been judged by a human expert.
 A style book or set of annotation guidelines is typically written before the annotation process
to ensure a consistent scheme of annotation throughout the treebank.
 Two main approaches to construct treebanks: dependency graphs and phrase structure.
Challenges:
 Ambiguity. Chose from an exponentially large number of alternative analyses.
 Language issues: tokenization, case, encoding, word segmentation and morphology.

CHAPTER 4. SEMANTIC PARSING
‫الدالل‬ ‫التحليل‬
Semantic parsing: identifying meaning chunks contained in an information signal in
an attempt to transform it into some data structure that can be manipulated by a
computer to perform higher level tasks.
Two types of representations:
 Deep semantic parsing: taking natural language input and transforming it into a meaning
representation. Domain-dependent.
 Problem: reusability of the representation across domains is very limited.
 Shallow semantic parsing: deals with the four main aspects of language: structural ambiguity,
word sense, entity and event recognition, and predicate argument structure recognition.
General-purpose.
 Problem: difficult to construct a general-purpose ontology.

A semantic theory should be able to:
1. Explain sentences having ambiguous meanings. For example, it should account for the
fact that the word bill in the sentence The bill is large is ambiguous in the sense that it
could represent money or the beak of a bird.
2. Resolve the ambiguities of words in context. For example, if the same sentence is
extended to form The bill is large but need not be paid, then the theory should be able
to disambiguate the monetary meaning of bill.
3. Identify meaningless but syntactically well-formed sentences, such as the famous
example by Chomsky: Colorless green ideas sleep furiously.
4. Identify syntactically or transformationally unrelated paraphrases of a concept
having the same semantic content.

Semantic parsing can be considered as part of semantic interpretation.
Requirements for Semantic Interpretation:
 Structural Ambiguity: transforming a sentence into its underlying syntactic representation.
 Word Sense: the same word type is used in different contexts.
 EX: She nailed the loose arm of the chair with a hammer. VS. She went to the beauty salon to get a manicure.
 Entity and Event Resolution: named entity recognition and coreference resolution.
 Predicate-Argument Structure: identifying the participants of the entities in these events.
 Can be defined as the identification of who did what to whom, when, where, why, and how
 Meaning Representation: build a semantic representation that can then be manipulated by
algorithms to various application ends (called deep representation). A domain-specific
approach.

CHAPTER 5. LANGUAGE MODELING
‫نمذجة‬‫اللغة‬
A statistical model that assigns a probability to a sentence.
 Specifies the a priori probability of a particular word sequence in the language of interest.
 Given an alphabet or inventory of units Σ and a sequence W = w1w2 ...wt ∈ Σ∗, a language
model can be used to compute the probability of W based on parameters previously
estimated from a training set.
LM is usually combined in speech recognition, machine translation.
A standard tool in information retrieval, spell correction, summarization, authorship
identification, and document classification.

n-Gram Models: all previous words except for the (n − 1) words directly preceding
the current word are irrelevant for predicting the current word, or, alternatively, that
they are equivalent.
Evaluation criteria: coverage rate, perplexity.
Language Model Adaptation: designing and tuning a language model such that it
performs well on a new test set for which little equivalent training data is available.
 Methods: Mixture language models, topic-dependent language model, trigger models.

Types of Language Models: other than n-gram language model
 Class-Based Language Models
 Variable-Length Language Models
 Discriminative Language Models
 Syntax-Based Language Models
 MaxEnt Language Models
 Factored Language Models
 Bayesian Topic-Based Language Models
 Neural Network Language Models

Language Modeling Problems:
 Language-Specific Modeling Problems:
 In Arabic, decomposition may be required. Integrating morphological information into the language
model is helpful for modeling dialectal Arabic.
 Spoken versus Written Languages:
 Many of the world’s 6,900 languages are spoken languages, that is, languages without a writing
system (dialects).
 In this case: the only way of obtaining language model training data is to manually transcribe the
language or dialect. This is a costly and time-consuming process because it involves (i) the
development of a writing standard, (ii) training native speakers to use the writing system consistently
and accurately, and (iii) the actual transcription effort. In the second case, those text resources that
can be obtained for the language in question (e.g., from the web) will need to be normalized, which
can also be a laborious process

CHAPTER 6. RECOGNIZING TEXTUAL ENTAILMENT
‫النص‬ ‫ن‬‫التضمي‬ ‫عىل‬ ‫التعرف‬
Textual entailment is defined as a directional relationship between pairs of text
expressions, denoted by T, the entailing text, and H, the entailed hypothesis. We say
that T entails H if the meaning of H can be inferred from the meaning of T, as would
typically be interpreted by people.
Applications of Textual Entailment Solutions:
 Summarization.
 Exhaustive Search for Relations
 Question Answering
 Machine Translation

CHAPTER 7. MULTILINGUAL SENTIMENT AND
SUBJECTIVITY ANALYSIS
‫المتعدد‬ ‫للغات‬ ‫الذات‬ ‫والتحليل‬ ‫المشاعر‬ ‫تحليل‬‫ة‬
Subjectivity classification: labels text as either subjective or objective.
Sentiment classification: classifies subjective text as either positive, negative, or
neutral.
 Used in automatic expressive text-to-speech synthesis, tracking sentiment timelines in online
forums and news, and mining opinions from product reviews.
Tools: two main types of tools:
I. Rule-based systems: relying on manually or semi-automatically constructed lexicons. Ex:
OpinionFinder.
II. Machine learning classifiers: trained on opinion-annotated corpora. Ex: Wiebe, Bruce,
and O’Hara .
Corpora: subjectivity and sentiment annotated corpora used to train automatic
classifiers, and as resources to extract opinion mining lexicons.

Lexicons:
 OpinionFinder: contains 6,856 unique entries, out of which 990 are multiword expressions.
Each entry is also associated with a polarity label, indicating whether the corresponding
word or phrase is positive, negative, or neutral.
 General Inquirer: a dictionary of about 10,000 words grouped into about 180 categories,
which have been widely used for content analysis. It includes semantic classes (e.g., animate,
human), verb classes (e.g., negatives, becoming verbs), cognitive orientation classes (e.g.,
causal, knowing, perception), and others. Two of the largest categories in the General
Inquirer are the valence classes, which form a lexicon of 1,915 positive words and 2,291
negative words.
 SentiWordNet: Built on top of WordNet, which assigns each synset in WordNet with a score
triplet (positive, negative, and objective), indicating the strength of each of these three
properties for the words in the synset.

Word- and Phrase-Level Annotations: three main directions:
i. manual annotations, which involve human judgment of selected words and phrases,
ii. automatic annotations based on knowledge sources such as dictionaries,
iii. automatic annotations based on information derived from corpora.
Sentence-Level Annotations: corpus annotations are often required either as an end goal for
various text-processing applications (e.g., mining opinions from the Web, classification of
reviews into positive and negative), or as an intermediate step toward building automatic
subjectivity and sentiment classifiers. Two methods:
i. dictionary-based, consisting of rule-based classifiers relying on lexicons,
ii. corpus-based, consisting of machine learning classifiers trained on preexisting annotated
data.
Document-Level Annotations: applications, such as review classification or web opinion
mining, often require corpus-level annotations of subjectivity and polarity.

CHAPTER 8. ENTITY DETECTION AND TRACKING
‫ومتابعتها‬ ‫االعالم‬ ‫أسماء‬ ‫عىل‬ ‫التعرف‬
Mention detection:
 Detecting the boundary of a mention and optionally identifying the semantic type (e.g.,
PERSON or ORGANIZATION) and other attributes (e.g., named, nominal, or pronominal).
 Closed to named entity recognition.
 Mentions: any instances of textual references to objects or abstractions, which can be either
named (e.g., John Mayor), nominal (e.g., the president), or pronominal (e.g., she, it).
 Can be formulated as a classification problem by assigning a label to each token in the text.
Coreference resolution:
 Clustering mentions referring to the same entity into equivalence classes.
 Machine learning-based approaches: learn a model from training data that assigns a score
to a pair of mentions indicating the likelihood that the two mentions refer to the same entity.
Mentions are then clustered into entities on the basis of mention-pair scores.

CHAPTER 9. RELATIONS AND EVENTS
‫واألحداث‬ ‫العالقات‬
Relation Extraction Systems: systems capable of finding semantic relations among entities.
 Relation extraction can be considered as multiclass classification problem, with several classes of
features including structural, lexical, entity-based, syntactic, and semantic.
Relation Extraction Types:
 Extracting relations typically associated with lexical ontologies, such as meronymy, hyponymy, and
troponymy;
 Extracting relations similar in nature, such as detecting that verb1 expresses the same concept as
verb2 but in a stronger fashion; and
 Finding similarity enablement, that is, detecting that the action expressed by verb1 is a
prerequisite for the action expressed by verb2.
 Identifying general semantic links between potentially heterogeneous entities, such as employment
relations between people and companies, cause of death relations between diseases and
people, or ownership of one entity (such as a company) by another.

National Institute of Standards and Technology (NIST) ACE evaluations:
 PHYS (physical): A spatial relation denoting that a person is located at or near a facility, or a
location.
 PART-WHOLE: A spatial relation denoting that a facility, a location, or a gpe is a part of another
facility.
 PER-SOC (personal-social): Personal-social relations capture links between people. Relations can
be business-related, can be family-based.
 ORG-AFF (organization-affiliation): This type of relation pertains to connections between persons
and organizations. A person could be employed by an organization or could be a member.
 GEN-AFF (general-affiliation): citizenship, residence in a country, religious affiliation, and
ethnicity.
 ART (artifact): A relation between a user, inventor, or manufacturer and the artifact itself.
 METONYMY: A relation between two different aspects of the same underlying entity.

Events: denotes any change of state in the world that is described using natural
language text.
Event extraction: is the use of any algorithm to extract a structured representation of
that change of state, crucially including the entities involved.

CHAPTER 10. MACHINE TRANSLATION
‫اآللية‬ ‫جمة‬‫الت‬
converting text in one language into another while preserving its meaning.
Research started in the 1940s at IBM. Most profound change can be dated back to
1988.
Statistical Machine Translation:
 Using large corpora of translated texts, typically many millions of words.
 Learn the rules of translation from corpora and provide the basis for a decoding algorithm
that finds the best translation for a given input sentence
Machine translation is being integrated into various applications: crosslingual
information retrieval, speech translation, and tools for translators.

Word Alignment: Learning translation rules from a parallel corpus.
 Unsupervised learning problem.
 A word-aligned parallel corpus allows the estimation of phrase-based and tree-based
models and other approaches.
Evaluation:
 Human Assessment: ask human judges if the output constitutes a correct translation. Is it
fluent? Is the translation adequate?
 Automatic Evaluation Metrics: evaluation campaigns for evaluation metrics, where different
metric developers compete for the highest correlation with human judges. Runs similarity
measures test between MT output and the reference translations. Count: matches, insertions,
deletions.

Current Research:
 The development of models that more closely mirror linguistic understanding of language,
 The application of novel machine learning methods to the estimation problem of learning
Translation rules from the data, and
 The attempts to exploit various types of data sources, which are often not in the desired
domain or may not be even proper sentence-by-sentence translations at all.

Linguistic Challenges:
 Lexical Choice: word sense disambiguation. n-gram language model, try to capture
effectively local context information that is very useful for making the right lexical choice.
 Morphology: when translating into morphologically rich languages, it is often not clear from
the local context which morphological variant to choose.
 Word Order: To define which of the entities mentioned in the sentence is the subject and
which are the objects and what their roles are, languages such as English use word order.
Future Directions:
 The estimations of parameter values in MT models.
 Syntactic models
 Using comparable or purely monolingual data instead of parallel data.
 Integrating statistical machine translation into other information processing applications.

CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVAL
‫اللغات‬ ‫متعدد‬ ‫المعلومات‬ ‫جاع‬‫است‬
Importance:
 Improvements in machine translation (MT), have fostered the development of effective multilingual
retrieval systems.
 The growing number of non-English Internet users and non-English content on the Web.
 Advent of Web 2.0 technologies.
Crosslingual information retrieval (CLIR):
 Retrieving documents relevant to a given query in some language (query language) from a
collection of documents in some other language (collection language).
 Approaches: Translation-Based Approaches, Inter-lingual Document Representations.
Multilingual information retrieval (MLIR):
 Involves corpora containing documents written in different languages.
 MLIR requires different index organization and relevance computation strategies than CLIR.

CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVAL
‫اللغات‬ ‫متعدد‬ ‫المعلومات‬ ‫جاع‬‫است‬
Evaluation:
 Metrics: Relevance Assessments, precision and recall.
 Evaluation Campaigns: Text REtrieval Conference (TREC), Crosslingual Evaluation Forum
(CLEF), NII Test Collection for IR Systems (NTCIR), Forum for Information Retrieval Evaluation
(FIRE).
Parallel Corpora: JRC-Acquis, Multext Dataset, Canadian Hansards, Europarl.
Tools, Software, and Resources:
 Preprocessing: Content Analysis Toolkit (Tika), Snowball Stemmer, HTML Parser, BananaSplit.
 IR Frameworks: Lucene, Terrier and Lemur.
 Evaluation: TREC eval.

CHAPTER 12. MULTILINGUAL AUTOMATIC
SUMMARIZATION
‫اللغات‬ ‫متعدد‬ ‫اآلل‬ ‫التلخيص‬
In multilingual summarization, texts written in multiple languages are used by
summarization systems.
Types of summary:
 An informative summary, is a compressed version of the original covering the most important
facts reported in the input text(s) (e.g., summary of a journal article).
 An indicative summary covers topics in the input text without providing further details (e.g.,
keywords for scientific papers).
 An evaluative summary gives an opinion on the input text most often by comparing it to
similar documents.
 An elaborative summary can provide more details of parts of a large document or the
document linked to by the current document to help navigation through large documents or
linked collections such as Wikipedia.

CHAPTER 12. MULTILINGUAL AUTOMATIC
SUMMARIZATION
Crosslingual summarization: spread out over multiple source languages, and the
resulting summary is presented in one (or more) target languages.
 Requires the integration of multiple source documents coming from different languages
 Named entities are often transcribed differently in different languages (coreference
resolution)
 Languages encode number and gender agreement differently as English lacks grammatical
gender (Anaphora resolution).
Evaluation:
 Extrinsic evaluations measure the usefulness of summaries by measuring how much they can
help in performing another information-processing task.
 Intrinsic evaluations measure and reflect summary quality and can be used in various stages
in a summarization development cycle.

CHAPTER 12. MULTILINGUAL
AUTOMATIC SUMMARIZATION
Summarization systems are divided into three stages:
1. For the analysis stage, summarization systems may
represent the text in the form of a graph. This may be a
linguistically motivated discourse tree or a matrix
representation based on sentence-to-sentence similarity.
2. The transformation process can be carried out via graph-
based algorithms such as PageRank or by machine
learning–based classifiers that learn to classify sentences
according to their relevancy.
3. Multilingual approaches have to face many language-
dependent challenges such as tokenization, anaphoric
expressions, and discourse structure for the realization of
the summary.

CHAPTER 13. QUESTION ANSWERING
‫األسئلة‬ ‫عىل‬ ‫اإلجابة‬
QA: Retrieve answers to user questions from information sources.
Follows a pipeline layout consisting of components for
1. Transforming questions into search engine queries
2. Retrieving related text using existing IR systems,
3. Extracting and scoring candidate answers.
Questions are classified with regard to their expected answer,
 factoid questions, which ask for concise answers such as named entities (e.g., What is the capital of Turkey?),
 list questions seeking lists of such factoid answers (e.g., Which countries are in NATO?).
 Attempts have been made to tackle questions with complex answers, such as definitional questions requesting
information on a given topic, including biographies for people (e.g., Who is Albert Einstein?),
 relationship questions (e.g., What is the relationship between the Taliban and Al-Qaeda?),
 opinion questions (e.g., What do people like about IKEA?).

Future Directions:
 Reliable confidence estimates for the top answers.
 Crosslingual QA systems that translate answers back to the language in which the question
was asked.
 General-purpose QA algorithms and techniques that can be adapted rapidly to new tasks
and achieve high performance across different domains.
 QA systems that provide complex answers.
 How and why questions seeking explanations or justifications
 Yes–no questions requiring a system to determine whether the combined knowledge in the available information
sources entails a hypothesis.
 Deeper NLP techniques to find answers in sources that lack semantic redundancy.
 QA systems that support user interactions and information sources in different languages.

CHAPTER 14. DISTILLATION
‫االستخالص‬
Distillation queries can be complex and require complex answers.
 For example: Describe the reactions of <COUNTRY> to <EVENT>.
The Rosetta Consortium Distillation System: built as part of the GALE program. The system is
designed to answer distillation queries run against a large corpus composed of text documents and
audio recordings in multiple languages: English, Arabic, and Mandarin. Text sources are assumed to
belong to two main categories: structured and unstructured.
Three Stages:
 Document preparation: recordings are transcribed, and text and transcripts in foreign languages are
translated into English. Tokenization, part-of-speech (POS) tagging, parsing, mention detection, and semantic
role labeling rely on maximum entropy (MaxEnt) models is performed.
 Indexing: documents are indexed using an open source search engine, Lucene.
 Query answering: takes as input a GALE-style query, and returns a list of main snippets with associated
supporting snippets and citations, sorted in decreasing order of relevance to the query. The architecture of
the system consists of five stages: query preprocessing, document retrieval, snippet filtering, snippet
processing, and planning.

CHAPTER 14. DISTILLATION
‫االستخالص‬
Challenges
 The lack of publicly available corpora for measuring the progress of the field a
 The difficulty and cost of evaluating the outputs of distillation systems due to the lack of
automatic metrics.

CHAPTER 15. SPOKEN DIALOG SYSTEMS
‫اآلل‬ ‫الحوار‬ ‫أنظمة‬
A spoken dialog system is a complex machine that
manages goal-oriented user interactions.
Functional architecture:
 Speech recognition and understanding module: to
assign one or more semantic tags to each speech
input.
 Speech generation module: Rule-based grammar is
used, which encodes both the syntax and semantics of
possible utterances.
 Dialog manager: uses a finite-state machine approach
by explicitly encoding the whole interaction into what
is generally known as call-flow.

CHAPTER 16. COMBINING NATURAL LANGUAGE
PROCESSING ENGINES
‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫كات‬‫محر‬ ‫ن‬‫بي‬ ‫الجمع‬
Many engines are now attaining accuracy sufficient to enable combining them to
serve more complex tasks than were possible before.
 Example applications: semantic search, enterprise reporting and other business intelligence,
question answering, medical-abstract mining, and crosslingual search, audio/video search
and cataloging, speech-to-speech translation, and foreign broadcast news analysis.
 Applications like these share many common engines, such as speaker identification, speech-
to-text, text tokenization, grammatical parsing, named entity detection, coreference analysis,
part-of-speech labeling, and translation.
Aggregation poses several challenges: Heterogeneous computing environments,
Remote operation, Data formats, Exception handling.

CHAPTER 16. COMBINING NATURAL LANGUAGE
PROCESSING ENGINES
‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫كات‬‫محر‬ ‫ن‬‫بي‬ ‫الجمع‬
Desired Attributes of Architectures for Aggregating Speech and NLP Engines:
 Flexible, Distributed Componentization.
 Computational Efficiency.
 Data-Manipulation Capabilities.
 Robust Processing.
Frameworks that support integration into more complex applications:
 UIMA
 GATE: General Architecture for Text Engineering
 InfoSphere Streams

FOR YOUR TIME AND
ATTENTION

Summary of Multilingual Natural Language Processing Applications: From Theory to Practice

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Summary of Multilingual Natural Language Processing Applications: From Theory to Practice

Similaire à Summary of Multilingual Natural Language Processing Applications: From Theory to Practice (20)

Plus de iwan_rg

Plus de iwan_rg (20)

Dernier

Dernier (20)

Summary of Multilingual Natural Language Processing Applications: From Theory to Practice