SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
MULTILINGUAL NATURAL LANGUAGE PROCESSING
APPLICATION: FROM THEORY TO PRACTICE
OCTOBER 2017
Mashael Alduwais
OVERVIEW
Multilingual Natural Language
processing application: From theory to
practice
Edited by Daniel M. Bikel Imed Zitouni
IBM Press @ 2012
Two Parts:
I. Theory: 7 chapters
II. Practice: 9 chapters
10/30/2017 MASHAEL ALDUWAIS 2
ABOUT THE AUTHORS
Daniel M. Bikel
 Current Position: Research Scientist @ Google
 Previous: LinkedIn, Google, IBM
 Education: Harvard University, University of
Pennsylvania
 Interest: Syntax/parsing, information extraction,
multilingual systems, NLP systems design,
machine learning toolkits, language modeling.
Imed Zitouni
 Current Position: Principle Researcher@ Microsoft
 Previous: IBM, Bell-Labs, DIALOCA
 Education: Université Henri Poincaré, Nancy
 Interest: natural language processing, language
modeling, spoken dialog systems, speech
recognition, and machine learning.
10/30/2017 MASHAEL ALDUWAIS 3
BOOK CONTENT
Part I: Theory
 Chapter 1 Finding the Structure of Words
 Chapter 2 Finding the Structure of Documents
 Chapter 3 Syntax
 Chapter 4 Semantic Parsing
 Chapter 5 Language Modeling
 Chapter 6 Recognizing Textual Entailment
 Chapter 7 Multilingual Sentiment and
Subjectivity Analysis
Part II: Practice
 Chapter 8 Entity Detection and Tracking
 Chapter 9 Relations and Events
 Chapter 10 Machine Translation
 Chapter 11 Multilingual Information Retrieval
 Chapter 12 Multilingual Automatic
Summarization
 Chapter 13 Question Answering
 Chapter 14 Distillation
 Chapter 15 Spoken Dialog Systems
 Chapter 16 Combining Natural Language
Processing Engines
10/30/2017 MASHAEL ALDUWAIS 4
CHAPTER 1. FINDING THE STRUCTURE OF WORDS
‫الكلمات‬ ‫كيب‬‫تر‬
Morphological parsing: discovery of word structure
 Tokens: words
 In Arabic, certain tokens are concatenated in writing with the preceding or the following ones, possibly changing
their forms as well. (called clitics).
 Lexemes: the concept behind a linguistic form and the set of alternative that can express it.
 Lexical categories of verbs, nouns, adjectives, conjunctions, particles, or other parts of speech.
 Turning singular into plural
 Morphemes: structural components of word form (segments or morphs). Ex: dis-agree-ment-s
 Typology: divides languages into groups by characterizing the prevalent morphological
phenomena in those languages. Ex: Isolating, Synthetic, Agglutinative, Fusional.
10/30/2017 MASHAEL ALDUWAIS 5
CHAPTER 1. FINDING THE STRUCTURE OF WORDS
‫الكلمات‬ ‫كيب‬‫تر‬
Issues and Challenges:
 Irregularity: word forms that are not described by a prototypical linguistic model.
 Ambiguity: word forms be understood in multiple ways out of the context of their discourse.
 Productivity: Is the inventory of words in a language finite, or is it unlimited?
Morphological Models:
 Dictionary Lookup
 Finite-State Morphology
 Unification-Based Morphology
 Functional Morphology
10/30/2017 MASHAEL ALDUWAIS 6
CHAPTER 2. FINDING THE STRUCTURE OF DOCUMENTS
‫النص‬ ‫كيب‬‫تر‬
Some (NLP) tasks use sentences as the basic processing unit:
 Parsing, machine translation, automatic speech recognition (ASR) systems, and semantic role
labeling
Sentence boundary detection (sentence segmentation): Automatically segmenting a
sequence of word tokens into sentence units.
Topic segmentation (discourse or text segmentation): Automatically dividing a stream
of text or speech into topically homogeneous blocks.
A boundary classification problem:
 Depending on the type of input (i.e., text versus speech), different features may be used.
 Performance have improved by exploiting very high-dimensional feature sets.
10/30/2017 MASHAEL ALDUWAIS 7
CHAPTER 3. SYNTAX
‫النحو‬
Syntax Parsing: (syntax analysis): discover the various predicate-argument
dependencies that may exist in a sentence.
 Parse natural language text to provide syntactic trees.
 Recursively partition the words in the sentence into individual phrases such as verb or noun.
 Used for text-to-speech, machine translation, summarization, and paraphrasing application.
10/30/2017 MASHAEL ALDUWAIS 8
CHAPTER 3. SYNTAX
‫النحو‬
Treebanks:
 A collection of sentences where each sentence is provided a complete syntax analysis.
(Annotated text corpus)
 The syntactic analysis for each sentence has been judged by a human expert.
 A style book or set of annotation guidelines is typically written before the annotation process
to ensure a consistent scheme of annotation throughout the treebank.
 Two main approaches to construct treebanks: dependency graphs and phrase structure.
Challenges:
 Ambiguity. Chose from an exponentially large number of alternative analyses.
 Language issues: tokenization, case, encoding, word segmentation and morphology.
10/30/2017 MASHAEL ALDUWAIS 9
CHAPTER 4. SEMANTIC PARSING
‫الدالل‬ ‫التحليل‬
Semantic parsing: identifying meaning chunks contained in an information signal in
an attempt to transform it into some data structure that can be manipulated by a
computer to perform higher level tasks.
Two types of representations:
 Deep semantic parsing: taking natural language input and transforming it into a meaning
representation. Domain-dependent.
 Problem: reusability of the representation across domains is very limited.
 Shallow semantic parsing: deals with the four main aspects of language: structural ambiguity,
word sense, entity and event recognition, and predicate argument structure recognition.
General-purpose.
 Problem: difficult to construct a general-purpose ontology.
10/30/2017 MASHAEL ALDUWAIS 10
CHAPTER 4. SEMANTIC PARSING
‫الدالل‬ ‫التحليل‬
A semantic theory should be able to:
1. Explain sentences having ambiguous meanings. For example, it should account for the
fact that the word bill in the sentence The bill is large is ambiguous in the sense that it
could represent money or the beak of a bird.
2. Resolve the ambiguities of words in context. For example, if the same sentence is
extended to form The bill is large but need not be paid, then the theory should be able
to disambiguate the monetary meaning of bill.
3. Identify meaningless but syntactically well-formed sentences, such as the famous
example by Chomsky: Colorless green ideas sleep furiously.
4. Identify syntactically or transformationally unrelated paraphrases of a concept
having the same semantic content.
10/30/2017 MASHAEL ALDUWAIS 11
CHAPTER 4. SEMANTIC PARSING
‫الدالل‬ ‫التحليل‬
Semantic parsing can be considered as part of semantic interpretation.
Requirements for Semantic Interpretation:
 Structural Ambiguity: transforming a sentence into its underlying syntactic representation.
 Word Sense: the same word type is used in different contexts.
 EX: She nailed the loose arm of the chair with a hammer. VS. She went to the beauty salon to get a manicure.
 Entity and Event Resolution: named entity recognition and coreference resolution.
 Predicate-Argument Structure: identifying the participants of the entities in these events.
 Can be defined as the identification of who did what to whom, when, where, why, and how
 Meaning Representation: build a semantic representation that can then be manipulated by
algorithms to various application ends (called deep representation). A domain-specific
approach.
10/30/2017 MASHAEL ALDUWAIS 12
CHAPTER 5. LANGUAGE MODELING
‫نمذجة‬‫اللغة‬
A statistical model that assigns a probability to a sentence.
 Specifies the a priori probability of a particular word sequence in the language of interest.
 Given an alphabet or inventory of units Σ and a sequence W = w1w2 ...wt ∈ Σ∗, a language
model can be used to compute the probability of W based on parameters previously
estimated from a training set.
LM is usually combined in speech recognition, machine translation.
A standard tool in information retrieval, spell correction, summarization, authorship
identification, and document classification.
10/30/2017 MASHAEL ALDUWAIS 13
CHAPTER 5. LANGUAGE MODELING
‫نمذجة‬‫اللغة‬
n-Gram Models: all previous words except for the (n − 1) words directly preceding
the current word are irrelevant for predicting the current word, or, alternatively, that
they are equivalent.
Evaluation criteria: coverage rate, perplexity.
Language Model Adaptation: designing and tuning a language model such that it
performs well on a new test set for which little equivalent training data is available.
 Methods: Mixture language models, topic-dependent language model, trigger models.
10/30/2017 MASHAEL ALDUWAIS 14
CHAPTER 5. LANGUAGE MODELING
‫نمذجة‬‫اللغة‬
Types of Language Models: other than n-gram language model
 Class-Based Language Models
 Variable-Length Language Models
 Discriminative Language Models
 Syntax-Based Language Models
 MaxEnt Language Models
 Factored Language Models
 Bayesian Topic-Based Language Models
 Neural Network Language Models
10/30/2017 MASHAEL ALDUWAIS 15
CHAPTER 5. LANGUAGE MODELING
‫نمذجة‬‫اللغة‬
Language Modeling Problems:
 Language-Specific Modeling Problems:
 In Arabic, decomposition may be required. Integrating morphological information into the language
model is helpful for modeling dialectal Arabic.
 Spoken versus Written Languages:
 Many of the world’s 6,900 languages are spoken languages, that is, languages without a writing
system (dialects).
 In this case: the only way of obtaining language model training data is to manually transcribe the
language or dialect. This is a costly and time-consuming process because it involves (i) the
development of a writing standard, (ii) training native speakers to use the writing system consistently
and accurately, and (iii) the actual transcription effort. In the second case, those text resources that
can be obtained for the language in question (e.g., from the web) will need to be normalized, which
can also be a laborious process
10/30/2017 MASHAEL ALDUWAIS 16
CHAPTER 6. RECOGNIZING TEXTUAL ENTAILMENT
‫النص‬ ‫ن‬‫التضمي‬ ‫عىل‬ ‫التعرف‬
Textual entailment is defined as a directional relationship between pairs of text
expressions, denoted by T, the entailing text, and H, the entailed hypothesis. We say
that T entails H if the meaning of H can be inferred from the meaning of T, as would
typically be interpreted by people.
Applications of Textual Entailment Solutions:
 Summarization.
 Exhaustive Search for Relations
 Question Answering
 Machine Translation
10/30/2017 MASHAEL ALDUWAIS 17
CHAPTER 7. MULTILINGUAL SENTIMENT AND
SUBJECTIVITY ANALYSIS
‫المتعدد‬ ‫للغات‬ ‫الذات‬ ‫والتحليل‬ ‫المشاعر‬ ‫تحليل‬‫ة‬
Subjectivity classification: labels text as either subjective or objective.
Sentiment classification: classifies subjective text as either positive, negative, or
neutral.
 Used in automatic expressive text-to-speech synthesis, tracking sentiment timelines in online
forums and news, and mining opinions from product reviews.
Tools: two main types of tools:
I. Rule-based systems: relying on manually or semi-automatically constructed lexicons. Ex:
OpinionFinder.
II. Machine learning classifiers: trained on opinion-annotated corpora. Ex: Wiebe, Bruce,
and O’Hara .
Corpora: subjectivity and sentiment annotated corpora used to train automatic
classifiers, and as resources to extract opinion mining lexicons.
10/30/2017 MASHAEL ALDUWAIS 18
CHAPTER 7. MULTILINGUAL SENTIMENT AND
SUBJECTIVITY ANALYSIS
‫المتعدد‬ ‫للغات‬ ‫الذات‬ ‫والتحليل‬ ‫المشاعر‬ ‫تحليل‬‫ة‬
Lexicons:
 OpinionFinder: contains 6,856 unique entries, out of which 990 are multiword expressions.
Each entry is also associated with a polarity label, indicating whether the corresponding
word or phrase is positive, negative, or neutral.
 General Inquirer: a dictionary of about 10,000 words grouped into about 180 categories,
which have been widely used for content analysis. It includes semantic classes (e.g., animate,
human), verb classes (e.g., negatives, becoming verbs), cognitive orientation classes (e.g.,
causal, knowing, perception), and others. Two of the largest categories in the General
Inquirer are the valence classes, which form a lexicon of 1,915 positive words and 2,291
negative words.
 SentiWordNet: Built on top of WordNet, which assigns each synset in WordNet with a score
triplet (positive, negative, and objective), indicating the strength of each of these three
properties for the words in the synset.
10/30/2017 MASHAEL ALDUWAIS 19
CHAPTER 7. MULTILINGUAL SENTIMENT AND
SUBJECTIVITY ANALYSIS
‫المتعدد‬ ‫للغات‬ ‫الذات‬ ‫والتحليل‬ ‫المشاعر‬ ‫تحليل‬‫ة‬
Word- and Phrase-Level Annotations: three main directions:
i. manual annotations, which involve human judgment of selected words and phrases,
ii. automatic annotations based on knowledge sources such as dictionaries,
iii. automatic annotations based on information derived from corpora.
Sentence-Level Annotations: corpus annotations are often required either as an end goal for
various text-processing applications (e.g., mining opinions from the Web, classification of
reviews into positive and negative), or as an intermediate step toward building automatic
subjectivity and sentiment classifiers. Two methods:
i. dictionary-based, consisting of rule-based classifiers relying on lexicons,
ii. corpus-based, consisting of machine learning classifiers trained on preexisting annotated
data.
Document-Level Annotations: applications, such as review classification or web opinion
mining, often require corpus-level annotations of subjectivity and polarity.
10/30/2017 MASHAEL ALDUWAIS 20
CHAPTER 8. ENTITY DETECTION AND TRACKING
‫ومتابعتها‬ ‫االعالم‬ ‫أسماء‬ ‫عىل‬ ‫التعرف‬
Mention detection:
 Detecting the boundary of a mention and optionally identifying the semantic type (e.g.,
PERSON or ORGANIZATION) and other attributes (e.g., named, nominal, or pronominal).
 Closed to named entity recognition.
 Mentions: any instances of textual references to objects or abstractions, which can be either
named (e.g., John Mayor), nominal (e.g., the president), or pronominal (e.g., she, it).
 Can be formulated as a classification problem by assigning a label to each token in the text.
Coreference resolution:
 Clustering mentions referring to the same entity into equivalence classes.
 Machine learning-based approaches: learn a model from training data that assigns a score
to a pair of mentions indicating the likelihood that the two mentions refer to the same entity.
Mentions are then clustered into entities on the basis of mention-pair scores.
10/30/2017 MASHAEL ALDUWAIS 21
CHAPTER 9. RELATIONS AND EVENTS
‫واألحداث‬ ‫العالقات‬
Relation Extraction Systems: systems capable of finding semantic relations among entities.
 Relation extraction can be considered as multiclass classification problem, with several classes of
features including structural, lexical, entity-based, syntactic, and semantic.
Relation Extraction Types:
 Extracting relations typically associated with lexical ontologies, such as meronymy, hyponymy, and
troponymy;
 Extracting relations similar in nature, such as detecting that verb1 expresses the same concept as
verb2 but in a stronger fashion; and
 Finding similarity enablement, that is, detecting that the action expressed by verb1 is a
prerequisite for the action expressed by verb2.
 Identifying general semantic links between potentially heterogeneous entities, such as employment
relations between people and companies, cause of death relations between diseases and
people, or ownership of one entity (such as a company) by another.
10/30/2017 MASHAEL ALDUWAIS 22
CHAPTER 9. RELATIONS AND EVENTS
‫واألحداث‬ ‫العالقات‬
National Institute of Standards and Technology (NIST) ACE evaluations:
 PHYS (physical): A spatial relation denoting that a person is located at or near a facility, or a
location.
 PART-WHOLE: A spatial relation denoting that a facility, a location, or a gpe is a part of another
facility.
 PER-SOC (personal-social): Personal-social relations capture links between people. Relations can
be business-related, can be family-based.
 ORG-AFF (organization-affiliation): This type of relation pertains to connections between persons
and organizations. A person could be employed by an organization or could be a member.
 GEN-AFF (general-affiliation): citizenship, residence in a country, religious affiliation, and
ethnicity.
 ART (artifact): A relation between a user, inventor, or manufacturer and the artifact itself.
 METONYMY: A relation between two different aspects of the same underlying entity.
10/30/2017 MASHAEL ALDUWAIS 23
CHAPTER 9. RELATIONS AND EVENTS
‫واألحداث‬ ‫العالقات‬
Events: denotes any change of state in the world that is described using natural
language text.
Event extraction: is the use of any algorithm to extract a structured representation of
that change of state, crucially including the entities involved.
10/30/2017 MASHAEL ALDUWAIS 24
CHAPTER 10. MACHINE TRANSLATION
‫اآللية‬ ‫جمة‬‫الت‬
converting text in one language into another while preserving its meaning.
Research started in the 1940s at IBM. Most profound change can be dated back to
1988.
Statistical Machine Translation:
 Using large corpora of translated texts, typically many millions of words.
 Learn the rules of translation from corpora and provide the basis for a decoding algorithm
that finds the best translation for a given input sentence
Machine translation is being integrated into various applications: crosslingual
information retrieval, speech translation, and tools for translators.
10/30/2017 MASHAEL ALDUWAIS 25
CHAPTER 10. MACHINE TRANSLATION
‫اآللية‬ ‫جمة‬‫الت‬
Word Alignment: Learning translation rules from a parallel corpus.
 Unsupervised learning problem.
 A word-aligned parallel corpus allows the estimation of phrase-based and tree-based
models and other approaches.
Evaluation:
 Human Assessment: ask human judges if the output constitutes a correct translation. Is it
fluent? Is the translation adequate?
 Automatic Evaluation Metrics: evaluation campaigns for evaluation metrics, where different
metric developers compete for the highest correlation with human judges. Runs similarity
measures test between MT output and the reference translations. Count: matches, insertions,
deletions.
10/30/2017 MASHAEL ALDUWAIS 26
CHAPTER 10. MACHINE TRANSLATION
‫اآللية‬ ‫جمة‬‫الت‬
Current Research:
 The development of models that more closely mirror linguistic understanding of language,
 The application of novel machine learning methods to the estimation problem of learning
Translation rules from the data, and
 The attempts to exploit various types of data sources, which are often not in the desired
domain or may not be even proper sentence-by-sentence translations at all.
10/30/2017 MASHAEL ALDUWAIS 27
CHAPTER 10. MACHINE TRANSLATION
‫اآللية‬ ‫جمة‬‫الت‬
Linguistic Challenges:
 Lexical Choice: word sense disambiguation. n-gram language model, try to capture
effectively local context information that is very useful for making the right lexical choice.
 Morphology: when translating into morphologically rich languages, it is often not clear from
the local context which morphological variant to choose.
 Word Order: To define which of the entities mentioned in the sentence is the subject and
which are the objects and what their roles are, languages such as English use word order.
Future Directions:
 The estimations of parameter values in MT models.
 Syntactic models
 Using comparable or purely monolingual data instead of parallel data.
 Integrating statistical machine translation into other information processing applications.
10/30/2017 MASHAEL ALDUWAIS 28
CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVAL
‫اللغات‬ ‫متعدد‬ ‫المعلومات‬ ‫جاع‬‫است‬
Importance:
 Improvements in machine translation (MT), have fostered the development of effective multilingual
retrieval systems.
 The growing number of non-English Internet users and non-English content on the Web.
 Advent of Web 2.0 technologies.
Crosslingual information retrieval (CLIR):
 Retrieving documents relevant to a given query in some language (query language) from a
collection of documents in some other language (collection language).
 Approaches: Translation-Based Approaches, Inter-lingual Document Representations.
Multilingual information retrieval (MLIR):
 Involves corpora containing documents written in different languages.
 MLIR requires different index organization and relevance computation strategies than CLIR.
10/30/2017 MASHAEL ALDUWAIS 29
CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVAL
‫اللغات‬ ‫متعدد‬ ‫المعلومات‬ ‫جاع‬‫است‬
Evaluation:
 Metrics: Relevance Assessments, precision and recall.
 Evaluation Campaigns: Text REtrieval Conference (TREC), Crosslingual Evaluation Forum
(CLEF), NII Test Collection for IR Systems (NTCIR), Forum for Information Retrieval Evaluation
(FIRE).
Parallel Corpora: JRC-Acquis, Multext Dataset, Canadian Hansards, Europarl.
Tools, Software, and Resources:
 Preprocessing: Content Analysis Toolkit (Tika), Snowball Stemmer, HTML Parser, BananaSplit.
 IR Frameworks: Lucene, Terrier and Lemur.
 Evaluation: TREC eval.
10/30/2017 MASHAEL ALDUWAIS 30
CHAPTER 12. MULTILINGUAL AUTOMATIC
SUMMARIZATION
‫اللغات‬ ‫متعدد‬ ‫اآلل‬ ‫التلخيص‬
In multilingual summarization, texts written in multiple languages are used by
summarization systems.
Types of summary:
 An informative summary, is a compressed version of the original covering the most important
facts reported in the input text(s) (e.g., summary of a journal article).
 An indicative summary covers topics in the input text without providing further details (e.g.,
keywords for scientific papers).
 An evaluative summary gives an opinion on the input text most often by comparing it to
similar documents.
 An elaborative summary can provide more details of parts of a large document or the
document linked to by the current document to help navigation through large documents or
linked collections such as Wikipedia.
10/30/2017 MASHAEL ALDUWAIS 31
CHAPTER 12. MULTILINGUAL AUTOMATIC
SUMMARIZATION
‫اللغات‬ ‫متعدد‬ ‫اآلل‬ ‫التلخيص‬
Crosslingual summarization: spread out over multiple source languages, and the
resulting summary is presented in one (or more) target languages.
 Requires the integration of multiple source documents coming from different languages
 Named entities are often transcribed differently in different languages (coreference
resolution)
 Languages encode number and gender agreement differently as English lacks grammatical
gender (Anaphora resolution).
Evaluation:
 Extrinsic evaluations measure the usefulness of summaries by measuring how much they can
help in performing another information-processing task.
 Intrinsic evaluations measure and reflect summary quality and can be used in various stages
in a summarization development cycle.
10/30/2017 MASHAEL ALDUWAIS 32
CHAPTER 12. MULTILINGUAL
AUTOMATIC SUMMARIZATION
‫اللغات‬ ‫متعدد‬ ‫اآلل‬ ‫التلخيص‬
Summarization systems are divided into three stages:
1. For the analysis stage, summarization systems may
represent the text in the form of a graph. This may be a
linguistically motivated discourse tree or a matrix
representation based on sentence-to-sentence similarity.
2. The transformation process can be carried out via graph-
based algorithms such as PageRank or by machine
learning–based classifiers that learn to classify sentences
according to their relevancy.
3. Multilingual approaches have to face many language-
dependent challenges such as tokenization, anaphoric
expressions, and discourse structure for the realization of
the summary.
10/30/2017 MASHAEL ALDUWAIS 33
CHAPTER 13. QUESTION ANSWERING
‫األسئلة‬ ‫عىل‬ ‫اإلجابة‬
QA: Retrieve answers to user questions from information sources.
Follows a pipeline layout consisting of components for
1. Transforming questions into search engine queries
2. Retrieving related text using existing IR systems,
3. Extracting and scoring candidate answers.
Questions are classified with regard to their expected answer,
 factoid questions, which ask for concise answers such as named entities (e.g., What is the capital of Turkey?),
 list questions seeking lists of such factoid answers (e.g., Which countries are in NATO?).
 Attempts have been made to tackle questions with complex answers, such as definitional questions requesting
information on a given topic, including biographies for people (e.g., Who is Albert Einstein?),
 relationship questions (e.g., What is the relationship between the Taliban and Al-Qaeda?),
 opinion questions (e.g., What do people like about IKEA?).
10/30/2017 MASHAEL ALDUWAIS 34
CHAPTER 13. QUESTION ANSWERING
‫األسئلة‬ ‫عىل‬ ‫اإلجابة‬
10/30/2017 MASHAEL ALDUWAIS 35
CHAPTER 13. QUESTION ANSWERING
‫األسئلة‬ ‫عىل‬ ‫اإلجابة‬
Future Directions:
 Reliable confidence estimates for the top answers.
 Crosslingual QA systems that translate answers back to the language in which the question
was asked.
 General-purpose QA algorithms and techniques that can be adapted rapidly to new tasks
and achieve high performance across different domains.
 QA systems that provide complex answers.
 How and why questions seeking explanations or justifications
 Yes–no questions requiring a system to determine whether the combined knowledge in the available information
sources entails a hypothesis.
 Deeper NLP techniques to find answers in sources that lack semantic redundancy.
 QA systems that support user interactions and information sources in different languages.
10/30/2017 MASHAEL ALDUWAIS 36
CHAPTER 14. DISTILLATION
‫االستخالص‬
Distillation queries can be complex and require complex answers.
 For example: Describe the reactions of <COUNTRY> to <EVENT>.
The Rosetta Consortium Distillation System: built as part of the GALE program. The system is
designed to answer distillation queries run against a large corpus composed of text documents and
audio recordings in multiple languages: English, Arabic, and Mandarin. Text sources are assumed to
belong to two main categories: structured and unstructured.
Three Stages:
 Document preparation: recordings are transcribed, and text and transcripts in foreign languages are
translated into English. Tokenization, part-of-speech (POS) tagging, parsing, mention detection, and semantic
role labeling rely on maximum entropy (MaxEnt) models is performed.
 Indexing: documents are indexed using an open source search engine, Lucene.
 Query answering: takes as input a GALE-style query, and returns a list of main snippets with associated
supporting snippets and citations, sorted in decreasing order of relevance to the query. The architecture of
the system consists of five stages: query preprocessing, document retrieval, snippet filtering, snippet
processing, and planning.
10/30/2017 MASHAEL ALDUWAIS 37
CHAPTER 14. DISTILLATION
‫االستخالص‬
Challenges
 The lack of publicly available corpora for measuring the progress of the field a
 The difficulty and cost of evaluating the outputs of distillation systems due to the lack of
automatic metrics.
10/30/2017 MASHAEL ALDUWAIS 38
CHAPTER 15. SPOKEN DIALOG SYSTEMS
‫اآلل‬ ‫الحوار‬ ‫أنظمة‬
A spoken dialog system is a complex machine that
manages goal-oriented user interactions.
Functional architecture:
 Speech recognition and understanding module: to
assign one or more semantic tags to each speech
input.
 Speech generation module: Rule-based grammar is
used, which encodes both the syntax and semantics of
possible utterances.
 Dialog manager: uses a finite-state machine approach
by explicitly encoding the whole interaction into what
is generally known as call-flow.
10/30/2017 MASHAEL ALDUWAIS 39
CHAPTER 16. COMBINING NATURAL LANGUAGE
PROCESSING ENGINES
‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫كات‬‫محر‬ ‫ن‬‫بي‬ ‫الجمع‬
Many engines are now attaining accuracy sufficient to enable combining them to
serve more complex tasks than were possible before.
 Example applications: semantic search, enterprise reporting and other business intelligence,
question answering, medical-abstract mining, and crosslingual search, audio/video search
and cataloging, speech-to-speech translation, and foreign broadcast news analysis.
 Applications like these share many common engines, such as speaker identification, speech-
to-text, text tokenization, grammatical parsing, named entity detection, coreference analysis,
part-of-speech labeling, and translation.
Aggregation poses several challenges: Heterogeneous computing environments,
Remote operation, Data formats, Exception handling.
10/30/2017 MASHAEL ALDUWAIS 40
CHAPTER 16. COMBINING NATURAL LANGUAGE
PROCESSING ENGINES
‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫كات‬‫محر‬ ‫ن‬‫بي‬ ‫الجمع‬
Desired Attributes of Architectures for Aggregating Speech and NLP Engines:
 Flexible, Distributed Componentization.
 Computational Efficiency.
 Data-Manipulation Capabilities.
 Robust Processing.
Frameworks that support integration into more complex applications:
 UIMA
 GATE: General Architecture for Text Engineering
 InfoSphere Streams
10/30/2017 MASHAEL ALDUWAIS 41
10/30/2017 MASHAEL ALDUWAIS 42
FOR YOUR TIME AND
ATTENTION

Contenu connexe

Tendances

Tendances (20)

Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Language models
Language modelsLanguage models
Language models
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
 
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 

Similaire à Summary of Multilingual Natural Language Processing Applications: From Theory to Practice

Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation of
Andi Wu
 
Information retrieval based on word sens 1
Information retrieval based on word sens 1Information retrieval based on word sens 1
Information retrieval based on word sens 1
ATHMAN HAJ-HAMOU
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language model
c sharada
 
Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parameters
Velnar
 

Similaire à Summary of Multilingual Natural Language Processing Applications: From Theory to Practice (20)

Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Open issue in oop
Open issue in oopOpen issue in oop
Open issue in oop
 
WORD RECOGNITION MASLP
WORD RECOGNITION MASLPWORD RECOGNITION MASLP
WORD RECOGNITION MASLP
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
NLPinAAC
NLPinAACNLPinAAC
NLPinAAC
 
Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...
 
Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation of
 
Information retrieval based on word sens 1
Information retrieval based on word sens 1Information retrieval based on word sens 1
Information retrieval based on word sens 1
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
NL Context Understanding 23(6)
NL Context Understanding 23(6)NL Context Understanding 23(6)
NL Context Understanding 23(6)
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language model
 
Crafting Your Customized Legal Mastery: A Guide to Building Your Private LLM
Crafting Your Customized Legal Mastery: A Guide to Building Your Private LLMCrafting Your Customized Legal Mastery: A Guide to Building Your Private LLM
Crafting Your Customized Legal Mastery: A Guide to Building Your Private LLM
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parameters
 
Applied Genre Analysis A Multi-Perspective Model
Applied Genre Analysis  A Multi-Perspective ModelApplied Genre Analysis  A Multi-Perspective Model
Applied Genre Analysis A Multi-Perspective Model
 
Domain Specific Terminology Extraction (ICICT 2006)
Domain Specific Terminology Extraction (ICICT 2006)Domain Specific Terminology Extraction (ICICT 2006)
Domain Specific Terminology Extraction (ICICT 2006)
 
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptxARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
 
Processing Written English
Processing Written EnglishProcessing Written English
Processing Written English
 

Plus de iwan_rg

Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspects
iwan_rg
 
محاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهامحاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتها
iwan_rg
 
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـالتقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
iwan_rg
 

Plus de iwan_rg (20)

Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspects
 
تلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربيةتلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربية
 
Building theoretical models using structured equation modeling
Building theoretical models using structured equation modelingBuilding theoretical models using structured equation modeling
Building theoretical models using structured equation modeling
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)
 
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـالتقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
 
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERSCHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
 
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـالتقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
 
مركز تميز الحوسبة العربية المتقدمة
مركز تميز  الحوسبة العربية المتقدمةمركز تميز  الحوسبة العربية المتقدمة
مركز تميز الحوسبة العربية المتقدمة
 
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
 
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
 
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
 
P02- Towards a New Arabic Corpus of Dyslexic Texts
P02- Towards a New Arabic Corpus of Dyslexic TextsP02- Towards a New Arabic Corpus of Dyslexic Texts
P02- Towards a New Arabic Corpus of Dyslexic Texts
 
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
 
OSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedingsOSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedings
 
محاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهامحاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتها
 
لغويات المدونة الحاسوبية
لغويات المدونة الحاسوبيةلغويات المدونة الحاسوبية
لغويات المدونة الحاسوبية
 
iWAN Annual Report 1435/1436H
 iWAN Annual Report 1435/1436H iWAN Annual Report 1435/1436H
iWAN Annual Report 1435/1436H
 
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـالتقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
 

Dernier

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Dernier (20)

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 

Summary of Multilingual Natural Language Processing Applications: From Theory to Practice

  • 1. MULTILINGUAL NATURAL LANGUAGE PROCESSING APPLICATION: FROM THEORY TO PRACTICE OCTOBER 2017 Mashael Alduwais
  • 2. OVERVIEW Multilingual Natural Language processing application: From theory to practice Edited by Daniel M. Bikel Imed Zitouni IBM Press @ 2012 Two Parts: I. Theory: 7 chapters II. Practice: 9 chapters 10/30/2017 MASHAEL ALDUWAIS 2
  • 3. ABOUT THE AUTHORS Daniel M. Bikel  Current Position: Research Scientist @ Google  Previous: LinkedIn, Google, IBM  Education: Harvard University, University of Pennsylvania  Interest: Syntax/parsing, information extraction, multilingual systems, NLP systems design, machine learning toolkits, language modeling. Imed Zitouni  Current Position: Principle Researcher@ Microsoft  Previous: IBM, Bell-Labs, DIALOCA  Education: Université Henri Poincaré, Nancy  Interest: natural language processing, language modeling, spoken dialog systems, speech recognition, and machine learning. 10/30/2017 MASHAEL ALDUWAIS 3
  • 4. BOOK CONTENT Part I: Theory  Chapter 1 Finding the Structure of Words  Chapter 2 Finding the Structure of Documents  Chapter 3 Syntax  Chapter 4 Semantic Parsing  Chapter 5 Language Modeling  Chapter 6 Recognizing Textual Entailment  Chapter 7 Multilingual Sentiment and Subjectivity Analysis Part II: Practice  Chapter 8 Entity Detection and Tracking  Chapter 9 Relations and Events  Chapter 10 Machine Translation  Chapter 11 Multilingual Information Retrieval  Chapter 12 Multilingual Automatic Summarization  Chapter 13 Question Answering  Chapter 14 Distillation  Chapter 15 Spoken Dialog Systems  Chapter 16 Combining Natural Language Processing Engines 10/30/2017 MASHAEL ALDUWAIS 4
  • 5. CHAPTER 1. FINDING THE STRUCTURE OF WORDS ‫الكلمات‬ ‫كيب‬‫تر‬ Morphological parsing: discovery of word structure  Tokens: words  In Arabic, certain tokens are concatenated in writing with the preceding or the following ones, possibly changing their forms as well. (called clitics).  Lexemes: the concept behind a linguistic form and the set of alternative that can express it.  Lexical categories of verbs, nouns, adjectives, conjunctions, particles, or other parts of speech.  Turning singular into plural  Morphemes: structural components of word form (segments or morphs). Ex: dis-agree-ment-s  Typology: divides languages into groups by characterizing the prevalent morphological phenomena in those languages. Ex: Isolating, Synthetic, Agglutinative, Fusional. 10/30/2017 MASHAEL ALDUWAIS 5
  • 6. CHAPTER 1. FINDING THE STRUCTURE OF WORDS ‫الكلمات‬ ‫كيب‬‫تر‬ Issues and Challenges:  Irregularity: word forms that are not described by a prototypical linguistic model.  Ambiguity: word forms be understood in multiple ways out of the context of their discourse.  Productivity: Is the inventory of words in a language finite, or is it unlimited? Morphological Models:  Dictionary Lookup  Finite-State Morphology  Unification-Based Morphology  Functional Morphology 10/30/2017 MASHAEL ALDUWAIS 6
  • 7. CHAPTER 2. FINDING THE STRUCTURE OF DOCUMENTS ‫النص‬ ‫كيب‬‫تر‬ Some (NLP) tasks use sentences as the basic processing unit:  Parsing, machine translation, automatic speech recognition (ASR) systems, and semantic role labeling Sentence boundary detection (sentence segmentation): Automatically segmenting a sequence of word tokens into sentence units. Topic segmentation (discourse or text segmentation): Automatically dividing a stream of text or speech into topically homogeneous blocks. A boundary classification problem:  Depending on the type of input (i.e., text versus speech), different features may be used.  Performance have improved by exploiting very high-dimensional feature sets. 10/30/2017 MASHAEL ALDUWAIS 7
  • 8. CHAPTER 3. SYNTAX ‫النحو‬ Syntax Parsing: (syntax analysis): discover the various predicate-argument dependencies that may exist in a sentence.  Parse natural language text to provide syntactic trees.  Recursively partition the words in the sentence into individual phrases such as verb or noun.  Used for text-to-speech, machine translation, summarization, and paraphrasing application. 10/30/2017 MASHAEL ALDUWAIS 8
  • 9. CHAPTER 3. SYNTAX ‫النحو‬ Treebanks:  A collection of sentences where each sentence is provided a complete syntax analysis. (Annotated text corpus)  The syntactic analysis for each sentence has been judged by a human expert.  A style book or set of annotation guidelines is typically written before the annotation process to ensure a consistent scheme of annotation throughout the treebank.  Two main approaches to construct treebanks: dependency graphs and phrase structure. Challenges:  Ambiguity. Chose from an exponentially large number of alternative analyses.  Language issues: tokenization, case, encoding, word segmentation and morphology. 10/30/2017 MASHAEL ALDUWAIS 9
  • 10. CHAPTER 4. SEMANTIC PARSING ‫الدالل‬ ‫التحليل‬ Semantic parsing: identifying meaning chunks contained in an information signal in an attempt to transform it into some data structure that can be manipulated by a computer to perform higher level tasks. Two types of representations:  Deep semantic parsing: taking natural language input and transforming it into a meaning representation. Domain-dependent.  Problem: reusability of the representation across domains is very limited.  Shallow semantic parsing: deals with the four main aspects of language: structural ambiguity, word sense, entity and event recognition, and predicate argument structure recognition. General-purpose.  Problem: difficult to construct a general-purpose ontology. 10/30/2017 MASHAEL ALDUWAIS 10
  • 11. CHAPTER 4. SEMANTIC PARSING ‫الدالل‬ ‫التحليل‬ A semantic theory should be able to: 1. Explain sentences having ambiguous meanings. For example, it should account for the fact that the word bill in the sentence The bill is large is ambiguous in the sense that it could represent money or the beak of a bird. 2. Resolve the ambiguities of words in context. For example, if the same sentence is extended to form The bill is large but need not be paid, then the theory should be able to disambiguate the monetary meaning of bill. 3. Identify meaningless but syntactically well-formed sentences, such as the famous example by Chomsky: Colorless green ideas sleep furiously. 4. Identify syntactically or transformationally unrelated paraphrases of a concept having the same semantic content. 10/30/2017 MASHAEL ALDUWAIS 11
  • 12. CHAPTER 4. SEMANTIC PARSING ‫الدالل‬ ‫التحليل‬ Semantic parsing can be considered as part of semantic interpretation. Requirements for Semantic Interpretation:  Structural Ambiguity: transforming a sentence into its underlying syntactic representation.  Word Sense: the same word type is used in different contexts.  EX: She nailed the loose arm of the chair with a hammer. VS. She went to the beauty salon to get a manicure.  Entity and Event Resolution: named entity recognition and coreference resolution.  Predicate-Argument Structure: identifying the participants of the entities in these events.  Can be defined as the identification of who did what to whom, when, where, why, and how  Meaning Representation: build a semantic representation that can then be manipulated by algorithms to various application ends (called deep representation). A domain-specific approach. 10/30/2017 MASHAEL ALDUWAIS 12
  • 13. CHAPTER 5. LANGUAGE MODELING ‫نمذجة‬‫اللغة‬ A statistical model that assigns a probability to a sentence.  Specifies the a priori probability of a particular word sequence in the language of interest.  Given an alphabet or inventory of units Σ and a sequence W = w1w2 ...wt ∈ Σ∗, a language model can be used to compute the probability of W based on parameters previously estimated from a training set. LM is usually combined in speech recognition, machine translation. A standard tool in information retrieval, spell correction, summarization, authorship identification, and document classification. 10/30/2017 MASHAEL ALDUWAIS 13
  • 14. CHAPTER 5. LANGUAGE MODELING ‫نمذجة‬‫اللغة‬ n-Gram Models: all previous words except for the (n − 1) words directly preceding the current word are irrelevant for predicting the current word, or, alternatively, that they are equivalent. Evaluation criteria: coverage rate, perplexity. Language Model Adaptation: designing and tuning a language model such that it performs well on a new test set for which little equivalent training data is available.  Methods: Mixture language models, topic-dependent language model, trigger models. 10/30/2017 MASHAEL ALDUWAIS 14
  • 15. CHAPTER 5. LANGUAGE MODELING ‫نمذجة‬‫اللغة‬ Types of Language Models: other than n-gram language model  Class-Based Language Models  Variable-Length Language Models  Discriminative Language Models  Syntax-Based Language Models  MaxEnt Language Models  Factored Language Models  Bayesian Topic-Based Language Models  Neural Network Language Models 10/30/2017 MASHAEL ALDUWAIS 15
  • 16. CHAPTER 5. LANGUAGE MODELING ‫نمذجة‬‫اللغة‬ Language Modeling Problems:  Language-Specific Modeling Problems:  In Arabic, decomposition may be required. Integrating morphological information into the language model is helpful for modeling dialectal Arabic.  Spoken versus Written Languages:  Many of the world’s 6,900 languages are spoken languages, that is, languages without a writing system (dialects).  In this case: the only way of obtaining language model training data is to manually transcribe the language or dialect. This is a costly and time-consuming process because it involves (i) the development of a writing standard, (ii) training native speakers to use the writing system consistently and accurately, and (iii) the actual transcription effort. In the second case, those text resources that can be obtained for the language in question (e.g., from the web) will need to be normalized, which can also be a laborious process 10/30/2017 MASHAEL ALDUWAIS 16
  • 17. CHAPTER 6. RECOGNIZING TEXTUAL ENTAILMENT ‫النص‬ ‫ن‬‫التضمي‬ ‫عىل‬ ‫التعرف‬ Textual entailment is defined as a directional relationship between pairs of text expressions, denoted by T, the entailing text, and H, the entailed hypothesis. We say that T entails H if the meaning of H can be inferred from the meaning of T, as would typically be interpreted by people. Applications of Textual Entailment Solutions:  Summarization.  Exhaustive Search for Relations  Question Answering  Machine Translation 10/30/2017 MASHAEL ALDUWAIS 17
  • 18. CHAPTER 7. MULTILINGUAL SENTIMENT AND SUBJECTIVITY ANALYSIS ‫المتعدد‬ ‫للغات‬ ‫الذات‬ ‫والتحليل‬ ‫المشاعر‬ ‫تحليل‬‫ة‬ Subjectivity classification: labels text as either subjective or objective. Sentiment classification: classifies subjective text as either positive, negative, or neutral.  Used in automatic expressive text-to-speech synthesis, tracking sentiment timelines in online forums and news, and mining opinions from product reviews. Tools: two main types of tools: I. Rule-based systems: relying on manually or semi-automatically constructed lexicons. Ex: OpinionFinder. II. Machine learning classifiers: trained on opinion-annotated corpora. Ex: Wiebe, Bruce, and O’Hara . Corpora: subjectivity and sentiment annotated corpora used to train automatic classifiers, and as resources to extract opinion mining lexicons. 10/30/2017 MASHAEL ALDUWAIS 18
  • 19. CHAPTER 7. MULTILINGUAL SENTIMENT AND SUBJECTIVITY ANALYSIS ‫المتعدد‬ ‫للغات‬ ‫الذات‬ ‫والتحليل‬ ‫المشاعر‬ ‫تحليل‬‫ة‬ Lexicons:  OpinionFinder: contains 6,856 unique entries, out of which 990 are multiword expressions. Each entry is also associated with a polarity label, indicating whether the corresponding word or phrase is positive, negative, or neutral.  General Inquirer: a dictionary of about 10,000 words grouped into about 180 categories, which have been widely used for content analysis. It includes semantic classes (e.g., animate, human), verb classes (e.g., negatives, becoming verbs), cognitive orientation classes (e.g., causal, knowing, perception), and others. Two of the largest categories in the General Inquirer are the valence classes, which form a lexicon of 1,915 positive words and 2,291 negative words.  SentiWordNet: Built on top of WordNet, which assigns each synset in WordNet with a score triplet (positive, negative, and objective), indicating the strength of each of these three properties for the words in the synset. 10/30/2017 MASHAEL ALDUWAIS 19
  • 20. CHAPTER 7. MULTILINGUAL SENTIMENT AND SUBJECTIVITY ANALYSIS ‫المتعدد‬ ‫للغات‬ ‫الذات‬ ‫والتحليل‬ ‫المشاعر‬ ‫تحليل‬‫ة‬ Word- and Phrase-Level Annotations: three main directions: i. manual annotations, which involve human judgment of selected words and phrases, ii. automatic annotations based on knowledge sources such as dictionaries, iii. automatic annotations based on information derived from corpora. Sentence-Level Annotations: corpus annotations are often required either as an end goal for various text-processing applications (e.g., mining opinions from the Web, classification of reviews into positive and negative), or as an intermediate step toward building automatic subjectivity and sentiment classifiers. Two methods: i. dictionary-based, consisting of rule-based classifiers relying on lexicons, ii. corpus-based, consisting of machine learning classifiers trained on preexisting annotated data. Document-Level Annotations: applications, such as review classification or web opinion mining, often require corpus-level annotations of subjectivity and polarity. 10/30/2017 MASHAEL ALDUWAIS 20
  • 21. CHAPTER 8. ENTITY DETECTION AND TRACKING ‫ومتابعتها‬ ‫االعالم‬ ‫أسماء‬ ‫عىل‬ ‫التعرف‬ Mention detection:  Detecting the boundary of a mention and optionally identifying the semantic type (e.g., PERSON or ORGANIZATION) and other attributes (e.g., named, nominal, or pronominal).  Closed to named entity recognition.  Mentions: any instances of textual references to objects or abstractions, which can be either named (e.g., John Mayor), nominal (e.g., the president), or pronominal (e.g., she, it).  Can be formulated as a classification problem by assigning a label to each token in the text. Coreference resolution:  Clustering mentions referring to the same entity into equivalence classes.  Machine learning-based approaches: learn a model from training data that assigns a score to a pair of mentions indicating the likelihood that the two mentions refer to the same entity. Mentions are then clustered into entities on the basis of mention-pair scores. 10/30/2017 MASHAEL ALDUWAIS 21
  • 22. CHAPTER 9. RELATIONS AND EVENTS ‫واألحداث‬ ‫العالقات‬ Relation Extraction Systems: systems capable of finding semantic relations among entities.  Relation extraction can be considered as multiclass classification problem, with several classes of features including structural, lexical, entity-based, syntactic, and semantic. Relation Extraction Types:  Extracting relations typically associated with lexical ontologies, such as meronymy, hyponymy, and troponymy;  Extracting relations similar in nature, such as detecting that verb1 expresses the same concept as verb2 but in a stronger fashion; and  Finding similarity enablement, that is, detecting that the action expressed by verb1 is a prerequisite for the action expressed by verb2.  Identifying general semantic links between potentially heterogeneous entities, such as employment relations between people and companies, cause of death relations between diseases and people, or ownership of one entity (such as a company) by another. 10/30/2017 MASHAEL ALDUWAIS 22
  • 23. CHAPTER 9. RELATIONS AND EVENTS ‫واألحداث‬ ‫العالقات‬ National Institute of Standards and Technology (NIST) ACE evaluations:  PHYS (physical): A spatial relation denoting that a person is located at or near a facility, or a location.  PART-WHOLE: A spatial relation denoting that a facility, a location, or a gpe is a part of another facility.  PER-SOC (personal-social): Personal-social relations capture links between people. Relations can be business-related, can be family-based.  ORG-AFF (organization-affiliation): This type of relation pertains to connections between persons and organizations. A person could be employed by an organization or could be a member.  GEN-AFF (general-affiliation): citizenship, residence in a country, religious affiliation, and ethnicity.  ART (artifact): A relation between a user, inventor, or manufacturer and the artifact itself.  METONYMY: A relation between two different aspects of the same underlying entity. 10/30/2017 MASHAEL ALDUWAIS 23
  • 24. CHAPTER 9. RELATIONS AND EVENTS ‫واألحداث‬ ‫العالقات‬ Events: denotes any change of state in the world that is described using natural language text. Event extraction: is the use of any algorithm to extract a structured representation of that change of state, crucially including the entities involved. 10/30/2017 MASHAEL ALDUWAIS 24
  • 25. CHAPTER 10. MACHINE TRANSLATION ‫اآللية‬ ‫جمة‬‫الت‬ converting text in one language into another while preserving its meaning. Research started in the 1940s at IBM. Most profound change can be dated back to 1988. Statistical Machine Translation:  Using large corpora of translated texts, typically many millions of words.  Learn the rules of translation from corpora and provide the basis for a decoding algorithm that finds the best translation for a given input sentence Machine translation is being integrated into various applications: crosslingual information retrieval, speech translation, and tools for translators. 10/30/2017 MASHAEL ALDUWAIS 25
  • 26. CHAPTER 10. MACHINE TRANSLATION ‫اآللية‬ ‫جمة‬‫الت‬ Word Alignment: Learning translation rules from a parallel corpus.  Unsupervised learning problem.  A word-aligned parallel corpus allows the estimation of phrase-based and tree-based models and other approaches. Evaluation:  Human Assessment: ask human judges if the output constitutes a correct translation. Is it fluent? Is the translation adequate?  Automatic Evaluation Metrics: evaluation campaigns for evaluation metrics, where different metric developers compete for the highest correlation with human judges. Runs similarity measures test between MT output and the reference translations. Count: matches, insertions, deletions. 10/30/2017 MASHAEL ALDUWAIS 26
  • 27. CHAPTER 10. MACHINE TRANSLATION ‫اآللية‬ ‫جمة‬‫الت‬ Current Research:  The development of models that more closely mirror linguistic understanding of language,  The application of novel machine learning methods to the estimation problem of learning Translation rules from the data, and  The attempts to exploit various types of data sources, which are often not in the desired domain or may not be even proper sentence-by-sentence translations at all. 10/30/2017 MASHAEL ALDUWAIS 27
  • 28. CHAPTER 10. MACHINE TRANSLATION ‫اآللية‬ ‫جمة‬‫الت‬ Linguistic Challenges:  Lexical Choice: word sense disambiguation. n-gram language model, try to capture effectively local context information that is very useful for making the right lexical choice.  Morphology: when translating into morphologically rich languages, it is often not clear from the local context which morphological variant to choose.  Word Order: To define which of the entities mentioned in the sentence is the subject and which are the objects and what their roles are, languages such as English use word order. Future Directions:  The estimations of parameter values in MT models.  Syntactic models  Using comparable or purely monolingual data instead of parallel data.  Integrating statistical machine translation into other information processing applications. 10/30/2017 MASHAEL ALDUWAIS 28
  • 29. CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVAL ‫اللغات‬ ‫متعدد‬ ‫المعلومات‬ ‫جاع‬‫است‬ Importance:  Improvements in machine translation (MT), have fostered the development of effective multilingual retrieval systems.  The growing number of non-English Internet users and non-English content on the Web.  Advent of Web 2.0 technologies. Crosslingual information retrieval (CLIR):  Retrieving documents relevant to a given query in some language (query language) from a collection of documents in some other language (collection language).  Approaches: Translation-Based Approaches, Inter-lingual Document Representations. Multilingual information retrieval (MLIR):  Involves corpora containing documents written in different languages.  MLIR requires different index organization and relevance computation strategies than CLIR. 10/30/2017 MASHAEL ALDUWAIS 29
  • 30. CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVAL ‫اللغات‬ ‫متعدد‬ ‫المعلومات‬ ‫جاع‬‫است‬ Evaluation:  Metrics: Relevance Assessments, precision and recall.  Evaluation Campaigns: Text REtrieval Conference (TREC), Crosslingual Evaluation Forum (CLEF), NII Test Collection for IR Systems (NTCIR), Forum for Information Retrieval Evaluation (FIRE). Parallel Corpora: JRC-Acquis, Multext Dataset, Canadian Hansards, Europarl. Tools, Software, and Resources:  Preprocessing: Content Analysis Toolkit (Tika), Snowball Stemmer, HTML Parser, BananaSplit.  IR Frameworks: Lucene, Terrier and Lemur.  Evaluation: TREC eval. 10/30/2017 MASHAEL ALDUWAIS 30
  • 31. CHAPTER 12. MULTILINGUAL AUTOMATIC SUMMARIZATION ‫اللغات‬ ‫متعدد‬ ‫اآلل‬ ‫التلخيص‬ In multilingual summarization, texts written in multiple languages are used by summarization systems. Types of summary:  An informative summary, is a compressed version of the original covering the most important facts reported in the input text(s) (e.g., summary of a journal article).  An indicative summary covers topics in the input text without providing further details (e.g., keywords for scientific papers).  An evaluative summary gives an opinion on the input text most often by comparing it to similar documents.  An elaborative summary can provide more details of parts of a large document or the document linked to by the current document to help navigation through large documents or linked collections such as Wikipedia. 10/30/2017 MASHAEL ALDUWAIS 31
  • 32. CHAPTER 12. MULTILINGUAL AUTOMATIC SUMMARIZATION ‫اللغات‬ ‫متعدد‬ ‫اآلل‬ ‫التلخيص‬ Crosslingual summarization: spread out over multiple source languages, and the resulting summary is presented in one (or more) target languages.  Requires the integration of multiple source documents coming from different languages  Named entities are often transcribed differently in different languages (coreference resolution)  Languages encode number and gender agreement differently as English lacks grammatical gender (Anaphora resolution). Evaluation:  Extrinsic evaluations measure the usefulness of summaries by measuring how much they can help in performing another information-processing task.  Intrinsic evaluations measure and reflect summary quality and can be used in various stages in a summarization development cycle. 10/30/2017 MASHAEL ALDUWAIS 32
  • 33. CHAPTER 12. MULTILINGUAL AUTOMATIC SUMMARIZATION ‫اللغات‬ ‫متعدد‬ ‫اآلل‬ ‫التلخيص‬ Summarization systems are divided into three stages: 1. For the analysis stage, summarization systems may represent the text in the form of a graph. This may be a linguistically motivated discourse tree or a matrix representation based on sentence-to-sentence similarity. 2. The transformation process can be carried out via graph- based algorithms such as PageRank or by machine learning–based classifiers that learn to classify sentences according to their relevancy. 3. Multilingual approaches have to face many language- dependent challenges such as tokenization, anaphoric expressions, and discourse structure for the realization of the summary. 10/30/2017 MASHAEL ALDUWAIS 33
  • 34. CHAPTER 13. QUESTION ANSWERING ‫األسئلة‬ ‫عىل‬ ‫اإلجابة‬ QA: Retrieve answers to user questions from information sources. Follows a pipeline layout consisting of components for 1. Transforming questions into search engine queries 2. Retrieving related text using existing IR systems, 3. Extracting and scoring candidate answers. Questions are classified with regard to their expected answer,  factoid questions, which ask for concise answers such as named entities (e.g., What is the capital of Turkey?),  list questions seeking lists of such factoid answers (e.g., Which countries are in NATO?).  Attempts have been made to tackle questions with complex answers, such as definitional questions requesting information on a given topic, including biographies for people (e.g., Who is Albert Einstein?),  relationship questions (e.g., What is the relationship between the Taliban and Al-Qaeda?),  opinion questions (e.g., What do people like about IKEA?). 10/30/2017 MASHAEL ALDUWAIS 34
  • 35. CHAPTER 13. QUESTION ANSWERING ‫األسئلة‬ ‫عىل‬ ‫اإلجابة‬ 10/30/2017 MASHAEL ALDUWAIS 35
  • 36. CHAPTER 13. QUESTION ANSWERING ‫األسئلة‬ ‫عىل‬ ‫اإلجابة‬ Future Directions:  Reliable confidence estimates for the top answers.  Crosslingual QA systems that translate answers back to the language in which the question was asked.  General-purpose QA algorithms and techniques that can be adapted rapidly to new tasks and achieve high performance across different domains.  QA systems that provide complex answers.  How and why questions seeking explanations or justifications  Yes–no questions requiring a system to determine whether the combined knowledge in the available information sources entails a hypothesis.  Deeper NLP techniques to find answers in sources that lack semantic redundancy.  QA systems that support user interactions and information sources in different languages. 10/30/2017 MASHAEL ALDUWAIS 36
  • 37. CHAPTER 14. DISTILLATION ‫االستخالص‬ Distillation queries can be complex and require complex answers.  For example: Describe the reactions of <COUNTRY> to <EVENT>. The Rosetta Consortium Distillation System: built as part of the GALE program. The system is designed to answer distillation queries run against a large corpus composed of text documents and audio recordings in multiple languages: English, Arabic, and Mandarin. Text sources are assumed to belong to two main categories: structured and unstructured. Three Stages:  Document preparation: recordings are transcribed, and text and transcripts in foreign languages are translated into English. Tokenization, part-of-speech (POS) tagging, parsing, mention detection, and semantic role labeling rely on maximum entropy (MaxEnt) models is performed.  Indexing: documents are indexed using an open source search engine, Lucene.  Query answering: takes as input a GALE-style query, and returns a list of main snippets with associated supporting snippets and citations, sorted in decreasing order of relevance to the query. The architecture of the system consists of five stages: query preprocessing, document retrieval, snippet filtering, snippet processing, and planning. 10/30/2017 MASHAEL ALDUWAIS 37
  • 38. CHAPTER 14. DISTILLATION ‫االستخالص‬ Challenges  The lack of publicly available corpora for measuring the progress of the field a  The difficulty and cost of evaluating the outputs of distillation systems due to the lack of automatic metrics. 10/30/2017 MASHAEL ALDUWAIS 38
  • 39. CHAPTER 15. SPOKEN DIALOG SYSTEMS ‫اآلل‬ ‫الحوار‬ ‫أنظمة‬ A spoken dialog system is a complex machine that manages goal-oriented user interactions. Functional architecture:  Speech recognition and understanding module: to assign one or more semantic tags to each speech input.  Speech generation module: Rule-based grammar is used, which encodes both the syntax and semantics of possible utterances.  Dialog manager: uses a finite-state machine approach by explicitly encoding the whole interaction into what is generally known as call-flow. 10/30/2017 MASHAEL ALDUWAIS 39
  • 40. CHAPTER 16. COMBINING NATURAL LANGUAGE PROCESSING ENGINES ‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫كات‬‫محر‬ ‫ن‬‫بي‬ ‫الجمع‬ Many engines are now attaining accuracy sufficient to enable combining them to serve more complex tasks than were possible before.  Example applications: semantic search, enterprise reporting and other business intelligence, question answering, medical-abstract mining, and crosslingual search, audio/video search and cataloging, speech-to-speech translation, and foreign broadcast news analysis.  Applications like these share many common engines, such as speaker identification, speech- to-text, text tokenization, grammatical parsing, named entity detection, coreference analysis, part-of-speech labeling, and translation. Aggregation poses several challenges: Heterogeneous computing environments, Remote operation, Data formats, Exception handling. 10/30/2017 MASHAEL ALDUWAIS 40
  • 41. CHAPTER 16. COMBINING NATURAL LANGUAGE PROCESSING ENGINES ‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫كات‬‫محر‬ ‫ن‬‫بي‬ ‫الجمع‬ Desired Attributes of Architectures for Aggregating Speech and NLP Engines:  Flexible, Distributed Componentization.  Computational Efficiency.  Data-Manipulation Capabilities.  Robust Processing. Frameworks that support integration into more complex applications:  UIMA  GATE: General Architecture for Text Engineering  InfoSphere Streams 10/30/2017 MASHAEL ALDUWAIS 41
  • 42. 10/30/2017 MASHAEL ALDUWAIS 42 FOR YOUR TIME AND ATTENTION