2. OVERVIEW
Multilingual Natural Language
processing application: From theory to
practice
Edited by Daniel M. Bikel Imed Zitouni
IBM Press @ 2012
Two Parts:
I. Theory: 7 chapters
II. Practice: 9 chapters
10/30/2017 MASHAEL ALDUWAIS 2
3. ABOUT THE AUTHORS
Daniel M. Bikel
Current Position: Research Scientist @ Google
Previous: LinkedIn, Google, IBM
Education: Harvard University, University of
Pennsylvania
Interest: Syntax/parsing, information extraction,
multilingual systems, NLP systems design,
machine learning toolkits, language modeling.
Imed Zitouni
Current Position: Principle Researcher@ Microsoft
Previous: IBM, Bell-Labs, DIALOCA
Education: Université Henri Poincaré, Nancy
Interest: natural language processing, language
modeling, spoken dialog systems, speech
recognition, and machine learning.
10/30/2017 MASHAEL ALDUWAIS 3
4. BOOK CONTENT
Part I: Theory
Chapter 1 Finding the Structure of Words
Chapter 2 Finding the Structure of Documents
Chapter 3 Syntax
Chapter 4 Semantic Parsing
Chapter 5 Language Modeling
Chapter 6 Recognizing Textual Entailment
Chapter 7 Multilingual Sentiment and
Subjectivity Analysis
Part II: Practice
Chapter 8 Entity Detection and Tracking
Chapter 9 Relations and Events
Chapter 10 Machine Translation
Chapter 11 Multilingual Information Retrieval
Chapter 12 Multilingual Automatic
Summarization
Chapter 13 Question Answering
Chapter 14 Distillation
Chapter 15 Spoken Dialog Systems
Chapter 16 Combining Natural Language
Processing Engines
10/30/2017 MASHAEL ALDUWAIS 4
5. CHAPTER 1. FINDING THE STRUCTURE OF WORDS
الكلمات كيبتر
Morphological parsing: discovery of word structure
Tokens: words
In Arabic, certain tokens are concatenated in writing with the preceding or the following ones, possibly changing
their forms as well. (called clitics).
Lexemes: the concept behind a linguistic form and the set of alternative that can express it.
Lexical categories of verbs, nouns, adjectives, conjunctions, particles, or other parts of speech.
Turning singular into plural
Morphemes: structural components of word form (segments or morphs). Ex: dis-agree-ment-s
Typology: divides languages into groups by characterizing the prevalent morphological
phenomena in those languages. Ex: Isolating, Synthetic, Agglutinative, Fusional.
10/30/2017 MASHAEL ALDUWAIS 5
6. CHAPTER 1. FINDING THE STRUCTURE OF WORDS
الكلمات كيبتر
Issues and Challenges:
Irregularity: word forms that are not described by a prototypical linguistic model.
Ambiguity: word forms be understood in multiple ways out of the context of their discourse.
Productivity: Is the inventory of words in a language finite, or is it unlimited?
Morphological Models:
Dictionary Lookup
Finite-State Morphology
Unification-Based Morphology
Functional Morphology
10/30/2017 MASHAEL ALDUWAIS 6
7. CHAPTER 2. FINDING THE STRUCTURE OF DOCUMENTS
النص كيبتر
Some (NLP) tasks use sentences as the basic processing unit:
Parsing, machine translation, automatic speech recognition (ASR) systems, and semantic role
labeling
Sentence boundary detection (sentence segmentation): Automatically segmenting a
sequence of word tokens into sentence units.
Topic segmentation (discourse or text segmentation): Automatically dividing a stream
of text or speech into topically homogeneous blocks.
A boundary classification problem:
Depending on the type of input (i.e., text versus speech), different features may be used.
Performance have improved by exploiting very high-dimensional feature sets.
10/30/2017 MASHAEL ALDUWAIS 7
8. CHAPTER 3. SYNTAX
النحو
Syntax Parsing: (syntax analysis): discover the various predicate-argument
dependencies that may exist in a sentence.
Parse natural language text to provide syntactic trees.
Recursively partition the words in the sentence into individual phrases such as verb or noun.
Used for text-to-speech, machine translation, summarization, and paraphrasing application.
10/30/2017 MASHAEL ALDUWAIS 8
9. CHAPTER 3. SYNTAX
النحو
Treebanks:
A collection of sentences where each sentence is provided a complete syntax analysis.
(Annotated text corpus)
The syntactic analysis for each sentence has been judged by a human expert.
A style book or set of annotation guidelines is typically written before the annotation process
to ensure a consistent scheme of annotation throughout the treebank.
Two main approaches to construct treebanks: dependency graphs and phrase structure.
Challenges:
Ambiguity. Chose from an exponentially large number of alternative analyses.
Language issues: tokenization, case, encoding, word segmentation and morphology.
10/30/2017 MASHAEL ALDUWAIS 9
10. CHAPTER 4. SEMANTIC PARSING
الدالل التحليل
Semantic parsing: identifying meaning chunks contained in an information signal in
an attempt to transform it into some data structure that can be manipulated by a
computer to perform higher level tasks.
Two types of representations:
Deep semantic parsing: taking natural language input and transforming it into a meaning
representation. Domain-dependent.
Problem: reusability of the representation across domains is very limited.
Shallow semantic parsing: deals with the four main aspects of language: structural ambiguity,
word sense, entity and event recognition, and predicate argument structure recognition.
General-purpose.
Problem: difficult to construct a general-purpose ontology.
10/30/2017 MASHAEL ALDUWAIS 10
11. CHAPTER 4. SEMANTIC PARSING
الدالل التحليل
A semantic theory should be able to:
1. Explain sentences having ambiguous meanings. For example, it should account for the
fact that the word bill in the sentence The bill is large is ambiguous in the sense that it
could represent money or the beak of a bird.
2. Resolve the ambiguities of words in context. For example, if the same sentence is
extended to form The bill is large but need not be paid, then the theory should be able
to disambiguate the monetary meaning of bill.
3. Identify meaningless but syntactically well-formed sentences, such as the famous
example by Chomsky: Colorless green ideas sleep furiously.
4. Identify syntactically or transformationally unrelated paraphrases of a concept
having the same semantic content.
10/30/2017 MASHAEL ALDUWAIS 11
12. CHAPTER 4. SEMANTIC PARSING
الدالل التحليل
Semantic parsing can be considered as part of semantic interpretation.
Requirements for Semantic Interpretation:
Structural Ambiguity: transforming a sentence into its underlying syntactic representation.
Word Sense: the same word type is used in different contexts.
EX: She nailed the loose arm of the chair with a hammer. VS. She went to the beauty salon to get a manicure.
Entity and Event Resolution: named entity recognition and coreference resolution.
Predicate-Argument Structure: identifying the participants of the entities in these events.
Can be defined as the identification of who did what to whom, when, where, why, and how
Meaning Representation: build a semantic representation that can then be manipulated by
algorithms to various application ends (called deep representation). A domain-specific
approach.
10/30/2017 MASHAEL ALDUWAIS 12
13. CHAPTER 5. LANGUAGE MODELING
نمذجةاللغة
A statistical model that assigns a probability to a sentence.
Specifies the a priori probability of a particular word sequence in the language of interest.
Given an alphabet or inventory of units Σ and a sequence W = w1w2 ...wt ∈ Σ∗, a language
model can be used to compute the probability of W based on parameters previously
estimated from a training set.
LM is usually combined in speech recognition, machine translation.
A standard tool in information retrieval, spell correction, summarization, authorship
identification, and document classification.
10/30/2017 MASHAEL ALDUWAIS 13
14. CHAPTER 5. LANGUAGE MODELING
نمذجةاللغة
n-Gram Models: all previous words except for the (n − 1) words directly preceding
the current word are irrelevant for predicting the current word, or, alternatively, that
they are equivalent.
Evaluation criteria: coverage rate, perplexity.
Language Model Adaptation: designing and tuning a language model such that it
performs well on a new test set for which little equivalent training data is available.
Methods: Mixture language models, topic-dependent language model, trigger models.
10/30/2017 MASHAEL ALDUWAIS 14
15. CHAPTER 5. LANGUAGE MODELING
نمذجةاللغة
Types of Language Models: other than n-gram language model
Class-Based Language Models
Variable-Length Language Models
Discriminative Language Models
Syntax-Based Language Models
MaxEnt Language Models
Factored Language Models
Bayesian Topic-Based Language Models
Neural Network Language Models
10/30/2017 MASHAEL ALDUWAIS 15
16. CHAPTER 5. LANGUAGE MODELING
نمذجةاللغة
Language Modeling Problems:
Language-Specific Modeling Problems:
In Arabic, decomposition may be required. Integrating morphological information into the language
model is helpful for modeling dialectal Arabic.
Spoken versus Written Languages:
Many of the world’s 6,900 languages are spoken languages, that is, languages without a writing
system (dialects).
In this case: the only way of obtaining language model training data is to manually transcribe the
language or dialect. This is a costly and time-consuming process because it involves (i) the
development of a writing standard, (ii) training native speakers to use the writing system consistently
and accurately, and (iii) the actual transcription effort. In the second case, those text resources that
can be obtained for the language in question (e.g., from the web) will need to be normalized, which
can also be a laborious process
10/30/2017 MASHAEL ALDUWAIS 16
17. CHAPTER 6. RECOGNIZING TEXTUAL ENTAILMENT
النص نالتضمي عىل التعرف
Textual entailment is defined as a directional relationship between pairs of text
expressions, denoted by T, the entailing text, and H, the entailed hypothesis. We say
that T entails H if the meaning of H can be inferred from the meaning of T, as would
typically be interpreted by people.
Applications of Textual Entailment Solutions:
Summarization.
Exhaustive Search for Relations
Question Answering
Machine Translation
10/30/2017 MASHAEL ALDUWAIS 17
18. CHAPTER 7. MULTILINGUAL SENTIMENT AND
SUBJECTIVITY ANALYSIS
المتعدد للغات الذات والتحليل المشاعر تحليلة
Subjectivity classification: labels text as either subjective or objective.
Sentiment classification: classifies subjective text as either positive, negative, or
neutral.
Used in automatic expressive text-to-speech synthesis, tracking sentiment timelines in online
forums and news, and mining opinions from product reviews.
Tools: two main types of tools:
I. Rule-based systems: relying on manually or semi-automatically constructed lexicons. Ex:
OpinionFinder.
II. Machine learning classifiers: trained on opinion-annotated corpora. Ex: Wiebe, Bruce,
and O’Hara .
Corpora: subjectivity and sentiment annotated corpora used to train automatic
classifiers, and as resources to extract opinion mining lexicons.
10/30/2017 MASHAEL ALDUWAIS 18
19. CHAPTER 7. MULTILINGUAL SENTIMENT AND
SUBJECTIVITY ANALYSIS
المتعدد للغات الذات والتحليل المشاعر تحليلة
Lexicons:
OpinionFinder: contains 6,856 unique entries, out of which 990 are multiword expressions.
Each entry is also associated with a polarity label, indicating whether the corresponding
word or phrase is positive, negative, or neutral.
General Inquirer: a dictionary of about 10,000 words grouped into about 180 categories,
which have been widely used for content analysis. It includes semantic classes (e.g., animate,
human), verb classes (e.g., negatives, becoming verbs), cognitive orientation classes (e.g.,
causal, knowing, perception), and others. Two of the largest categories in the General
Inquirer are the valence classes, which form a lexicon of 1,915 positive words and 2,291
negative words.
SentiWordNet: Built on top of WordNet, which assigns each synset in WordNet with a score
triplet (positive, negative, and objective), indicating the strength of each of these three
properties for the words in the synset.
10/30/2017 MASHAEL ALDUWAIS 19
20. CHAPTER 7. MULTILINGUAL SENTIMENT AND
SUBJECTIVITY ANALYSIS
المتعدد للغات الذات والتحليل المشاعر تحليلة
Word- and Phrase-Level Annotations: three main directions:
i. manual annotations, which involve human judgment of selected words and phrases,
ii. automatic annotations based on knowledge sources such as dictionaries,
iii. automatic annotations based on information derived from corpora.
Sentence-Level Annotations: corpus annotations are often required either as an end goal for
various text-processing applications (e.g., mining opinions from the Web, classification of
reviews into positive and negative), or as an intermediate step toward building automatic
subjectivity and sentiment classifiers. Two methods:
i. dictionary-based, consisting of rule-based classifiers relying on lexicons,
ii. corpus-based, consisting of machine learning classifiers trained on preexisting annotated
data.
Document-Level Annotations: applications, such as review classification or web opinion
mining, often require corpus-level annotations of subjectivity and polarity.
10/30/2017 MASHAEL ALDUWAIS 20
21. CHAPTER 8. ENTITY DETECTION AND TRACKING
ومتابعتها االعالم أسماء عىل التعرف
Mention detection:
Detecting the boundary of a mention and optionally identifying the semantic type (e.g.,
PERSON or ORGANIZATION) and other attributes (e.g., named, nominal, or pronominal).
Closed to named entity recognition.
Mentions: any instances of textual references to objects or abstractions, which can be either
named (e.g., John Mayor), nominal (e.g., the president), or pronominal (e.g., she, it).
Can be formulated as a classification problem by assigning a label to each token in the text.
Coreference resolution:
Clustering mentions referring to the same entity into equivalence classes.
Machine learning-based approaches: learn a model from training data that assigns a score
to a pair of mentions indicating the likelihood that the two mentions refer to the same entity.
Mentions are then clustered into entities on the basis of mention-pair scores.
10/30/2017 MASHAEL ALDUWAIS 21
22. CHAPTER 9. RELATIONS AND EVENTS
واألحداث العالقات
Relation Extraction Systems: systems capable of finding semantic relations among entities.
Relation extraction can be considered as multiclass classification problem, with several classes of
features including structural, lexical, entity-based, syntactic, and semantic.
Relation Extraction Types:
Extracting relations typically associated with lexical ontologies, such as meronymy, hyponymy, and
troponymy;
Extracting relations similar in nature, such as detecting that verb1 expresses the same concept as
verb2 but in a stronger fashion; and
Finding similarity enablement, that is, detecting that the action expressed by verb1 is a
prerequisite for the action expressed by verb2.
Identifying general semantic links between potentially heterogeneous entities, such as employment
relations between people and companies, cause of death relations between diseases and
people, or ownership of one entity (such as a company) by another.
10/30/2017 MASHAEL ALDUWAIS 22
23. CHAPTER 9. RELATIONS AND EVENTS
واألحداث العالقات
National Institute of Standards and Technology (NIST) ACE evaluations:
PHYS (physical): A spatial relation denoting that a person is located at or near a facility, or a
location.
PART-WHOLE: A spatial relation denoting that a facility, a location, or a gpe is a part of another
facility.
PER-SOC (personal-social): Personal-social relations capture links between people. Relations can
be business-related, can be family-based.
ORG-AFF (organization-affiliation): This type of relation pertains to connections between persons
and organizations. A person could be employed by an organization or could be a member.
GEN-AFF (general-affiliation): citizenship, residence in a country, religious affiliation, and
ethnicity.
ART (artifact): A relation between a user, inventor, or manufacturer and the artifact itself.
METONYMY: A relation between two different aspects of the same underlying entity.
10/30/2017 MASHAEL ALDUWAIS 23
24. CHAPTER 9. RELATIONS AND EVENTS
واألحداث العالقات
Events: denotes any change of state in the world that is described using natural
language text.
Event extraction: is the use of any algorithm to extract a structured representation of
that change of state, crucially including the entities involved.
10/30/2017 MASHAEL ALDUWAIS 24
25. CHAPTER 10. MACHINE TRANSLATION
اآللية جمةالت
converting text in one language into another while preserving its meaning.
Research started in the 1940s at IBM. Most profound change can be dated back to
1988.
Statistical Machine Translation:
Using large corpora of translated texts, typically many millions of words.
Learn the rules of translation from corpora and provide the basis for a decoding algorithm
that finds the best translation for a given input sentence
Machine translation is being integrated into various applications: crosslingual
information retrieval, speech translation, and tools for translators.
10/30/2017 MASHAEL ALDUWAIS 25
26. CHAPTER 10. MACHINE TRANSLATION
اآللية جمةالت
Word Alignment: Learning translation rules from a parallel corpus.
Unsupervised learning problem.
A word-aligned parallel corpus allows the estimation of phrase-based and tree-based
models and other approaches.
Evaluation:
Human Assessment: ask human judges if the output constitutes a correct translation. Is it
fluent? Is the translation adequate?
Automatic Evaluation Metrics: evaluation campaigns for evaluation metrics, where different
metric developers compete for the highest correlation with human judges. Runs similarity
measures test between MT output and the reference translations. Count: matches, insertions,
deletions.
10/30/2017 MASHAEL ALDUWAIS 26
27. CHAPTER 10. MACHINE TRANSLATION
اآللية جمةالت
Current Research:
The development of models that more closely mirror linguistic understanding of language,
The application of novel machine learning methods to the estimation problem of learning
Translation rules from the data, and
The attempts to exploit various types of data sources, which are often not in the desired
domain or may not be even proper sentence-by-sentence translations at all.
10/30/2017 MASHAEL ALDUWAIS 27
28. CHAPTER 10. MACHINE TRANSLATION
اآللية جمةالت
Linguistic Challenges:
Lexical Choice: word sense disambiguation. n-gram language model, try to capture
effectively local context information that is very useful for making the right lexical choice.
Morphology: when translating into morphologically rich languages, it is often not clear from
the local context which morphological variant to choose.
Word Order: To define which of the entities mentioned in the sentence is the subject and
which are the objects and what their roles are, languages such as English use word order.
Future Directions:
The estimations of parameter values in MT models.
Syntactic models
Using comparable or purely monolingual data instead of parallel data.
Integrating statistical machine translation into other information processing applications.
10/30/2017 MASHAEL ALDUWAIS 28
29. CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVAL
اللغات متعدد المعلومات جاعاست
Importance:
Improvements in machine translation (MT), have fostered the development of effective multilingual
retrieval systems.
The growing number of non-English Internet users and non-English content on the Web.
Advent of Web 2.0 technologies.
Crosslingual information retrieval (CLIR):
Retrieving documents relevant to a given query in some language (query language) from a
collection of documents in some other language (collection language).
Approaches: Translation-Based Approaches, Inter-lingual Document Representations.
Multilingual information retrieval (MLIR):
Involves corpora containing documents written in different languages.
MLIR requires different index organization and relevance computation strategies than CLIR.
10/30/2017 MASHAEL ALDUWAIS 29
30. CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVAL
اللغات متعدد المعلومات جاعاست
Evaluation:
Metrics: Relevance Assessments, precision and recall.
Evaluation Campaigns: Text REtrieval Conference (TREC), Crosslingual Evaluation Forum
(CLEF), NII Test Collection for IR Systems (NTCIR), Forum for Information Retrieval Evaluation
(FIRE).
Parallel Corpora: JRC-Acquis, Multext Dataset, Canadian Hansards, Europarl.
Tools, Software, and Resources:
Preprocessing: Content Analysis Toolkit (Tika), Snowball Stemmer, HTML Parser, BananaSplit.
IR Frameworks: Lucene, Terrier and Lemur.
Evaluation: TREC eval.
10/30/2017 MASHAEL ALDUWAIS 30
31. CHAPTER 12. MULTILINGUAL AUTOMATIC
SUMMARIZATION
اللغات متعدد اآلل التلخيص
In multilingual summarization, texts written in multiple languages are used by
summarization systems.
Types of summary:
An informative summary, is a compressed version of the original covering the most important
facts reported in the input text(s) (e.g., summary of a journal article).
An indicative summary covers topics in the input text without providing further details (e.g.,
keywords for scientific papers).
An evaluative summary gives an opinion on the input text most often by comparing it to
similar documents.
An elaborative summary can provide more details of parts of a large document or the
document linked to by the current document to help navigation through large documents or
linked collections such as Wikipedia.
10/30/2017 MASHAEL ALDUWAIS 31
32. CHAPTER 12. MULTILINGUAL AUTOMATIC
SUMMARIZATION
اللغات متعدد اآلل التلخيص
Crosslingual summarization: spread out over multiple source languages, and the
resulting summary is presented in one (or more) target languages.
Requires the integration of multiple source documents coming from different languages
Named entities are often transcribed differently in different languages (coreference
resolution)
Languages encode number and gender agreement differently as English lacks grammatical
gender (Anaphora resolution).
Evaluation:
Extrinsic evaluations measure the usefulness of summaries by measuring how much they can
help in performing another information-processing task.
Intrinsic evaluations measure and reflect summary quality and can be used in various stages
in a summarization development cycle.
10/30/2017 MASHAEL ALDUWAIS 32
33. CHAPTER 12. MULTILINGUAL
AUTOMATIC SUMMARIZATION
اللغات متعدد اآلل التلخيص
Summarization systems are divided into three stages:
1. For the analysis stage, summarization systems may
represent the text in the form of a graph. This may be a
linguistically motivated discourse tree or a matrix
representation based on sentence-to-sentence similarity.
2. The transformation process can be carried out via graph-
based algorithms such as PageRank or by machine
learning–based classifiers that learn to classify sentences
according to their relevancy.
3. Multilingual approaches have to face many language-
dependent challenges such as tokenization, anaphoric
expressions, and discourse structure for the realization of
the summary.
10/30/2017 MASHAEL ALDUWAIS 33
34. CHAPTER 13. QUESTION ANSWERING
األسئلة عىل اإلجابة
QA: Retrieve answers to user questions from information sources.
Follows a pipeline layout consisting of components for
1. Transforming questions into search engine queries
2. Retrieving related text using existing IR systems,
3. Extracting and scoring candidate answers.
Questions are classified with regard to their expected answer,
factoid questions, which ask for concise answers such as named entities (e.g., What is the capital of Turkey?),
list questions seeking lists of such factoid answers (e.g., Which countries are in NATO?).
Attempts have been made to tackle questions with complex answers, such as definitional questions requesting
information on a given topic, including biographies for people (e.g., Who is Albert Einstein?),
relationship questions (e.g., What is the relationship between the Taliban and Al-Qaeda?),
opinion questions (e.g., What do people like about IKEA?).
10/30/2017 MASHAEL ALDUWAIS 34
36. CHAPTER 13. QUESTION ANSWERING
األسئلة عىل اإلجابة
Future Directions:
Reliable confidence estimates for the top answers.
Crosslingual QA systems that translate answers back to the language in which the question
was asked.
General-purpose QA algorithms and techniques that can be adapted rapidly to new tasks
and achieve high performance across different domains.
QA systems that provide complex answers.
How and why questions seeking explanations or justifications
Yes–no questions requiring a system to determine whether the combined knowledge in the available information
sources entails a hypothesis.
Deeper NLP techniques to find answers in sources that lack semantic redundancy.
QA systems that support user interactions and information sources in different languages.
10/30/2017 MASHAEL ALDUWAIS 36
37. CHAPTER 14. DISTILLATION
االستخالص
Distillation queries can be complex and require complex answers.
For example: Describe the reactions of <COUNTRY> to <EVENT>.
The Rosetta Consortium Distillation System: built as part of the GALE program. The system is
designed to answer distillation queries run against a large corpus composed of text documents and
audio recordings in multiple languages: English, Arabic, and Mandarin. Text sources are assumed to
belong to two main categories: structured and unstructured.
Three Stages:
Document preparation: recordings are transcribed, and text and transcripts in foreign languages are
translated into English. Tokenization, part-of-speech (POS) tagging, parsing, mention detection, and semantic
role labeling rely on maximum entropy (MaxEnt) models is performed.
Indexing: documents are indexed using an open source search engine, Lucene.
Query answering: takes as input a GALE-style query, and returns a list of main snippets with associated
supporting snippets and citations, sorted in decreasing order of relevance to the query. The architecture of
the system consists of five stages: query preprocessing, document retrieval, snippet filtering, snippet
processing, and planning.
10/30/2017 MASHAEL ALDUWAIS 37
38. CHAPTER 14. DISTILLATION
االستخالص
Challenges
The lack of publicly available corpora for measuring the progress of the field a
The difficulty and cost of evaluating the outputs of distillation systems due to the lack of
automatic metrics.
10/30/2017 MASHAEL ALDUWAIS 38
39. CHAPTER 15. SPOKEN DIALOG SYSTEMS
اآلل الحوار أنظمة
A spoken dialog system is a complex machine that
manages goal-oriented user interactions.
Functional architecture:
Speech recognition and understanding module: to
assign one or more semantic tags to each speech
input.
Speech generation module: Rule-based grammar is
used, which encodes both the syntax and semantics of
possible utterances.
Dialog manager: uses a finite-state machine approach
by explicitly encoding the whole interaction into what
is generally known as call-flow.
10/30/2017 MASHAEL ALDUWAIS 39
40. CHAPTER 16. COMBINING NATURAL LANGUAGE
PROCESSING ENGINES
الطبيعية اللغة معالجة كاتمحر نبي الجمع
Many engines are now attaining accuracy sufficient to enable combining them to
serve more complex tasks than were possible before.
Example applications: semantic search, enterprise reporting and other business intelligence,
question answering, medical-abstract mining, and crosslingual search, audio/video search
and cataloging, speech-to-speech translation, and foreign broadcast news analysis.
Applications like these share many common engines, such as speaker identification, speech-
to-text, text tokenization, grammatical parsing, named entity detection, coreference analysis,
part-of-speech labeling, and translation.
Aggregation poses several challenges: Heterogeneous computing environments,
Remote operation, Data formats, Exception handling.
10/30/2017 MASHAEL ALDUWAIS 40
41. CHAPTER 16. COMBINING NATURAL LANGUAGE
PROCESSING ENGINES
الطبيعية اللغة معالجة كاتمحر نبي الجمع
Desired Attributes of Architectures for Aggregating Speech and NLP Engines:
Flexible, Distributed Componentization.
Computational Efficiency.
Data-Manipulation Capabilities.
Robust Processing.
Frameworks that support integration into more complex applications:
UIMA
GATE: General Architecture for Text Engineering
InfoSphere Streams
10/30/2017 MASHAEL ALDUWAIS 41