SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
DOI : 10.14810/ecij.2017.6401 1
USING MACHINE LEARNING TO BUILD A
SEMI-INTELLIGENT BOT
Ali Rahmani, Patrick Laffitte, Raja Haddad and Yassin Chabeb
Research and development entity, Data Science team, Palo IT, Paris, France
ABSTRACT
Nowadays, real-time systems and intelligent systems offer more and more control interface based on voice
recognition or human language recognition. Robots and drones will soon be mainly controlled by voice.
Other robots will integrate bots to interact with their users, this can be useful both in industry and
entertainment. At first, researchers were digging on the side of "ontology reasoning". Given all the
technical constraints brought by the treatment of ontologies, an interesting solution has emerged in last
years: the construction of a model based on machine learning to connect a human language to a knowledge
base (based for example on RDF). We present in this paper our contribution to build a bot that could be
used on real-time systems and drones/robots, using recent machine learning technologies.
KEYWORDS
Real-time systems, Intelligent systems, Machine learning, Bot
1. INTRODUCTION
We present here our contributions within Palo IT [17] for a year of research & development
activities. The main part of the R&D entity works on Data Science trends and intelligent systems
based on machine learning technologies. The aim of this project was to create a semi-intelligent
bot. This bot must be able to analyse facts, reason and answer questions using machine learning
methods. This paper aims to provide an overview on our work during this project. It consists of
four parts. In the first part, we present the context of the project, the problematics, some related
works, the objectives. In the second part, we present details of the tasks that we deal with during
our research project. Tasks about the implementation and testing of different methods of text
mining - by applying these methods on text data in French - and the test results are detailed in the
third part of this paper. Finally, we concluded with our analysis of what we have acquired through
this project and future scope.
2. GLOBAL OVERVIEW
We present here the context, some related work, issues, and our objectives.
2.1. Context
The amount of text data on the web, or stored by companies is growing continuously. In order to
exploit this wealth, it is essential to extract knowledge from such data. The discipline dealing
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
2
with this type of data is called "Text Mining" includes several issues such as search indexing of
documents, summary generation, creation of bots, etc. The work done during our project is part of
the enrichment of the Palo IT textual and data analysis research. It aims to create a semi-
intelligent bot. For this project an internal R & D was launched. This project "PetiText"
(translated SmallText in English; petit=small) is based on the analysis and reasoning on short
sentences to detect new facts and answer questions. It involves an analysis of data from text
corpora which allows to:
• extract targeted information, sorted and added value for companies using algorithms
• search for similarities and identifying causal relationships between different facts
• detect behaviours and intentions
• answer questions of policy makers
• guide the marketing action and set up alerts to devices.
2.2. Issues
Faced with the growing demand of Palo IT customers to extract knowledge from their textual
data, the PetiText R & D project was launched. Indeed, these customers possess documents and
tools for collecting reviews and customer complaints or employees. Hence the need to design and
implement a tool for analysing this type of data. Text data poorly used by most companies,
represent a wealth of information. Their analysis is a means of decision support and a strategic
asset for companies. Study of Text Mining existing products shows a major flaw for processing
text data written in French. This defect consists of the almost total absence of "open source"
libraries incorporating the semantics of the French language. Indeed, unlike the English, we found
that most libraries and tools used globally (as Clips [1], NLTK [2], etc.) to treat this type of
problem are not reliable when it comes to deal in French documents. For these reasons it was
decided to set up a new tool combining several text analysis methods that treats the French
language, which allows the machine to reason as a little boy of two years.
2.3. Related works
Some authors have proposed to deal with those issues by deep learning and ontology reasoning. It
was the case of [14] Patrick Hohenecker and Thomas Lukasiewicz, from Department of
Computer Science University of Oxford, introduce a new model for statistical relational learning
that is built upon deep recursive neural networks, and give experimental evidence that it can
easily compete with, or even outperform, existing logic-based reasoners on the task of ontology
reasoning. Other authors recently have proposed in [15] a model that builds an abstract
knowledge graph on the entities and relations present in a document which can then be used to
answer questions about the document. It is trained end-to-end: only supervision to the model is in
the form of correct answers to the questions. Thuy Vu and D. Stott Parker [16] describe a
technique for adding contextual distinctions to word embeddings by extending the usual
embedding process — into two phases. The first phase resembles existing methods, but also
constructs K classifications of concepts. The second phase uses these classifications in
developing refined K embeddings for words, namely word K-embeddings. We propose to
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
3
complete these propositions with an approach to connect human language and knowledge bases
(here we start with French but it must be same thing for other languages).
2.4. Objectives
The aim of our project was to help in all the steps of creating a semi-intelligent bot. This bot will
learn facts from existing textual resources by conducting a thorough analysis. It must then be able
to deduce new facts and answer open questions. To achieve this, a combination of different
methods of textual data analysis was used. These methods can be grouped in three axes:
• Frequency analysis: using metrics based on the detection of global information and
characteristics of a text (keywords, rare words, etc.).
• Knowledge of analysis: based Keyword Analysis and Mapping of knowledge for a
classification of subjects or knowledge extraction rules (logical rules).
• Semantic Analysis: based on the analysis of the context and emotions to contextualize a
given text.
2.5. Technological choices and human resources
The "petiText" is a project that is part of Data Science Palo IT activities leaded by three PhDs: a
data science expert as supervisor Mr. Patrick LAFFITTE, then Mrs. Raja HADDAD and Mr.
Yassin CHABEB. Thanks to the wealth of existing libraries in python dedicated to machine
learning, the choice of that language was obvious. Regarding data storage, we used ZODB (Zope
Object DataBase) which is a hierarchical and object-oriented database. It can store the data as
python objects. We used Gensim [3] and Scikit-learn [4] are two python libraries that implement
various machine learning methods and facilitating the statistical treatment of data. These methods
of learning and statistical data computations require considerable material resources: due to the
volume of data to process and especially the time computations, therefore, two remote OVH
machines were rented. These machines have the following configurations: Machine n°1: 8 CPUs,
16Gb of RAM and a GPU; Machine n°2: 16 CPUs and 128Gb of RAM.
3. TASKS CARRIED OUT
During our project, we were able to participate in the implementation of several tasks on the
textual analysis of sentences from corpus of documents to create an intelligent bot capable of
answering questions in real-time interaction. The approach on which the bot was based would
allow also to exploit 80% of stored data in some enterprises and generate a lot of hidden facts,
this stored data is not exploited by the enterprises’ businesses.
3.1. Drawing conclusions from a set of sentences
The objective of this set is to create and implement a formal logic model. This was achieved by
combining the results of two tools. The first is used to apply a logical model to a set of sentences
modelled as relationships between objects. These were extracted through the use of the second
tool is the CoreNLP. Appendix A shows an example of application of our logic model on a set of
sentences about the family universe/field/domain.
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
4
3.1.1. Building a logic model
This model is based on interpreting the world as a set of facts, and every fact is the relationship
between two or more objects. Knowing that the objects of a sentence are a fact (a relationship), it
is sufficient to apply logical rules that we have defined to derive and generate new information
between different objects.
A simple example:
➢ Man is a creature. [Fact in input]
➢ John is a man. [Fact in input]
✓ John is a creature. [Generated fact]
To extract new information from a given set of facts, we have implemented a set of logical rules.
When a rule can generate a new fact (called conclusion), or a hypothesis. Each hypothesis can
become a conclusion if new facts arrive and validate it. The logical rules that we have defined in
this part:
• Conclusions:
- If obj1 a obj2 Then obj2 is part of obj1.
- If obj1 is obj2 AND obj2 is obj3, Then obj1 is obj3.
- If obj1 is obj2 OR obj3, Then obj1 is obj2 OR obj1 is obj3.
- If obj1 is obj2 OR obj3 AND obj2 is obj2-1 OR obj2-2 AND obj3 is obj3-1 OR obj3-
2,
Then obj1 is obj2-1 OR obj2-2 OR obj3-1 OR obj3-2.
- If obj1 is obj2 OR obj3 AND obj2 is obj4 AND obj3 is obj5, Then obj1 is obj4 OR
obj5.
- If obj1 is obj2 AND obj3 is obj2, Then obj1 AND obj3 are obj2.
• The hypotheses:
- If obj1 is obj2 OR obj3 AND obj3 is obj2, Then obj3 is probably obj1.
- If obj1 is obj2 OR obj3 AND obj4 is obj5 AND obj2 Then obj4 is probably obj1.
- If obj1 is obj2 OR obj3 AND obj2 is probably obj4, Then obj1 is probably obj4.
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
3.1.2. CoreNLP
The construction of logic models from the input sentences needs first level of understanding from
a syntactic and grammatical analysis. For this we used a natural language processing tool (NLP)
entitled CoreNLP [5] of Stanford
their basic form from plural or conjugation or grammatical declinations. It also explains the
overall structure of the sentence analysing subject, verb, etc. It also acts as a parser adapting to
the turn of phrases. Figure 1 shows an example of analysis of a sentence using CoreNLP. When
edges represent the overall structure of the sentence, then the blue tags represent lemmas of
different words, and those in red, give the grammatical class of a wor
Figure 1. Example of CoreNLP analysis result
3.2. Recover Data (The scraping [6] with XPath [7])
Web Scraping is a set of techniques to extract the contents of a source (a website or other). The
goal is to transform the data retrieved for use:
• For rapid integration between applications (when API is not available).
• Or to store this data in data base to be analysed later.
In this project we used this technique (Web Scraping) to store a maximum of French
definitions. This is to integrate the intelligent aspect to our bot. So that it can answer questions,
the meaning of words and phrases is very important, so the choice of web sites and dictionaries of
synonyms has emerged. To scrabble definitions from various dictionaries of sites we used
This library allows python to extract information (elements, attributes, comments, etc.) of a
document through the formulation of terms including
• An axis (child or parent).
• A node (name or type).
• One or more predicates (optional).
These definitions and synonyms have served as a learning base to allow the bot to learn the
meaning of words and phrases in different contexts
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
The construction of logic models from the input sentences needs first level of understanding from
a syntactic and grammatical analysis. For this we used a natural language processing tool (NLP)
entitled CoreNLP [5] of Stanford University. CoreNLP fetches lemmas of words, it identifies
their basic form from plural or conjugation or grammatical declinations. It also explains the
overall structure of the sentence analysing subject, verb, etc. It also acts as a parser adapting to
he turn of phrases. Figure 1 shows an example of analysis of a sentence using CoreNLP. When
edges represent the overall structure of the sentence, then the blue tags represent lemmas of
different words, and those in red, give the grammatical class of a word.
Figure 1. Example of CoreNLP analysis result
Recover Data (The scraping [6] with XPath [7])
Web Scraping is a set of techniques to extract the contents of a source (a website or other). The
goal is to transform the data retrieved for use:
For rapid integration between applications (when API is not available).
Or to store this data in data base to be analysed later.
In this project we used this technique (Web Scraping) to store a maximum of French
the intelligent aspect to our bot. So that it can answer questions,
the meaning of words and phrases is very important, so the choice of web sites and dictionaries of
synonyms has emerged. To scrabble definitions from various dictionaries of sites we used
This library allows python to extract information (elements, attributes, comments, etc.) of a
document through the formulation of terms including
An axis (child or parent).
One or more predicates (optional).
definitions and synonyms have served as a learning base to allow the bot to learn the
meaning of words and phrases in different contexts.
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
5
The construction of logic models from the input sentences needs first level of understanding from
a syntactic and grammatical analysis. For this we used a natural language processing tool (NLP)
University. CoreNLP fetches lemmas of words, it identifies
their basic form from plural or conjugation or grammatical declinations. It also explains the
overall structure of the sentence analysing subject, verb, etc. It also acts as a parser adapting to
he turn of phrases. Figure 1 shows an example of analysis of a sentence using CoreNLP. When
edges represent the overall structure of the sentence, then the blue tags represent lemmas of
Web Scraping is a set of techniques to extract the contents of a source (a website or other). The
In this project we used this technique (Web Scraping) to store a maximum of French-language
the intelligent aspect to our bot. So that it can answer questions,
the meaning of words and phrases is very important, so the choice of web sites and dictionaries of
synonyms has emerged. To scrabble definitions from various dictionaries of sites we used XPath.
This library allows python to extract information (elements, attributes, comments, etc.) of a
definitions and synonyms have served as a learning base to allow the bot to learn the
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
6
3.3. Develop statistical models and learning
During this research project we have implemented and tested several models of text data as
learning the TF-IDF, the word embeddings. The application of these two models to learn allowed
us to have a clear idea about the use cases for each model.
3.3.1. TF-IDF
The Extracting relevant information from textual sources based on statistical models. These are
used to detect rare words (therefore the most significant), to eliminate less significant as stop-
words that does not depend from the context. The most commonly used technique for this is the
TF-IDF [8] (Term Frequency - Inverse Document Frequency).
• f: The frequency of the term t in the document.
• D: All documents.
• N: Total number of documents in the corpus (N = | D |).
• d: A document in all documents D.
• t: A term in a document.
3.3.2. Word embeddings
To build a learning model based on the word embeddings every sentence is converted into vector
of real values. The application of a model based on the succession of several layers neural
networks to detect semantics, contexts and relationships between them, and the classification of
new texts through an unsupervised learning. To apply the word embeddings on a textual corpus
we used different python libraries like: Gensim (Word embeddings) and Scikit-learn. They offer
all necessary methods (Doc2Vec, word2Vec...).
3.3.3. TF-IDF versus Word embeddings
We assessed the reliability of the two learning models cited earlier. This evaluation was to be
applied on a data set (20 newsgroups) 20,000 items and categories 20 (1000 items per category).
For this test, TF-IDF comes at a score of 58% while the word embeddings gives a score of 98%.
This result supported our choice to use the word embeddings as a learning model for our bot.
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
7
4. INTELLIGENT BOT
To develop an intelligent (or semi-intelligent) bot that can analyse the facts, learn and answer
questions, we combined the different parts detailed in the previous section. So, our bot combines
between the quality of text processing, classical and machine learning logic (semantics). Our bot
mainly combines three components: logic, semantics and training/learning base. The bot must be
logical, intelligent, autonomous and must render services. In our case the services are to answer
questions and generate different contexts, conclusions and assumptions from the facts given to
enrich its knowledge base. Aside from the natural language processing that must be done
automatically with the arrival of new developments, several challenges need to be resolved to
reach reliable conclusions.
4.1. The logic
The bot must be logical in its calculations and answers as a human, that's why a conventional
logic model was developed. This model will allow to validate or not the facts that are available,
based on the rules of the basic classical logic. This will argue about the facts and generate
conclusions to be potentially added to the knowledge base of the bot. Before moving to the
logical model, the facts, which are only phrases, must first be processed and decomposed by a
natural language processing tool. Question complexity and execution time, the logic model is the
simplicity of classical logic rules and is very fast because the facts are relatively simple sentences
composed of a single verb, subject, complement, conjunction.
4.2. The basic knowledge and autonomy of the bot
The knowledge base is made up of all findings and the operative events that are valid. It is
enriched when the facts arrive and conclusions are generated. The bot is then autonomous and no
longer depends on human intervention as textual data sources feed the bot with new facts. Figure
2 summarizes the developments of the validation process. It must still that moment to launch the
bot the first time, he already has a basic knowledge they will need to make learning so he can
detect and recognize the context and thus be able to work on developments that arrive. The bot
will look like a man (boy) that reasons and learns so the most logical is from a knowledge base
that is based on dictionary definitions (for each word, several definitions depending on context)
and synonyms. These are collected by the Scraping technique from several specialized websites.
The departure of the knowledge base contains 38,565 definitions of 226,303 words.
Figure 2. New facts validation process
4.3. The intelligence
Now that we have a logic model and an initial knowledge base, the challenge is to find a
technique that will allow the bot to detect the context of new arriving facts and rank them, and
this is exactly when the word embeddings is used. Recent Deep Learning idea is that the
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
8
approximate meaning of a word or sentence can be represented as a vector in a multidimensional
space. The nearer vectors represent similar meanings. To do so, we used Gensim for Python
designed to automatically extract topics Semantic documents, as efficiently as possible. Gensim is
designed to process raw digital and unstructured text. The algorithms in Gensim, such as Latent
Semantic Analysis, latent Dirichlet allocation and random projections, describe the semantic
structure of documents. The latter is extracted by examining the statistical models of co-
occurrences of words in a corpus of training documents. These algorithms are unsupervised,
which means that no human input is required. The only input to these algorithms is the text
document corpus for training the model. Gensim allows:
• Collect and process semantically similar documents.
• Analyse text documents for the semantic structure.
• Making scalable statistical semantics.
4.3.1. Word Embeddings [9, 10]
Word embeddings are one of the most exciting areas of research in the field of Deep Learning. A
word embeddings WE: words → ℝn is a parameterized function that maps words in a certain
language to high-dimensional vectors (100, 200 to 500 sizes). Essentially, each word is
represented by a numerical vector. For example, we might find (poire means pear in French):
• WE ("pair") = (0.2, -0.4, 0.7, ...)
• WE ("paire") = (0.2, -0.1, 0.7, ...)
• WE ("pear") = (0.0 - 0.3; 0.1, ...)
• WE ("poire") = (0.1, -0.1, 0.2, ...)
The purpose and usefulness of the word embeddings consist in grouping the vectors of similar
words in a vector space. Mathematically, it detects similarities between different vectors. These
are digital representations describing the characteristics of the word, such as context. The word
embeddings has several variations including:
• the Word2vec
• the Doc2vec
4.3.1.1 Word2vec [11]
Word2vec is a two-layer neural network that processes the text. The input to the network is a text
corpus and its output is a set of vectors. These include semantic features to the words in the
corpus. Word2vec is not a deep neural network in itself, but it is very useful because it turns text
into digital form (vectors) that the deep networks can understand. Figure 3 summarizes the
process used in word2vec algorithm. Words can be considered as discrete states, then simply
search the transition probabilities between these states, such as, the probability that they occur
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
together. In this case we will have close vectors for the words in a similar context (plus the cosine
is close to 1, the more the context of these words is similar)
Figure 3. Description of steps word2vec (Gensim [3])
In the Mikolov [12] introduction about learning word2vec, each word is mapped to a single
vector, represented by a column in a matrix. The column is indexe
the vocabulary. The concatenation, or the sum of the vectors is then used as a characteristic for
predicting the next word in a sentence. The
concatenation.
If we take into consideration enough data, use and contexts, the word2vec can make very accurate
assumptions about the meaning of a word based on past appearances. These assumptions can be
used to establish an association of words with other
W('woman') - W('man') ≈ W('queen')
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
together. In this case we will have close vectors for the words in a similar context (plus the cosine
re the context of these words is similar).
Figure 3. Description of steps word2vec (Gensim [3])
In the Mikolov [12] introduction about learning word2vec, each word is mapped to a single
vector, represented by a column in a matrix. The column is indexed by the position of the word in
the vocabulary. The concatenation, or the sum of the vectors is then used as a characteristic for
predicting the next word in a sentence. The Figure 4 gives an example of
Figure 4. Word2vec example [12]
If we take into consideration enough data, use and contexts, the word2vec can make very accurate
assumptions about the meaning of a word based on past appearances. These assumptions can be
used to establish an association of words with other words in terms of vectors. For example:
≈ W('queen') - W('king')
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
9
together. In this case we will have close vectors for the words in a similar context (plus the cosine
In the Mikolov [12] introduction about learning word2vec, each word is mapped to a single
d by the position of the word in
the vocabulary. The concatenation, or the sum of the vectors is then used as a characteristic for
gives an example of word2vec
If we take into consideration enough data, use and contexts, the word2vec can make very accurate
assumptions about the meaning of a word based on past appearances. These assumptions can be
words in terms of vectors. For example:
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
Word2vec form words by other words it detects in the input corpus. Word2vec
methods:
1. Continuous Bag of Words model (CBOW):
• The context (surrounding wo
2. Skip-gram with negative sampling or skip
• A word is used to predict a target
• This method can also work well with a small amount of training data. It can also
represent words or
Figure 5. Architecture of methods CBOW and Skip
4.3.1.2 Doc2vec [13]
Doc2vec (Paragraph2vec) changes the algorithm word2vec (generalization of
unsupervised learning of continuous representations for blocks of most important keywords such
as sentences, paragraphs or entire documents
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
Word2vec form words by other words it detects in the input corpus. Word2vec
Continuous Bag of Words model (CBOW):
The context (surrounding words) is used to predict the target word.
gram with negative sampling or skip-gram:
A word is used to predict a target-context (surrounding words).
This method can also work well with a small amount of training data. It can also
represent words or few sentences.
Figure 5. Architecture of methods CBOW and Skip-gram. w(t) is the current word, w(t-1), w(t
words surrounding the word
Doc2vec (Paragraph2vec) changes the algorithm word2vec (generalization of word2vec) for
unsupervised learning of continuous representations for blocks of most important keywords such
as sentences, paragraphs or entire documents.
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
10
includes two
rds) is used to predict the target word.
This method can also work well with a small amount of training data. It can also
1), w(t-2) ... are the
word2vec) for
unsupervised learning of continuous representations for blocks of most important keywords such
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
11
Figure 6. Description of doc2vec [3]
The Doc2vec realize a learning on a large set of documents, and creates a model of vector spaces.
In this model, each document is a vector space composed by the vectors of words. Thus, to have
the degrees of similarities, the method "most_similar ' uses the cosine between vectors, the higher
the cosine is close to 1, the higher the similarity is high. Figure 6 illustrates the steps of
unwinding the doc2vec algorithm. To apply Doc2Vec, two methods can be used:
• distributed memory model (DM)
• distributed bag of words (DBOW)
a. Distributed memory model (DM)
It considers the vector of paragraph with the vectors of paragraph words (Word2vec) to predict
the next word in a text. Using this distributed memory model (DM) comprises:
• Randomly assign a paragraph vector for each document.
• Predict the next word using the context of the word + the paragraph vector.
• Drag the window contexts on the document while the paragraph vector is fixed (therefore
distributed memory)
Figure 7 illustrates the operation of distributed memory model.
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
b. DBOW: Distributed Bag Of words
This method (DBOW) ignores the context words at the entrance. One paragraph vector predicts
words in a small window. This method requires less storage and it is very similar to the method of
the Skip-gram word2vec [12]. This method is less efficient than DM. However, the combination
of the two methods DM + DBOW
As shown in Figure 8, the DBOW method involves:
• Using only paragraphs vectors (No word2vec).
• Taking a window of words in a
using paragraphs vectors (ignoring word order).
Figure 8. DBOW model doc2vec [12]
4.4. Tests and applications
As part of the project "PetiText
definitions, when each definition is tagged with a word. We proceed to the construction of the
model using the method DBOW. To assign a tag to a new definition, the model infers the vector
and returns the tag definitions having the highest cosine
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
Figure 7. DM model doc2vec [12]
b. DBOW: Distributed Bag Of words
This method (DBOW) ignores the context words at the entrance. One paragraph vector predicts
words in a small window. This method requires less storage and it is very similar to the method of
This method is less efficient than DM. However, the combination
DM + DBOW is the best way to make a Doc2vec.
As shown in Figure 8, the DBOW method involves:
Using only paragraphs vectors (No word2vec).
Taking a window of words in a paragraph and random predict what word
using paragraphs vectors (ignoring word order).
Figure 8. DBOW model doc2vec [12]
PetiText" the doc2vec model is built from all available dictionaries of
definitions, when each definition is tagged with a word. We proceed to the construction of the
model using the method DBOW. To assign a tag to a new definition, the model infers the vector
and returns the tag definitions having the highest cosine.
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
12
This method (DBOW) ignores the context words at the entrance. One paragraph vector predicts
words in a small window. This method requires less storage and it is very similar to the method of
This method is less efficient than DM. However, the combination
paragraph and random predict what word
" the doc2vec model is built from all available dictionaries of
definitions, when each definition is tagged with a word. We proceed to the construction of the
model using the method DBOW. To assign a tag to a new definition, the model infers the vector
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
4.4.1. Learning time et scores
With a machine having 16 CPU
definitions of 226,303 words varies proportionally to the model parameters, the size of the
generated vectors and the number of iteration of l
Table 1 shows the results of different tests we performed. We note that the learning time and its
quality are proportional to the number of iterations of the algorithm. We also find that the best
results are obtained with 200 iterations and vecto
of the score according to the number of iterations
Table 1. Overview of learning times and scores of different models.
Size vectors
100
100
100
200
200
200
Figure 8. Scores obtained according to the size of the vectors and the number of iterations
4.4.2. Assignment example of a tag to a new sentence
Considering the example of a family word definition:
[ 'Generation', 'successive', 'down', 'ancestor', 'lined']
The sentence has been normalized by removing stop
words are lemmatized.
The vector inferred by the model parameter
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
With a machine having 16 CPUs and RAM 128Gb, the learning period of the model on 38,565
definitions of 226,303 words varies proportionally to the model parameters, the size of the
generated vectors and the number of iteration of learning.
Table 1 shows the results of different tests we performed. We note that the learning time and its
quality are proportional to the number of iterations of the algorithm. We also find that the best
results are obtained with 200 iterations and vectors of size 200. Figure 9 illustrates the evolution
of the score according to the number of iterations.
Table 1. Overview of learning times and scores of different models.
Size vectors Iterations Learning Time Score
50 8 min. 77%
100 16 min. 79%
200 32 min. 80%
50 9 min. 74%
100 17 min. 80%
200 32 min. 84%
Figure 8. Scores obtained according to the size of the vectors and the number of iterations
Assignment example of a tag to a new sentence
example of a family word definition:
[ 'Generation', 'successive', 'down', 'ancestor', 'lined']
The sentence has been normalized by removing stop-words, verbs are put in the infinitive and
The vector inferred by the model parameter vector size 200 corresponds to:
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
13
, the learning period of the model on 38,565
definitions of 226,303 words varies proportionally to the model parameters, the size of the
Table 1 shows the results of different tests we performed. We note that the learning time and its
quality are proportional to the number of iterations of the algorithm. We also find that the best
rs of size 200. Figure 9 illustrates the evolution
Figure 8. Scores obtained according to the size of the vectors and the number of iterations
words, verbs are put in the infinitive and
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
[-2.53752857e-01 -2.71043032e
7.85913542e
The Figure 9 shows the tags that different models have found by calculating the cosine between
the inferred vector of the previous sentence, and the vectors of all the definitions in the vector
space model.
Figure 9. Search results on the context of the preceding sentence regarding the word family
For this example, we notice that the models having size
and 200 are the best performers, since they have been the only ones that have returned the tag
"family" that corresponds to the definition of the input sentence.
Figure 10 is a screenshot of our first bot
universes/contexts: family, abstract objects, biological organisms
Then, we choose the family definition context then we list facts that we give to the bot.
Fact 1: a person is a man or a woman
Fact 2: a woman is female
…
In Figure 11, the bot starts reasoning on the facts in the order and generating conclusions and
hypothesis
Generated fact 1: parents and kids are parts of the same family
Generated fact 2: a father is a male
…
The same process was tested in real
the bottom of the screen.
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
2.71043032e-02 4.33574356e-02 -9.83970612e-02 2.55723894e
7.85913542e-02 02--5.09732738e...]
shows the tags that different models have found by calculating the cosine between
d vector of the previous sentence, and the vectors of all the definitions in the vector
Figure 9. Search results on the context of the preceding sentence regarding the word family
For this example, we notice that the models having size vectors 200 and number of iterations 100
and 200 are the best performers, since they have been the only ones that have returned the tag
"family" that corresponds to the definition of the input sentence.
Figure 10 is a screenshot of our first bot prototype. It is a French PetiText reasoner about three
universes/contexts: family, abstract objects, biological organisms.
Then, we choose the family definition context then we list facts that we give to the bot.
Fact 1: a person is a man or a woman
In Figure 11, the bot starts reasoning on the facts in the order and generating conclusions and
Generated fact 1: parents and kids are parts of the same family
Generated fact 2: a father is a male
tested in real-time interaction and it works, we can also add new fact in
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
14
02 2.55723894e-01 -
shows the tags that different models have found by calculating the cosine between
d vector of the previous sentence, and the vectors of all the definitions in the vector
Figure 9. Search results on the context of the preceding sentence regarding the word family
vectors 200 and number of iterations 100
and 200 are the best performers, since they have been the only ones that have returned the tag
prototype. It is a French PetiText reasoner about three
Then, we choose the family definition context then we list facts that we give to the bot.
In Figure 11, the bot starts reasoning on the facts in the order and generating conclusions and
time interaction and it works, we can also add new fact in
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
15
Figure 9. First and second screens of bot prototype
Figure 11. Reasoning screen
3. CONCLUSIONS
We have presented here the different steps and tools combined in order to build a semi-intelligent
bot based on machine learning technologies. With a solid foundation of learning, a logic model, a
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
16
word embeddings with good scores (≈ 84%) and improvements in current and future, we hope to
combine the three parts in order to have a functional bot. We use advanced tools and
technologies, they are very recent and widespread on the Data Science: the python programming
language, jupyter notebook for a complete development environment, Gensim for word
embeddings, and advanced tools for natural languages processing such as Clips and CoreNLP
Stanford. During the project, we have used namely the methods of classification and clustering,
Data Mining and Text Mining, Sentiment Analysis, and evaluations of the quality of classifiers
(Reminder, Precision, F-Measure). Then come the current systems, languages and paradigms for
Big Data and Advanced Big Data Analytics, that showed us especially the world of Big Data,
many use cases and market opportunities. We have tested some big data architectures, including
Hadoop and Spark with Python languages, Java and Scala.
As the subject of the project is part of the R&D activities, the ultimate goal was clear and
understandable, but in practice, problems on understanding how to achieve the objectives arise.
Indeed, understanding how to get the value and signification of a text is not easy. Then follows
the difficulty of understanding the functioning of the used methods, which algorithms to use and
when to use it. Development work and tests, reading papers and publications of other researchers,
and the documentation, all that was done in order to understand the subject, to progress and have
good results (≈ 84%), it was not the case at the beginning (≈ 60%).
After months of data processing (scraping, natural language processing, data cleaning, data
standardization...) and the development of logic and learning models (Word embeddings), the
first results / satisfactory scores were obtained. Now, we plan to make improvements and use of
new techniques for the coming months. We will use LSTM (Long Short-Term Memory), an
architecture of recurrent neural networks (RNN) which should further improve the quality of
prediction and classification. We plan also to finalise the integration of the bot within a Parrot
drone that we controlled by voice thanks to a previous research project in order to complete a
global interactive real-time interface between human and drones/robots [18]
ACKNOWLEDGEMENTS
We like to thank everyone that helped us during the current year.
REFERENCES
[1] CLiPS, https://www.clips.uantwerpen.be/PAGES/PATTERN-FR.
[2] NLTK, http://www.nltk.org/
[3] Gensim, https://radimrehurek.com/project/gensim/
[4] Scikit learn, http://scikit-learn.org/stable/
[5] CoreNLP, https://stanfordnlp.github.io/CoreNLP/
[6] Web scraping, https://www.webharvy.com/articles/what-is-web-scraping.html
[7] XPath, https://www.w3.org/TR/1999/REC-xpath-19991116/
[8] TF-IDF, http://www.tfidf.com/
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
17
[9] Quoc V. Le & Tomas Mikolov (2014) Distributed Representations of Sentences and Documents.
CoRR abs/1405.4053
[10] Word-Embeddings, http://sanjaymeena.io/tech/word-embeddings/
[11] Word2vec and Doc2vec, http://gensim.narkive.com/RavqZorK/gensim-4914-graphic-representations-
of-word2vec-and-doc2vec
[12] Quoc V. Le & Tomas Mikolov (2014) Distributed Representations of Sentences and Documents.
CoRR abs/1405.4053
[13] Representations of word2vec and doc2vec, http://gensim.narkive.com/RavqZorK/gensim-4914-
graphic-representations-of-word2vec-and-doc2vec
[14] Patrick Hohenecker, Thomas Lukasiewicz (2009) “Deep Learning for Ontology Reasoning”, CoRR
abs/1705.10342
[15] Trapit Bansal, Arvind Neelakantan, Andrew McCallum, (2017) “RelNet: End-to-end Modeling of
Entities & Relations”, University of Massachusetts Amherst, CoRR abs/1706.07179.
[16] Thuy Vu & Douglas Stott Parker, (2016) “$K$-Embeddings: Learning Conceptual Embeddings for
Words using Context”, HLT-NAACL, pp 1262-1267
[17] Palo IT, http://palo-it.com/
[18] Voice IT, https://github.com/Palo-IT/voice-IT
AUTHORS
Ali Rahmani, data engineer, Palo IT
Patrick Laffitte, PhD and data science expert, Palo IT
Raja Haddad, PhD and data scientist, Palo IT
Yassin Chabeb, PhD, R&D Consultant, Palo IT

Contenu connexe

Tendances

Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...IOSR Journals
 
Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...
Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...
Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...Christoph Lange
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
An Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileAn Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileIDES Editor
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
 
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...Punit Sharnagat
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning   sstose
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESijnlc
 
Information Flow based Ontology Mapping - 2002
Information Flow based Ontology Mapping - 2002Information Flow based Ontology Mapping - 2002
Information Flow based Ontology Mapping - 2002Yannis Kalfoglou
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataRinke Hoekstra
 
Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...Nakul Sharma
 

Tendances (17)

Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
 
Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...
Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...
Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 
G04124041046
G04124041046G04124041046
G04124041046
 
An Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileAn Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired File
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Information Flow based Ontology Mapping - 2002
Information Flow based Ontology Mapping - 2002Information Flow based Ontology Mapping - 2002
Information Flow based Ontology Mapping - 2002
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...
 
Ir 02
Ir   02Ir   02
Ir 02
 

Similaire à USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT

A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...
A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...
A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...ijcsit
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain IJECEIAES
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxstilliegeorgiana
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text minianhcrowley
 
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE ijmpict
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extractionunyil96
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003butest
 
An adaptation of Text2Onto for supporting the French language
An adaptation of Text2Onto for supporting  the French language An adaptation of Text2Onto for supporting  the French language
An adaptation of Text2Onto for supporting the French language IJECEIAES
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ijnlc
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...kevig
 
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...ijnlc
 
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...kevig
 
Ck32985989
Ck32985989Ck32985989
Ck32985989IJMER
 
Iot ontologies state of art$$$
Iot ontologies state of art$$$Iot ontologies state of art$$$
Iot ontologies state of art$$$Sof Ouni
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitationbutest
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitationbutest
 

Similaire à USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT (20)

A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...
A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...
A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...
 
Narrative: Text Generation Model from Data
Narrative: Text Generation Model from DataNarrative: Text Generation Model from Data
Narrative: Text Generation Model from Data
 
The Value and Benefits of Data-to-Text Technologies
The Value and Benefits of Data-to-Text TechnologiesThe Value and Benefits of Data-to-Text Technologies
The Value and Benefits of Data-to-Text Technologies
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
 
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
An adaptation of Text2Onto for supporting the French language
An adaptation of Text2Onto for supporting  the French language An adaptation of Text2Onto for supporting  the French language
An adaptation of Text2Onto for supporting the French language
 
ResearchPaper
ResearchPaperResearchPaper
ResearchPaper
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
 
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
 
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
XAI LANGUAGE TUTOR - A XAI-BASED LANGUAGE LEARNING CHATBOT USING ONTOLOGY AND...
 
Ck32985989
Ck32985989Ck32985989
Ck32985989
 
Iot ontologies state of art$$$
Iot ontologies state of art$$$Iot ontologies state of art$$$
Iot ontologies state of art$$$
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitation
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitation
 

Plus de ecij

NEW Current Issue - CALL FOR PAPERS - Electrical and Computer Engineering An ...
NEW Current Issue - CALL FOR PAPERS - Electrical and Computer Engineering An ...NEW Current Issue - CALL FOR PAPERS - Electrical and Computer Engineering An ...
NEW Current Issue - CALL FOR PAPERS - Electrical and Computer Engineering An ...ecij
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
 
ADOPTING MEASURES TO REDUCE POWER OUTAGES
ADOPTING MEASURES TO REDUCE POWER OUTAGESADOPTING MEASURES TO REDUCE POWER OUTAGES
ADOPTING MEASURES TO REDUCE POWER OUTAGESecij
 
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...ecij
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
 
PREPARATION OF POROUS AND RECYCLABLE PVA-TIO2HYBRID HYDROGEL
PREPARATION OF POROUS AND RECYCLABLE PVA-TIO2HYBRID HYDROGELPREPARATION OF POROUS AND RECYCLABLE PVA-TIO2HYBRID HYDROGEL
PREPARATION OF POROUS AND RECYCLABLE PVA-TIO2HYBRID HYDROGELecij
 
4th International Conference on Electrical Engineering (ELEC 2020)
4th International Conference on Electrical Engineering (ELEC 2020)4th International Conference on Electrical Engineering (ELEC 2020)
4th International Conference on Electrical Engineering (ELEC 2020)ecij
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
 
4th International Conference on Bioscience & Engineering (BIEN 2020)
4th International Conference on Bioscience & Engineering (BIEN 2020) 4th International Conference on Bioscience & Engineering (BIEN 2020)
4th International Conference on Bioscience & Engineering (BIEN 2020) ecij
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
 
Ecij cfp
Ecij cfpEcij cfp
Ecij cfpecij
 
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...ecij
 
UNION OF GRAVITATIONAL AND ELECTROMAGNETIC FIELDS ON THE BASIS OF NONTRADITIO...
UNION OF GRAVITATIONAL AND ELECTROMAGNETIC FIELDS ON THE BASIS OF NONTRADITIO...UNION OF GRAVITATIONAL AND ELECTROMAGNETIC FIELDS ON THE BASIS OF NONTRADITIO...
UNION OF GRAVITATIONAL AND ELECTROMAGNETIC FIELDS ON THE BASIS OF NONTRADITIO...ecij
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
MODELING AND SIMULATION OF SOLAR PHOTOVOLTAIC APPLICATION BASED MULTILEVEL IN...
MODELING AND SIMULATION OF SOLAR PHOTOVOLTAIC APPLICATION BASED MULTILEVEL IN...MODELING AND SIMULATION OF SOLAR PHOTOVOLTAIC APPLICATION BASED MULTILEVEL IN...
MODELING AND SIMULATION OF SOLAR PHOTOVOLTAIC APPLICATION BASED MULTILEVEL IN...ecij
 
Investigation of Interleaved Boost Converter with Voltage multiplier for PV w...
Investigation of Interleaved Boost Converter with Voltage multiplier for PV w...Investigation of Interleaved Boost Converter with Voltage multiplier for PV w...
Investigation of Interleaved Boost Converter with Voltage multiplier for PV w...ecij
 
A COMPARISON BETWEEN SWARM INTELLIGENCE ALGORITHMS FOR ROUTING PROBLEMS
A COMPARISON BETWEEN SWARM INTELLIGENCE ALGORITHMS FOR ROUTING PROBLEMSA COMPARISON BETWEEN SWARM INTELLIGENCE ALGORITHMS FOR ROUTING PROBLEMS
A COMPARISON BETWEEN SWARM INTELLIGENCE ALGORITHMS FOR ROUTING PROBLEMSecij
 

Plus de ecij (20)

NEW Current Issue - CALL FOR PAPERS - Electrical and Computer Engineering An ...
NEW Current Issue - CALL FOR PAPERS - Electrical and Computer Engineering An ...NEW Current Issue - CALL FOR PAPERS - Electrical and Computer Engineering An ...
NEW Current Issue - CALL FOR PAPERS - Electrical and Computer Engineering An ...
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)
 
ADOPTING MEASURES TO REDUCE POWER OUTAGES
ADOPTING MEASURES TO REDUCE POWER OUTAGESADOPTING MEASURES TO REDUCE POWER OUTAGES
ADOPTING MEASURES TO REDUCE POWER OUTAGES
 
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)
 
PREPARATION OF POROUS AND RECYCLABLE PVA-TIO2HYBRID HYDROGEL
PREPARATION OF POROUS AND RECYCLABLE PVA-TIO2HYBRID HYDROGELPREPARATION OF POROUS AND RECYCLABLE PVA-TIO2HYBRID HYDROGEL
PREPARATION OF POROUS AND RECYCLABLE PVA-TIO2HYBRID HYDROGEL
 
4th International Conference on Electrical Engineering (ELEC 2020)
4th International Conference on Electrical Engineering (ELEC 2020)4th International Conference on Electrical Engineering (ELEC 2020)
4th International Conference on Electrical Engineering (ELEC 2020)
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)
 
4th International Conference on Bioscience & Engineering (BIEN 2020)
4th International Conference on Bioscience & Engineering (BIEN 2020) 4th International Conference on Bioscience & Engineering (BIEN 2020)
4th International Conference on Bioscience & Engineering (BIEN 2020)
 
Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)Electrical & Computer Engineering: An International Journal (ECIJ)
Electrical & Computer Engineering: An International Journal (ECIJ)
 
Ecij cfp
Ecij cfpEcij cfp
Ecij cfp
 
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...
 
UNION OF GRAVITATIONAL AND ELECTROMAGNETIC FIELDS ON THE BASIS OF NONTRADITIO...
UNION OF GRAVITATIONAL AND ELECTROMAGNETIC FIELDS ON THE BASIS OF NONTRADITIO...UNION OF GRAVITATIONAL AND ELECTROMAGNETIC FIELDS ON THE BASIS OF NONTRADITIO...
UNION OF GRAVITATIONAL AND ELECTROMAGNETIC FIELDS ON THE BASIS OF NONTRADITIO...
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
MODELING AND SIMULATION OF SOLAR PHOTOVOLTAIC APPLICATION BASED MULTILEVEL IN...
MODELING AND SIMULATION OF SOLAR PHOTOVOLTAIC APPLICATION BASED MULTILEVEL IN...MODELING AND SIMULATION OF SOLAR PHOTOVOLTAIC APPLICATION BASED MULTILEVEL IN...
MODELING AND SIMULATION OF SOLAR PHOTOVOLTAIC APPLICATION BASED MULTILEVEL IN...
 
Investigation of Interleaved Boost Converter with Voltage multiplier for PV w...
Investigation of Interleaved Boost Converter with Voltage multiplier for PV w...Investigation of Interleaved Boost Converter with Voltage multiplier for PV w...
Investigation of Interleaved Boost Converter with Voltage multiplier for PV w...
 
A COMPARISON BETWEEN SWARM INTELLIGENCE ALGORITHMS FOR ROUTING PROBLEMS
A COMPARISON BETWEEN SWARM INTELLIGENCE ALGORITHMS FOR ROUTING PROBLEMSA COMPARISON BETWEEN SWARM INTELLIGENCE ALGORITHMS FOR ROUTING PROBLEMS
A COMPARISON BETWEEN SWARM INTELLIGENCE ALGORITHMS FOR ROUTING PROBLEMS
 

Dernier

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT

  • 1. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 DOI : 10.14810/ecij.2017.6401 1 USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT Ali Rahmani, Patrick Laffitte, Raja Haddad and Yassin Chabeb Research and development entity, Data Science team, Palo IT, Paris, France ABSTRACT Nowadays, real-time systems and intelligent systems offer more and more control interface based on voice recognition or human language recognition. Robots and drones will soon be mainly controlled by voice. Other robots will integrate bots to interact with their users, this can be useful both in industry and entertainment. At first, researchers were digging on the side of "ontology reasoning". Given all the technical constraints brought by the treatment of ontologies, an interesting solution has emerged in last years: the construction of a model based on machine learning to connect a human language to a knowledge base (based for example on RDF). We present in this paper our contribution to build a bot that could be used on real-time systems and drones/robots, using recent machine learning technologies. KEYWORDS Real-time systems, Intelligent systems, Machine learning, Bot 1. INTRODUCTION We present here our contributions within Palo IT [17] for a year of research & development activities. The main part of the R&D entity works on Data Science trends and intelligent systems based on machine learning technologies. The aim of this project was to create a semi-intelligent bot. This bot must be able to analyse facts, reason and answer questions using machine learning methods. This paper aims to provide an overview on our work during this project. It consists of four parts. In the first part, we present the context of the project, the problematics, some related works, the objectives. In the second part, we present details of the tasks that we deal with during our research project. Tasks about the implementation and testing of different methods of text mining - by applying these methods on text data in French - and the test results are detailed in the third part of this paper. Finally, we concluded with our analysis of what we have acquired through this project and future scope. 2. GLOBAL OVERVIEW We present here the context, some related work, issues, and our objectives. 2.1. Context The amount of text data on the web, or stored by companies is growing continuously. In order to exploit this wealth, it is essential to extract knowledge from such data. The discipline dealing
  • 2. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 2 with this type of data is called "Text Mining" includes several issues such as search indexing of documents, summary generation, creation of bots, etc. The work done during our project is part of the enrichment of the Palo IT textual and data analysis research. It aims to create a semi- intelligent bot. For this project an internal R & D was launched. This project "PetiText" (translated SmallText in English; petit=small) is based on the analysis and reasoning on short sentences to detect new facts and answer questions. It involves an analysis of data from text corpora which allows to: • extract targeted information, sorted and added value for companies using algorithms • search for similarities and identifying causal relationships between different facts • detect behaviours and intentions • answer questions of policy makers • guide the marketing action and set up alerts to devices. 2.2. Issues Faced with the growing demand of Palo IT customers to extract knowledge from their textual data, the PetiText R & D project was launched. Indeed, these customers possess documents and tools for collecting reviews and customer complaints or employees. Hence the need to design and implement a tool for analysing this type of data. Text data poorly used by most companies, represent a wealth of information. Their analysis is a means of decision support and a strategic asset for companies. Study of Text Mining existing products shows a major flaw for processing text data written in French. This defect consists of the almost total absence of "open source" libraries incorporating the semantics of the French language. Indeed, unlike the English, we found that most libraries and tools used globally (as Clips [1], NLTK [2], etc.) to treat this type of problem are not reliable when it comes to deal in French documents. For these reasons it was decided to set up a new tool combining several text analysis methods that treats the French language, which allows the machine to reason as a little boy of two years. 2.3. Related works Some authors have proposed to deal with those issues by deep learning and ontology reasoning. It was the case of [14] Patrick Hohenecker and Thomas Lukasiewicz, from Department of Computer Science University of Oxford, introduce a new model for statistical relational learning that is built upon deep recursive neural networks, and give experimental evidence that it can easily compete with, or even outperform, existing logic-based reasoners on the task of ontology reasoning. Other authors recently have proposed in [15] a model that builds an abstract knowledge graph on the entities and relations present in a document which can then be used to answer questions about the document. It is trained end-to-end: only supervision to the model is in the form of correct answers to the questions. Thuy Vu and D. Stott Parker [16] describe a technique for adding contextual distinctions to word embeddings by extending the usual embedding process — into two phases. The first phase resembles existing methods, but also constructs K classifications of concepts. The second phase uses these classifications in developing refined K embeddings for words, namely word K-embeddings. We propose to
  • 3. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 3 complete these propositions with an approach to connect human language and knowledge bases (here we start with French but it must be same thing for other languages). 2.4. Objectives The aim of our project was to help in all the steps of creating a semi-intelligent bot. This bot will learn facts from existing textual resources by conducting a thorough analysis. It must then be able to deduce new facts and answer open questions. To achieve this, a combination of different methods of textual data analysis was used. These methods can be grouped in three axes: • Frequency analysis: using metrics based on the detection of global information and characteristics of a text (keywords, rare words, etc.). • Knowledge of analysis: based Keyword Analysis and Mapping of knowledge for a classification of subjects or knowledge extraction rules (logical rules). • Semantic Analysis: based on the analysis of the context and emotions to contextualize a given text. 2.5. Technological choices and human resources The "petiText" is a project that is part of Data Science Palo IT activities leaded by three PhDs: a data science expert as supervisor Mr. Patrick LAFFITTE, then Mrs. Raja HADDAD and Mr. Yassin CHABEB. Thanks to the wealth of existing libraries in python dedicated to machine learning, the choice of that language was obvious. Regarding data storage, we used ZODB (Zope Object DataBase) which is a hierarchical and object-oriented database. It can store the data as python objects. We used Gensim [3] and Scikit-learn [4] are two python libraries that implement various machine learning methods and facilitating the statistical treatment of data. These methods of learning and statistical data computations require considerable material resources: due to the volume of data to process and especially the time computations, therefore, two remote OVH machines were rented. These machines have the following configurations: Machine n°1: 8 CPUs, 16Gb of RAM and a GPU; Machine n°2: 16 CPUs and 128Gb of RAM. 3. TASKS CARRIED OUT During our project, we were able to participate in the implementation of several tasks on the textual analysis of sentences from corpus of documents to create an intelligent bot capable of answering questions in real-time interaction. The approach on which the bot was based would allow also to exploit 80% of stored data in some enterprises and generate a lot of hidden facts, this stored data is not exploited by the enterprises’ businesses. 3.1. Drawing conclusions from a set of sentences The objective of this set is to create and implement a formal logic model. This was achieved by combining the results of two tools. The first is used to apply a logical model to a set of sentences modelled as relationships between objects. These were extracted through the use of the second tool is the CoreNLP. Appendix A shows an example of application of our logic model on a set of sentences about the family universe/field/domain.
  • 4. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 4 3.1.1. Building a logic model This model is based on interpreting the world as a set of facts, and every fact is the relationship between two or more objects. Knowing that the objects of a sentence are a fact (a relationship), it is sufficient to apply logical rules that we have defined to derive and generate new information between different objects. A simple example: ➢ Man is a creature. [Fact in input] ➢ John is a man. [Fact in input] ✓ John is a creature. [Generated fact] To extract new information from a given set of facts, we have implemented a set of logical rules. When a rule can generate a new fact (called conclusion), or a hypothesis. Each hypothesis can become a conclusion if new facts arrive and validate it. The logical rules that we have defined in this part: • Conclusions: - If obj1 a obj2 Then obj2 is part of obj1. - If obj1 is obj2 AND obj2 is obj3, Then obj1 is obj3. - If obj1 is obj2 OR obj3, Then obj1 is obj2 OR obj1 is obj3. - If obj1 is obj2 OR obj3 AND obj2 is obj2-1 OR obj2-2 AND obj3 is obj3-1 OR obj3- 2, Then obj1 is obj2-1 OR obj2-2 OR obj3-1 OR obj3-2. - If obj1 is obj2 OR obj3 AND obj2 is obj4 AND obj3 is obj5, Then obj1 is obj4 OR obj5. - If obj1 is obj2 AND obj3 is obj2, Then obj1 AND obj3 are obj2. • The hypotheses: - If obj1 is obj2 OR obj3 AND obj3 is obj2, Then obj3 is probably obj1. - If obj1 is obj2 OR obj3 AND obj4 is obj5 AND obj2 Then obj4 is probably obj1. - If obj1 is obj2 OR obj3 AND obj2 is probably obj4, Then obj1 is probably obj4.
  • 5. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 3.1.2. CoreNLP The construction of logic models from the input sentences needs first level of understanding from a syntactic and grammatical analysis. For this we used a natural language processing tool (NLP) entitled CoreNLP [5] of Stanford their basic form from plural or conjugation or grammatical declinations. It also explains the overall structure of the sentence analysing subject, verb, etc. It also acts as a parser adapting to the turn of phrases. Figure 1 shows an example of analysis of a sentence using CoreNLP. When edges represent the overall structure of the sentence, then the blue tags represent lemmas of different words, and those in red, give the grammatical class of a wor Figure 1. Example of CoreNLP analysis result 3.2. Recover Data (The scraping [6] with XPath [7]) Web Scraping is a set of techniques to extract the contents of a source (a website or other). The goal is to transform the data retrieved for use: • For rapid integration between applications (when API is not available). • Or to store this data in data base to be analysed later. In this project we used this technique (Web Scraping) to store a maximum of French definitions. This is to integrate the intelligent aspect to our bot. So that it can answer questions, the meaning of words and phrases is very important, so the choice of web sites and dictionaries of synonyms has emerged. To scrabble definitions from various dictionaries of sites we used This library allows python to extract information (elements, attributes, comments, etc.) of a document through the formulation of terms including • An axis (child or parent). • A node (name or type). • One or more predicates (optional). These definitions and synonyms have served as a learning base to allow the bot to learn the meaning of words and phrases in different contexts Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 The construction of logic models from the input sentences needs first level of understanding from a syntactic and grammatical analysis. For this we used a natural language processing tool (NLP) entitled CoreNLP [5] of Stanford University. CoreNLP fetches lemmas of words, it identifies their basic form from plural or conjugation or grammatical declinations. It also explains the overall structure of the sentence analysing subject, verb, etc. It also acts as a parser adapting to he turn of phrases. Figure 1 shows an example of analysis of a sentence using CoreNLP. When edges represent the overall structure of the sentence, then the blue tags represent lemmas of different words, and those in red, give the grammatical class of a word. Figure 1. Example of CoreNLP analysis result Recover Data (The scraping [6] with XPath [7]) Web Scraping is a set of techniques to extract the contents of a source (a website or other). The goal is to transform the data retrieved for use: For rapid integration between applications (when API is not available). Or to store this data in data base to be analysed later. In this project we used this technique (Web Scraping) to store a maximum of French the intelligent aspect to our bot. So that it can answer questions, the meaning of words and phrases is very important, so the choice of web sites and dictionaries of synonyms has emerged. To scrabble definitions from various dictionaries of sites we used This library allows python to extract information (elements, attributes, comments, etc.) of a document through the formulation of terms including An axis (child or parent). One or more predicates (optional). definitions and synonyms have served as a learning base to allow the bot to learn the meaning of words and phrases in different contexts. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 5 The construction of logic models from the input sentences needs first level of understanding from a syntactic and grammatical analysis. For this we used a natural language processing tool (NLP) University. CoreNLP fetches lemmas of words, it identifies their basic form from plural or conjugation or grammatical declinations. It also explains the overall structure of the sentence analysing subject, verb, etc. It also acts as a parser adapting to he turn of phrases. Figure 1 shows an example of analysis of a sentence using CoreNLP. When edges represent the overall structure of the sentence, then the blue tags represent lemmas of Web Scraping is a set of techniques to extract the contents of a source (a website or other). The In this project we used this technique (Web Scraping) to store a maximum of French-language the intelligent aspect to our bot. So that it can answer questions, the meaning of words and phrases is very important, so the choice of web sites and dictionaries of synonyms has emerged. To scrabble definitions from various dictionaries of sites we used XPath. This library allows python to extract information (elements, attributes, comments, etc.) of a definitions and synonyms have served as a learning base to allow the bot to learn the
  • 6. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 6 3.3. Develop statistical models and learning During this research project we have implemented and tested several models of text data as learning the TF-IDF, the word embeddings. The application of these two models to learn allowed us to have a clear idea about the use cases for each model. 3.3.1. TF-IDF The Extracting relevant information from textual sources based on statistical models. These are used to detect rare words (therefore the most significant), to eliminate less significant as stop- words that does not depend from the context. The most commonly used technique for this is the TF-IDF [8] (Term Frequency - Inverse Document Frequency). • f: The frequency of the term t in the document. • D: All documents. • N: Total number of documents in the corpus (N = | D |). • d: A document in all documents D. • t: A term in a document. 3.3.2. Word embeddings To build a learning model based on the word embeddings every sentence is converted into vector of real values. The application of a model based on the succession of several layers neural networks to detect semantics, contexts and relationships between them, and the classification of new texts through an unsupervised learning. To apply the word embeddings on a textual corpus we used different python libraries like: Gensim (Word embeddings) and Scikit-learn. They offer all necessary methods (Doc2Vec, word2Vec...). 3.3.3. TF-IDF versus Word embeddings We assessed the reliability of the two learning models cited earlier. This evaluation was to be applied on a data set (20 newsgroups) 20,000 items and categories 20 (1000 items per category). For this test, TF-IDF comes at a score of 58% while the word embeddings gives a score of 98%. This result supported our choice to use the word embeddings as a learning model for our bot.
  • 7. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 7 4. INTELLIGENT BOT To develop an intelligent (or semi-intelligent) bot that can analyse the facts, learn and answer questions, we combined the different parts detailed in the previous section. So, our bot combines between the quality of text processing, classical and machine learning logic (semantics). Our bot mainly combines three components: logic, semantics and training/learning base. The bot must be logical, intelligent, autonomous and must render services. In our case the services are to answer questions and generate different contexts, conclusions and assumptions from the facts given to enrich its knowledge base. Aside from the natural language processing that must be done automatically with the arrival of new developments, several challenges need to be resolved to reach reliable conclusions. 4.1. The logic The bot must be logical in its calculations and answers as a human, that's why a conventional logic model was developed. This model will allow to validate or not the facts that are available, based on the rules of the basic classical logic. This will argue about the facts and generate conclusions to be potentially added to the knowledge base of the bot. Before moving to the logical model, the facts, which are only phrases, must first be processed and decomposed by a natural language processing tool. Question complexity and execution time, the logic model is the simplicity of classical logic rules and is very fast because the facts are relatively simple sentences composed of a single verb, subject, complement, conjunction. 4.2. The basic knowledge and autonomy of the bot The knowledge base is made up of all findings and the operative events that are valid. It is enriched when the facts arrive and conclusions are generated. The bot is then autonomous and no longer depends on human intervention as textual data sources feed the bot with new facts. Figure 2 summarizes the developments of the validation process. It must still that moment to launch the bot the first time, he already has a basic knowledge they will need to make learning so he can detect and recognize the context and thus be able to work on developments that arrive. The bot will look like a man (boy) that reasons and learns so the most logical is from a knowledge base that is based on dictionary definitions (for each word, several definitions depending on context) and synonyms. These are collected by the Scraping technique from several specialized websites. The departure of the knowledge base contains 38,565 definitions of 226,303 words. Figure 2. New facts validation process 4.3. The intelligence Now that we have a logic model and an initial knowledge base, the challenge is to find a technique that will allow the bot to detect the context of new arriving facts and rank them, and this is exactly when the word embeddings is used. Recent Deep Learning idea is that the
  • 8. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 8 approximate meaning of a word or sentence can be represented as a vector in a multidimensional space. The nearer vectors represent similar meanings. To do so, we used Gensim for Python designed to automatically extract topics Semantic documents, as efficiently as possible. Gensim is designed to process raw digital and unstructured text. The algorithms in Gensim, such as Latent Semantic Analysis, latent Dirichlet allocation and random projections, describe the semantic structure of documents. The latter is extracted by examining the statistical models of co- occurrences of words in a corpus of training documents. These algorithms are unsupervised, which means that no human input is required. The only input to these algorithms is the text document corpus for training the model. Gensim allows: • Collect and process semantically similar documents. • Analyse text documents for the semantic structure. • Making scalable statistical semantics. 4.3.1. Word Embeddings [9, 10] Word embeddings are one of the most exciting areas of research in the field of Deep Learning. A word embeddings WE: words → ℝn is a parameterized function that maps words in a certain language to high-dimensional vectors (100, 200 to 500 sizes). Essentially, each word is represented by a numerical vector. For example, we might find (poire means pear in French): • WE ("pair") = (0.2, -0.4, 0.7, ...) • WE ("paire") = (0.2, -0.1, 0.7, ...) • WE ("pear") = (0.0 - 0.3; 0.1, ...) • WE ("poire") = (0.1, -0.1, 0.2, ...) The purpose and usefulness of the word embeddings consist in grouping the vectors of similar words in a vector space. Mathematically, it detects similarities between different vectors. These are digital representations describing the characteristics of the word, such as context. The word embeddings has several variations including: • the Word2vec • the Doc2vec 4.3.1.1 Word2vec [11] Word2vec is a two-layer neural network that processes the text. The input to the network is a text corpus and its output is a set of vectors. These include semantic features to the words in the corpus. Word2vec is not a deep neural network in itself, but it is very useful because it turns text into digital form (vectors) that the deep networks can understand. Figure 3 summarizes the process used in word2vec algorithm. Words can be considered as discrete states, then simply search the transition probabilities between these states, such as, the probability that they occur
  • 9. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 together. In this case we will have close vectors for the words in a similar context (plus the cosine is close to 1, the more the context of these words is similar) Figure 3. Description of steps word2vec (Gensim [3]) In the Mikolov [12] introduction about learning word2vec, each word is mapped to a single vector, represented by a column in a matrix. The column is indexe the vocabulary. The concatenation, or the sum of the vectors is then used as a characteristic for predicting the next word in a sentence. The concatenation. If we take into consideration enough data, use and contexts, the word2vec can make very accurate assumptions about the meaning of a word based on past appearances. These assumptions can be used to establish an association of words with other W('woman') - W('man') ≈ W('queen') Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 together. In this case we will have close vectors for the words in a similar context (plus the cosine re the context of these words is similar). Figure 3. Description of steps word2vec (Gensim [3]) In the Mikolov [12] introduction about learning word2vec, each word is mapped to a single vector, represented by a column in a matrix. The column is indexed by the position of the word in the vocabulary. The concatenation, or the sum of the vectors is then used as a characteristic for predicting the next word in a sentence. The Figure 4 gives an example of Figure 4. Word2vec example [12] If we take into consideration enough data, use and contexts, the word2vec can make very accurate assumptions about the meaning of a word based on past appearances. These assumptions can be used to establish an association of words with other words in terms of vectors. For example: ≈ W('queen') - W('king') Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 9 together. In this case we will have close vectors for the words in a similar context (plus the cosine In the Mikolov [12] introduction about learning word2vec, each word is mapped to a single d by the position of the word in the vocabulary. The concatenation, or the sum of the vectors is then used as a characteristic for gives an example of word2vec If we take into consideration enough data, use and contexts, the word2vec can make very accurate assumptions about the meaning of a word based on past appearances. These assumptions can be words in terms of vectors. For example:
  • 10. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 Word2vec form words by other words it detects in the input corpus. Word2vec methods: 1. Continuous Bag of Words model (CBOW): • The context (surrounding wo 2. Skip-gram with negative sampling or skip • A word is used to predict a target • This method can also work well with a small amount of training data. It can also represent words or Figure 5. Architecture of methods CBOW and Skip 4.3.1.2 Doc2vec [13] Doc2vec (Paragraph2vec) changes the algorithm word2vec (generalization of unsupervised learning of continuous representations for blocks of most important keywords such as sentences, paragraphs or entire documents Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 Word2vec form words by other words it detects in the input corpus. Word2vec Continuous Bag of Words model (CBOW): The context (surrounding words) is used to predict the target word. gram with negative sampling or skip-gram: A word is used to predict a target-context (surrounding words). This method can also work well with a small amount of training data. It can also represent words or few sentences. Figure 5. Architecture of methods CBOW and Skip-gram. w(t) is the current word, w(t-1), w(t words surrounding the word Doc2vec (Paragraph2vec) changes the algorithm word2vec (generalization of word2vec) for unsupervised learning of continuous representations for blocks of most important keywords such as sentences, paragraphs or entire documents. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 10 includes two rds) is used to predict the target word. This method can also work well with a small amount of training data. It can also 1), w(t-2) ... are the word2vec) for unsupervised learning of continuous representations for blocks of most important keywords such
  • 11. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 11 Figure 6. Description of doc2vec [3] The Doc2vec realize a learning on a large set of documents, and creates a model of vector spaces. In this model, each document is a vector space composed by the vectors of words. Thus, to have the degrees of similarities, the method "most_similar ' uses the cosine between vectors, the higher the cosine is close to 1, the higher the similarity is high. Figure 6 illustrates the steps of unwinding the doc2vec algorithm. To apply Doc2Vec, two methods can be used: • distributed memory model (DM) • distributed bag of words (DBOW) a. Distributed memory model (DM) It considers the vector of paragraph with the vectors of paragraph words (Word2vec) to predict the next word in a text. Using this distributed memory model (DM) comprises: • Randomly assign a paragraph vector for each document. • Predict the next word using the context of the word + the paragraph vector. • Drag the window contexts on the document while the paragraph vector is fixed (therefore distributed memory) Figure 7 illustrates the operation of distributed memory model.
  • 12. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 b. DBOW: Distributed Bag Of words This method (DBOW) ignores the context words at the entrance. One paragraph vector predicts words in a small window. This method requires less storage and it is very similar to the method of the Skip-gram word2vec [12]. This method is less efficient than DM. However, the combination of the two methods DM + DBOW As shown in Figure 8, the DBOW method involves: • Using only paragraphs vectors (No word2vec). • Taking a window of words in a using paragraphs vectors (ignoring word order). Figure 8. DBOW model doc2vec [12] 4.4. Tests and applications As part of the project "PetiText definitions, when each definition is tagged with a word. We proceed to the construction of the model using the method DBOW. To assign a tag to a new definition, the model infers the vector and returns the tag definitions having the highest cosine Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 Figure 7. DM model doc2vec [12] b. DBOW: Distributed Bag Of words This method (DBOW) ignores the context words at the entrance. One paragraph vector predicts words in a small window. This method requires less storage and it is very similar to the method of This method is less efficient than DM. However, the combination DM + DBOW is the best way to make a Doc2vec. As shown in Figure 8, the DBOW method involves: Using only paragraphs vectors (No word2vec). Taking a window of words in a paragraph and random predict what word using paragraphs vectors (ignoring word order). Figure 8. DBOW model doc2vec [12] PetiText" the doc2vec model is built from all available dictionaries of definitions, when each definition is tagged with a word. We proceed to the construction of the model using the method DBOW. To assign a tag to a new definition, the model infers the vector and returns the tag definitions having the highest cosine. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 12 This method (DBOW) ignores the context words at the entrance. One paragraph vector predicts words in a small window. This method requires less storage and it is very similar to the method of This method is less efficient than DM. However, the combination paragraph and random predict what word " the doc2vec model is built from all available dictionaries of definitions, when each definition is tagged with a word. We proceed to the construction of the model using the method DBOW. To assign a tag to a new definition, the model infers the vector
  • 13. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 4.4.1. Learning time et scores With a machine having 16 CPU definitions of 226,303 words varies proportionally to the model parameters, the size of the generated vectors and the number of iteration of l Table 1 shows the results of different tests we performed. We note that the learning time and its quality are proportional to the number of iterations of the algorithm. We also find that the best results are obtained with 200 iterations and vecto of the score according to the number of iterations Table 1. Overview of learning times and scores of different models. Size vectors 100 100 100 200 200 200 Figure 8. Scores obtained according to the size of the vectors and the number of iterations 4.4.2. Assignment example of a tag to a new sentence Considering the example of a family word definition: [ 'Generation', 'successive', 'down', 'ancestor', 'lined'] The sentence has been normalized by removing stop words are lemmatized. The vector inferred by the model parameter Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 With a machine having 16 CPUs and RAM 128Gb, the learning period of the model on 38,565 definitions of 226,303 words varies proportionally to the model parameters, the size of the generated vectors and the number of iteration of learning. Table 1 shows the results of different tests we performed. We note that the learning time and its quality are proportional to the number of iterations of the algorithm. We also find that the best results are obtained with 200 iterations and vectors of size 200. Figure 9 illustrates the evolution of the score according to the number of iterations. Table 1. Overview of learning times and scores of different models. Size vectors Iterations Learning Time Score 50 8 min. 77% 100 16 min. 79% 200 32 min. 80% 50 9 min. 74% 100 17 min. 80% 200 32 min. 84% Figure 8. Scores obtained according to the size of the vectors and the number of iterations Assignment example of a tag to a new sentence example of a family word definition: [ 'Generation', 'successive', 'down', 'ancestor', 'lined'] The sentence has been normalized by removing stop-words, verbs are put in the infinitive and The vector inferred by the model parameter vector size 200 corresponds to: Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 13 , the learning period of the model on 38,565 definitions of 226,303 words varies proportionally to the model parameters, the size of the Table 1 shows the results of different tests we performed. We note that the learning time and its quality are proportional to the number of iterations of the algorithm. We also find that the best rs of size 200. Figure 9 illustrates the evolution Figure 8. Scores obtained according to the size of the vectors and the number of iterations words, verbs are put in the infinitive and
  • 14. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 [-2.53752857e-01 -2.71043032e 7.85913542e The Figure 9 shows the tags that different models have found by calculating the cosine between the inferred vector of the previous sentence, and the vectors of all the definitions in the vector space model. Figure 9. Search results on the context of the preceding sentence regarding the word family For this example, we notice that the models having size and 200 are the best performers, since they have been the only ones that have returned the tag "family" that corresponds to the definition of the input sentence. Figure 10 is a screenshot of our first bot universes/contexts: family, abstract objects, biological organisms Then, we choose the family definition context then we list facts that we give to the bot. Fact 1: a person is a man or a woman Fact 2: a woman is female … In Figure 11, the bot starts reasoning on the facts in the order and generating conclusions and hypothesis Generated fact 1: parents and kids are parts of the same family Generated fact 2: a father is a male … The same process was tested in real the bottom of the screen. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 2.71043032e-02 4.33574356e-02 -9.83970612e-02 2.55723894e 7.85913542e-02 02--5.09732738e...] shows the tags that different models have found by calculating the cosine between d vector of the previous sentence, and the vectors of all the definitions in the vector Figure 9. Search results on the context of the preceding sentence regarding the word family For this example, we notice that the models having size vectors 200 and number of iterations 100 and 200 are the best performers, since they have been the only ones that have returned the tag "family" that corresponds to the definition of the input sentence. Figure 10 is a screenshot of our first bot prototype. It is a French PetiText reasoner about three universes/contexts: family, abstract objects, biological organisms. Then, we choose the family definition context then we list facts that we give to the bot. Fact 1: a person is a man or a woman In Figure 11, the bot starts reasoning on the facts in the order and generating conclusions and Generated fact 1: parents and kids are parts of the same family Generated fact 2: a father is a male tested in real-time interaction and it works, we can also add new fact in Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 14 02 2.55723894e-01 - shows the tags that different models have found by calculating the cosine between d vector of the previous sentence, and the vectors of all the definitions in the vector Figure 9. Search results on the context of the preceding sentence regarding the word family vectors 200 and number of iterations 100 and 200 are the best performers, since they have been the only ones that have returned the tag prototype. It is a French PetiText reasoner about three Then, we choose the family definition context then we list facts that we give to the bot. In Figure 11, the bot starts reasoning on the facts in the order and generating conclusions and time interaction and it works, we can also add new fact in
  • 15. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 15 Figure 9. First and second screens of bot prototype Figure 11. Reasoning screen 3. CONCLUSIONS We have presented here the different steps and tools combined in order to build a semi-intelligent bot based on machine learning technologies. With a solid foundation of learning, a logic model, a
  • 16. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 16 word embeddings with good scores (≈ 84%) and improvements in current and future, we hope to combine the three parts in order to have a functional bot. We use advanced tools and technologies, they are very recent and widespread on the Data Science: the python programming language, jupyter notebook for a complete development environment, Gensim for word embeddings, and advanced tools for natural languages processing such as Clips and CoreNLP Stanford. During the project, we have used namely the methods of classification and clustering, Data Mining and Text Mining, Sentiment Analysis, and evaluations of the quality of classifiers (Reminder, Precision, F-Measure). Then come the current systems, languages and paradigms for Big Data and Advanced Big Data Analytics, that showed us especially the world of Big Data, many use cases and market opportunities. We have tested some big data architectures, including Hadoop and Spark with Python languages, Java and Scala. As the subject of the project is part of the R&D activities, the ultimate goal was clear and understandable, but in practice, problems on understanding how to achieve the objectives arise. Indeed, understanding how to get the value and signification of a text is not easy. Then follows the difficulty of understanding the functioning of the used methods, which algorithms to use and when to use it. Development work and tests, reading papers and publications of other researchers, and the documentation, all that was done in order to understand the subject, to progress and have good results (≈ 84%), it was not the case at the beginning (≈ 60%). After months of data processing (scraping, natural language processing, data cleaning, data standardization...) and the development of logic and learning models (Word embeddings), the first results / satisfactory scores were obtained. Now, we plan to make improvements and use of new techniques for the coming months. We will use LSTM (Long Short-Term Memory), an architecture of recurrent neural networks (RNN) which should further improve the quality of prediction and classification. We plan also to finalise the integration of the bot within a Parrot drone that we controlled by voice thanks to a previous research project in order to complete a global interactive real-time interface between human and drones/robots [18] ACKNOWLEDGEMENTS We like to thank everyone that helped us during the current year. REFERENCES [1] CLiPS, https://www.clips.uantwerpen.be/PAGES/PATTERN-FR. [2] NLTK, http://www.nltk.org/ [3] Gensim, https://radimrehurek.com/project/gensim/ [4] Scikit learn, http://scikit-learn.org/stable/ [5] CoreNLP, https://stanfordnlp.github.io/CoreNLP/ [6] Web scraping, https://www.webharvy.com/articles/what-is-web-scraping.html [7] XPath, https://www.w3.org/TR/1999/REC-xpath-19991116/ [8] TF-IDF, http://www.tfidf.com/
  • 17. Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017 17 [9] Quoc V. Le & Tomas Mikolov (2014) Distributed Representations of Sentences and Documents. CoRR abs/1405.4053 [10] Word-Embeddings, http://sanjaymeena.io/tech/word-embeddings/ [11] Word2vec and Doc2vec, http://gensim.narkive.com/RavqZorK/gensim-4914-graphic-representations- of-word2vec-and-doc2vec [12] Quoc V. Le & Tomas Mikolov (2014) Distributed Representations of Sentences and Documents. CoRR abs/1405.4053 [13] Representations of word2vec and doc2vec, http://gensim.narkive.com/RavqZorK/gensim-4914- graphic-representations-of-word2vec-and-doc2vec [14] Patrick Hohenecker, Thomas Lukasiewicz (2009) “Deep Learning for Ontology Reasoning”, CoRR abs/1705.10342 [15] Trapit Bansal, Arvind Neelakantan, Andrew McCallum, (2017) “RelNet: End-to-end Modeling of Entities & Relations”, University of Massachusetts Amherst, CoRR abs/1706.07179. [16] Thuy Vu & Douglas Stott Parker, (2016) “$K$-Embeddings: Learning Conceptual Embeddings for Words using Context”, HLT-NAACL, pp 1262-1267 [17] Palo IT, http://palo-it.com/ [18] Voice IT, https://github.com/Palo-IT/voice-IT AUTHORS Ali Rahmani, data engineer, Palo IT Patrick Laffitte, PhD and data science expert, Palo IT Raja Haddad, PhD and data scientist, Palo IT Yassin Chabeb, PhD, R&D Consultant, Palo IT