Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
1. TURKISH LANGUAGE MODELING
Chaza Alkis, Abdurrahim Derric
Department of computer engineering
Yildiz Technical University, 34220 Istanbul, Türkiye
shaza.alqays@hotmail.com, abdelrahimdarrige@gmail.com
Abstract—Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
Keywords—Statistical Language Modelling, Neural Language
Models, Markov Chain, Artificial Neural Networks, Deep Learn-
ing, BERT.
I. INTRODUCTION
Our project is a new technique to guess the appropriate
word in a certain sentence, in this regard to get a good
result we studied some models and tested on the Turkish
language, including the statistical language modeling and
neural language models.
II. LANGUAGE MODELING
Language modeling is central to many important natural
language processing tasks.
III. STATISTICAL LANGUAGE MODELING
A statistical language model SLM is a probability
distribution over sequences of words.
The language model learns the probability of words
occurring based on examples of text. Simpler models may
appear in the context of a short series of words, while larger
models may work at the level of sentences or paragraphs.
Most commonly, language models work at the word level.
The language model can be developed and used
independently, such as creating new sequences of text
that appear to come from the set of documents.
Language modeling is an essential problem for a wide
range of natural language processing tasks. In a more
practical way, language models are used in the front or
back of a more sophisticated model for a task that requires
understanding the language.
Developing better language models often results in models
that perform better in the intended natural language
processing task. This is the motivation for developing better
and more accurate language models, [1].
IV. NEURAL LANGUAGE MODELS
Recently, the use of neural networks in the development
of linguistic models has become so popular that it may now
be the preferred approach.
The use of neural networks in language modeling is often
called Neural Language Modeling, or NLM for short.
Neural network approaches yield better results than classic
methods in independent language models and when models
are incorporated into larger models in challenging tasks
such as speech recognition and machine translation.
The main reason behind the improvements in performance
may be the ability of the method to generalize.
Specifically, an inclusive word that uses the real value
vector to represent each word in the project vector space is
approved. This learned representation of words based on
their use of words with a similar meaning allows to have a
similar representation.
This generalization is something that is not easily achievable
in linguistic representation in classical statistical language
models.
Furthermore, the distributed representation approach allows
for better representation of inclusion in measurement with
vocabulary size. Classical methods with one separate
representation of each word fight dimensional curse with
larger and larger vocabulary of words that lead to longer
and more separate representations.
The neural network approach to language modeling can be
described using the three following model properties:
• Associate each word in the vocabulary with a
distributed word feature vector.
• Express the joint probability function of word
sequences in terms of the feature vectors of these
words in the sequence.
• Learn simultaneously the word feature vector and
the parameters of the probability function.
This represents a relatively simple model where both
representation and probability model are learned together
directly from raw text data.
Recently, neurotic based approaches have begun and
consistently outperformed classical statistical approaches.
V. MODELS STUDY
A. Markov chain
A Markov chain is a stochastic model describing a
sequence of possible events in which the probability of each
event depends only on the state attained in the previous
event.
More formally, a separate Markov chain is a series of
2. random variables X1, X2, X3, ... that satisfies the Markov
feature - the probability of moving from the current state
to the next state depends only on the current state.
With respect to probability distribution, given that the
system is at the right time n, the conditional distribution
of states in the next instance, n + 1, is conditionally
independent of the state of the system in temporal cases 1,
2,. . . , n-1.
This can be written as follows:
Pr(Xn+1 = x|X1 = x1, X2 = x2,..., Xn = xn) =
Pr(Xn+1 = x|Xn = xn)
1)Markov chain graph representation: Markov chains are
often represented using vector diagrams. The nodes in the
vector diagrams represent the various possible states of
random variables, while the edges represent the probability
that the system will move from one state to another the next
time.
For example, in the weather forecast there are three possible
states for the random variable Weather = Sunny, Rainy,
Snowy, and possible Markov chains can be represented as
shown in the Figure 1 One of the main points to understand
Figure 1 Markov chain graph representation
in Markov chains is that you design the results of a series
of random variables over time. The nodes in the above
graph represent the different weather condition, and the
edges between them show the possibility that the next
random variable will change as many different states as
possible, given the condition of the current random variable.
Self-loops show the probability that the model will remain
in its current state.
In the Markov series above, the observed state of the current
random variable is Sunny. Then, the probability that the
random variable will take an instance of next time is Sunny
is 0.8. It may also take Rainy with a probability of 0.19 or
Snowy with a probability of 0.01.
2)Parameterization of Markov chains: Another way to
represent state transitions is to use a transition matrix.
The transition matrix, as the name implies, uses a tabular
representation of the transition probabilities.
The following table shows the transition matrix for the
Markov chain shown in Figure 1. The probability values
represent the probability of the system going from the state
in the row to the states mentioned in the columns, see Table
1.
Table 1 Transition matrix
state sunny rainy snowy
sunny 0.8 0.19 0.01
rainy 0.2 0.7 0.1
snowy 0.1+ 0.2 0.7
B. BERT
Bidirectional Encoder Representations from Transformers
(BERT) is a technique for NLP (Natural Language
Processing) pre-training developed by Google.
Modern NLP models based on deep learning see benefits
from much larger amounts of data, which improve upon
training in millions, or billions, from examples of annotated
training. To help fill this gap in the data, researchers have
developed a variety of techniques to train general purpose
language models using a massive amount of unexplained
text on the web (known as pre-training) as BERT.
1)Why BERT is different: BERT is the first non-supervised
bi-directional linguistic representation, pre-trained with a
plain story block.
For example, in the sentence "You have accessed the bank
account", a one-way contextual model would represent
"bank" based on "you have accessed" but not "account."
However, BERT represents a "bank" using both its previous
and next context - "I have accessed ... account" - starting
from below the deep neural network, making it deeply
bidirectional [2].
2)Masked language modelig: BERT has been pre-trained
on masked language modeling and next sentence prediction
(next sentence prediction will be explained in next section).
Masked language modeling is the task of predicting the next
word given a sequence of words. In masked language mod-
eling instead of predicting every next token, a percentage
of input tokens is masked at random and only those masked
tokens are predicted.
The masked words are not always replaced with the masked
token – [MASK] because then the masked tokens would
never be seen before fine-tuning. Therefore,
• 15% of the tokens are chosen at random.
• 80% of the time tokens are actually replaced with
the token [MASK].
• 10% of the time tokens are replaced with a random
token.
• 10% of the time tokens are left unchanged.
3. 3)Next sentence prediction: The missing word is
predicted, if the next word is the same as missing then the
model made a right guess, for example:
Input = [CLS] the man want to [MASK] store [SEP]
he bought a gallon [MASK] milk [SEP]
Label = IsNext
Input = [CLS] the man [MASK] to the store [SEP]
penguin [MASK] are flight less birds [SEP]
Label = NotNext
This task can be easily created from any single language
group. It is useful because many of the downstream tasks
such as question and answer and reasoning of natural
language require understanding the relationship between
two sentences.
4)Input text presentation before feeding to BERT: The
input representation used by BERT is capable of representing
a single text sentence as well as a pair of sentences
(for example, [Question, Answer]) in a single sequence of
symbols.
• The first token of every input sequence is the
special classification token – [CLS]. This token is
used in classification tasks as an aggregate of the
entire sequence representation. It is ignored in
non-classification tasks.
• For single text sentence tasks, this [CLS] token is
followed by the WordPiece tokens and the separator
token – [SEP],
[CLS] my cat is very good [SEP]
• For sentence pair tasks, the WordPiece tokens of the
two sentences are separated by another [SEP] token.
This input sequence also ends with the [SEP] token,
[CLS] my cat is cute [SEP] he likes play ing [SEP]
• A sentence referring to sentence A or sentence B
is added to each symbol. Decorations are similar to
symbols / word decorations with vocabulary 2.
• A positional embedding is also added to each token
to indicate its position in the sequence.
BERT uses the symbolism of WordPiece. The vocabulary is
initialized with all the individual letters of the language,
hence the most common / most likely groups of words in
the vocabulary are added frequently.
Any word that does not occur in the vocabulary is broken
down into sub-words greedily. For example, if play, ing, and
ed are present in the vocabulary but playing and played are
OOV words then they will be broken down into play + ing
and play + ed respectively. ( is used to represent sub-words).
And the maximum sequenced length of the input is 512
tokens, [3].
VI. RESULTS ANALYSING
A. Markov Cahin model dataset size effect comparison
Here we compare the affect of the size of the dataset,
we notice that when we use large dataset there is slight
improvement. It’s expected result because some words may
not be found in the dataset and as the dataset be larger as
we find more words. Also the best effect shown in order 1,
see Figure 2.
Figure 2 20K - 40K - 100K datasets comparison
B. Smoothing algorithms comparison
Smoothing is searching for the result by passing through
third order to second order back to first order, we noticed
that it had good effect on the result, see Figure 3.
Figure 3 Smoothed - Unsmoothed algorithms comparison
C. BERT model results comparison
Here we will compare BERT results through 20k, 40k
and 100k, we notice that the higher dataset size effect the
most, see Figure 4
Figure 4 BERT results comparison
4. D. BERT vs Google Multilingual
Here we will compare BERT and google multilingual,We
notice that Multilingual gives much lower results than our
BERT model, and it is unsuccessful in finding the missing
word because it focuses on more than 100 languages and
cannot focus on one language. As for our BERT model, it
is learning on the Turkish language alone, so its ability to
link Turkish words and the meanings between sentences
are stronger and this is the reason for the big difference
in results, see Figure 5
Figure 5 BERT vs Multilingual
E. Comparison of statistical language modeling and neural
language model
Here we will see the effect of the training dataset size in
each moddel, and the accuracy of each by comparing results
of top1 and top5.
By comparing Markov and BERT we find the Figure 6 which
means that BERT gives higher results than Markov chain
when dataset size going bigger, in this figure we use 3
datasets and see how they effect.
Figure 6 BERT vs Markov Chain
VII. CONCLUSION
From our study and previous studies, we notice that the
statistical language modeling, although it is considered an
old technique compared to BERT’s deep learning model, still
gives good results.
We notice that BERT, although it is deep learning model, did
not succeed much because the language contains hundreds
of thousands of words and these words may be names or
verbs with different terms, and they can also be found in
different locations of the sentence and this gives us millions
of possibilities.
So all from the statistical language modeling to the natural
language Models, it gets results of approximately 30 percent
or 40 percent of guesses.
Based on the graphics that we extracted from our study,
we see that the size of the dataset greatly affects the
probability of guesswork, so in the future a larger size of
the dataset and new techniques that improve the computer’s
understanding of the language can be used. But in return,
increasing the size of the dataset will lead to an increase in
mathematical operations, for example, when the size of the
dataset was 100K, the operating time was approximately 56
hours. Assuming we would have a million-volume data, the
operation is expected to be lengthened for months using the
current processors.
REFERENCES
[1] J. Brownlee. (2017) Gentle introduction to statistical
language modeling and neural language models. [Online].
Available: https://machinelearningmastery.com/statistical-language-
modeling-and-neural-language-models/
[2] J. Devlin and M.-W. Chang. (2018) Open sourcing bert: State-
of-the-art pre-training for natural language processing. [Online].
Available: https://ai.googleblog.com/2018/11/open-sourcing-bert-
state-of-art-pre.html
[3] Y. SETH. (2019) Bert explained. [Online].
Available: https://yashuseth.blog/2019/06/12/bert-explained-faqs-
understand-bert-working/