1. Adrian Iftene1
, Diana Trandabăţ1,2
{adiftene, dtrandabat}@info.uaic.ro
1
Faculty of Computer Science
1
“Al. I. Cuza” University of Iasi
2
Romanian Academy, Iasi Branch
2 July, KEP T 2009, Cluj Napoca
5. Step 1 - Initial text is split into sentences and then sentences
are further split into words
Step 2 - For every word without diacritics, we search in
DBPF the corresponding possible value
◦ If the current word doesn’t contain “a, i, s, t” letters then we search in
DBFP or in Ro-Wikipedia the word
◦ If the current word contains one or more from “a, i, s, t” letters then we
search in DBFP or in Ro-Wikipedia using a pattern, obtained from
initial word, where all possible diacritics (a, i, s, t) are replaced with
the corresponding values (”a” is replaced by (ă|â|a), ”i” is replaced by
(î|i), ”s” is replaced by (ş|s), ”t” is replaced by (t|ţ))
◦ For example for word = “fata” the pattern = “f(ă|â|a)(t|ţ)(ă|â|a)”
Iftene, Trandabăţ, KEPT 2009
6. Step 3 - We build a query in order to search web
pages that contain similar sentences (At this
step we receive sentences that contain words
with multiples forms in DBFP)
Iftene, Trandabăţ, KEPT 2009
7. Step 4 - We extract from web the first 10 relevant
pages returned by Google
Step 5- From downloaded sites we select only pages
with texts and ignore files with images, fonts, and
with configuration settings. In the selection process
we identify the ”correct” files with diacritics and
concatenate them in one file
Iftene, Trandabăţ, KEPT 2009
8. Step 6 - Using the file built at Step 5 we will show
how we will identify the most appropiate form for
words with multiple forms. We build the same kind of
patterns as at Step 2 b) ii. and identify, for every
word, the possible forms and its relative positions in
the concatenated file
Iftene, Trandabăţ, KEPT 2009
9. If the sentence S has as components the words w1,
w2, ..., wn
We note with fi the current form for word wi and with
pi1, pi2, ..., piti the positions from each associated layer
With these notations a full path from first layer
(corresponding to the first word of the sentence) to
the last layer (corresponding to the last word of the
sentence) can be noticed with
FP = (p1i1, p2i2, …, pnin)
Iftene, Trandabăţ, KEPT 2009
10. From now our goal is to find a full path between
current layers with a minimal length
For that we build
Iftene, Trandabăţ, KEPT 2009
11. An example is presented below for the sentence: ”Scoala
incepe sambata” with two possible solutions:
Şcoala începe sâmbătă. (School starts this Saturday).
Şcoala începe sâmbăta. ((Usually) the school starts
Saturday).
Iftene, Trandabăţ, KEPT 2009
12. Step 7 - Context improvement:
◦ The backward rule
◦ The forward rule
◦ The maximization rule
Iftene, Trandabăţ, KEPT 2009
13. In order to evaluate the systems performances, we
used a large file containing the Calimera Guidelines
(14.148 sentences).
Iftene, Trandabăţ, KEPT 2009
14. The paper presents a method to restore
diacritics using web found contexts
The system accuracy is similar to the
accuracy of existing systems, but the main
advantage comes from fact that it uses
resource and tools available for free.
Also, we tested our algorithm on other
languages like French and German and the
results are very promising
Iftene, Trandabăţ, KEPT 2009
Notes de l'éditeur
For every word from the initial sentence we build layers with its position, in the following manner: at every moment, each form found in DBPF is placed on a different layer. On every layer we place the position of the corresponding forms.
For the initial sentence we consider an ordered set of layers associated to every word of it. A path between two
layers will be an ordered set of positions from every layer between considered layers. One full path from first layer
(corresponding to the first word of the sentence) to the last layer (corresponding to the last word of the sentence) will
have consecutive positions from every layer.
The backward rule searches in previous solved sentences in order to see what forms were already used for words with multiple forms.
The forward rule puts this sentence in a waiting process until next sentences will be solved. After that we will use the identified forms in unclear situations.
Another rule can be the maximization rule. This rule can be used in cases in which we have a high level of confidence in identifying the correct form for some words, and we de cide to use the same form of these words in other sentences from a specified ”neighborhood”.