Recovering Diacritics using Wikipedia and Google

•Télécharger en tant que PPT, PDF•

0 j'aime•356 vues

Faculty of Computer Science

Technologie Formation

 Motivation
 The system
 Steps performed
 Results
 Conclusions

 Ro-Wikipedia was used in CLEF 2007
◦ 1.43 Gb
◦ 121.832 files
Iftene, Trandabăţ, KEPT 2009

Step 1 - Initial text is split into sentences and then sentences
are further split into words
Step 2 - For every word without diacritics, we search in
DBPF the corresponding possible value
◦ If the current word doesn’t contain “a, i, s, t” letters then we search in
DBFP or in Ro-Wikipedia the word
◦ If the current word contains one or more from “a, i, s, t” letters then we
search in DBFP or in Ro-Wikipedia using a pattern, obtained from
initial word, where all possible diacritics (a, i, s, t) are replaced with
the corresponding values (”a” is replaced by (ă|â|a), ”i” is replaced by
(î|i), ”s” is replaced by (ş|s), ”t” is replaced by (t|ţ))
◦ For example for word = “fata” the pattern = “f(ă|â|a)(t|ţ)(ă|â|a)”
Iftene, Trandabăţ, KEPT 2009

Step 3 - We build a query in order to search web
pages that contain similar sentences (At this
step we receive sentences that contain words
with multiples forms in DBFP)
Iftene, Trandabăţ, KEPT 2009

Step 4 - We extract from web the ﬁrst 10 relevant
pages returned by Google
Step 5- From downloaded sites we select only pages
with texts and ignore ﬁles with images, fonts, and
with conﬁguration settings. In the selection process
we identify the ”correct” ﬁles with diacritics and
concatenate them in one file
Iftene, Trandabăţ, KEPT 2009

Step 6 - Using the ﬁle built at Step 5 we will show
how we will identify the most appropiate form for
words with multiple forms. We build the same kind of
patterns as at Step 2 b) ii. and identify, for every
word, the possible forms and its relative positions in
the concatenated ﬁle
Iftene, Trandabăţ, KEPT 2009

 If the sentence S has as components the words w1,
w2, ..., wn
 We note with fi the current form for word wi and with
pi1, pi2, ..., piti the positions from each associated layer
 With these notations a full path from ﬁrst layer
(corresponding to the ﬁrst word of the sentence) to
the last layer (corresponding to the last word of the
sentence) can be noticed with
FP = (p1i1, p2i2, …, pnin)
Iftene, Trandabăţ, KEPT 2009

 From now our goal is to ﬁnd a full path between
current layers with a minimal length
 For that we build
Iftene, Trandabăţ, KEPT 2009

 An example is presented below for the sentence: ”Scoala
incepe sambata” with two possible solutions:
 Şcoala începe sâmbătă. (School starts this Saturday).
 Şcoala începe sâmbăta. ((Usually) the school starts
Saturday).
Iftene, Trandabăţ, KEPT 2009

 Step 7 - Context improvement:
◦ The backward rule
◦ The forward rule
◦ The maximization rule
Iftene, Trandabăţ, KEPT 2009

 In order to evaluate the systems performances, we
used a large ﬁle containing the Calimera Guidelines
(14.148 sentences).
Iftene, Trandabăţ, KEPT 2009

 The paper presents a method to restore
diacritics using web found contexts
 The system accuracy is similar to the
accuracy of existing systems, but the main
advantage comes from fact that it uses
resource and tools available for free.
 Also, we tested our algorithm on other
languages like French and German and the
results are very promising
Iftene, Trandabăţ, KEPT 2009

Contenu connexe

Tendances

Tutorial on word2vecLeiden University

Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin

Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague

Word Embeddings, why the hype ? Hady Elsahar

AINL 2016: YagunovaLidia Pivovarova

Thai Word Embedding with Tensorflow Kobkrit Viriyayudhakorn

L1Ekaterina Chernyak

L3 v2Ekaterina Chernyak

Intro to NLP. Lecture 2Ekaterina Chernyak

Word representations in vector spaceAbdullah Khan Zehady

GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan

A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov

Semantic Role LabelingMarina Santini

Text Mining for LexicographyLeiden University

AINL 2016: MalykhLidia Pivovarova

Representation Learning of Vectors of Words and PhrasesFelipe Moraes

Lecture: Word SensesMarina Santini

Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Marcin Junczys-Dowmunt

New word analogy corpusLukáš Svoboda

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg

Tendances (20)

Tutorial on word2vec

Crash Course in Natural Language Processing (2016)

Tomáš Mikolov - Distributed Representations for NLP

Word Embeddings, why the hype ?

AINL 2016: Yagunova

Thai Word Embedding with Tensorflow

L3 v2

Intro to NLP. Lecture 2

Word representations in vector space

GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText

A general method applicable to the search for anglicisms in russian social ne...

Semantic Role Labeling

Text Mining for Lexicography

AINL 2016: Malykh

Representation Learning of Vectors of Words and Phrases

Lecture: Word Senses

Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...

New word analogy corpus

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop

En vedette

Phonetic 2Fatmawati Khodijah

Ipa internacional phonetic_alphabetLuciana Viter

Schwa and the short iDavid Nicholson

Ipa pronunciation session[1]Laydy

2 phonetics slides finalJasmine Wong

Ocean powerpoint presentationasniffen

Phonetics & phonology (The way Vowels and Consonant of English are articulated)AishaKoukab

phonetics and phonologyWu Heping

Phonetics powerpointMelvin Cabacaba

En vedette (9)

Phonetic 2

Ipa internacional phonetic_alphabet

Schwa and the short i

Ipa pronunciation session[1]

2 phonetics slides final

Ocean powerpoint presentation

Phonetics & phonology (The way Vowels and Consonant of English are articulated)

phonetics and phonology

Phonetics powerpoint

Plus de Faculty of Computer Science

Using Artificial Intelligence in Software EngineeringFaculty of Computer Science

Eye and Voice Control for an Augmented Reality Cooking ExperienceFaculty of Computer Science

Learn Chemistry with Augmented RealityFaculty of Computer Science

Exploiting Social Networks. Technological TrendsFaculty of Computer Science

Augmented Reality in EducationFaculty of Computer Science

Diversification in an Image Retrieval SystemFaculty of Computer Science

Using opinion mining techniques for early crisis detectionFaculty of Computer Science

Augmented realityFaculty of Computer Science

I See You, You Can't See Me: On People's Perception About Surveillance In Po...Faculty of Computer Science

Named Entity Recognition for RomanianFaculty of Computer Science

Question Answering for Machine Reading Evaluation on Romanian and EnglishFaculty of Computer Science

Identify Experts from a Domain of Interest Faculty of Computer Science

Question Answering on Romanian, English and French LanguagesFaculty of Computer Science

UAIC Participation at RTE4Faculty of Computer Science

Hypothesis Transformation and Semantic Variability Rules Used in RTEFaculty of Computer Science

Improving a Question Answering System for Romanian Using Textual EntailmentFaculty of Computer Science

A Distributed Architecture System for Recognizing Textual EntailmentFaculty of Computer Science

Graph Coloring using Peer-to-Peer NetworksFaculty of Computer Science

Formalizing Peer-to-Peer Systems based on Content Addressable NetworkFaculty of Computer Science

Plus de Faculty of Computer Science (19)

Using Artificial Intelligence in Software Engineering

Eye and Voice Control for an Augmented Reality Cooking Experience

Learn Chemistry with Augmented Reality

Exploiting Social Networks. Technological Trends

Augmented Reality in Education

Diversification in an Image Retrieval System

Using opinion mining techniques for early crisis detection

Augmented reality

I See You, You Can't See Me: On People's Perception About Surveillance In Po...

Named Entity Recognition for Romanian

Question Answering for Machine Reading Evaluation on Romanian and English

Identify Experts from a Domain of Interest

Question Answering on Romanian, English and French Languages

UAIC Participation at RTE4

Hypothesis Transformation and Semantic Variability Rules Used in RTE

Improving a Question Answering System for Romanian Using Textual Entailment

A Distributed Architecture System for Recognizing Textual Entailment

Graph Coloring using Peer-to-Peer Networks

Formalizing Peer-to-Peer Systems based on Content Addressable Network

Dernier

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

CloudStudio User manual (basic edition):comworks

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

"ML in Production",Oleksandr BaganFwdays

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Search Engine Optimization SEO PDF for 2024.pdfRankYa

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Story boards and shot lists for my a level piececharlottematthew16

Dernier (20)

DMCC Future of Trade Web3 - Special Edition

DevoxxFR 2024 Reproducible Builds with Apache Maven

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Developer Data Modeling Mistakes: From Postgres to NoSQL

CloudStudio User manual (basic edition):

TeamStation AI System Report LATAM IT Salaries 2024

"ML in Production",Oleksandr Bagan

Unraveling Multimodality with Large Language Models.pdf

Search Engine Optimization SEO PDF for 2024.pdf

DSPy a system for AI to Write Prompts and Do Fine Tuning

Vertex AI Gemini Prompt Engineering Tips

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

WordPress Websites for Engineers: Elevate Your Brand

Advanced Test Driven-Development @ php[tek] 2024

DevEX - reference for building teams, processes, and platforms

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Nell’iperspazio con Rocket: il Framework Web di Rust!

Commit 2024 - Secret Management made easy

Story boards and shot lists for my a level piece

Recovering Diacritics using Wikipedia and Google

1. Adrian Iftene1 , Diana Trandabăţ1,2 {adiftene, dtrandabat}@info.uaic.ro 1 Faculty of Computer Science 1 “Al. I. Cuza” University of Iasi 2 Romanian Academy, Iasi Branch 2 July, KEP T 2009, Cluj Napoca

2.  Motivation  The system  Steps performed  Results  Conclusions

3.  Ro-Wikipedia was used in CLEF 2007 ◦ 1.43 Gb ◦ 121.832 files Iftene, Trandabăţ, KEPT 2009

4. Iftene, Trandabăţ, KEPT 2009

5. Step 1 - Initial text is split into sentences and then sentences are further split into words Step 2 - For every word without diacritics, we search in DBPF the corresponding possible value ◦ If the current word doesn’t contain “a, i, s, t” letters then we search in DBFP or in Ro-Wikipedia the word ◦ If the current word contains one or more from “a, i, s, t” letters then we search in DBFP or in Ro-Wikipedia using a pattern, obtained from initial word, where all possible diacritics (a, i, s, t) are replaced with the corresponding values (”a” is replaced by (ă|â|a), ”i” is replaced by (î|i), ”s” is replaced by (ş|s), ”t” is replaced by (t|ţ)) ◦ For example for word = “fata” the pattern = “f(ă|â|a)(t|ţ)(ă|â|a)” Iftene, Trandabăţ, KEPT 2009

6. Step 3 - We build a query in order to search web pages that contain similar sentences (At this step we receive sentences that contain words with multiples forms in DBFP) Iftene, Trandabăţ, KEPT 2009

7. Step 4 - We extract from web the first 10 relevant pages returned by Google Step 5- From downloaded sites we select only pages with texts and ignore files with images, fonts, and with configuration settings. In the selection process we identify the ”correct” files with diacritics and concatenate them in one file Iftene, Trandabăţ, KEPT 2009

8. Step 6 - Using the ﬁle built at Step 5 we will show how we will identify the most appropiate form for words with multiple forms. We build the same kind of patterns as at Step 2 b) ii. and identify, for every word, the possible forms and its relative positions in the concatenated ﬁle Iftene, Trandabăţ, KEPT 2009

9.  If the sentence S has as components the words w1, w2, ..., wn  We note with fi the current form for word wi and with pi1, pi2, ..., piti the positions from each associated layer  With these notations a full path from ﬁrst layer (corresponding to the ﬁrst word of the sentence) to the last layer (corresponding to the last word of the sentence) can be noticed with FP = (p1i1, p2i2, …, pnin) Iftene, Trandabăţ, KEPT 2009

10.  From now our goal is to ﬁnd a full path between current layers with a minimal length  For that we build Iftene, Trandabăţ, KEPT 2009

11.  An example is presented below for the sentence: ”Scoala incepe sambata” with two possible solutions:  Şcoala începe sâmbătă. (School starts this Saturday).  Şcoala începe sâmbăta. ((Usually) the school starts Saturday). Iftene, Trandabăţ, KEPT 2009

12.  Step 7 - Context improvement: ◦ The backward rule ◦ The forward rule ◦ The maximization rule Iftene, Trandabăţ, KEPT 2009

13.  In order to evaluate the systems performances, we used a large ﬁle containing the Calimera Guidelines (14.148 sentences). Iftene, Trandabăţ, KEPT 2009

14.  The paper presents a method to restore diacritics using web found contexts  The system accuracy is similar to the accuracy of existing systems, but the main advantage comes from fact that it uses resource and tools available for free.  Also, we tested our algorithm on other languages like French and German and the results are very promising Iftene, Trandabăţ, KEPT 2009

Notes de l'éditeur

For every word from the initial sentence we build layers with its position, in the following manner: at every moment, each form found in DBPF is placed on a different layer. On every layer we place the position of the corresponding forms.
For the initial sentence we consider an ordered set of layers associated to every word of it. A path between two layers will be an ordered set of positions from every layer between considered layers. One full path from ﬁrst layer (corresponding to the ﬁrst word of the sentence) to the last layer (corresponding to the last word of the sentence) will have consecutive positions from every layer.
The backward rule searches in previous solved sentences in order to see what forms were already used for words with multiple forms. The forward rule puts this sentence in a waiting process until next sentences will be solved. After that we will use the identified forms in unclear situations. Another rule can be the maximization rule. This rule can be used in cases in which we have a high level of confidence in identifying the correct form for some words, and we de cide to use the same form of these words in other sentences from a specified ”neighborhood”.

Recovering Diacritics using Wikipedia and Google

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Plus de Faculty of Computer Science

Plus de Faculty of Computer Science (19)

Dernier

Dernier (20)

Recovering Diacritics using Wikipedia and Google

Notes de l'éditeur