SlideShare une entreprise Scribd logo
1  sur  17
Authors
University
Politehnica
of Bucharest
A Focused Crawler for
Romanian Words Discovery
Ionuț-Gabriel Radu
Traian Rebedea traian.rebedea@cs.pub.ro
Overview
• Introduction
• Objective
• RWScraper
• Related Work
• RWScraper: Implementation
• Results
• Conclusions
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 2
Introduction
• All natural languages are subject to change
over time
• As the Web becomes more prevalent, it also
constitutes a major source for identifying
language evolution
• Due to large amounts of Romanian web
content, the rate of change has increased
significantly
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 3
Objective
• To provide a mechanism to identify new
words (e.g. neologisms) that entered the
Romanian language
• Develop a specialized (focused) web crawler
for analyzing Romanian web pages and
identifying new words
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 4
Focused Web Crawling
• Crawling the web with a specific purpose:
– “Focus” the spiders to specific content (e.g.
people search, scientific publications, products,
etc.)
– Ignore other web pages
and domains
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 5
Solution: RWScraper
• RWScraper (Romanian Word Scraper) - is able
to solve the following problems:
– Identify Romanian texts;
– Distinguish between proper names and common
nouns;
– Create a database with new words along with
context information and metadata. In order to
identify new
– Discover the most frequent spelling errors in
Romanian online texts.
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 6
RWScraper – Text Processing
• Each word discovered in a Romanian text is looked in
the database provided by www.dexonline.ro, which
contains definitions from several Romanian
dictionaries (DEX, DOOM, etc.)
• Text Processing Pipeline
– Text Normalization
– Language Validation
– Sentence Segmentation
– Sentence-Level Language
Identification
– Word Tokenization
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 7
Related Work:
Neologisms Identification
• A study for Japanese:
– Scanning existing Japanese corpora for possible ”new” words,
typically by processing the texts through segmentation software
and dealing with the ”out-of-lexicon” problem
– Simulating the Japanese morphological processes to create new
possible words and then test for the presence of them in large
corpora
• Identification of lexical discriminants (e.g. termed, called,
known as) and punctuation discriminants (e.g. single and
double quotes) for introducing new words
– This method is able to identify a significantly smaller number of
potential new words due to the limited number of lexical
discriminant patterns.
• Using data about the frequency of words usage over time
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 8
Related Work:
Language Identification
• Common Words Methods
– Store and use a list with the most frequent words for each language
• Unique Letter Combinations
– Database with the most frequent sequences of letters in a language,
not necessarily valid words
– The main disadvantage: the poor performance on short texts
– The main advantage: it does not require word tokenization
• Language Identification Using N-Grams
– Every language has several specific frequently used character n-grams
– For a particular language L, the n-gram ordered dictionary is called n-
gram language profile
– For a new text, we compute the distance to all computed language
profiles
• Markov Models for Language Identification
– The word can be represented as a Markov chain where letters are
states
– Compute a Markov model for each language
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 9
RWScraper: Implementation
• RWScraper is a focused crawler for Romanian
web pages
• Developed using Scrapy: open-source scraping
framework in Python
• It uses three main concepts:
– Spiders: responsible for defining rules to restrict the
crawled content to our area of interest
– Items: data we want to scrape from the web pages
– Pipelines: text processing tasks that act on the
crawled web resources
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 10
RWScraper Language Validation
• Divide the texts into two categories:
– Diacritics free texts - DIAFREE
– Genuine Romanian texts – GEN
• 6.40% of the characters in the Romanian texts part of
the ro_eu_parliament corpus are diacritics
• One of the problems with this approach is that 4.14%
of texts contained ș, â, and î. Unfortunately, there are
also other languages that possess these diacritics
• Romanian is the only language that uses ț and ă
• Our assumption: if a text has over 600 characters and
has no ț/ă are found
– Then it is DIAFREE
– Otherwise is GEN
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 11
RWScraper Language Validation
• Build language profiles, consisting of:
– Character bigrams and trigrams frequency
– Common words frequency
– Diacritics frequency
– Rare characters frequency
– Double consonant frequency
– Single quotes frequency
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 12
Results: Language Validation
• 105 texts are divided into: 20 Romanian with diacritics (RO1 -
RO20), 20 Romanian without diacritics (RO21- RO40), 20
Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish
texts, 3 Catalan texts, and 2 Aromanian
• The size of the texts varied from 9KB to 2:5MB, the average
size being 253:4KB
• Average scores for the discriminator function
– Lower score means higher probability for the text to be written in
Romanian
– Used to set the discriminant score to 0.77 to separate between
Romanian and non-Romanian texts
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 13
Results
• Processed 264,328 online documents
– Only 12,555 documents contained new words
• From this set of texts, we extracted 698,341
– Only 47,363 phrases contained new words
• Discovered 53,724 new words
– 21,343 are proper names
• The remaining tokens are common words and they are
divided into the following main categories:
– Misspelled words (approximately 35%)
– Technical words (approximately 15%)
– Argotic words (approximately 10%)
– Clitics, regionalisms, archaisms, alternative forms for
existing words account for the rest (cca. 40%)
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 14
Results
• Most frequent new words
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 15
Conclusions
• RWScraper is a simple new Romanian words discovery
system
• The project has also managed to create a large
database of Romanian words extracted from the
WWW
– Statistics about common proper names, frequent spelling
mistakes and newly-invented words
• There are several elements that could be further
improved
– The accuracy of the NLP components used by the system
– A more pertinent analysis of the words identified by the
system
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 16
Thank you!
Questions?
Discussion
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 17
This work has been funded by the
Sectorial Operational Programme
Human Resources Development
2007-2013 of the Romanian Ministry
of European Funds through the
Financial Agreement
POSDRU/159/1.5/S/132397

Contenu connexe

Tendances

Admixture of Poisson MRFs: A New Topic Model with Word Dependencies
Admixture of Poisson MRFs: A New Topic Model with Word DependenciesAdmixture of Poisson MRFs: A New Topic Model with Word Dependencies
Admixture of Poisson MRFs: A New Topic Model with Word DependenciesDavid Inouye
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's TutorialWayne Lee
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialVitomir Kovanovic
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language ProcessingSebastian Ruder
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseAndrea Nuzzolese
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
MAHOUT classifier tour
MAHOUT classifier tourMAHOUT classifier tour
MAHOUT classifier tourTed Dunning
 
Semantic Technologies in ST&DL
Semantic Technologies in ST&DLSemantic Technologies in ST&DL
Semantic Technologies in ST&DLAndrea Nuzzolese
 
Terminology turbocharges your translation: From my archive before TaaS ;-)
Terminology turbocharges your translation: From my archive before TaaS ;-)Terminology turbocharges your translation: From my archive before TaaS ;-)
Terminology turbocharges your translation: From my archive before TaaS ;-)Tatjana Gornostaja
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-completeLaura Mandell
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational SemanticsMarina Santini
 

Tendances (20)

Admixture of Poisson MRFs: A New Topic Model with Word Dependencies
Admixture of Poisson MRFs: A New Topic Model with Word DependenciesAdmixture of Poisson MRFs: A New Topic Model with Word Dependencies
Admixture of Poisson MRFs: A New Topic Model with Word Dependencies
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuse
 
Oke
OkeOke
Oke
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Question answering
Question answeringQuestion answering
Question answering
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
MAHOUT classifier tour
MAHOUT classifier tourMAHOUT classifier tour
MAHOUT classifier tour
 
Semantic Technologies in ST&DL
Semantic Technologies in ST&DLSemantic Technologies in ST&DL
Semantic Technologies in ST&DL
 
Terminology turbocharges your translation: From my archive before TaaS ;-)
Terminology turbocharges your translation: From my archive before TaaS ;-)Terminology turbocharges your translation: From my archive before TaaS ;-)
Terminology turbocharges your translation: From my archive before TaaS ;-)
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 

En vedette

Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1Traian Rebedea
 
Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriTraian Rebedea
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea
 
Final draft ideg observation-report_voter-reg-ex 2008 1
Final draft  ideg observation-report_voter-reg-ex 2008 1Final draft  ideg observation-report_voter-reg-ex 2008 1
Final draft ideg observation-report_voter-reg-ex 2008 1IDEGGhana
 
Intro: Ancient Greece
Intro: Ancient GreeceIntro: Ancient Greece
Intro: Ancient GreeceFrank Genise
 
Recintos y clasificación arancelaria
Recintos y clasificación arancelariaRecintos y clasificación arancelaria
Recintos y clasificación arancelariaRosario Canales
 
Fact, Figures and Statistics
Fact, Figures and StatisticsFact, Figures and Statistics
Fact, Figures and Statisticsmeducationdotnet
 
What have you learnt about technologies from the process of constructing this...
What have you learnt about technologies from the process of constructing this...What have you learnt about technologies from the process of constructing this...
What have you learnt about technologies from the process of constructing this...emilymedia1314
 
Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7Traian Rebedea
 
Sistema aduanero mexicano, modernización para un marco mundial.
Sistema aduanero mexicano, modernización para un marco mundial.Sistema aduanero mexicano, modernización para un marco mundial.
Sistema aduanero mexicano, modernización para un marco mundial.ChristoJFV
 
Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Traian Rebedea
 
Surgical Anatomy Of The Nose
Surgical Anatomy Of The NoseSurgical Anatomy Of The Nose
Surgical Anatomy Of The NoseChih-Yen Wei
 
Surgical anatomy of nose
Surgical anatomy of noseSurgical anatomy of nose
Surgical anatomy of noseAugustine raj
 

En vedette (20)

Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
 
Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuri
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
 
Portaretrato tunantero
Portaretrato tunanteroPortaretrato tunantero
Portaretrato tunantero
 
Final draft ideg observation-report_voter-reg-ex 2008 1
Final draft  ideg observation-report_voter-reg-ex 2008 1Final draft  ideg observation-report_voter-reg-ex 2008 1
Final draft ideg observation-report_voter-reg-ex 2008 1
 
Intro: Ancient Greece
Intro: Ancient GreeceIntro: Ancient Greece
Intro: Ancient Greece
 
coverstory
coverstorycoverstory
coverstory
 
Introduction to Excel
Introduction to ExcelIntroduction to Excel
Introduction to Excel
 
Recintos y clasificación arancelaria
Recintos y clasificación arancelariaRecintos y clasificación arancelaria
Recintos y clasificación arancelaria
 
Fact, Figures and Statistics
Fact, Figures and StatisticsFact, Figures and Statistics
Fact, Figures and Statistics
 
What have you learnt about technologies from the process of constructing this...
What have you learnt about technologies from the process of constructing this...What have you learnt about technologies from the process of constructing this...
What have you learnt about technologies from the process of constructing this...
 
Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7
 
портфолио1
портфолио1портфолио1
портфолио1
 
Sistema aduanero mexicano, modernización para un marco mundial.
Sistema aduanero mexicano, modernización para un marco mundial.Sistema aduanero mexicano, modernización para un marco mundial.
Sistema aduanero mexicano, modernización para un marco mundial.
 
Number system
Number systemNumber system
Number system
 
Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2
 
Bitwise operators
Bitwise operatorsBitwise operators
Bitwise operators
 
Surgical Anatomy Of The Nose
Surgical Anatomy Of The NoseSurgical Anatomy Of The Nose
Surgical Anatomy Of The Nose
 
หาผลบวกและผลลบของเอกนาม
หาผลบวกและผลลบของเอกนามหาผลบวกและผลลบของเอกนาม
หาผลบวกและผลลบของเอกนาม
 
Surgical anatomy of nose
Surgical anatomy of noseSurgical anatomy of nose
Surgical anatomy of nose
 

Similaire à A focused crawler for romanian words discovery

An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysisPeter Bouda
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711STIinnsbruck
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Web2Learn
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...LangOER
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...LangOER
 
Europeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom ViewsEuropeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom ViewsVladimir Alexiev, PhD, PMP
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of SouthamptonRepository Fringe
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN
 
The Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
The Canadian Linked Data Initiative: Charting a Path to a Linked Data FutureThe Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
The Canadian Linked Data Initiative: Charting a Path to a Linked Data FutureNASIG
 
Rda and new research potentials, agata kawalec
Rda and new research potentials, agata kawalecRda and new research potentials, agata kawalec
Rda and new research potentials, agata kawalecRichard.Sapon-White
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Nuno Freire
 
Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web
Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web
Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web María Poveda Villalón
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification Zakaria Zubi
 
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...Shawna Reibling
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review PresentationChamila Wijayarathna
 

Similaire à A focused crawler for romanian words discovery (20)

OWN-PT: Taking Stock
OWN-PT: Taking Stock OWN-PT: Taking Stock
OWN-PT: Taking Stock
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
 
The Danish National Bibliography as LOD
The Danish National Bibliography as LODThe Danish National Bibliography as LOD
The Danish National Bibliography as LOD
 
Europeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom ViewsEuropeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom Views
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of Southampton
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)
 
The Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
The Canadian Linked Data Initiative: Charting a Path to a Linked Data FutureThe Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
The Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
 
Summit2013 sw in russian universities
Summit2013   sw in russian universitiesSummit2013   sw in russian universities
Summit2013 sw in russian universities
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
Rda and new research potentials, agata kawalec
Rda and new research potentials, agata kawalecRda and new research potentials, agata kawalec
Rda and new research potentials, agata kawalec
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...
 
Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web
Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web
Detecting Good Practices and Pitfalls when Publishing Vocabularies on the Web
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
 
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review Presentation
 

Plus de Traian Rebedea

AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5Traian Rebedea
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesTraian Rebedea
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringTraian Rebedea
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitareTraian Rebedea
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeTraian Rebedea
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Traian Rebedea
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyTraian Rebedea
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Traian Rebedea
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Traian Rebedea
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Traian Rebedea
 
Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Traian Rebedea
 
Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9Traian Rebedea
 
Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8Traian Rebedea
 
Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6Traian Rebedea
 
Algorithm Design and Complexity - Course 5
Algorithm Design and Complexity - Course 5Algorithm Design and Complexity - Course 5
Algorithm Design and Complexity - Course 5Traian Rebedea
 
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic ProgammingAlgorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic ProgammingTraian Rebedea
 
Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Traian Rebedea
 

Plus de Traian Rebedea (20)

AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitare
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTube
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD Survey
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
 
Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10
 
Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9
 
Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8
 
Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6
 
Algorithm Design and Complexity - Course 5
Algorithm Design and Complexity - Course 5Algorithm Design and Complexity - Course 5
Algorithm Design and Complexity - Course 5
 
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic ProgammingAlgorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
 
Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3
 

Dernier

ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 

Dernier (20)

ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 

A focused crawler for romanian words discovery

  • 1. Authors University Politehnica of Bucharest A Focused Crawler for Romanian Words Discovery Ionuț-Gabriel Radu Traian Rebedea traian.rebedea@cs.pub.ro
  • 2. Overview • Introduction • Objective • RWScraper • Related Work • RWScraper: Implementation • Results • Conclusions 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 2
  • 3. Introduction • All natural languages are subject to change over time • As the Web becomes more prevalent, it also constitutes a major source for identifying language evolution • Due to large amounts of Romanian web content, the rate of change has increased significantly 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 3
  • 4. Objective • To provide a mechanism to identify new words (e.g. neologisms) that entered the Romanian language • Develop a specialized (focused) web crawler for analyzing Romanian web pages and identifying new words 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 4
  • 5. Focused Web Crawling • Crawling the web with a specific purpose: – “Focus” the spiders to specific content (e.g. people search, scientific publications, products, etc.) – Ignore other web pages and domains 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 5
  • 6. Solution: RWScraper • RWScraper (Romanian Word Scraper) - is able to solve the following problems: – Identify Romanian texts; – Distinguish between proper names and common nouns; – Create a database with new words along with context information and metadata. In order to identify new – Discover the most frequent spelling errors in Romanian online texts. 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 6
  • 7. RWScraper – Text Processing • Each word discovered in a Romanian text is looked in the database provided by www.dexonline.ro, which contains definitions from several Romanian dictionaries (DEX, DOOM, etc.) • Text Processing Pipeline – Text Normalization – Language Validation – Sentence Segmentation – Sentence-Level Language Identification – Word Tokenization 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 7
  • 8. Related Work: Neologisms Identification • A study for Japanese: – Scanning existing Japanese corpora for possible ”new” words, typically by processing the texts through segmentation software and dealing with the ”out-of-lexicon” problem – Simulating the Japanese morphological processes to create new possible words and then test for the presence of them in large corpora • Identification of lexical discriminants (e.g. termed, called, known as) and punctuation discriminants (e.g. single and double quotes) for introducing new words – This method is able to identify a significantly smaller number of potential new words due to the limited number of lexical discriminant patterns. • Using data about the frequency of words usage over time 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 8
  • 9. Related Work: Language Identification • Common Words Methods – Store and use a list with the most frequent words for each language • Unique Letter Combinations – Database with the most frequent sequences of letters in a language, not necessarily valid words – The main disadvantage: the poor performance on short texts – The main advantage: it does not require word tokenization • Language Identification Using N-Grams – Every language has several specific frequently used character n-grams – For a particular language L, the n-gram ordered dictionary is called n- gram language profile – For a new text, we compute the distance to all computed language profiles • Markov Models for Language Identification – The word can be represented as a Markov chain where letters are states – Compute a Markov model for each language 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 9
  • 10. RWScraper: Implementation • RWScraper is a focused crawler for Romanian web pages • Developed using Scrapy: open-source scraping framework in Python • It uses three main concepts: – Spiders: responsible for defining rules to restrict the crawled content to our area of interest – Items: data we want to scrape from the web pages – Pipelines: text processing tasks that act on the crawled web resources 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 10
  • 11. RWScraper Language Validation • Divide the texts into two categories: – Diacritics free texts - DIAFREE – Genuine Romanian texts – GEN • 6.40% of the characters in the Romanian texts part of the ro_eu_parliament corpus are diacritics • One of the problems with this approach is that 4.14% of texts contained ș, â, and î. Unfortunately, there are also other languages that possess these diacritics • Romanian is the only language that uses ț and ă • Our assumption: if a text has over 600 characters and has no ț/ă are found – Then it is DIAFREE – Otherwise is GEN 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 11
  • 12. RWScraper Language Validation • Build language profiles, consisting of: – Character bigrams and trigrams frequency – Common words frequency – Diacritics frequency – Rare characters frequency – Double consonant frequency – Single quotes frequency 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 12
  • 13. Results: Language Validation • 105 texts are divided into: 20 Romanian with diacritics (RO1 - RO20), 20 Romanian without diacritics (RO21- RO40), 20 Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish texts, 3 Catalan texts, and 2 Aromanian • The size of the texts varied from 9KB to 2:5MB, the average size being 253:4KB • Average scores for the discriminator function – Lower score means higher probability for the text to be written in Romanian – Used to set the discriminant score to 0.77 to separate between Romanian and non-Romanian texts 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 13
  • 14. Results • Processed 264,328 online documents – Only 12,555 documents contained new words • From this set of texts, we extracted 698,341 – Only 47,363 phrases contained new words • Discovered 53,724 new words – 21,343 are proper names • The remaining tokens are common words and they are divided into the following main categories: – Misspelled words (approximately 35%) – Technical words (approximately 15%) – Argotic words (approximately 10%) – Clitics, regionalisms, archaisms, alternative forms for existing words account for the rest (cca. 40%) 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 14
  • 15. Results • Most frequent new words 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 15
  • 16. Conclusions • RWScraper is a simple new Romanian words discovery system • The project has also managed to create a large database of Romanian words extracted from the WWW – Statistics about common proper names, frequent spelling mistakes and newly-invented words • There are several elements that could be further improved – The accuracy of the NLP components used by the system – A more pertinent analysis of the words identified by the system 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 16
  • 17. Thank you! Questions? Discussion 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 17 This work has been funded by the Sectorial Operational Programme Human Resources Development 2007-2013 of the Romanian Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132397