SlideShare une entreprise Scribd logo
1  sur  13
Vector Norms And OOV Words
 Visualize vector norms vs term-frequency (count)
 FastText Norm vs TF ~ Word2Vec Norm vs TF
 Norm ~ context specificity?
 non-averaged norm ~ non-english word indicator
Vaclav Kosar
 studied physics
 develops software
 dabbles in ML
 works for “Time is Ltd”
startup (internal comms
analytics)
@
Word embeddings intro
 Word is represented by an
100 dim vector
 Representation is trained or
statistical
 Word similarity via cosine
similarity (angle)
 Vector norm is length of a
vector
Word2vec norms (Schakel 2015)
 (Schakel 2015) Measuring Word Significance
using Distributed Representations of Words
 Word significance ~ context distinctiveness
 Document term frequency vs global TF (tf-idf)
 word2vec norms ~ word significance (within
training corpus and frequency band)
Word2vec norms (Schakel 2015)
 Corpus mainly about Q Mechanics
 Longest vectors in the high-tf
bands:
 “inflation” (v=4.64, tf= 571)
 “sitter” (v= 3.81, tf= 1490) “de
Sitter”
 “holes” (v= 3.41, tf=2465) “black
holes”
 Word vector length versus term
frequency of all words in the hep-th
vocabulary.
 ngram is subsequence within word
 fastText(word) = (vword + Σg ∈ ngrams(word) vg)/ (1 + |ngrams(word)|)
 If the word is not present in the dictionary (OOV) only n-grams
vectors are used.
 To study OOV words removed asymmetry by only utilizing
word’s n-grams vectors
 10k most common english words to contrast them with FT
vocabulary including dataset artifacts (non-words)
FastText: Standard Vector Norm
 Standard
 4 clusters
with unknown
meaning
FastText: No-Ngrams Norm
 Only whole-
word vectors
used
 Norm ~ word
significance
 Shape similar
to word2vec
FastText: NG_Norm
 Only ngram
vectors used
 No ngram count
averaging
 common words in
narrow norm-band
FastText: Hypo and Hypernyms
 67 hypo-hypernyms
norms pairs
 Disregarding TF
no_ngram_norm
most predictive
 Filtering to max
30% away TF to 7
samples standard
norm most
predictive
hyper hypo standard_norm no_ngram_norm ng_norm count
month January -25.4 11.5 25.4 55.6
month February -41.7 12.5 26.8 44.6
month March 19.6 7.9 19.6 62.8
month April 23.0 9.9 23.0 60.6
month May 110.6 6.4 14.7 104.7
month June 60.5 9.2 21.8 57.3
color red 103.4 -2.5 4.6 -15.4
color black -5.6 -3.8 -5.6 23.1
color pink 47.1 1.4 7.5 -119.8
color yellow -28.8 1.5 -0.3 -111.3
color cyan 61.0 17.2 22.4 -197.7
color violet -33.9 9.6 -5.4 -193.6
… … … … … …
average 2.6 8.1 7.7 -105.1
counts 43.9 77.3 65.2
counts selected 43.9 77.3 65.2
FastText: Vocab vs Common
 NG_Norm
distributions
 Bayes
approach
FastText: Splitting To English Words
 Algo:
 Join 2
common
words
 Generate all
possible
splits
 Decide
looking at
norms
 48% accuracy
word1 word2 norm1 norm2 prob1 prob2 prob
i nflationlithium 0 4.20137 0 0.000397 0
in flationlithium 0 4.40944 0 0.000519 0
inf lationlithium 1.88772 3.86235 0.010414 0.000741 7.721472E-06
infl ationlithium 2.29234 4.04391 0.053977 0.000428 2.308942E-05
infla tionlithium 2.24394 4.74456 0.052467 0 0
inflat ionlithium 2.55929 3.45802 0.048715 0.002442 0.0001189513
inflati onlithium 3.10228 3.55187 0.007973 0.001767 1.408828E-05
inflatio nlithium 3.34667 3.26616 0.003907 0.003159 1.234263E-05
inflation lithium 2.87083 2.73886 0.017853 0.035389 0.0006318213
inflationl ithium 3.36933 2.35156 0.002887 0.053333 0.0001539945
inflationli thium 3.73344 2.21766 0.001283 0.052467 6.730259E-05
inflationlit hium 4.16165 1.66477 9.6E-05 0.004324 4.139165E-07
inflationlith ium 4.40217 1.59184 0.000519 0.002212 1.147982E-06
inflationlithi um 4.71089 0 0 0 0
inflationlithiu m 4.91263 0 0.000213 0 0
Conclusion
 Visualized FastText vector norms vs term-
frequency
 Standard Norm vs Term-Frequency plot
interesting clustering of common words
 No-NGram Norm vs TF is shaped as Word2Vec
Norm vs TF
 The word significance ~ No-N-Gram Norm.

Contenu connexe

Tendances (8)

lesson 3 presentation of data and frequency distribution
lesson 3 presentation of data and frequency distributionlesson 3 presentation of data and frequency distribution
lesson 3 presentation of data and frequency distribution
 
frequency distribution table
frequency distribution tablefrequency distribution table
frequency distribution table
 
Chapter 3: Prsentation of Data
Chapter 3: Prsentation of DataChapter 3: Prsentation of Data
Chapter 3: Prsentation of Data
 
2.1 frequency distributions, histograms, and related topics
2.1 frequency distributions, histograms, and related topics2.1 frequency distributions, histograms, and related topics
2.1 frequency distributions, histograms, and related topics
 
frequency distribution
 frequency distribution frequency distribution
frequency distribution
 
Classes
ClassesClasses
Classes
 
Frequency distribution & graph
Frequency distribution & graphFrequency distribution & graph
Frequency distribution & graph
 
Psychological Statistics Chapter 2
Psychological Statistics Chapter 2Psychological Statistics Chapter 2
Psychological Statistics Chapter 2
 

Similaire à FastText Vector Norms And OOV Words

RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
Rubén Izquierdo Beviá
 
Source of DATA
Source of DATASource of DATA
Source of DATA
Nahid Amin
 
Tabulation of Data, Frequency Distribution, Contingency table
Tabulation of Data, Frequency Distribution, Contingency tableTabulation of Data, Frequency Distribution, Contingency table
Tabulation of Data, Frequency Distribution, Contingency table
Jagdish Powar
 
AMTA'2008 translation universals
AMTA'2008 translation universalsAMTA'2008 translation universals
AMTA'2008 translation universals
Naveed Afzal
 
Building an Inflectional Stemmer for Bulgarian
Building an Inflectional Stemmer for BulgarianBuilding an Inflectional Stemmer for Bulgarian
Building an Inflectional Stemmer for Bulgarian
Svetlin Nakov
 

Similaire à FastText Vector Norms And OOV Words (20)

1. descriptive statistics
1. descriptive statistics1. descriptive statistics
1. descriptive statistics
 
B.Ed _ Test _ and _ Measurement _ record
B.Ed _ Test _ and _ Measurement _ recordB.Ed _ Test _ and _ Measurement _ record
B.Ed _ Test _ and _ Measurement _ record
 
Development and validation of a vocabulary size test of multiword expressions
Development and validation of a vocabulary size test of multiword expressionsDevelopment and validation of a vocabulary size test of multiword expressions
Development and validation of a vocabulary size test of multiword expressions
 
Lecture 1.pptx
Lecture 1.pptxLecture 1.pptx
Lecture 1.pptx
 
SP and R.pptx
SP and R.pptxSP and R.pptx
SP and R.pptx
 
MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...
MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...
MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...
 
ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...
ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...
ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...
 
20- Tabular & Graphical Presentation of data(UG2017-18).ppt
20- Tabular & Graphical Presentation of data(UG2017-18).ppt20- Tabular & Graphical Presentation of data(UG2017-18).ppt
20- Tabular & Graphical Presentation of data(UG2017-18).ppt
 
20- Tabular & Graphical Presentation of data(UG2017-18).ppt
20- Tabular & Graphical Presentation of data(UG2017-18).ppt20- Tabular & Graphical Presentation of data(UG2017-18).ppt
20- Tabular & Graphical Presentation of data(UG2017-18).ppt
 
Frequency distribution 6
Frequency distribution 6Frequency distribution 6
Frequency distribution 6
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
Frequency Distribution
Frequency DistributionFrequency Distribution
Frequency Distribution
 
Data and Statistics
Data and StatisticsData and Statistics
Data and Statistics
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
 
Apurba Paul - 2015 - Identification and Classification of Emotional Key Phras...
Apurba Paul - 2015 - Identification and Classification of Emotional Key Phras...Apurba Paul - 2015 - Identification and Classification of Emotional Key Phras...
Apurba Paul - 2015 - Identification and Classification of Emotional Key Phras...
 
Source of DATA
Source of DATASource of DATA
Source of DATA
 
Tabulation of Data, Frequency Distribution, Contingency table
Tabulation of Data, Frequency Distribution, Contingency tableTabulation of Data, Frequency Distribution, Contingency table
Tabulation of Data, Frequency Distribution, Contingency table
 
3_-frequency_distribution.pptx
3_-frequency_distribution.pptx3_-frequency_distribution.pptx
3_-frequency_distribution.pptx
 
AMTA'2008 translation universals
AMTA'2008 translation universalsAMTA'2008 translation universals
AMTA'2008 translation universals
 
Building an Inflectional Stemmer for Bulgarian
Building an Inflectional Stemmer for BulgarianBuilding an Inflectional Stemmer for Bulgarian
Building an Inflectional Stemmer for Bulgarian
 

Plus de Vaclav Kosar

Plus de Vaclav Kosar (6)

Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)
 
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student PracticeSimulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
 
Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4
 
Spline 0.3 User Guide
Spline 0.3 User GuideSpline 0.3 User Guide
Spline 0.3 User Guide
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 
Spline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured StreamingSpline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured Streaming
 

Dernier

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 

Dernier (20)

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 

FastText Vector Norms And OOV Words

  • 1. Vector Norms And OOV Words  Visualize vector norms vs term-frequency (count)  FastText Norm vs TF ~ Word2Vec Norm vs TF  Norm ~ context specificity?  non-averaged norm ~ non-english word indicator
  • 2. Vaclav Kosar  studied physics  develops software  dabbles in ML  works for “Time is Ltd” startup (internal comms analytics) @
  • 3. Word embeddings intro  Word is represented by an 100 dim vector  Representation is trained or statistical  Word similarity via cosine similarity (angle)  Vector norm is length of a vector
  • 4. Word2vec norms (Schakel 2015)  (Schakel 2015) Measuring Word Significance using Distributed Representations of Words  Word significance ~ context distinctiveness  Document term frequency vs global TF (tf-idf)  word2vec norms ~ word significance (within training corpus and frequency band)
  • 5. Word2vec norms (Schakel 2015)  Corpus mainly about Q Mechanics  Longest vectors in the high-tf bands:  “inflation” (v=4.64, tf= 571)  “sitter” (v= 3.81, tf= 1490) “de Sitter”  “holes” (v= 3.41, tf=2465) “black holes”  Word vector length versus term frequency of all words in the hep-th vocabulary.
  • 6.  ngram is subsequence within word  fastText(word) = (vword + Σg ∈ ngrams(word) vg)/ (1 + |ngrams(word)|)  If the word is not present in the dictionary (OOV) only n-grams vectors are used.  To study OOV words removed asymmetry by only utilizing word’s n-grams vectors  10k most common english words to contrast them with FT vocabulary including dataset artifacts (non-words)
  • 7. FastText: Standard Vector Norm  Standard  4 clusters with unknown meaning
  • 8. FastText: No-Ngrams Norm  Only whole- word vectors used  Norm ~ word significance  Shape similar to word2vec
  • 9. FastText: NG_Norm  Only ngram vectors used  No ngram count averaging  common words in narrow norm-band
  • 10. FastText: Hypo and Hypernyms  67 hypo-hypernyms norms pairs  Disregarding TF no_ngram_norm most predictive  Filtering to max 30% away TF to 7 samples standard norm most predictive hyper hypo standard_norm no_ngram_norm ng_norm count month January -25.4 11.5 25.4 55.6 month February -41.7 12.5 26.8 44.6 month March 19.6 7.9 19.6 62.8 month April 23.0 9.9 23.0 60.6 month May 110.6 6.4 14.7 104.7 month June 60.5 9.2 21.8 57.3 color red 103.4 -2.5 4.6 -15.4 color black -5.6 -3.8 -5.6 23.1 color pink 47.1 1.4 7.5 -119.8 color yellow -28.8 1.5 -0.3 -111.3 color cyan 61.0 17.2 22.4 -197.7 color violet -33.9 9.6 -5.4 -193.6 … … … … … … average 2.6 8.1 7.7 -105.1 counts 43.9 77.3 65.2 counts selected 43.9 77.3 65.2
  • 11. FastText: Vocab vs Common  NG_Norm distributions  Bayes approach
  • 12. FastText: Splitting To English Words  Algo:  Join 2 common words  Generate all possible splits  Decide looking at norms  48% accuracy word1 word2 norm1 norm2 prob1 prob2 prob i nflationlithium 0 4.20137 0 0.000397 0 in flationlithium 0 4.40944 0 0.000519 0 inf lationlithium 1.88772 3.86235 0.010414 0.000741 7.721472E-06 infl ationlithium 2.29234 4.04391 0.053977 0.000428 2.308942E-05 infla tionlithium 2.24394 4.74456 0.052467 0 0 inflat ionlithium 2.55929 3.45802 0.048715 0.002442 0.0001189513 inflati onlithium 3.10228 3.55187 0.007973 0.001767 1.408828E-05 inflatio nlithium 3.34667 3.26616 0.003907 0.003159 1.234263E-05 inflation lithium 2.87083 2.73886 0.017853 0.035389 0.0006318213 inflationl ithium 3.36933 2.35156 0.002887 0.053333 0.0001539945 inflationli thium 3.73344 2.21766 0.001283 0.052467 6.730259E-05 inflationlit hium 4.16165 1.66477 9.6E-05 0.004324 4.139165E-07 inflationlith ium 4.40217 1.59184 0.000519 0.002212 1.147982E-06 inflationlithi um 4.71089 0 0 0 0 inflationlithiu m 4.91263 0 0.000213 0 0
  • 13. Conclusion  Visualized FastText vector norms vs term- frequency  Standard Norm vs Term-Frequency plot interesting clustering of common words  No-NGram Norm vs TF is shaped as Word2Vec Norm vs TF  The word significance ~ No-N-Gram Norm.