SlideShare une entreprise Scribd logo
1  sur  17
Vectorization 
Core Concepts in Data Mining
Topic Index 
• Why Vectorization? 
• Vector Space Model 
• Bag of Words 
• TF-IDF 
• N-Grams 
• Kernel Hashing
“How is it possible for a slow, tiny brain, whether biological or 
electronic, to perceive, understand, predict, and manipulate a 
world far larger and more complicated than itself?” 
--- Peter Norvig, “Artificial Intelligence: A Modern Approach” 
WHY VECTORIZATION?
Classic Scenario: 
“Classify some tweets 
for positive vs 
negative sentiment”
What Needs to Happen? 
• Need each tweet as some structure that can be 
fed to a learning algorithm 
– To represent the knowledge of “negative” vs 
“positive” tweet 
• How does that happen? 
– We need to take the raw text and convert it into what 
is called a “vector” 
• Vector relates to the fundamentals of linear 
algebra 
– “Solving sets of linear equations”
Wait. What’s a Vector Again? 
• An array of floating point numbers 
• Represents data 
– Text 
– Audio 
– Image 
• Example: 
–[ 1.0, 0.0, 1.0, 0.5 ]
“I am putting myself to the fullest possible use, which is 
all I think that any conscious entity can ever hope to do.” 
--- Hal, 2001 
VECTOR SPACE MODEL
Vector Space Model 
• Common way of vectorizing text 
– every possible word is mapped to a specific integer 
• If we have a large enough array then every word 
fits into a unique slot in the array 
– value at that index is the number of the times the 
word occurs 
• Most often our array size is less than our corpus 
vocabulary 
– so we have to have a “vectorization strategy” to 
account for this
Text Can Include Several Stages 
• Sentence Segmentation 
– can skip straight to tokenization depending on use case 
• Tokenization 
– find individual words 
• Lemmatization 
– finding the base or stem of words 
• Removing Stop words 
– “the”, “and”, etc 
• Vectorization 
– we take the output of the process and make an array of 
floating point values
“A man who carries a cat by the tail learns something he can 
learn in no other way.” 
--- Mark Twain 
TEXT VECTORIZATION STRATEGIES
Bag of Words 
• A group of words or a document is represented as a bag 
– or “multi-set” of its words 
• Bag of words is a list of words and their word counts 
– simplest vector model 
– but can end up using a lot of columns due to number of words 
involved. 
• Grammar and word ordering is ignored 
– but we still track how many times the word occurs in the 
document 
• has been used most frequently in the document 
classification 
– and information retrieval domains.
Term frequency inverse document 
frequency (TF-IDF) 
• Fixes some issues with “bag of words” 
• allows us to leverage the information about 
how often a word occurs in a document (TF) 
– while considering the frequency of the word in the 
corpus to control for the facet that some words 
will be more common than others (IDF) 
• more accurate than the basic bag of words 
model 
– but computationally more expensive
TF-IDF Formula 
• wi = TFi * IDFi 
• TFi(t) 
– = (Number of times term t appears in a document) / 
(Total number of terms in the document). 
• IDFi = log (N / Dfi) 
– N is total documents in corpus 
– Dfi is documents containing the term t
N-grams 
• A group of words in a sequence is called an n-gram 
• A single word can be called a unigram 
• Two words like “Coca Cola” can be considered 
a single unit and called a bigram 
• Three and more terms can be called trigrams, 
4-grams, 5-grams and so on and so forth
N-Grams Usage 
• If we combine the unigrams and bigrams from a document and 
generate weights using TF-IDF 
– will end up with large vectors with many meaningless bigrams 
– having large weights on account of their large IDF 
• Can pass n-gram through something called a log-likelihood test 
– which can determine whether two words occurred together rather by 
chance, or because they form a significant unit 
– It selects the most significant ones and prunes away the least 
significant ones 
• Using the remaining n-grams, TF-IDF weighting scheme is applied 
and vectors are produced 
– In this way, significant bigrams like “Coca Cola” can be more properly 
accounted for in a TF-IDF weighting.
Kernel Hashing 
• When we want to vectorize the data in a single 
pass 
– making it a “just in time” vectorizer. 
• Can be used when we want to vectorize text right 
before we feed it to our learning algorithm. 
• We come up with a fixed sized vector that is 
typically smaller than the total possible words 
that we could index or vectorize 
– Then we use a hash function to create an index into 
the vector.
More Kernel Hashing 
• Advantage to use kernel hashing is that we don’t 
need the pre-cursor pass like we do with TF-IDF 
– but we run the risk of having collisions between words 
• The reality is that these collisions occur very 
infrequently 
– and don’t have a noticeable impact on learning 
performance 
• For more reading: 
– http://jeremydhoon.github.com/2013/03/19/abusing-hash- 
kernels-for-wildly-unprincipled-machine-learning/

Contenu connexe

Tendances

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 

Tendances (20)

HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

En vedette

Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 

En vedette (15)

Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
The road to the launch of vectoring in Belgium
The road to the launch of vectoring in BelgiumThe road to the launch of vectoring in Belgium
The road to the launch of vectoring in Belgium
 
Vectorization
VectorizationVectorization
Vectorization
 
California Theft Crimes: Understanding Offenses and Penalties
California Theft Crimes: Understanding Offenses and PenaltiesCalifornia Theft Crimes: Understanding Offenses and Penalties
California Theft Crimes: Understanding Offenses and Penalties
 
Data Insights - sentiXchange
Data Insights - sentiXchangeData Insights - sentiXchange
Data Insights - sentiXchange
 
Xgboost
XgboostXgboost
Xgboost
 
Natural Language Processing and Machine Learning
Natural Language Processing and Machine LearningNatural Language Processing and Machine Learning
Natural Language Processing and Machine Learning
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 
QCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneQCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for Everyone
 

Similaire à Intro to Vectorization Concepts - GaTech cse6242

NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
rohithprabhas1
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
gagravarr
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 

Similaire à Intro to Vectorization Concepts - GaTech cse6242 (20)

Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
What might a spoken corpus tell us about language
What might a spoken corpus tell us about languageWhat might a spoken corpus tell us about language
What might a spoken corpus tell us about language
 
Text Mining
Text MiningText Mining
Text Mining
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Data Mining Email SPam Detection PPT WITH Algorithms
Data Mining Email SPam Detection PPT WITH AlgorithmsData Mining Email SPam Detection PPT WITH Algorithms
Data Mining Email SPam Detection PPT WITH Algorithms
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clustering
 
Taming Text
Taming TextTaming Text
Taming Text
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
 

Plus de Josh Patterson

Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
Josh Patterson
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
Josh Patterson
 

Plus de Josh Patterson (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 

Dernier

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Dernier (20)

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 

Intro to Vectorization Concepts - GaTech cse6242

  • 2. Topic Index • Why Vectorization? • Vector Space Model • Bag of Words • TF-IDF • N-Grams • Kernel Hashing
  • 3. “How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?” --- Peter Norvig, “Artificial Intelligence: A Modern Approach” WHY VECTORIZATION?
  • 4. Classic Scenario: “Classify some tweets for positive vs negative sentiment”
  • 5. What Needs to Happen? • Need each tweet as some structure that can be fed to a learning algorithm – To represent the knowledge of “negative” vs “positive” tweet • How does that happen? – We need to take the raw text and convert it into what is called a “vector” • Vector relates to the fundamentals of linear algebra – “Solving sets of linear equations”
  • 6. Wait. What’s a Vector Again? • An array of floating point numbers • Represents data – Text – Audio – Image • Example: –[ 1.0, 0.0, 1.0, 0.5 ]
  • 7. “I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.” --- Hal, 2001 VECTOR SPACE MODEL
  • 8. Vector Space Model • Common way of vectorizing text – every possible word is mapped to a specific integer • If we have a large enough array then every word fits into a unique slot in the array – value at that index is the number of the times the word occurs • Most often our array size is less than our corpus vocabulary – so we have to have a “vectorization strategy” to account for this
  • 9. Text Can Include Several Stages • Sentence Segmentation – can skip straight to tokenization depending on use case • Tokenization – find individual words • Lemmatization – finding the base or stem of words • Removing Stop words – “the”, “and”, etc • Vectorization – we take the output of the process and make an array of floating point values
  • 10. “A man who carries a cat by the tail learns something he can learn in no other way.” --- Mark Twain TEXT VECTORIZATION STRATEGIES
  • 11. Bag of Words • A group of words or a document is represented as a bag – or “multi-set” of its words • Bag of words is a list of words and their word counts – simplest vector model – but can end up using a lot of columns due to number of words involved. • Grammar and word ordering is ignored – but we still track how many times the word occurs in the document • has been used most frequently in the document classification – and information retrieval domains.
  • 12. Term frequency inverse document frequency (TF-IDF) • Fixes some issues with “bag of words” • allows us to leverage the information about how often a word occurs in a document (TF) – while considering the frequency of the word in the corpus to control for the facet that some words will be more common than others (IDF) • more accurate than the basic bag of words model – but computationally more expensive
  • 13. TF-IDF Formula • wi = TFi * IDFi • TFi(t) – = (Number of times term t appears in a document) / (Total number of terms in the document). • IDFi = log (N / Dfi) – N is total documents in corpus – Dfi is documents containing the term t
  • 14. N-grams • A group of words in a sequence is called an n-gram • A single word can be called a unigram • Two words like “Coca Cola” can be considered a single unit and called a bigram • Three and more terms can be called trigrams, 4-grams, 5-grams and so on and so forth
  • 15. N-Grams Usage • If we combine the unigrams and bigrams from a document and generate weights using TF-IDF – will end up with large vectors with many meaningless bigrams – having large weights on account of their large IDF • Can pass n-gram through something called a log-likelihood test – which can determine whether two words occurred together rather by chance, or because they form a significant unit – It selects the most significant ones and prunes away the least significant ones • Using the remaining n-grams, TF-IDF weighting scheme is applied and vectors are produced – In this way, significant bigrams like “Coca Cola” can be more properly accounted for in a TF-IDF weighting.
  • 16. Kernel Hashing • When we want to vectorize the data in a single pass – making it a “just in time” vectorizer. • Can be used when we want to vectorize text right before we feed it to our learning algorithm. • We come up with a fixed sized vector that is typically smaller than the total possible words that we could index or vectorize – Then we use a hash function to create an index into the vector.
  • 17. More Kernel Hashing • Advantage to use kernel hashing is that we don’t need the pre-cursor pass like we do with TF-IDF – but we run the risk of having collisions between words • The reality is that these collisions occur very infrequently – and don’t have a noticeable impact on learning performance • For more reading: – http://jeremydhoon.github.com/2013/03/19/abusing-hash- kernels-for-wildly-unprincipled-machine-learning/